Modalities
Modalities represent the sources of information that the platform processes and analyzes in a video.
Visual includes:
- Actions, objects, and events in the video.
- Text that appears on screen (through OCR).
- Brand logos and visual elements.
Audio includes:
- Ambient sounds, music, and sound effects.
- Human speech and conversations (Marengo 2.7).
- Non-speech audio only (Marengo 3.0). For speech content, use the transcription modality.
Transcription includes (Marengo 3.0 only):
- Spoken words extracted from the audio track.
You specify modalities through different parameters depending on your task:
- Model options: when you create an index.
- Search options: when you search videos.
- Embedding option: when you retrieve embeddings.
Model options
When you create an index, specify which modalities the platform must process. You can include the following values in the model_options array:
visual: To process visual contentaudio: To process audio content
You can enable one or both model options. The platform processes only the modalities you specify.
Related topics
- Pyton SDK Reference > Create an index
- Node.js SDK Reference > Create an index
- API Reference > Create an index
Search options
When you search videos, use the search_options parameter specify which modalities the platform uses to find relevant matches. The values and their behavior depend on the version of the model you’re using.
Marengo 3.0
Marengo 3.0 separates audio into speech and non-speech content.
To find visual content:
Set search_options to visual to search for:
- Actions, objects, and events in the video
- Text that appears on screen (through OCR)
- Brand logos and visual elements
Example use cases:
- Finding scenes with specific objects: “red car in parking lot”
- Locating on-screen text: “company logo on building”
- Identifying actions: “person running”
To find non-speech audio:
Set search_options to audio to search for sounds other than human speech:
- Musical tones and melodies
- Beeping, alarms, and mechanical sounds
- Environmental sounds (rain, traffic, nature)
Example use cases:
- Finding background music: “upbeat electronic music”
- Locating sound effects: “door slamming”
- Identifying ambient sounds: “rainfall”
Find spoken words
Set search_options to transcription to search the spoken content in your videos.
Example use cases:
- Finding mentions of topics: “climate change discussion”
- Locating product names: “iPhone 15 Pro Max”
- Identifying speakers discussing concepts: “quarterly revenue growth”
Transcription options
Use the transcription_options parameter to specify how the platform matches your query against spoken words:
lexical: Matches the exact words or phrases in your query, allowing for minor spelling variations.semantic: Matches the meaning of your query, even when the spoken words differ.
Exact word matching (lexical)
- Matches the specific words or phrases in your query
- Allows for minor spelling variations
Best for: Product names, technical terminology, proper nouns.
Meaning-based matching (semantic)
- Matches the meaning of your query, even with different wording
- Finds conceptually similar content
Best for: General concepts, topics that can be expressed in multiple ways.
Using both methods (default)
- Specify both
lexicalandsemantic, or omittranscription_optionsentirely - Returns the broadest set of results
Best for: Comprehensive searches where you want both exact matches and related content.
Marengo 2.7
Marengo 2.7 handles all audio (speech and non-speech) as a single modality.
Find visual content
Set search_options to visual to search for:
- Actions, objects, and events in the video
- Text that appears on screen (through OCR)
- Brand logos and visual elements
Find audio content
Set search_options to audio to search all audio, including:
- Ambient sounds and music
- Human speech and conversations
- Sound effects
Combine multiple modalities
You can search across multiple modalities simultaneously by specifying multiple values for the search_options parameter. Control how results are combined using the operator parameter.
Related topics
- Pyton SDK Reference > Make a search request
- Node.js SDK Reference > Make a search request
- API Reference > Make a search request
Related topics
- Pyton SDK Reference > Make a search request
- Node.js SDK Reference > Make a search request
- API Reference > Make a search request
Embedding options
When you create video embeddings, specify the types of embeddings the platform must return. Depending on the version of the model, you can include the following values in the embedding_option array:
Marengo 3.0:
visual: To retrieve visual embeddings.audio: To retrieve embeddings for non-verbal audio (musical tones, beeping, environmental sounds).transcription: To retrieve embeddings for transcribed speech (the actual words spoken in the video).
Marengo 2.7:
visual-text: To retrieve visual embeddings optimized for text search.audio: To retrieve audio embeddings.