Modalities
Modalities represent the sources of information that the platform processes and analyzes in a video.
Visual includes:
- Actions, objects, and events in the video.
- Text that appears on screen (through OCR).
- Brand logos and visual elements.
Audio includes:
- Ambient sounds, music, and sound effects.
- Non-speech audio only. For speech content, use the transcription modality.
Transcription includes:
- Spoken words extracted from the audio track.
You specify modalities through different parameters depending on your task:
- Model options: when you create an index.
- Search options: when you search videos.
- Embedding option: when you retrieve embeddings.
Model options
When you create an index, specify which modalities the platform must process. You can include the following values in the model_options array:
visual: To process visual contentaudio: To process audio content
You can enable one or both model options. The platform processes only the modalities you specify.
Related topics
- Python SDK Reference > Create an index
- Node.js SDK Reference > Create an index
- API Reference > Create an index
Search options
When you search videos, use the search_options parameter specify which modalities the platform uses to find relevant matches.
Marengo separates audio into speech and non-speech content.
To find visual content:
Set search_options to visual to search for:
- Actions, objects, and events in the video
- Text that appears on screen (through OCR)
- Brand logos and visual elements
Example use cases:
- Finding scenes with specific objects: “red car in parking lot”
- Locating on-screen text: “company logo on building”
- Identifying actions: “person running”
To find non-speech audio:
Set search_options to audio to search for sounds other than human speech:
- Musical tones and melodies
- Beeping, alarms, and mechanical sounds
- Environmental sounds (rain, traffic, nature)
Example use cases:
- Finding background music: “upbeat electronic music”
- Locating sound effects: “door slamming”
- Identifying ambient sounds: “rainfall”
Find spoken words
Set search_options to transcription to search the spoken content in your videos.
Example use cases:
- Finding mentions of topics: “climate change discussion”
- Locating product names: “iPhone 15 Pro Max”
- Identifying speakers discussing concepts: “quarterly revenue growth”
Transcription options
Use the transcription_options parameter to specify how the platform matches your query against spoken words:
lexical: Matches the exact words or phrases in your query, allowing for minor spelling variations.semantic: Matches the meaning of your query, even when the spoken words differ.
Exact word matching (lexical)
- Matches the specific words or phrases in your query
- Allows for minor spelling variations
Best for: Product names, technical terminology, proper nouns.
Meaning-based matching (semantic)
- Matches the meaning of your query, even with different wording
- Finds conceptually similar content
Best for: General concepts, topics that can be expressed in multiple ways.
Using both methods (default)
- Specify both
lexicalandsemantic, or omittranscription_optionsentirely - Returns the broadest set of results
Best for: Comprehensive searches where you want both exact matches and related content.
Combine multiple modalities
You can search across multiple modalities simultaneously by specifying multiple values for the search_options parameter. Control how results are combined using the operator parameter.
Related topics
- Python SDK Reference > Make a search request
- Node.js SDK Reference > Make a search request
- API Reference > Make a search request
Embedding options
When you create video embeddings, specify the types of embeddings the platform must return. You can include the following values in the embedding_option array:
visual: To retrieve visual embeddings.audio: To retrieve embeddings for non-verbal audio (musical tones, beeping, environmental sounds).transcription: To retrieve embeddings for transcribed speech (the actual words spoken in the video).