Modalities represent the sources of information that the platform processes and analyzes in a video.

Visual includes:

Actions, objects, and events in the video.
Text that appears on screen (through OCR).
Brand logos and visual elements.

Audio includes:

Ambient sounds, music, and sound effects.
Human speech and conversations (Marengo 2.7).
Non-speech audio only (Marengo 3.0). For speech content, use the transcription modality.

Transcription includes (Marengo 3.0 only):

Spoken words extracted from the audio track.

You specify modalities through different parameters depending on your task:

Model options: when you create an index.
Search options: when you search videos.
Embedding option: when you retrieve embeddings.

Model options

When you create an index, specify which modalities the platform must process. You can include the following values in the model_options array:

visual: To process visual content
audio: To process audio content

You can enable one or both model options. The platform processes only the modalities you specify.

Search options

When you search videos, use the search_options parameter specify which modalities the platform uses to find relevant matches. The values and their behavior depend on the version of the model you’re using.

Marengo 3.0

Marengo 3.0 separates audio into speech and non-speech content.

To find visual content:

Set search_options to visual to search for:

Actions, objects, and events in the video
Text that appears on screen (through OCR)
Brand logos and visual elements

Example use cases:

Finding scenes with specific objects: “red car in parking lot”
Locating on-screen text: “company logo on building”
Identifying actions: “person running”

To find non-speech audio:

Set search_options to audio to search for sounds other than human speech:

Musical tones and melodies
Beeping, alarms, and mechanical sounds
Environmental sounds (rain, traffic, nature)

Example use cases:

Finding background music: “upbeat electronic music”
Locating sound effects: “door slamming”
Identifying ambient sounds: “rainfall”

Find spoken words

Set search_options to transcription to search the spoken content in your videos.

Example use cases:

Finding mentions of topics: “climate change discussion”
Locating product names: “iPhone 15 Pro Max”
Identifying speakers discussing concepts: “quarterly revenue growth”

Transcription options

Use the transcription_options parameter to specify how the platform matches your query against spoken words:

lexical: Matches the exact words or phrases in your query, allowing for minor spelling variations.
semantic: Matches the meaning of your query, even when the spoken words differ.

Exact word matching (lexical)

Matches the specific words or phrases in your query
Allows for minor spelling variations

Best for: Product names, technical terminology, proper nouns.

Meaning-based matching (semantic)

Matches the meaning of your query, even with different wording
Finds conceptually similar content

Best for: General concepts, topics that can be expressed in multiple ways.

Using both methods (default)

Specify both lexical and semantic, or omit transcription_options entirely
Returns the broadest set of results

Best for: Comprehensive searches where you want both exact matches and related content.

Marengo 2.7

Marengo 2.7 handles all audio (speech and non-speech) as a single modality.

Find visual content

Set search_options to visual to search for:

Actions, objects, and events in the video
Text that appears on screen (through OCR)
Brand logos and visual elements

Find audio content

Set search_options to audio to search all audio, including:

Ambient sounds and music
Human speech and conversations
Sound effects

Combine multiple modalities

You can search across multiple modalities simultaneously by specifying multiple values for the search_options parameter. Control how results are combined using the operator parameter.

`search_options`	`operator`	`transcription_options`	Result
`["visual", "transcription"]`	`or`	`lexical`	Product shown OR exact name spoken
`visual`, `transcription`	`and`	`lexical`	Product shown WHILE exact name spoken
`visual`, `transcription`	`or`	`semantic`	Product shown OR discussed (any wording)
`visual`, `transcription`	`and`	`semantic`	Product shown WHILE discussed
`visual`, `audio`	`or`	N/A	Visuals OR sounds (non-speech)
`visual`, `audio`	`and`	N/A	Visuals WITH sounds together
`visual`, `audio`, `transcription`	`or`	Both	Any modality matches
`visual`, `audio`, `transcription`	`and`	Both	All modalities match simultaneously

Embedding options

When you create video embeddings, specify the types of embeddings the platform must return. Depending on the version of the model, you can include the following values in the embedding_option array:

Marengo 3.0:

visual: To retrieve visual embeddings.
audio: To retrieve embeddings for non-verbal audio (musical tones, beeping, environmental sounds).
transcription: To retrieve embeddings for transcribed speech (the actual words spoken in the video).

Marengo 2.7:

visual-text: To retrieve visual embeddings optimized for text search.
audio: To retrieve audio embeddings.

Model options

Related topics

Search options

Marengo 3.0

Transcription options

Marengo 2.7

Combine multiple modalities

Related topics

Related topics

Embedding options

Related topics