Modalities represent the sources of information that the platform processes and analyzes in a video.
Visual includes:
Audio includes:
Transcription includes:
You specify modalities through different parameters depending on your task:
When you create an index, specify which modalities the platform must process. You can include the following values in the model_options array:
visual: To process visual contentaudio: To process audio contentYou can enable one or both model options. The platform processes only the modalities you specify.
When you search videos, use the search_options parameter specify which modalities the platform uses to find relevant matches.
Marengo separates audio into speech and non-speech content.
To find visual content:
Set search_options to visual to search for:
Example use cases:
To find non-speech audio:
Set search_options to audio to search for sounds other than human speech:
Example use cases:
Find spoken words
Set search_options to transcription to search the spoken content in your videos.
Example use cases:
Use the transcription_options parameter to specify how the platform matches your query against spoken words:
lexical: Matches the exact words or phrases in your query, allowing for minor spelling variations.semantic: Matches the meaning of your query, even when the spoken words differ.Exact word matching (lexical)
Best for: Product names, technical terminology, proper nouns.
Meaning-based matching (semantic)
Best for: General concepts, topics that can be expressed in multiple ways.
Using both methods (default)
lexical and semantic, or omit transcription_options entirelyBest for: Comprehensive searches where you want both exact matches and related content.
You can search across multiple modalities simultaneously by specifying multiple values for the search_options parameter. Control how results are combined using the operator parameter.
When you create video embeddings, specify the types of embeddings the platform must return. You can include the following values in the embedding_option array:
visual: To retrieve visual embeddings.audio: To retrieve embeddings for non-verbal audio (musical tones, beeping, environmental sounds).transcription: To retrieve embeddings for transcribed speech (the actual words spoken in the video).