Modalities
Modalities represent the sources of information that the platform processes and analyzes in a video. The platform supports the following modalities:
- Visual: Contains actions, objects, events, text (through Optical Character Recognition, or OCR), and brand logos.
- Audio: Contains ambient sounds, music, and human speech.
You specify modalities through different parameters depending on the operation:
- Model options when you create an index
- Search options when you perform a search
- Embedding options when you create embeddings
Model options
When you create an index, you must specify which modalities the platform processes. This determines what information is extracted and indexed from your videos. You can specify:
visual
audio
You can enable one or both model options, depending on your needs.
Search options
When you perform a search, you must specify which modalities the video understanding model uses to find relevant information. You can specify:
visual
audio
Notes
- Search options must be a subset of the model options specified when the index was created. For example, if only the
visual
model option is enabled for your index, you cannot search using theaudio
search option. - You can combine multiple search options with the
operator
parameter to broaden or narrow your search.
Embedding options
When you create video embeddings, you must specify the modalities for which the platform returns embeddings. You can specify:
visual-text
: Returns visual embeddings optimized for text search.audio
: Returns audio embeddings.