Modalities

Modalities represent the sources of information that the platform processes and analyzes in a video. The platform supports the following modalities:

  • Visual: Contains actions, objects, events, text (through Optical Character Recognition, or OCR), and brand logos.
  • Audio: Contains ambient sounds, music, and human speech.

You specify modalities through different parameters depending on the operation:

  • Model options when you create an index
  • Search options when you perform a search
  • Embedding options when you create embeddings

Model options

When you create an index, you must specify which modalities the platform processes. This determines what information is extracted and indexed from your videos. You can specify:

  • visual
  • audio

You can enable one or both model options, depending on your needs.

Search options

When you perform a search, you must specify which modalities the video understanding model uses to find relevant information. You can specify:

  • visual
  • audio
Notes
  • Search options must be a subset of the model options specified when the index was created. For example, if only the visual model option is enabled for your index, you cannot search using the audio search option.
  • You can combine multiple search options with the operator parameter to broaden or narrow your search.

Embedding options

When you create video embeddings, you must specify the modalities for which the platform returns embeddings. You can specify:

  • visual-text: Returns visual embeddings optimized for text search.
  • audio: Returns audio embeddings.