Video understanding engines
TwelveLabs’ video understanding engines consist of a family of deep neural networks built on our multimodal foundation model for video understanding that you can use for the following downstream tasks:
- Search using natural language queries
- Zero-shot classification
- Generate text from video
Videos contain multiple types of information, including visuals, sounds, spoken words, and texts. The human brain combines all types of information and their relations with each other to comprehend the overall meaning of a scene. For example, you’re watching a video of a person jumping and clapping, both visual cues, but the sound is muted. You might realize they’re happy, but you can’t understand why they’re happy without the sound. However, if the sound is unmuted, you could realize they’re cheering for a soccer team that scored a goal.
Thus, an application that analyzes a single type of information can’t provide a comprehensive understanding of a video. TwelveLabs’ video understanding engines, however, analyze and combine information from all the modalities to accurately interpret the meaning of a video holistically, similar to how humans watch, listen, and read simultaneously to understand videos.
Our video understanding engines have the ability to identify, analyze, and interpret a variety of elements, including but not limited to the following:
Engine Types
TwelveLabs provides two distinct engine types - embedding and generative, each serving unique purposes in multimodal video understanding.
- Embedding engines (Marengo): These engines are proficient at performing tasks such as search and classification, enabling enhanced video understanding.
- Generative engines (Pegasus): These engines generate text based on your videos.
The following engines are available:
The following engines are no longer supported: