Models
TwelveLabs’ video understanding models consist of a family of deep neural networks built on our multimodal foundation model for video understanding that you can use for the following downstream tasks:
- Search using natural language queries
- Analyze videos to generate text
Videos contain multiple types of information, including visuals, sounds, spoken words, and texts. The human brain combines all types of information and their relations with each other to comprehend the overall meaning of a scene. For example, you’re watching a video of a person jumping and clapping, both visual cues, but the sound is muted. You might realize they’re happy, but you can’t understand why they’re happy without the sound. However, if the sound is unmuted, you could realize they’re cheering for a soccer team that scored a goal.
Thus, an application that analyzes a single type of information can’t provide a comprehensive understanding of a video. TwelveLabs’ video understanding models, however, analyze and combine information from all the modalities to accurately interpret the meaning of a video holistically, similar to how humans watch, listen, and read simultaneously to understand videos.
Our video understanding models have the ability to identify, analyze, and interpret a variety of elements, including but not limited to the following:
Available models
TwelveLabs provides different models for video understanding tasks. This section describes each model and its capabilities, helping you understand which one fits your needs.
Marengo
Task type
- Search for specific content in your videos using natural language queries.
- Create video embeddings for downstream tasks.
Use cases
- Find scenes where a person appears, locate brand logos, search for spoken phrases, identify specific actions or objects.
- Build recommendation systems, perform similarity searches, integrate with custom ML pipelines.
For more information, see the Marengo page.
Pegasus
Task type
- Analyze videos and generate text based on their content.
Use cases
- Create video summaries, generate social media captions, extract key information, identify when events occur in videos.
For more information, see the Pegasus page.
Supported languages
The platform supports the following languages for processing visual and audio content, understanding queries or prompts, and generating outputs:
- Full support: English
- Partial support: Arabic, Chinese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Thai, Vietnamese