Video understanding models
Twelve Labs' video understanding models consist of a family of deep neural networks built on our multimodal foundation model for video understanding that you can use for the following downstream tasks:
- Search using natural language queries
- Generate text from video.
Videos contain multiple types of information, including visuals, sounds, spoken words, and texts. The human brain combines all types of information and their relations with each other to comprehend the overall meaning of a scene. For example, you're watching a video of a person jumping and clapping, both visual cues, but the sound is muted. You might realize they're happy, but you can't understand why they're happy without the sound. However, if the sound is unmuted, you could realize they're cheering for a soccer team that scored a goal.
Thus, an application that analyzes a single type of information can't provide a comprehensive understanding of a video. Twelve Labs' video understanding models, however, analyze and combine information from all the modalities to accurately interpret the meaning of a video holistically, similar to how humans watch, listen, and read simultaneously to understand videos.
Our video understanding models have the ability to identify, analyze, and interpret a variety of elements, including but not limited to the following:
Element | Modality | Example |
---|---|---|
People, including famous individuals | Visual | Michael Jordan, Steve Jobs |
Actions | Visual | Running, dancing, kickboxing |
Objects | Visual | Cars, computers, stadiums |
Animals or pets | Visual | Monkeys, cats, horses |
Nature | Visual | Mountains, lakes, forests |
Text displayed on the screen (OCR) | Visual | License plates, handwritten words, number on a player's jersey |
Brand logos | Visual | Nike, Starbucks, Mercedes |
Shot techniques and effects | Visual | Aerial shots, slow motion, time-lapse |
Counting objects | Visual | Number of people in a crowd, items on a shelf, vehicles in traffic |
Sounds | Audio | Chirping (birds), applause, fireworks popping or exploding |
Human speech | Audio | "Good morning. How may I help you?" |
Music | Audio | Ominous music, whistling, lyrics |
Model Types
Twelve Labs provides two distinct model types - embedding and generative, each serving unique purposes in multimodal video understanding.
- Embedding models (Marengo) : These models are proficient at performing tasks such as search and classification, enabling enhanced video understanding.
- Generative models (Pegasus): These models generate text based on your videos.
The following models are available:
Name | Features | Description |
---|---|---|
Marengo2.7 | Search | This version of the Marengo video understanding models improves accuracy and performance in the following areas: - Multimodal processing that combines visual, audio, and text elements. - Fine-grained image-to-video search: detect brand logos, text, and small objects (as small as 10% of the video frame). - Improvement in motion search capability. - Counting capabilities. - More nuanced audio comprehension: music, lyrics, sound, and silence. For more details on the new features and improvements in this version, refer to this blog post: Introducing Marengo 2.7: Pioneering Multi-Vector Embeddings for Advanced Video Understanding . |
Pegasus1.1 | Video-to-text generation | The 1.1 version of the Pegasus video understanding model provides the following enhancements compared to 1.0: - Improved model accuracy: Enhanced video description and question-answering capabilities, delivering more precise and relevant results. - Fine-grained visual understanding and instruction following: Improved ability to analyze and interpret visual content, enabling more detailed insights. - Streaming support: The ability to generate open-ended texts with real-time access to each token as it's generated allows for faster processing and utilization of generated content. For details, refer to the Streaming responses page. - Increased maximum prompt length: Expanded from 300 to 1500 characters, allowing for the integration of more complex business logic and detailed examples in prompts. - Extended video duration: The maximum duration of the videos you can upload has been increased from 20 to 30 minutes, allowing for analysis of longer content. |
Marengo-retrieval-2.7 | Embeddings | This version of the Marengo video understanding model creates embeddings that you can use in various downstream tasks. For details, see the Create embeddings section. |
The following models are no longer supported:
Name | Features | Notes |
---|---|---|
Pegasus1.0 | Video-to-text generation | Effective July 8, 2024, Pegasus 1.0 is no longer supported. All existing indexes created with Pegasus 1.0 will be automatically upgraded to Pegasus 1.1. No manual intervention is required for this migration process, and all indexes will utilize Pegasus 1.1 upon completion. |
Updated 9 days ago