TwelveLabs’ video understanding models consist of a family of deep neural networks built on our multimodal foundation model for video understanding that you can use for the following downstream tasks:

Search using natural language queries
Generate text from video.

Videos contain multiple types of information, including visuals, sounds, spoken words, and texts. The human brain combines all types of information and their relations with each other to comprehend the overall meaning of a scene. For example, you’re watching a video of a person jumping and clapping, both visual cues, but the sound is muted. You might realize they’re happy, but you can’t understand why they’re happy without the sound. However, if the sound is unmuted, you could realize they’re cheering for a soccer team that scored a goal.

Thus, an application that analyzes a single type of information can’t provide a comprehensive understanding of a video. TwelveLabs’ video understanding models, however, analyze and combine information from all the modalities to accurately interpret the meaning of a video holistically, similar to how humans watch, listen, and read simultaneously to understand videos.

Our video understanding models have the ability to identify, analyze, and interpret a variety of elements, including but not limited to the following:

Element	Modality	Example
People, including famous individuals	Visual	Michael Jordan, Steve Jobs
Actions	Visual	Running, dancing, kickboxing
Objects	Visual	Cars, computers, stadiums
Animals or pets	Visual	Monkeys, cats, horses
Nature	Visual	Mountains, lakes, forests
Text displayed on the screen (OCR)	Visual	License plates, handwritten words, number on a player’s jersey
Brand logos	Visual	Nike, Starbucks, Mercedes
Shot techniques and effects	Visual	Aerial shots, slow motion, time-lapse
Counting objects	Visual	Number of people in a crowd, items on a shelf, vehicles in traffic
Sounds	Audio	Chirping (birds), applause, fireworks popping or exploding
Human speech	Audio	”Good morning. How may I help you?”
Music	Audio	Ominous music, whistling, lyrics

Model types

TwelveLabs provides two distinct model types - embedding and generative, each serving unique purposes in multimodal video understanding.

Embedding models (Marengo): These models are proficient at performing tasks such as search and classification, enabling enhanced video understanding.
Generative models (Pegasus): These models generate text based on your videos.

The following models are available:

Name	Features	Description
Marengo2.7	Search	This version of the Marengo video understanding models improves accuracy and performance in the following areas: - Multimodal processing that combines visual, audio, and text elements. - Fine-grained image-to-video search: detect brand logos, text, and small objects (as small as 10% of the video frame). - Improvement in motion search capability. - Counting capabilities. - More nuanced audio comprehension: music, lyrics, sound, and silence. For more details on the new features and improvements in this version, refer to this blog post: Introducing Marengo 2.7: Pioneering Multi-Vector Embeddings for Advanced Video Understanding.
Marengo-retrieval-2.7	Embeddings	This version of the Marengo video understanding model creates embeddings that you can use in various downstream tasks. For details, see the Create embeddings section.
Pegasus 1.2	Video-to-text generation	Pegasus 1.2 offers significant improvements over Pegasus 1.1: - Extended video processing capacity from 30 minutes to 1 hour per video. - The maximum length of a prompt is 2,000 tokens. - Enhanced performance across video-language tasks compared to Pegasus 1.1 and other models of the same size. - More granular visual comprehension of objects, on-screen text, and numerical content. - More accurate temporal grounding and timestamp identification. For example, you can ask questions about the timestamps of certain events.During the preview phase, the model is available only for new and sample indexes. . For more details on the new features and improvements in this version, refer to this blog post: Introducing Pegasus 1.2: An Industry-Grade Video Language Model for Scalable Applications. . For support or feedback, contact support@twelvelabs.io.