Twelve Labs' video understanding engines consist of a family of deep neural networks built on our multimodal foundation model for video understanding that you can use for the following downstream tasks:

Search using natural language queries
Zero-shot classification
Generate text from video.

Videos contain multiple types of information, including visuals, sounds, spoken words, and texts. The human brain combines all types of information and their relations with each other to comprehend the overall meaning of a scene. For example, you're watching a video of a person jumping and clapping, both visual cues, but the sound is muted. You might realize they're happy, but you can't understand why they're happy without the sound. However, if the sound is unmuted, you could realize they're cheering for a soccer team that scored a goal.

Thus, an application that analyzes a single type of information can't provide a comprehensive understanding of a video. Twelve Labs' video understanding engines, however, analyze and combine information from all the modalities to accurately interpret the meaning of a video holistically, similar to how humans watch, listen, and read simultaneously to understand videos.

Our video understanding engines have the ability to identify, analyze, and interpret a variety of elements, including but not limited to the following:

Element	Modality	Example
People, including famous individuals	Visual	Michael Jordan, Steve Jobs
Actions	Visual	Running, dancing, kickboxing
Objects	Visual	Cars, computers, stadiums
Animals or pets	Visual	Monkeys, cats, horses
Nature	Visual	Mountains, lakes, forests
Sounds (excluding human speech)	Visual	Chirping (birds), applause, fireworks popping or exploding
Human speech	Conversation	"Good morning. How may I help you?"
Text displayed on the screen (OCR)	Text in video	License plates, handwritten words, number on a player's jersey
Brand logos	Logo	Nike, Starbucks, Mercedes

Engine Types

Twelve Labs provides two distinct engine types - embedding and generative, each serving unique purposes in multimodal video understanding.

Embedding engines (Marengo) : These engines are proficient at performing tasks such as search and classification, enabling enhanced video understanding.
Generative engines (Pegasus): These engines generate text based on your videos.

The following engines are available:

Name	Features	Description
Marengo2.6	Search and classification	This version of the Marengo video understanding engine provides the following main features: - Expanded Multimodal Capabilities: Marengo supports any-to-any retrieval tasks, including text-to-video, text-to-image, text-to-audio, audio-to-video, and image-to-video. However, note that the platform currently supports text-to-video search and classification features. Other modalities will be supported in a future release. - Enhanced Temporal Localization: Introducing a Reranker model, Marengo allows for precise search results by better temporal localization.
Pegasus1.0	Video-to-text generation	This version of the Pegasus video understanding engine provides the following main features: - Precise, detailed, and holistic descriptions: Pegasus supports fine-grained video descriptions and summaries. - Question answering capabilities: Pegasus allows you to frame your prompts as questions. Example: "What are the key takeaways of the video?" - Diverse video-to-text-capabilities: The model uses a multimodal approach that analyzes the whole context of a video, including visuals, sounds, spoken words, and texts and their relationship with one another. This allows the creation of diverse textual outputs, from marketing materials to specialized reports.