Marengo is an embedding model for comprehensive video understanding. The current version is Marengo 2.7.

Marengo analyzes multiple modalities in video content, including visuals, audio, and text, to provide a holistic understanding similar to human comprehension.

Available models

Model	Purpose
Marengo 2.7	Search using text or image queries
Marengo-retrieval-2.7	Create embeddings for downstream tasks

Key features

Multimodal processing: Combines visual, audio, and text elements for comprehensive understanding
Fine-grained search: Detects brand logos, text, and small objects (as small as 10% of the video frame)
Motion search: Identifies and analyzes movement within videos
Counting capabilities: Accurately counts objects in video frames
Audio comprehension: Analyzes music, lyrics, sound, and silence

Use cases

Search: Use natural language queries to find specific content within videos
Embeddings: Create video embeddings for various downstream applications

Examples

This section contains examples of using the Marengo video understanding model.

Steve Jobs introducing the iPhone

In the example screenshot below, the query was “How did Steve Jobs introduce the iPhone?”. The Marengo video understanding model used information found in the visual and conversation modalities to perform the following tasks:

Visual recognition of a famous person (Steve Jobs)
Joint speech and visual recognition to semantically search for the moment when Steve Jobs introduced the iPhone. Note that semantic search finds information based on the intended meaning of the query rather than the literal words you used, meaning that the platform identified the matching video fragments even if Steve Jobs didn’t explicitly say the words in the query.

To see this example in the Playground, ensure you’re logged in, and then open this URL in your browser.

Polar bear holding a Coca-Cola bottle

In the example screenshot below, the query was “Polar bear holding a Coca-Cola bottle.” The Marengo video understanding model used information found in the visual and logo modalities to perform the following tasks:

Recognition of a cartoon character (polar bear)
Identification of an object (bottle)
Detection of a specific brand logo (Coca-Cola)
Identification of an action (polar bear holding a bottle)

To see this example in the Playground, ensure you’re logged in, and then open this URL in your browser.

Using different languages

This section provides examples of using different languages to perform search requests.

Spanish

In the example screenshot below, the query was “¿Cómo presentó Steve Jobs el iPhone?” (“How did Steve Jobs introduce the iPhone?”). The Marengo video understanding model used information from the visual and audio modalities.

To see this example in the Playground, ensure you’re logged in, and then open this URL in your browser.

Chinese

In the example screenshot below, the query was “猫做有趣的事情” (“Cats doing funny things.”). The Marengo video understanding model used information from the visual modality.

To see this example in the Playground, ensure you’re logged in, and then open this URL in your browser.

French

In the example screenshot below, the query was “J’ai trouvé la solution” (“I found the solution.”). The Marengo video understanding model used information from the visual modality (text displayed on the screen).

Support

For support or feedback regarding Marengo, contact support@twelvelabs.io.