Marengo
Marengo is an embedding model for comprehensive video understanding. The current version is Marengo 2.7.
Marengo analyzes multiple modalities in video content, including visuals, audio, and text, to provide a holistic understanding similar to human comprehension.
Available models
Key features
- Multimodal processing: Combines visual, audio, and text elements for comprehensive understanding
- Fine-grained search: Detects brand logos, text, and small objects (as small as 10% of the video frame)
- Motion search: Identifies and analyzes movement within videos
- Counting capabilities: Accurately counts objects in video frames
- Audio comprehension: Analyzes music, lyrics, sound, and silence
Use cases
- Search: Use natural language queries to find specific content within videos
- Embeddings: Create video embeddings for various downstream applications
Examples
This section contains examples of using the Marengo video understanding model.
Steve Jobs introducing the iPhone
In the example screenshot below, the query was “How did Steve Jobs introduce the iPhone?”. The Marengo video understanding model used information found in the visual and conversation modalities to perform the following tasks:
- Visual recognition of a famous person (Steve Jobs)
- Joint speech and visual recognition to semantically search for the moment when Steve Jobs introduced the iPhone. Note that semantic search finds information based on the intended meaning of the query rather than the literal words you used, meaning that the platform identified the matching video fragments even if Steve Jobs didn’t explicitly say the words in the query.
To see this example in the Playground, ensure you’re logged in, and then open this URL in your browser.
Polar bear holding a Coca-Cola bottle
In the example screenshot below, the query was “Polar bear holding a Coca-Cola bottle.” The Marengo video understanding model used information found in the visual and logo modalities to perform the following tasks:
- Recognition of a cartoon character (polar bear)
- Identification of an object (bottle)
- Detection of a specific brand logo (Coca-Cola)
- Identification of an action (polar bear holding a bottle)
To see this example in the Playground, ensure you’re logged in, and then open this URL in your browser.
Using different languages
This section provides examples of using different languages to perform search requests.
Spanish
In the example screenshot below, the query was “¿Cómo presentó Steve Jobs el iPhone?” (“How did Steve Jobs introduce the iPhone?”). The Marengo video understanding model used information from the visual and audio modalities.
To see this example in the Playground, ensure you’re logged in, and then open this URL in your browser.
Chinese
In the example screenshot below, the query was “猫做有趣的事情” (“Cats doing funny things.”). The Marengo video understanding model used information from the visual modality.
To see this example in the Playground, ensure you’re logged in, and then open this URL in your browser.
French
In the example screenshot below, the query was “J’ai trouvé la solution” (“I found the solution.”). The Marengo video understanding model used information from the visual modality (text displayed on the screen).
Support
For support or feedback regarding Marengo, contact support@twelvelabs.io.