Create embeddings

Note

The 2.7 version of the Marengo video understanding model generates embeddings incompatible with v2.6, which will be discontinued. If you are using v2.6 embeddings, regenerate them using v2.7.

Use the Embed API to create multimodal embeddings for videos, texts, images, and audio files. These embeddings are contextual vector representations that capture interactions between modalities, such as visual expressions, body language, spoken words, and video context. You can apply these embeddings to downstream tasks like training custom multimodal models for anomaly detection, diversity sorting, sentiment analysis, recommendations, or building Retrieval-Augmented Generation (RAG) systems.

Key features:

  • Native multimodal support: Process all modalities natively without separate models or frame conversion.
  • State-of-the-art performance: Captures motion and temporal information for accurate video interpretation.
  • Unified vector space: Combines embeddings from different modalities for holistic understanding.
  • Fast and reliable: Reduces processing time for large video sets.
  • Flexible segmentation: Generate embeddings for video segments or the entire video.

Use cases:

  • Anomaly detection: Identify unusual patterns, such as corrupt videos with black backgrounds, to improve data set quality.
  • Diversity sorting: Organize data for broad representation, reducing bias and improving AI model training.
  • Sentiment analysis: Combine vocal tone, facial expressions, and spoken language for accurate insights, which particularly useful for customer service.
  • Recommendations: Use embeddings in similarity-based retrieval and ranking systems for recommendations.

To understand how your usage is measured and billed, see the Pricing page.

Note

The platform can generate embeddings for text, audio, and image content types individually or in any combination within a single API call.