Create embeddings | TwelveLabs

Use the platform to create multimodal embeddings for videos, texts, images, and audio files. These embeddings are contextual vector representations that capture interactions between modalities, such as visual expressions, body language, spoken words, and video context. You can apply these embeddings to downstream tasks like training custom multimodal models for anomaly detection, diversity sorting, sentiment analysis, recommendations, or building Retrieval-Augmented Generation (RAG) systems.

Key features:

Native multimodal support: Process all modalities natively without separate models or frame conversion.
State-of-the-art performance: Captures motion and temporal information for accurate video interpretation.
Unified vector space: Combines embeddings from different modalities for holistic understanding.
Fast and reliable: Reduces processing time for large video sets.
Flexible segmentation: Generate embeddings for video segments or the entire video.

Use cases:

Anomaly detection: Identify unusual patterns, such as corrupt videos with black backgrounds, to improve data set quality.
Diversity sorting: Organize data for broad representation, reducing bias and improving AI model training.
Sentiment analysis: Combine vocal tone, facial expressions, and spoken language for accurate insights, which is particularly useful for customer service.
Recommendations: Use embeddings in similarity-based retrieval and ranking systems for recommendations.

Retention policy

Embeddings created through the async endpoints (/embed-v2/tasks) are stored for seven days. After this period, you must recreate the embedding task to obtain the results again.

For details on how your usage is measured and billed, see the Pricing page.

Create video embeddings

Create text embeddings

Create audio embeddings

Create single image embeddings

Create text and image embeddings