Use the platform to create multimodal embeddings for videos, texts, images, and audio files. These embeddings are contextual vector representations that capture interactions between modalities, such as visual expressions, body language, spoken words, and video context. You can apply these embeddings to downstream tasks like training custom multimodal models for anomaly detection, diversity sorting, sentiment analysis, recommendations, or building Retrieval-Augmented Generation (RAG) systems.
Key features:
Use cases:
Embeddings created through the async endpoints (/embed-v2/tasks) are stored for seven days. After this period, you must recreate the embedding task to obtain the results again.
For details on how your usage is measured and billed, see the Pricing page.