Introduction

Twelve Labs Video Understanding Platform uses artificial intelligence to extract information from videos. The platform identifies and interprets movements, actions, objects, individuals, sounds, on-screen text, and spoken words. Built on top of our state-of-the-art multimodal foundation model optimized for videos, the platform enables you to add rich, contextual video understanding to your applications through developer-friendly APIs.

Key capabilities of Twelve Labs for multimodal video understanding

Twelve Labs Video Understanding Platform equips developers with the following key capabilities:

  • Deep semantic search: Find the exact moment you need within your videos using natural language queries instead of tags or metadata.
  • Zero-shot classification: Use natural language to create your custom taxonomy, facilitating accurate and efficient video classification tailored to your unique use case.
  • Dynamic video-to-text generation: Capture the essence of your videos into concise summaries or custom reports. The platform offers built-in formats to generate the following: titles, topics, summaries, hashtags, chapters, and highlights. Additionally, you can provide a prompt detailing the content and desired output format, such as a police report, to tailor the results to your needs.
  • Intuitive integration: Embed a state-of-the-art multimodal foundation model for video understanding into your application in just a few API calls.
  • Rapid result retrieval: Receive your results within seconds.
  • Scalability: Our cloud-native distributed infrastructure seamlessly processes thousands of concurrent requests.

Twelve Labs’ Advantages

The table below provides a basic comparison between Twelve Labs Video Understanding Platform and other video AI solutions:

  • Simplified API integration: Perform a rich set of video understanding tasks with just a few API calls. This allows you to focus on building your application rather than aggregating data from separate image and speech APIs or managing multiple data sources.
  • Natural language use: Tap into the model's capabilities using everyday language to write queries or prompts. This method is more effective, intuitive, flexible, and accurate than using solely rules, tags, or keywords.
  • Image-to-video search: Perform searches using images as queries and find videos semantically similar to the provided images. This addresses the challenges you face when the existing reverse image search tools yield inconsistent results or when describing the desired results using text is challenging.
  • Multimodal approach: The platform adopts a video-first, multimodal approach, surpassing traditional unimodal models that depend exclusively on text or images, providing a comprehensive understanding of your videos.
  • One-time video indexing for multiple tasks: Index your videos once and create contextual video embeddings that encapsulate semantics for scaling and repurposing, allowing you to search and classify your videos swiftly.
  • Flexible deployment: The platform can adapt to varied business needs, with deployment options spanning on-premise, hybrid, or cloud-based environments.
  • Fine-tuning capabilities: Though our state-of-the-art foundation model for video understanding already yields highly accurate results, we can provide fine-tuning capabilities to help you get more out of the models and achieve better results with only a few examples.

For details on fine-tuning the models or different deployment options, please contact us at [email protected].

Discover Twelve Labs

Experience the key capabilities of the Twelve Labs Video Understanding platform by signing up for a free account.