Platform overview

Twelve Labs Video Understanding Platform, currently in beta, offers an API suite for integrating a state-of-the-art (“SOTA”) foundation model that understands contextual information from your videos, making it accessible to your applications. The API is organized around REST and is compatible with most programming languages. You can also use Postman or other REST clients to send requests and view responses.

Architecture overview

The following diagram illustrates the architecture of the Twelve Labs Video Understanding Platform and how different parts interact:


An index is a basic unit for organizing and storing video data consisting of video embeddings and metadata. Indexes facilitate information retrieval and processing.

Video Understanding Engines

A video understanding engine consists of a family of deep neural networks built on top of our multimodal foundation model for video understanding, offering search, classification, and summarization capabilities. For each index, you must configure the engines you want to enable. See the Video understanding engines page for more details about the available engines and their capabilities.

Engine options

The engine options define the types of information that a specific engine will process. Currently, the platform provides the following engine options:

  • Visual
  • Conversation
  • Text in video
  • Logo

For more details, see the Engine options page.

Query/Prompt Processing Engine

This component processes the following user inputs and returns the corresponding results to your application:

  • Search queries
  • Classification queries
  • Prompts for generating text from video