Multimodal language models

When you watch a movie, you typically use multiple senses to experience it. For example, you use your eyes to see the actors and objects on the screen and your ears to hear the dialogue and sounds. Using only one sense, you would miss essential details like body language or conversation. This is similar to how most language models operate - they are usually trained to understand either text, human speech, or images. Still, they cannot integrate multiple forms of information and understand what's happening in a scene.

When a language model processes a form of information, such as a text, it generates a compact numerical representation that defines the meaning of that specific input. These numerical representations are named unimodal embeddings and take the form of real-valued vectors in a multi-dimensional space. They allow computers to perform various downstream tasks such as translation, question answering, or classification.

In contrast, when a multimodal language model processes a video, it generates a multimodal embedding that represents the overall context from all sources of information, such as images, sounds, speech, or text displayed on the screen, and how they relate to one another. By doing so, the model acquires a comprehensive understanding of the video. Once multimodal embeddings are created, they are used for various downstream tasks such as visual question answering, classification, or sentiment analysis.

The Twelve Labs multimodal language model

Twelve Labs has developed a multimodal video understanding technology that creates multimodal embeddings for your videos. These embeddings are highly efficient in terms of storage and computational requirements. They contain all the context of a video and enable fast and scalable task execution without storing the entire video. For more details about the architecture of the platform, see the Architecture overview section.

The model has been trained on a vast amount of video data, and it can recognize entities, actions, patterns, movements, objects, scenes, and other elements present in videos. By integrating information from different modalities, the model can be used for several downstream tasks, such as search using natural language queries and zero-shot classification. For details, see the Search and Classify pages.

Currently, our multimodal model allows you to perform search and classification tasks by providing text inputs. Future versions will expand this functionality, enabling you to use images and videos as inputs.