Frequently asked questions

Navigate to the section that best addresses your query. If you don't find an answer to your question, please contact us .

General questions

This section answers frequently asked general questions.

On what types of data is your model trained?

We trained our foundation model on a few hundred million video-text pairs, which is currently one of the largest video datasets in the world. Our dataset is comprised of information scraped from the internet and open-source academic benchmarks.

Where do you store your training dataset?

We have a valuable partnership with Oracle Cloud Infrastructure for both computing and storing data. We conduct all of our training on OCI, and we store a large number of video text pairs on OCI's Object Storage platform.

How do you handle user data privacy?

We transform user-uploaded videos into vector embeddings, which are then securely stored in a separate vector database. Please note that these embeddings cannot be reverse-engineered back into the original raw video. Additionally, we do provide a platform for users to play back their uploaded videos on the Playground, a sandbox environment that allows users to try out the features of the Twelve Labs Video Understanding Platform through an intuitive web page. We are also actively working towards SOC-2 compliance, ensuring that our practices meet the highest security & privacy standards. Please visit the Privacy Policy page for more information on how we collect, retain, and process your data.

How does your model handle temporal dimension within videos?

We utilize a technique known as Positional Encoding, which is employed within the Transformers architecture to convey information regarding the position of a sequence of tokens within the input data. In this case, the tokens refer to the key scenes within the video. This technique facilitates the integration of sequential information into our model while simultaneously preserving the parallel processing capability of self-attention within the Transformer architecture.

What is the maximum size of videos that can be stored in one index?

The Developer plan can accommodate up to 10,000 hours of video (whether in a single index or a combination of all indexes). For larger volumes, our enterprise plan would be best suited. Please contact us for more information at [email protected].

How long does it take to index a video?

Indexing is typically completed in 30-40% of the duration of the video. However, indexing duration also depends on the number of concurrent indexing tasks, and delays can occur if too many indexing tasks are being processed simultaneously. If you're on the Free plan, for faster indexing, consider upgrading to the Developer plan, which supports more concurrent tasks. We also offer a dedicated cloud deployment option for enterprise customers. Please contact us at [email protected] to discuss this option.

Can your model recognize natural sounds in videos?

Yes, the visual option when configuring our engine contains both visual and audio. This means the model considers sounds and noise, such as gunshots, honking sounds, trains, thunder, and more. Note that the model learns the correlation between certain visual objects or situations with sounds frequently appearing together.

Can your model recognize text from other languages?

Yes, the model supports multiple languages. See the Supported languages page for details.

How does your visual language model compare to other LLMs?

The platform utilizes a multimodal approach for video understanding. Instead of relying on textual input like traditional LLMs, the platform interprets visuals, sounds, and spoken words to deliver comprehensive and accurate results.

Can I use TwelveLabs with my own LLM or with LangChain?

You can optionally integrate our video-to-text model (Pegasus) with your LLMs. We also provide an open-source project demonstrating the integration with LangChain. Find out more at twelvelabs-io/tl-jockey.

Embed API

This section answers frequently asked questions related to the Generate API.

When should I use the Embed API versus the built-in search?

The Embed API and built-in search service offer different functionalities for working with visual content.

Embed API

  • Generate visual embeddings for:
    • RAG workflows
    • Hybrid search
    • Classification
    • Clustering
  • Use the embeddings as input for your custom models
  • Create flexible, domain-specific solutions

Built-in search service

  • Perform semantic searches across multiple modalities:
    • Visual content
    • Conversation (human speech)
    • Text-in-video (OCR)
    • Logo
  • Utilize production-ready, out-of-the-box functionality
  • Ideal for projects not requiring additional customization

Generate API

This section answers frequently asked questions related to the Generate API.

What LLM does the Generate API suite use?

The Generate API suite employs our foundational Visual Language Model (VLM), which integrates a language encoder to extract multimodal data from videos and a decoder to generate concise text representations.

To use the Generate API suite, do I need to reindex my videos if I already indexed them with Marengo?

Yes, for the Generate feature, you would have to reindex videos using the Pegasus engine. See the Generate text from video and Pricing pages for details.