Text and image embeddings

This guide shows how to create a combined embedding from up to 10 images with optional text using the Marengo video understanding model. For a list of available versions, complete specifications, and input requirements, see the Marengo page.

The Marengo video understanding model generates embeddings for all modalities in the same latent space. This shared space enables any-to-any searches across different types of content.

For details on how your usage is measured and billed, see the Pricing page.

Key concepts

This section explains the key concepts and terminology used in this guide:

  • Asset: Your uploaded content. Once created, you can reference the same asset across multiple operations without uploading the file again.
  • Embedding: Vector representation of your content.

Workflow

This guide shows how to upload your images as assets and create a combined embedding synchronously. You can also pass images inline as a URL or base64-encoded data instead of creating assets; both are shown as commented-out lines in the code examples.

Use these embeddings for similarity search, content classification, clustering, recommendations, or Retrieval-Augmented Generation (RAG).

Prerequisites

  • To use the platform, you need an API key:

    1

    If you don’t have an account, sign up for a free account.

    2

    Go to the API Keys page.

    3

    If you need to create a new key, select the Create API Key button. Enter a name and set the expiration period. The default is 12 months.

    4

    Select the Copy icon next to your key to copy it to your clipboard.

  • Depending on the programming language you are using, install the TwelveLabs SDK by entering one of the following commands:

    $pip install twelvelabs
  • Your image files must meet the requirements.

Complete example

Copy and paste the code below, replacing the placeholders surrounded by <> with your values.

1from twelvelabs import TwelveLabs, MultiInputRequest, MultiInputMediaSource
2
3# 1. Initialize the client
4client = TwelveLabs(api_key="<YOUR_API_KEY>")
5
6# 2. Upload an image
7asset = client.assets.create(
8 method="url",
9 url="<YOUR_IMAGE_URL>" # Use direct links to raw media files
10 # Or use method="direct" and file=open("<PATH_TO_IMAGE_FILE>", "rb") to upload a file from the local file system
11)
12print(f"Created asset: id={asset.id}")
13
14# 3. Create text and image embeddings
15response = client.embed.v_2.create(
16 input_type="multi_input",
17 model_name="marengo3.0",
18 multi_input=MultiInputRequest(
19 media_sources=[
20 MultiInputMediaSource(
21 # name="img1", # Required when using <@name> placeholders in input_text
22 media_type="image",
23 asset_id=asset.id,
24 # url="<YOUR_IMAGE_URL>", # Use direct links to raw media files
25 # base_64_string="<BASE_64_ENCODED_IMAGE_DATA>",
26 ),
27 # Add more images as needed (up to 10):
28 # MultiInputMediaSource(name="img2", media_type="image", url="<YOUR_IMAGE_URL_2>"),
29 ],
30 input_text="<YOUR_TEXT>",
31 # Use <@name> placeholders to reference specific images:
32 # input_text="<@img1> shows X, while <@img2> shows Y",
33 ),
34)
35
36# 4. Process the results
37print(f"Number of embeddings: {len(response.data)}")
38for embedding_data in response.data:
39 print(f"Embedding dimensions: {len(embedding_data.embedding)}")
40 print(f"First 10 values: {embedding_data.embedding[:10]}")

Code explanation

1

Import the SDK and initialize the client

Create a client instance to interact with the TwelveLabs Video Understanding Platform.
Function call: You call the constructor of the TwelveLabs class.
Parameters:

  • api_key: The API key to authenticate your requests to the platform.

Return value: An object of type TwelveLabs configured for making API calls.

2

Upload your images

Upload one or more image files to create assets. For details about the available upload methods and the corresponding limits, see the Upload methods page.
Function call: You call the assets.create function.
Parameters:

  • method: The upload method for your asset. Use url for a publicly accessible URL or direct to upload a local file. This example uses url.
  • url or file: The publicly accessible URL of your image file or an opened file object in binary read mode. This example uses url.

Return value: An object of type Asset. This object contains, among other information, a field named id representing the unique identifier of your asset.

Repeat this step for each image you want to include. In the next step, you can assign a name to each image source to reference it in your text using <@name> placeholders.

3

Create text and image embeddings

Function call: You call the embed.v_2.create function.
Parameters:

  • input_type: The type of content. Set this parameter to multi_input.
  • model_name: The model you want to use. This example uses marengo3.0.
  • multi_input: A MultiInputRequest object containing the following properties:
    • media_sources: A list of up to 10 image sources. Each source accepts the following properties:
      • (Optional) name: A unique identifier for this image. Required when using <@name> placeholders in input_text.
      • media_type: The type of media. Set to "image".
      • Exactly one of the following:
        • asset_id: The unique identifier of an asset from a previous upload. This example uses the asset ID from the previous step.
        • url: The publicly accessible URL of the image file.
        • base_64_string: The base64-encoded image data.
    • input_text: The text for which you wish to create an embedding. Provide plain text for context, or use <@name> placeholders to reference specific images. When using placeholders, set a matching name field on each image source.

Return value: An object of type EmbeddingSuccessResponse containing a field named data, which is a list of embedding objects. Each embedding object includes the following fields:

  • embedding: An array of floats representing the embedding vector.
  • embedding_option: The type of embedding generated.
4

Process the results

This example prints the number of embeddings, their dimensions, and the first 10 values of each embedding.

If you need only a single image with text, you can use the text_image input type as a simpler alternative. For details, see the TextImageInputRequest section in the SDK Reference for Python or Node.js.