Audio embeddings

This guide shows how you can create audio embeddings using the Marengo video understanding model. For a list of available versions, complete specifications and input requirements for each version, see the Marengo page.

The Marengo video understanding model generates embeddings for all modalities in the same latent space. This shared space enables any-to-any searches across different types of content.

For details on how your usage is measured and billed, see the Pricing page.

Key concepts

This section explains the key concepts and terminology used in this guide:

  • Asset: Your uploaded content
  • Embedding: Vector representation of your content.
  • Embedding task: An asynchronous operation for processing your content and creating embeddings. Contains a status and the resulting embeddings when complete.

Workflow

To create audio embeddings, provide your audio content to the platform. You can upload audio files as assets, provide a publicly accessible URL, or use base64-encoded data. The platform processes your audio and returns vector representations of your content. Use these embeddings for similarity search, content classification, clustering, recommendations, or building Retrieval-Augmented Generation (RAG) systems.

For audio files shorter than 10 minutes, you can provide a publicly accessible URL or base64-encoded audio data inline. This method skips the upload step but limits reusability for subsequent operations. See the Short audio files (synchronous) section for an example implementation.

This guide demonstrates how to create embeddings by uploading your audio file as an asset. This approach is the most flexible because you can reuse assets across multiple operations.

Customize your embeddings

You can customize your embeddings in the following ways:

  • Specify the types of embeddings you wish to generate:
    • Audio: Based on sounds and music
    • Transcription: Based on spoken words
  • Choose the embedding scope: clip (per segment) or asset (entire audio file)
  • Define the segment duration by specifying a fixed length in seconds

Prerequisites

  • To use the platform, you need an API key:

    1

    If you don’t have an account, sign up for a free account.

    2

    Go to the API Keys page.

    3

    Select the Copy icon next to your key.

  • Depending on the programming language you are using, install the TwelveLabs SDK by entering one of the following commands:

    $pip install twelvelabs
  • Your audio files must meet the following requirements:

    • For this guide: Files up to 4 hours
    • Model capabilities: See the complete requirements for resolution, aspect ratio, and supported formats.

    For other upload methods with different limits, see the Upload methods page.

Complete example

Copy and paste the code below, replacing the placeholders surrounded by <> with your values.

1import time
2from twelvelabs import (
3 TwelveLabs,
4 AudioInputRequest,
5 MediaSource,
6 AudioSegmentation,
7 AudioSegmentationFixed
8)
9# 1. Initialize the client
10client = TwelveLabs(api_key="<YOUR_API_KEY>")
11
12# 2. Upload an audio file
13asset = client.assets.create(
14 method="url",
15 url="<YOUR_AUDIO_URL>" # Use direct links to raw media files.
16 # Or use method="direct" and file=open("<PATH_TO_AUDIO_FILE>"", "rb") to upload a file from the local file system
17)
18print(f'Created asset: id={asset.id}')
19
20# 3. Process your video
21task = client.embed.v_2.tasks.create(
22 input_type="audio",
23 model_name="marengo3.0",
24 audio=AudioInputRequest(
25 media_source=MediaSource(
26 asset_id=asset.id,
27 # url="<YOUR_AUDIO_URL>", # Use direct links to raw media files
28 # base_64_string="<BASE_64_ENCODED_DATA>",
29 ),
30 # start_sec=0,
31 # end_sec=60,
32 # segmentation=AudioSegmentation(
33 # fixed=AudioSegmentationFixed(
34 # duration_sec=6
35 # )
36 # ),
37 # embedding_option=["audio", "transcription"],
38 # embedding_scope=["clip","asset"]
39 ),
40)
41print(f"Task ID: {task.id}")
42
43# 4. Poll until the task is ready
44while True:
45 task = client.embed.v_2.tasks.retrieve(task_id=task.id)
46
47 if task.status == "ready":
48 print(f"Task completed")
49 break
50 elif task.status == "failed":
51 print("Task failed")
52 break
53 else:
54 print("Task still processing...")
55 time.sleep(5)
56
57
58# 5. Process the results
59print(f"\n{'='*80}")
60print(f"EMBEDDINGS SUMMARY: {len(task.data)} total embeddings")
61print(f"{'='*80}\n")
62
63for idx, embedding_data in enumerate(task.data, 1):
64 print(f"[{idx}/{len(task.data)}] {embedding_data.embedding_option.upper()} | {embedding_data.embedding_scope.upper()}")
65 print(f"├─ Time range: {embedding_data.start_sec}s - {embedding_data.end_sec}s")
66 print(f"├─ Dimensions: {len(embedding_data.embedding)}")
67 print(f"└─ First 10 values: {embedding_data.embedding[:10]}")
68 print()

Code explanation

1

Import the SDK and initialize the client

Create a client instance to interact with the TwelveLabs Video Understanding Platform.
Function call: You call the constructor of the TwelveLabs class.
Parameters:

  • api_key: The API key to authenticate your requests to the platform.

Return value: An object of type TwelveLabs configured for making API calls.

2

Upload an audio file

Upload an audio file to create an asset. For details about the available upload methods and the corresponding limits, see the Upload methods page.
Function call: You call the assets.create function.
Parameters:

  • method: The upload method for your asset. Use url for a publicly accessible or direct to upload a local file. This example uses url.
  • url or file: The publicly accessible URL of your audio file or an opened file object in binary read mode. This example uses url.

Return value: An object of type Asset. This object contains, among other information, a field named id representing the unique identifier of your asset.

3

Process your video

Create an embedding task to start processing your audio. This operation is asynchronous.
Function call: You call the embed.v_2.tasks.create function.
Parameters:

  • input_type: The type of content. Set this parameter to audio.
  • model_name: The model you want to use. This example uses marengo3.0.
  • audio: An object containing the following properties:
    • media_source: An object specifying the source of the audio file. You can specify one of the following:

      • asset_id: The unique identifier of an asset from a previous upload.

      • url: The publicly accessible URL of the audio file.

      • base_64_string: The base64-encoded audio data.

        This example uses the asset ID from the previous step.

    • (Optional) start_sec: The start time in seconds for processing the audio file. By default, the platform processes audio from the beginning.

    • (Optional) end_sec: The end time in seconds for processing the audio file. By default, the platform processes audio to the end of the audio file.

    • (Optional) embedding_option: The types of embeddings to generate. Valid values are audio and transcription. You can specify multiple options to generate different types of embeddings. The default value is ["audio", "transcription"].

    • (Optional) embedding_scope: The scope for which to generate embeddings. Valid values are the following:

      • clip: Generates one embedding for each segment.
      • asset: Generates one embedding for the entire audio file.

      You can specify multiple scopes to generate embeddings at different levels. The default value is ["clip", "asset"].

    • (Optional) segmentation: An object that specifies how the platform divides the audio into segments. Use AudioSegmentation with a fixed property containing a duration_sec field to specify the exact duration in seconds for each segment.

Return value: An object of type TasksCreateResponse containing, among other information, a field named id, which represents the unique identifier of your embedding task. You can use this identifier to track the status of your embedding task.

4

Monitor the status

The platform requires some time to process audio. Poll the status of the embedding task until processing completes. This example uses a loop to check the status every 5 seconds.
Function call: You repeatedly call the embed.v_2.tasks.retrieve function until the task completes.

Parameters:

  • task_id: The unique identifier of your embedding task.

Return value: An object of type EmbeddingTaskResponse containing, among other information, the following fields:

  • status: The current status of the task. The possible values are:
    • processing: The platform is creating the embeddings.
    • ready: Processing is complete. Embeddings are available in the data field.
    • failed: The task failed.
  • data: When the status is ready, this field contains a list of embedding objects. Each embedding object includes:
    • embedding: The embedding vector (a list of floats).
    • embedding_option: The type of embedding (audio or transcription).
    • embedding_scope: The scope of the embedding (clip or asset).
    • start_sec: The start time of the segment in seconds.
    • end_sec: The end time of the segment in seconds.
5

Process the results

This example iterates through the embeddings in the data field and prints the embedding type, scope, time range, dimensions, and the first 10 vector values for each segment.

Short audio files (synchronous)

For audio files shorter than 10 minutes, you can use a synchronous approach that returns embeddings immediately without requiring polling.

1from twelvelabs import (
2 TwelveLabs,
3 AudioInputRequest,
4 MediaSource,
5 # AudioSegmentation,
6 # AudioSegmentationFixed
7)
8
9# 1. Initialize the client
10client = TwelveLabs(api_key="<YOUR_API_KEY>")
11
12# 2. Upload an audio file
13asset = client.assets.create(
14 method="url",
15 url="<YOUR_AUDIO_URL>", # Use direct links to raw media files
16 # Or use method="direct" and file=open("<PATH_TO_AUDIO_FILE>"", "rb") to upload a file from the local file system
17)
18print(f'Created asset: id={asset.id}')
19
20# 3. Create audio embeddings
21response = client.embed.v_2.create(
22 input_type="audio",
23 model_name="marengo3.0",
24 audio=AudioInputRequest(
25 media_source=MediaSource(
26 asset_id=asset.id,
27 # url="<YOUR_VIDEO_URL>", # Use direct links to raw media files
28 # base_64_string="<BASE_64_ENCODED_DATA>",
29 ),
30 # start_sec=0,
31 # end_sec=60,
32 # segmentation=AudioSegmentation(
33 # fixed=AudioSegmentationFixed(
34 # duration_sec=6
35 # )
36 # ),
37 # embedding_option=["audio", "transcription"],
38 # embedding_scope=["clip","asset"]
39 ),
40)
41
42# 4. Process the results
43print(f"\n{'='*80}")
44print(f"EMBEDDINGS SUMMARY: {len(response.data)} total embeddings")
45print(f"{'='*80}\n")
46
47for idx, embedding_data in enumerate(response.data, 1):
48 print(f"[{idx}/{len(response.data)}] {embedding_data.embedding_option.upper()} | {embedding_data.embedding_scope.upper()}")
49 print(f"├─ Time range: {embedding_data.start_sec}s - {embedding_data.end_sec}s")
50 print(f"├─ Dimensions: {len(embedding_data.embedding)}")
51 print(f"└─ First 10 values: {embedding_data.embedding[:10]}")
52 print()

All the fields of the audio object function similarly to the asynchronous approach.