The Embed.V2 interface provides methods to create embeddings synchronously for multimodal content. It returns embeddings immediately in the response.

Note

This interface only supports Marengo version 3.0 or newer.

When to use this interface:

Create embeddings for text, images, audio, or video content
Retrieve immediate results without waiting for background processing
Process audio or video content up to 10 minutes in duration

Do not use this interface for:

Audio or video content longer than 10 minutes. Use the embed.v2.tasks.create method instead.

Methods

Create embeddings

Description: This method synchronously creates embeddings for multimodal content and returns the results immediately in the response.

Input requirements

Text:

Maximum length: 500 tokens

Images:

Formats: JPEG, PNG
Minimum size: 128x128 pixels
Maximum file size: 5 MB

Audio and video:

Maximum duration: 10 minutes
Maximum file size for base64 encoded strings: 36 MB
Audio formats: WAV (uncompressed), MP3 (lossy), FLAC (lossless)
Video formats: FFmpeg supported formats
Video resolution: 360x360 to 5184x2160 pixels
Aspect ratio: Between 1:1 and 1:2.4, or between 2.4:1 and 1:1

Note

This method is rate-limited. For details, see the Rate limits page.

Function signature and example:

1 create(
2     request: TwelvelabsApi.embed.CreateEmbeddingsRequest,
3     requestOptions?: V2.RequestOptions
4 ): core.HttpResponsePromise<TwelvelabsApi.EmbeddingSuccessResponse>

Parameters

Name	Type	Required	Description
`request`	`TwelvelabsApi.embed.CreateEmbeddingsRequest`	Yes	Parameters for creating embeddings.
`requestOptions`	`V2.RequestOptions`	No	Request-specific configuration.

The TwelvelabsApi.embed.CreateEmbeddingsRequest interface has the following properties:

Name	Type	Required	Description
`inputType`	`TwelvelabsApi.embed.CreateEmbeddingsRequestInputType`	Yes	The type of content for the embeddings. Values: `text`, `image`, `text_image`, `audio`, `video`, `multi_input`.
`modelName`	`TwelvelabsApi.embed.CreateEmbeddingsRequestModelName`	Yes	The video understanding model you wish to use. Value: `marengo3.0`.
`text`	`TwelvelabsApi.TextInputRequest`	No	Text input configuration. Required when `inputType` is `text`. See TextInputRequest for details.
`image`	`TwelvelabsApi.ImageInputRequest`	No	Image input configuration. Required when `inputType` is `image`. See ImageInputRequest for details.
`textImage`	`TwelvelabsApi.TextImageInputRequest`	No	Combined text and image input configuration. Required when `inputType` is `text_image`. See TextImageInputRequest for details.
`audio`	`TwelvelabsApi.AudioInputRequest`	No	Audio input configuration. Required when `inputType` is `audio`. See AudioInputRequest for details.
`video`	`TwelvelabsApi.VideoInputRequest`	No	Video input configuration. Required when `inputType` is `video`. See VideoInputRequest for details.
`multiInput`	`TwelvelabsApi.MultiInputRequest`	No	Multiple images and optional text configuration. Required when `inputType` is `multi_input`. See MultiInputRequest for details.

TextInputRequest

The TwelvelabsApi.TextInputRequest interface specifies the configuration for processing text content. Required when inputType is text.

Name	Type	Required	Description
`inputText`	`string`	Yes	The text for which you wish to create an embedding. The maximum length is 500 tokens.

ImageInputRequest

The TwelvelabsApi.ImageInputRequest interface specifies the configuration for processing image content. Required when inputType is image.

Name	Type	Required	Description
`mediaSource`	`TwelvelabsApi.MediaSource`	Yes	Specifies the source of the image file. See MediaSource for details.

TextImageInputRequest

The TwelvelabsApi.TextImageInputRequest interface specifies the configuration for processing combined text and image content. Required when inputType is text_image.

Name	Type	Required	Description
`mediaSource`	`TwelvelabsApi.MediaSource`	Yes	Specifies the source of the image file. See MediaSource for details.
`inputText`	`string`	Yes	The text for which you wish to create an embedding. The maximum length is 500 tokens.

AudioInputRequest

The TwelvelabsApi.AudioInputRequest interface specifies the configuration for processing audio content. Required when inputType is audio.

Name	Type	Required	Description
`mediaSource`	`TwelvelabsApi.MediaSource`	Yes	Specifies the source of the audio file. See MediaSource for details.
`startSec`	`number`	No	The start time in seconds for processing the audio file. Use this parameter to process a portion of the audio file starting from a specific time. Default: `0` (start from the beginning).
`endSec`	`number`	No	The end time in seconds for processing the audio file. Use this parameter to process a portion of the audio file ending at a specific time. The end time must be greater than the start time. Default: End of the audio file.
`segmentation`	`TwelvelabsApi.AudioSegmentation`	No	Specifies how the platform divides the audio into segments. When combined with `embeddingScope=["clip"]`, creates separate embeddings for each segment. Use this to generate embeddings for specific portions of your audio. See AudioSegmentation for details.
`embeddingOption`	`TwelvelabsApi.AudioInputRequestEmbeddingOptionItem[]`	No	The types of embeddings you wish to generate. Values: - `audio`: Generates embeddings based on audio content (sounds, music, effects) - `transcription`: Generates embeddings based on transcribed speech You can specify multiple options to generate different types of embeddings for the same audio.
`embeddingScope`	`TwelvelabsApi.AudioInputRequestEmbeddingScopeItem[]`	No	The scope for which you wish to generate embeddings. Values: - `clip`: Generates one embedding for each segment - `asset`: Generates one embedding for the entire audio file You can specify multiple scopes to generate embeddings at different levels.
`embeddingType`	`TwelvelabsApi.AudioInputRequestEmbeddingTypeItem[]`	No	Specifies how to structure the embedding. Use this parameter only when `embeddingOption` specifies two or more values. Values: - `separate_embedding`: Returns separate embeddings per modality specified in `embeddingOption`. - `fused_embedding`: Returns a single embedding that combines all modalities into one vector. Specify both values to receive separate and fused embeddings in the same response. Default: `separate_embedding`

VideoInputRequest

The TwelvelabsApi.VideoInputRequest interface specifies the configuration for processing video content. Required when inputType is video.

Name	Type	Required	Description
`mediaSource`	`TwelvelabsApi.MediaSource`	Yes	Specifies the source of the video file. See MediaSource for details.
`startSec`	`number`	No	The start time in seconds for processing the video file. Use this parameter to process a portion of the video file starting from a specific time. Default: `0` (start from the beginning).
`endSec`	`number`	No	The end time in seconds for processing the video file. Use this parameter to process a portion of the video file ending at a specific time. The end time must be greater than the start time. Default: End of the video file.
`segmentation`	`TwelvelabsApi.VideoSegmentation`	No	Specifies how the platform divides the video into segments. When combined with `embeddingScope=["clip"]`, creates separate embeddings for each segment. Supports fixed-duration segments or dynamic segmentation that adapts to scene changes. See VideoSegmentation for details.
`embeddingOption`	`TwelvelabsApi.VideoInputRequestEmbeddingOptionItem[]`	No	The types of embeddings to generate for the video. Values: - `visual`: Generates embeddings based on visual content (scenes, objects, actions) - `audio`: Generates embeddings based on audio content (sounds, music, effects) - `transcription`: Generates embeddings based on transcribed speech You can specify multiple options to generate different types of embeddings for the same video. Default: `["visual", "audio", "transcription"]`.
`embeddingScope`	`TwelvelabsApi.VideoInputRequestEmbeddingScopeItem[]`	No	The scope for which you wish to generate embeddings. Values: - `clip`: Generates one embedding for each segment - `asset`: Generates one embedding for the entire video file. Use this scope for videos up to 10-30 seconds to maintain optimal performance. You can specify multiple scopes to generate embeddings at different levels. Default: `["clip", "asset"]`.
`embeddingType`	`TwelvelabsApi.VideoInputRequestEmbeddingTypeItem[]`	No	Specifies how to structure the embedding. Include this parameter only when `embeddingOption` contains at least two values. Values: - `separate_embedding`: Returns separate embeddings per modality specified in `embeddingOption`. - `fused_embedding`: Returns a single embedding that combines all modalities into one vector. Specify both values to receive separate and fused embeddings in the same response. Default: `separate_embedding`

MultiInputRequest

The MultiInputRequest class specifies the configuration for processing multiple images and optional text. Required when inputType is multi_input.

Name	Type	Required	Description
`inputText`	`string`	No	Text to combine with the images when generating the embedding. Usage options: - Omit this field to create an embedding from images only. - Provide plain text to add context. Example: “A person cooking.” - Use image references to describe relationships between specific images. The format is `<@name>`, where `name` matches the `name` field of a media source. Example: “A person wearing <@outfit> and holding <@accessory>.”
`mediaSources`	`TwelvelabsApi.MultiInputMediaSource[]`	Yes	An array of up to 10 images to include in the embedding. The platform processes images in the order they appear in the array. If you use image references in the `inputText` parameter, each must have a corresponding image with a matching `name` field. If an image reference has no match, the request fails.

MediaSource

The TwelvelabsApi.MediaSource interface specifies the source of the media file. Provide exactly one of the following:

Name	Type	Required	Description
`base64String`	`string`	No	The base64-encoded media data.
`url`	`string`	No	The publicly accessible URL of the media file. Use direct links to raw media files. Video hosting platforms and cloud storage sharing links are not supported.
`assetId`	`string`	No	The unique identifier of an asset from a direct or multipart upload.

MultiInputMediaSource

A class specifying an image source for multi-input embeddings. You must provide exactly one of url, base64String, or assetId.

Name	Type	Required	Description
`name`	`string`	No	The unique identifier for this media source. This field is required when `inputType` references this image.
`mediaType`	`TwelvelabsApi.MultiInputMediaSourceMediaType`	No	The type of media. Value: `image`
`url`	`string`	No	The publicly accessible URL of the image file. Use direct links to raw image files. Image hosting platforms and cloud storage sharing links are not supported.
`base64String`	`string`	No	The base64-encoded image data.
`assetId`	`string`	No	The unique identifier of an asset from a direct or multipart upload.

AudioSegmentation

The TwelvelabsApi.AudioSegmentation interface specifies how the platform divides the audio into segments using fixed-length intervals.

Name	Type	Required	Description
`strategy`	`"fixed"`	Yes	The segmentation strategy. Value: `fixed`.
`fixed`	`TwelvelabsApi.AudioSegmentationFixed`	Yes	Configuration for fixed segmentation. See AudioSegmentationFixed for details.

AudioSegmentationFixed

The TwelvelabsApi.AudioSegmentationFixed interface configures fixed-length segmentation for audio.

Name	Type	Required	Description
`durationSec`	`number`	Yes	The duration in seconds for each segment. The platform divides the audio into segments of this exact length. The final segment may be shorter if the audio duration is not evenly divisible. Example: With `durationSec: 5`, a 12-second audio file produces segments: [0-5s], [5-10s], [10-12s].

VideoSegmentation

The TwelvelabsApi.VideoSegmentation type specifies how the platform divides the video into segments. Use one of the following:

Fixed segmentation: Divides the video into equal-length segments:

Name	Type	Required	Description
`strategy`	`"fixed"`	Yes	The segmentation strategy. Value: `fixed`.
`fixed`	`TwelvelabsApi.VideoSegmentationFixedFixed`	Yes	Configuration for fixed segmentation. See VideoSegmentationFixedFixed for details.

Dynamic segmentation: Divides the video into adaptive segments based on scene changes:

Name	Type	Required	Description
`strategy`	`"dynamic"`	Yes	The segmentation strategy. Value: `dynamic`.
`dynamic`	`TwelvelabsApi.VideoSegmentationDynamicDynamic`	Yes	Configuration for dynamic segmentation. See VideoSegmentationDynamicDynamic for details.

VideoSegmentationFixedFixed

The TwelvelabsApi.VideoSegmentationFixedFixed interface configures fixed-length segmentation for video.

Name	Type	Required	Description
`durationSec`	`number`	Yes	The duration in seconds for each segment. The platform divides the video into segments of this exact length. The final segment may be shorter if the video duration is not evenly divisible. Example: With `durationSec: 5`, a 12-second video produces segments: [0-5s], [5-10s], [10-12s].

VideoSegmentationDynamicDynamic

The TwelvelabsApi.VideoSegmentationDynamicDynamic interface configures dynamic segmentation for video based on scene changes.

Name	Type	Required	Description
`minDurationSec`	`number`	Yes	The minimum duration in seconds for each segment. The platform divides the video into segments that are at least this long. Segments adapt to scene changes and content boundaries and may be longer than the minimum. Example: With `minDurationSec: 3`, segments might be: [0-3.2s], [3.2-7.8s], [7.8-12.1s].

Return value

Returns an HttpResponsePromise that resolves to a TwelvelabsApi.EmbeddingSuccessResponse object containing the embedding results.

The TwelvelabsApi.EmbeddingSuccessResponse interface contains the following properties:

Name	Type	Description
`data`	`TwelvelabsApi.EmbeddingData[]`	Array of embedding results.
`metadata`	`TwelvelabsApi.EmbeddingMediaMetadata`	Metadata about the media content.

The TwelvelabsApi.EmbeddingData interface contains the following properties:

Name	Type	Description
`embedding`	`number[]`	The embedding vector for the content.
`embeddingOption`	`TwelvelabsApi.EmbeddingDataEmbeddingOption`	The type of embedding. Values: - `visual`: Embedding based on visual content (video only) -`audio`: Embedding based on audio content -`transcription`: Embedding based on transcribed speech - `fused`: Embedding based on a combination of the modalities specified in the request. The platform returns this embedding only for video and audio content, and only when the `embeddingType` parameter in the request includes `fused_embedding` - `null`: For text and image embeddings
`embeddingScope`	`TwelvelabsApi.EmbeddingDataEmbeddingScope`	The scope of the embedding. Values: `clip`, `asset`.
`startSec`	`number`	The start time in seconds for this embedding segment.
`endSec`	`number`	The end time in seconds for this embedding segment.

API Reference

Create sync embeddings