Pinecone - Multimodal RAG


Summary: This integration combines Twelve Labs' Embed and Generate APIs with Pinecone's hosted vector database to build RAG-based video Q&A applications. It transforms video content into rich embeddings that can be stored, indexed, and queried to extract text answers from unstructured video databases.

Description: The process of performing video-based question answering using Twelve Labs and Pinecone involves the following steps:

  • Generate rich, contextual embeddings from your video content using the Embed API
  • Store and index these embeddings in Pinecone's vector database
  • Perform semantic searches to find relevant video segments
  • Generate natural language responses using Generate API

This integration also showcases the difference in developer experience between using the Generate API to generate text responses and a leading open-source model, LLaVA-NeXT-Video, allowing you to compare approaches and select the most suitable solution for your needs.

Step-by-step guide: Our blog post, Multimodal RAG: Chat with Videos Using Twelve Labs and Pinecone, guides you through the process of creating a RAG-based video Q&A application.

Colab Notebook: TwelveLabs_Pinecone_Chat_with_video.

Integration with Twelve Labs

This section describes how the application uses the Twelve Labs Python SDK with Pinecone to create a video Q&A application. The integration is:

  • Video embedding generation using the Embed API
  • Vector database storage and indexing
  • Similarity search for relevant video segments
  • Natural language response generation using the Generate API

Video Embeddings

The generate_embedding function generates embeddings for a video file:

def generate_embedding(video_file, engine="Marengo-retrieval-2.6"):
    """
    Generate embeddings for a video file using Twelve Labs API.
    
    Args:
        video_file (str): Path to the video file
        engine (str): Embedding engine name
        
    Returns:
        tuple: Embeddings and metadata
    """
    # Create an embedding task
    task = twelvelabs_client.embed.task.create(
        engine_name=engine,
        video_file=video_file
    )
    print(f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}")
    
    # Monitor task progress
    def on_task_update(task: EmbeddingsTask):
        print(f"  Status={task.status}")
    
    status = task.wait_for_done(
        sleep_interval=2,
        callback=on_task_update
    )
    print(f"Embedding done: {status}")
    
    # Retrieve results
    task_result = twelvelabs_client.embed.task.retrieve(task.id)
    
    # Extract embeddings and metadata
    embeddings = task_result.float
    time_ranges = task_result.time_ranges
    scope = task_result.scope
    
    return embeddings, time_ranges, scope

For details on creating video embeddings, see the Create video embeddings page.
The ingest_data function stores embeddings in Pinecone:

def ingest_data(video_file, index_name="twelve-labs"):
    """
    Generate embeddings and store them in Pinecone.
    
    Args:
        video_file (str): Path to the video file
        index_name (str): Name of the Pinecone index
    """
    # Generate embeddings
    embeddings, time_ranges, scope = generate_embedding(video_file)
    
    # Connect to Pinecone index
    index = pc.Index(index_name)
    
    # Prepare vectors for upsert
    vectors = []
    for i, embedding in enumerate(embeddings):
        vectors.append({
            "id": f"{video_file}_{i}",
            "values": embedding,
            "metadata": {
                "video_file": video_file,
                "time_range": time_ranges[i],
                "scope": scope
            }
        })
    
    # Upsert vectors to Pinecone
    index.upsert(vectors=vectors)
    print(f"Successfully ingested {len(vectors)} embeddings into Pinecone")

Video search

The search_video_segments function creates text embeddings and performs similarity searches to find relevant video segments using the embeddings that have already been stored in Pinecone:

def search_video_segments(question, index_name="twelve-labs", top_k=5):
    """
    Search for relevant video segments based on a question.
    
    Args:
        question (str): Question text
        index_name (str): Name of the Pinecone index
        top_k (int): Number of results to retrieve
        
    Returns:
        list: Relevant video segments and their metadata
    """
    # Generate text embedding for the question
    question_embedding = twelvelabs_client.embed.create(
        engine_name="Marengo-retrieval-2.6",
        text=question
    ).text_embedding.float
    
    # Query Pinecone
    index = pc.Index(index_name)
    query_results = index.query(
        vector=question_embedding,
        top_k=top_k,
        include_metadata=True
    )
    
    # Process and return results
    results = []
    for match in query_results.matches:
        results.append({
            "score": match.score,
            "video_file": match.metadata["video_file"],
            "time_range": match.metadata["time_range"],
            "scope": match.metadata["scope"]
        })
    
    return results

For details on creating text embeddings, see the Create text embeddings page.

Natural language responses

After retrieving relevant video segments, the application uses the Generate API to create natural language responses:

def generate_response(question, video_segments):
    """
    Generate a natural language response using Pegasus.
    
    Args:
        question (str): The user's question
        video_segments (list): Relevant video segments from search
        
    Returns:
        str: Generated response based on video content
    """
    # Prepare context from video segments
    context = []
    for segment in video_segments:
        # Get the video clip based on time range
        video_file = segment["video_file"]
        start_time, end_time = segment["time_range"]
        
        # You can extract the clip or use the metadata directly
        context.append({
            "content": f"Video segment from {video_file}, {start_time}s to {end_time}s",
            "score": segment["score"]
        })
    
    # Generate response using Twelve Labs Generate API
    response = twelvelabs_client.generate.create(
        engine_name="Pegasus-1.0",
        prompt=question,
        contexts=context,
        max_tokens=250
    )
    
    return response.generated_text

For details on generating open-ended texts based on your videos, see the Open-ended text page.

Create a complete Q&A function

The application creates a complete Q&A function by combining search and response generation:

def video_qa(question, index_name="twelve-labs"):
    """
    Complete video Q&A pipeline.
    
    Args:
        question (str): User's question
        index_name (str): Pinecone index name
        
    Returns:
        dict: Response with answer and supporting video segments
    """
    # Find relevant video segments
    video_segments = search_video_segments(question, index_name)
    
    # Generate response using Pegasus
    answer = generate_response(question, video_segments)
    
    return {
        "question": question,
        "answer": answer,
        "supporting_segments": video_segments
    }

Next steps

After reading this page, you have the following options:

  • Customize and use the example: Use the TwelveLabs_Pinecone_Chat_with_video notebook to understand how the integration works. You can make changes and add functionalities to suit your specific use case. Below are a few examples:
    • Training a linear adapter on top of the embeddings to better fit your data.
    • Re-ranking videos using Pegasus when clips from different videos are returned.
    • Adding textual summary data for each video to the Pinecone entries to create a hybrid search system, enhancing accuracy using Pinecone's Metadata capabilities.