Pinecone - Multimodal RAG

Summary: This integration combines TwelveLabs’ Embed and Generate APIs with Pinecone’s hosted vector database to build RAG-based video Q&A applications. It transforms video content into rich embeddings that can be stored, indexed, and queried to extract text answers from unstructured video databases.

Description: The process of performing video-based question answering using TwelveLabs and Pinecone involves the following steps:

  • Generate rich, contextual embeddings from your video content using the Embed API
  • Store and index these embeddings in Pinecone’s vector database
  • Perform semantic searches to find relevant video segments
  • Generate natural language responses using Generate API

This integration also showcases the difference in developer experience between using the Generate API to generate text responses and a leading open-source model, LLaVA-NeXT-Video, allowing you to compare approaches and select the most suitable solution for your needs.

Step-by-step guide: Our blog post, Multimodal RAG: Chat with Videos Using TwelveLabs and Pinecone, guides you through the process of creating a RAG-based video Q&A application.

Colab Notebook: TwelveLabs_Pinecone_Chat_with_video.

Integration with TwelveLabs

This section describes how the application uses the TwelveLabs Python SDK with Pinecone to create a video Q&A application. The integration is comprised of the following main steps:

  • Video embedding generation using the Embed API
  • Vector database storage and indexing
  • Similarity search for relevant video segments
  • Natural language response generation using the Generate API

Video Embeddings

The generate_embedding function generates embeddings for a video file:

Python
1def generate_embedding(video_file, engine="Marengo-retrieval-2.6"):
2 """
3 Generate embeddings for a video file using TwelveLabs API.
4
5 Args:
6 video_file (str): Path to the video file
7 engine (str): Embedding engine name
8
9 Returns:
10 tuple: Embeddings and metadata
11 """
12 # Create an embedding task
13 task = twelvelabs_client.embed.task.create(
14 engine_name=engine,
15 video_file=video_file
16 )
17 print(f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}")
18
19 # Monitor task progress
20 def on_task_update(task: EmbeddingsTask):
21 print(f" Status={task.status}")
22
23 status = task.wait_for_done(
24 sleep_interval=2,
25 callback=on_task_update
26 )
27 print(f"Embedding done: {status}")
28
29 # Retrieve results
30 task_result = twelvelabs_client.embed.task.retrieve(task.id)
31
32 # Extract embeddings and metadata
33 embeddings = task_result.float
34 time_ranges = task_result.time_ranges
35 scope = task_result.scope
36
37 return embeddings, time_ranges, scope

For details on creating video embeddings, see the Create video embeddings page.

The ingest_data function stores embeddings in Pinecone:

Python
1def ingest_data(video_file, index_name="twelve-labs"):
2 """
3 Generate embeddings and store them in Pinecone.
4
5 Args:
6 video_file (str): Path to the video file
7 index_name (str): Name of the Pinecone index
8 """
9 # Generate embeddings
10 embeddings, time_ranges, scope = generate_embedding(video_file)
11
12 # Connect to Pinecone index
13 index = pc.Index(index_name)
14
15 # Prepare vectors for upsert
16 vectors = []
17 for i, embedding in enumerate(embeddings):
18 vectors.append({
19 "id": f"{video_file}_{i}",
20 "values": embedding,
21 "metadata": {
22 "video_file": video_file,
23 "time_range": time_ranges[i],
24 "scope": scope
25 }
26 })
27
28 # Upsert vectors to Pinecone
29 index.upsert(vectors=vectors)
30 print(f"Successfully ingested {len(vectors)} embeddings into Pinecone")

The search_video_segments function creates text embeddings and performs similarity searches to find relevant video segments using the embeddings that have already been stored in Pinecone:

Python
1def search_video_segments(question, index_name="twelve-labs", top_k=5):
2 """
3 Search for relevant video segments based on a question.
4
5 Args:
6 question (str): Question text
7 index_name (str): Name of the Pinecone index
8 top_k (int): Number of results to retrieve
9
10 Returns:
11 list: Relevant video segments and their metadata
12 """
13 # Generate text embedding for the question
14 question_embedding = twelvelabs_client.embed.create(
15 engine_name="Marengo-retrieval-2.6",
16 text=question
17 ).text_embedding.float
18
19 # Query Pinecone
20 index = pc.Index(index_name)
21 query_results = index.query(
22 vector=question_embedding,
23 top_k=top_k,
24 include_metadata=True
25 )
26
27 # Process and return results
28 results = []
29 for match in query_results.matches:
30 results.append({
31 "score": match.score,
32 "video_file": match.metadata["video_file"],
33 "time_range": match.metadata["time_range"],
34 "scope": match.metadata["scope"]
35 })
36
37 return results

For details on creating text embeddings, see the Create text embeddings page.

Natural language responses

After retrieving relevant video segments, the application uses the Generate API to create natural language responses:

Python
1def generate_response(question, video_segments):
2 """
3 Generate a natural language response using Pegasus.
4
5 Args:
6 question (str): The user's question
7 video_segments (list): Relevant video segments from search
8
9 Returns:
10 str: Generated response based on video content
11 """
12 # Prepare context from video segments
13 context = []
14 for segment in video_segments:
15 # Get the video clip based on time range
16 video_file = segment["video_file"]
17 start_time, end_time = segment["time_range"]
18
19 # You can extract the clip or use the metadata directly
20 context.append({
21 "content": f"Video segment from {video_file}, {start_time}s to {end_time}s",
22 "score": segment["score"]
23 })
24
25 # Generate response using TwelveLabs Generate API
26 response = twelvelabs_client.generate.create(
27 engine_name="Pegasus-1.0",
28 prompt=question,
29 contexts=context,
30 max_tokens=250
31 )
32
33 return response.generated_text

For details on generating open-ended texts from videos see the Open-ended text page.

Create a complete Q&A function

The application creates a complete Q&A function by combining search and response generation:

Python
1def video_qa(question, index_name="twelve-labs"):
2 """
3 Complete video Q&A pipeline.
4
5 Args:
6 question (str): User's question
7 index_name (str): Pinecone index name
8
9 Returns:
10 dict: Response with answer and supporting video segments
11 """
12 # Find relevant video segments
13 video_segments = search_video_segments(question, index_name)
14
15 # Generate response using Pegasus
16 answer = generate_response(question, video_segments)
17
18 return {
19 "question": question,
20 "answer": answer,
21 "supporting_segments": video_segments
22 }

Next steps

After reading this page, you have the following options:

  • Customize and use the example: Use the TwelveLabs_Pinecone_Chat_with_video notebook to understand how the integration works. You can make changes and add functionalities to suit your specific use case. Below are a few examples:
    • Training a linear adapter on top of the embeddings to better fit your data.
    • Re-ranking videos using Pegasus when clips from different videos are returned.
    • Adding textual summary data for each video to the Pinecone entries to create a hybrid search system, enhancing accuracy using Pinecone’s Metadata capabilities.
  • Explore further: Try the applications built by the community or our sample applications to get more insights into the TwelveLabs Video Understanding Platform’s diverse capabilities and learn more about integrating the platform into your applications.
Built with