Generate text from video

The Generate API suite generates texts based on your videos. Unlike conventional models limited to unimodal interpretations that, for example, summarize videos relying solely on transcriptions, the Generate API suite uses a multimodal approach that analyzes the whole context of a video, including visuals, sounds, spoken words, and texts and their relationship with one another. This method ensures a holistic understanding of your videos, capturing nuances that an unimodal interpretation might miss. Consequently, utilizing its video-to-text generative capabilities, our model translates this understanding into your desired textual representation.

The Generate API suite has diverse applications across various sectors:

  • E-learning platforms:
    • Content structuring: Automatically transform lengthy educational videos into well-structured chapters, making navigating and reviewing specific clips easier for students.
    • Course summaries: Generate concise summaries of lectures, helping students quickly recall key points without rewatching entire videos.
  • Content creators and marketers:
    • SEO optimization: Suggest SEO-friendly tags for your videos, improving discoverability on search engines like Google or Bing.
    • Social media amplification: Create engaging headlines and descriptions for your videos, increasing reach and engagement on platforms like YouTube, Instagram, and TikTok.
    • Content personalization: Efficiently generate personalized messages and call-to-actions tailored to varied user personas and geographic locations, expanding your video content's reach to a broader audience.
  • Media, sports, and broadcasting agencies:
    • Highlight creation: Automatically pinpoint and extract key moments or highlights from long broadcasts, ideal for news segments, sports events, or entertainment shows.
    • Archival and cataloging: Classify and tag vast video archives through segmentation, making locating and repurposing content easier.
  • Enterprise knowledge management:
    • Customer or employee meetings: Quickly summarize essential information such as key actions, the list of attendees, and other main points from an event and repurpose video content to other formats such as emails, reports, meeting minutes, etc.
    • Internal training videos: Organize internal training materials into chapters and summaries. Additionally, integrating with the Search API enables employees to discover content effortlessly.
  • Retail and manufacturing:
    • On-site, in-store management: Enable store managers to generate reports and extract insights about key activities directly from video footage, eliminating the need to review CCTV recordings manually.
  • Security and law enforcement:
    • Incident and police reports: Identify and summarize key events and their start and end times from Body-Worn Camera (BWC) footage to pinpoint unusual scenes and expedite report generation.

The Generate API suite offers three distinct endpoints tailored to meet various requirements. Each endpoint has been designed with specific levels of flexibility and customization to accommodate different needs.

The /gist endpoint

  • Function: Generates topics, titles, and hashtags.
  • Customization: Uses predefined formats.
  • Prompt : No.
  • Best use: To generate an immediate and straightforward text representation without specific customization.

The /summarize endpoint

  • Function: Generates summaries, chapters, and highlights.
  • Customization: Operates primarily on predefined formats, similar to /gist. However, you can provide a custom prompt that guides the model on how to generate the output.
  • Prompt: Optional. While you can invoke this endpoint without a prompt, providing one allows for tailored outputs.
  • Best use: To balance the efficiency of predefined formats and bespoke customization abilities.

The /generate endpoint

  • Function: Generates open-ended texts from videos.
  • Customization: Relies solely on user-defined prompts, ensuring maximum flexibility.
  • Prompt: Required. You must provide clear instructions to guide the model.
  • Best use: Ideal for advanced users with specific output requirements beyond the predefined formats.



  • Your prompts can be instructive or descriptive, or you can also phrase them as questions.
  • The platform generates text according to the engine options enabled for your index, which determine the types of information the video understanding engine processes.
    • If both the visual and conversation engine options are enabled, the platform generates text based on both visual and conversational information.
    • If only the visual option is enabled, the platform generates text based only on visual information.
  • The maximum length of a prompt is 1500 characters.