Navigate to the section that best addresses your query. If you don't find an answer to your question, please contact us.
This section answers frequently asked general questions.
We trained our foundation model on a few hundred million video-text pairs, which is currently one of the largest video datasets in the world. Our dataset is comprised of information scraped from the internet and open-source academic benchmarks.
We have a valuable partnership with Oracle Cloud Infrastructure for both computing and storing data. We conduct all of our training on OCI, and we store a large number of video text pairs on OCI's Object Storage platform.
We utilize a technique known as Positional Encoding, which is employed within the Transformers architecture to convey information regarding the position of a sequence of tokens within the input data. In this case, the tokens refer to the key scenes within the video. This technique facilitates the integration of sequential information into our model while simultaneously preserving the parallel processing capability of self-attention within the Transformer architecture.
The Developer plan can accommodate up to 1,000 hours of video (whether in a single index or a combination of all indexes). For larger volumes, our enterprise plan would be best suited. Please contact us for more information at [email protected].
Yes, the visual option when configuring our engine contains both visual and audio. This means the model considers sounds and noise, such as gunshots, honking sounds, trains, thunder, and more. Note that the model learns the correlation between certain visual objects or situations with sounds frequently appearing together.
We are working on supporting multilingual queries.
The platform utilizes a multimodal approach for video understanding. Instead of relying on textual input like traditional LLMs, the platform interprets visuals, sounds, and spoken words to deliver comprehensive and accurate results.
We are working on integrating with other LLMs so you can use the LLM of your choice.
This section answers frequently asked questions related to the Generate API Suite.
The Generate API suite employs our foundational Visual Language Model (VLM), which integrates a language encoder to extract multimodal data from videos and a decoder to generate concise text representations.
Updated 3 months ago