See the unseen. Know the unknowable.
#1
Video-MME benchmark (>30 min)
+17%
Pegasus 1.5 over Gemini 3.1 Pro
10x
Faster content review and compliance scanning.
4 hrs
Single video, one API call
Results in minutes.
Infrastructure
Ingest multimodal data through a single pipeline at ~60x real-time speed. Index an hour of video in a minute. 10k+ hours per day.
API + SDK
MCP
Integrations

Video-native perception, reasoning, and orchestration
Multimodal Embedding Model. You can't search what you can't see. Marengo turns video into data: spatiotemporal embeddings that make every moment findable by what's actually in it, not metadata someone typed. One index. Every modality. 78.5% composite accuracy. 47 languages.

Video Language Model. General-purpose models sample frames and guess. Pegasus reasons continuously over the full temporal arc of any asset, up to two hours: tracking entities, causation, and narrative across time. Not a transcript reader.

Video Language Model. General-purpose models sample frames and guess. Pegasus reasons continuously over the full temporal arc of any asset, up to two hours: tracking entities, causation, and narrative across time. Not a transcript reader.

Video Language Model. General-purpose models sample frames and guess. Pegasus reasons continuously over the full temporal arc of any asset, up to two hours: tracking entities, causation, and narrative across time. Not a transcript reader.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import requests
# Step 2: Define the API URL and the specific endpoint
API_URL = "https://api.twelvelabs.io/v1.3"
INDEXES_URL = f"{API_URL}/indexes"
# Step 3: Create the necessary headers for authentication
headers = {
"x-api-key": "<YOUR_API_KEY>"
}
# Step 4: Prepare the data payload for your API request
INDEX_NAME = "<YOUR_INDEX_NAME>"
data = {
"models": [
{
"model_name": "marengo3.0",
"model_options": ["visual", "audio"]
}
]
}
Others process video. We comprehend it.
GEMINI (AND GENERAL MULTIMODAL LLMS)
2-minute video cap, 80s audio cap
Gemini's embedding API caps at 2 minutes per video and 80 seconds per audio clip. Anything longer must be chunked manually — destroying temporal context.
Cross-modal embedding collapses
12 distinct audio genres → 1 video result in cross-modal retrieval. Similarity scores cluster between 0.30–0.41 for everything. Calibration fails: unrelated content scores identically to matches.
30.8% structured output failure rate
On news content, Gemini 3.1 Pro fails to produce valid structured JSON nearly a third of the time. Production pipelines require constant fallback handling and retries.
TWELVELABS
Intelligence built at ingest, not at query time
Marengo and Pegasus comprehend every asset the moment it is indexed — building embeddings, structured metadata, and entity relationships that persist. Every query draws on pre-built knowledge, not live inference.
Knowledge that compounds across every asset
Every new asset deepens the knowledge graph. Entity relationships discovered in one video inform retrieval across the entire archive. The system gets smarter with every hour indexed — without you doing anything.
Model upgrades re-process your entire library
When Marengo or Pegasus improves, every asset you've ever indexed automatically gets smarter — no re-upload, no re-indexing, no engineering effort. Historical content becomes more intelligent over time.