TwelveLabs: Video Intelligence Platform & API

See the unseen. Know the unknowable.

People don't want AI to do their work. They want to be freed from the mechanical — so they can do the work that matters. Your video archive contains every insight that mattered. Until now, extracting it has been physically impossible.

Get API key

Talk to Sales

Sports & Media

Find every comeback moment across our entire 10-year archive

Security & Compliance

Advertising & Publishing

AI Products & Agents

Found 847 comeback moments across 10 years of championship footage. Results ranked by excitement score and audience reaction.

#1

Video-MME benchmark (>30 min)

+17%

Pegasus 1.5 over Gemini 3.1 Pro

10x

Faster content review and compliance scanning.

4 hrs

Single video, one API call

Results in minutes.

Infrastructure for video intelligence, turning raw video into searchable, AI-ready data at massive scale.

Developer Hub

Infrastructure for video intelligence, turning raw video into searchable, AI-ready data at massive scale.

Developer Hub

Infrastructure

Ingest multimodal data through a single pipeline at ~60x real-time speed. Index an hour of video in a minute. 10k+ hours per day.

Learn more

API + SDK

Production-ready APIs. Submit video, run search or analysis, and receive structured outputs that work directly with your pipeline. Official SDKs for Python and Node.js with complete type support, async operations, retry logic, and streaming built in.

Read API Docs

MCP

Our MCP server works with Claude Code, Cursor, and VS Code so your agents can orchestrate video intelligence without leaving the IDE. Build multimodal-aware applications with persistent access to your multimodal data.

Install MCP Server

Integrations

TwelveLabs integrates into your existing stack. Video intelligence without changing the tools your team already loves using.

View Integrations

Video-native perception, reasoning, and orchestration

LLMs made text computable. Twelve Labs does the same for video, image, and audio enabling discovery to analysis to action.

Learn more

LLMs made text computable. Twelve Labs does the same for video, image, and audio enabling discovery to analysis to action.

Learn more

Multimodal Embedding Model. You can't search what you can't see. Marengo turns video into data: spatiotemporal embeddings that make every moment findable by what's actually in it, not metadata someone typed. One index. Every modality. 78.5% composite accuracy. 47 languages.

Learn more

Video Language Model. General-purpose models sample frames and guess. Pegasus reasons continuously over the full temporal arc of any asset, up to two hours: tracking entities, causation, and narrative across time. Not a transcript reader.

Learn more

Learn more

Learn more

Python

Node.js

import requests

# Step 2: Define the API URL and the specific endpoint

API_URL = "https://api.twelvelabs.io/v1.3"

INDEXES_URL = f"{API_URL}/indexes"

# Step 3: Create the necessary headers for authentication

headers = {

"x-api-key": "<YOUR_API_KEY>"

}

# Step 4: Prepare the data payload for your API request

INDEX_NAME = "<YOUR_INDEX_NAME>"

data = {

"models": [

{

"model_name": "marengo3.0",

"model_options": ["visual", "audio"]

}

]

}

First result: under 10 minutes. No patience required.

Install the SDK, paste your key, point it at a video. It indexes while you read the docs. Free tier ships with 600 minutes. No credit card, no onboarding call, no sales cycle — just results.

Get SDK

First result: under 10 minutes. No patience required.

Install the SDK, paste your key, point it at a video. It indexes while you read the docs. Free tier ships with 600 minutes. No credit card, no onboarding call, no sales cycle — just results.

Get SDK

Others process video. We comprehend it.

General-purpose models stuff frames into a context window at query time — every question starts from scratch. TwelveLabs builds structured knowledge at ingest, so intelligence compounds with every asset indexed.

GEMINI (AND GENERAL MULTIMODAL LLMS)

2-minute video cap, 80s audio cap

Gemini's embedding API caps at 2 minutes per video and 80 seconds per audio clip. Anything longer must be chunked manually — destroying temporal context.

Cross-modal embedding collapses

12 distinct audio genres → 1 video result in cross-modal retrieval. Similarity scores cluster between 0.30–0.41 for everything. Calibration fails: unrelated content scores identically to matches.

30.8% structured output failure rate

On news content, Gemini 3.1 Pro fails to produce valid structured JSON nearly a third of the time. Production pipelines require constant fallback handling and retries.

TWELVELABS

Intelligence built at ingest, not at query time

Marengo and Pegasus comprehend every asset the moment it is indexed — building embeddings, structured metadata, and entity relationships that persist. Every query draws on pre-built knowledge, not live inference.

Knowledge that compounds across every asset

Every new asset deepens the knowledge graph. Entity relationships discovered in one video inform retrieval across the entire archive. The system gets smarter with every hour indexed — without you doing anything.

Model upgrades re-process your entire library

When Marengo or Pegasus improves, every asset you've ever indexed automatically gets smarter — no re-upload, no re-indexing, no engineering effort. Historical content becomes more intelligent over time.