Research-first and human-centered.
Our brains continually process sensory input – helping us understand what has happened and predict what might happen next. This ability, known as perceptual reasoning, forms the basis of human intelligence.
AI, as rolled out so far, has bypassed a crucial learning step: creating a robust world representation through video, which closely resembles the sensory input that gives rise to human perception.
At TwelveLabs, we’re bridging this gap by training cutting-edge foundation models to learn rich, multimodal representations from video data, then using these representations for high-level reasoning tasks involving language.
Through video-native AI, we’re helping machines learn about the world – and enabling humans to retrieve, capture, and tell their visual stories better.
Perception: Capturing the sensory details through a video-native encoder
Our video-native encoder model, Marengo, is the embodiment of perception. The human sensory organs excel at capturing the world's visuals and auditory details. In line with this, Marengo can analyze visual frames and their temporal relationships, along with speech and sound – ensuring a thorough understanding of both visual and auditory elements.
This context-aware, video-native representation encoder serves as the foundation for our perceptual reasoning pipeline.
Reasoning: Inducing the perceptual reasoning capability through video and language alignment
True video understanding requires the ability to reason about what is perceived. This is where our video-language model, Pegasus, comes into play.
Pegasus merges the reasoning skills learned from large language models (text data) with the perceptual understanding gained from our video encoder model (video data). By aligning these two modalities, Pegasus can perform cross-modal reasoning, inferring meaning and intent from Marengo's rich, multimodal representations.
It’s the synergy between Marengo and Pegasus — the alignment of video and language – that enables perceptual reasoning capabilities in our AI systems. Building on the strengths of both models, we can develop systems that not only perceive and understand the visual world, but also reason about it in a way that resembles human cognition.
Our science team has a background in video and language throughout their careers, with 5+ wins in global competitions and 100+ publications in top AI conferences on video and language.
Rethinking how an AI thinks.
Marengo 3.0: Video Intelligence Turns Video Into Strategic Assets
Context Engineering for Video Understanding
Video Intelligence is Going Agentic
Pegasus 1.2: An Industry-Grade Video Language Model for Scalable Applications
The State of Video-Language Models: Research Insights from the Inaugural NeurIPS Workshop
Marengo 2.7: Pioneering Multi-Vector Embeddings for Advanced Video Understanding
TwelveLabs Embed API Beta
Accelerate Your Film Production with Twelve Labs
TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models
Jockey: A Conversational Video Agent Powered by TwelveLabs APIs and LangGraph
Semantic Content Discovery for a Post-Production World
Pegasus 1 Beta: Setting New Standards in Video-Language Modeling
Marengo 2.6: A State-of-the-Art Video Foundation Model for Any-to-Any Search
Introducing Video-To-Text and Pegasus-1 (80B)
A Tour of Video Understanding Use Cases
The Multimodal Evolution of Vector Embeddings
What Is Multimodal AI?
The Past, Present, and Future of Video Understanding Applications
What makes Foundation Models special?
Foundation models are going multimodal






























