Company
We raised $100M to build Video Superintelligence

Jae Lee
TwelveLabs raised $100M to scale the Video Cognition System, the foundation for Video Superintelligence. Co-led by NEA and NAVER Ventures.
TwelveLabs raised $100M to scale the Video Cognition System, the foundation for Video Superintelligence. Co-led by NEA and NAVER Ventures.

In this article
Join our newsletter
Receive the latest advancements, tutorials, and industry insights in video understanding
AI로 영상을 검색하고, 분석하고, 탐색하세요.
2026. 7. 1.
5 min
링크 복사하기
The Bet
Five years ago, we began with a simple observation:
The world does not happen in text. It happens in motion.
Language is how humans compress reality after the fact. It is powerful and useful, but lossy. A sentence can describe a dog, a goal, a collision, a surgery, a factory defect, or a crime scene. But before any description exists, there is sensory evidence: shape, motion, sound, and sequence. Signals that change over time.
Most AI systems are trained primarily on the compression.
We chose to build on the signal.
Our hypothesis was contrarian at the time: if intelligence is going to understand the physical world, it needs a native representation of video. Not a language model glancing at a few sampled frames. Not metadata wrapped around footage. Not captions pretending to be perception. A system that can perceive, index, retrieve, reason over, and eventually act on recorded reality.
Video is not just another modality. It is the closest digital record we have of reality as it unfolds. It contains space, time, sound, objects, people, intent, context, and consequence.
A caption may say the glass broke. Video preserves the seconds before and after: the hand moving, the object falling, the sound of impact, the reaction that follows. Causality lives in sequence. Video is sequence.
That is the bet TwelveLabs has been making.
What We Built
We built the company around three technical beliefs.
First, perception. A machine needs to turn raw video into meaning without flattening it into text too early. Our embedding model, Marengo, maps visual, audio, speech, and on-screen text signals into one searchable representation. Our video-language model, Pegasus, turns those representations into grounded descriptions, answers, and summaries.
Second, memory. A model that only looks at video at query time does not really know the archive. It is performing a temporary inspection. We built the opposite architecture. When video enters the system, it is understood once, converted into a durable representation, and kept addressable at the exact second of the exact file. The archive stops being passive storage. It becomes machine-readable memory.
Third, reasoning. Many important questions are not located in one clip. They are distributed across time. What changed across a season? Which pattern appeared before a failure? How did coverage of an event evolve across hundreds of broadcasts? Which moments matter in a library too large for any human team to watch?
Answering these questions requires a system that can search, gather evidence, compare events, and return conclusions grounded in the source footage.
Perception, memory, and reasoning form a loop.
We call this a Video Cognition System. It is not a model demo. It is an architecture for making video computational.
Why Now
This matters because the last decade of AI made text programmable. Language models turned words into tokens, and tokens became the semantic data layer for agents. Documents became context. Chats became workflows. Code became executable knowledge.
Video has not had that moment yet.
The world’s video is still mostly dark matter to machines. It sits in archives, camera systems, broadcasts, films, meetings, factories, hospitals, stadiums, drones, and satellites. It contains an enormous amount of human and physical information, but most of it is still accessed through filenames, folders, captions, transcripts, and human memory.
The richest record of reality is still largely outside the semantic layer that modern AI systems use.
We are changing that. Our goal is to make every second of video addressable, searchable, and usable by agents.
That is the path from video understanding to Video Superintelligence.
The Round
We raised $100 million to accelerate this work: to advance Marengo and Pegasus, to scale the Video Cognition System into the world’s most important video archives, and to build the team that can turn recorded reality into a substrate for AI.
The round was co-led by NEA and NAVER Ventures, with participation from Amazon, alongside Radical Ventures, Korea Investment Partners, Index Ventures, Quadrille Capital, and Red Bull Ventures. Many of them backed us when this thesis still looked unrealistic. They came back because the system is now real.
Come Build With Us
We are hiring researchers, engineers, product builders, and operators who believe the next frontier of AI will not be limited to what humans have written down. It will be built from what actually happened.
The road to Video Superintelligence starts here.
— Jae Lee, CEO and Co-founder, TwelveLabs
The Bet
Five years ago, we began with a simple observation:
The world does not happen in text. It happens in motion.
Language is how humans compress reality after the fact. It is powerful and useful, but lossy. A sentence can describe a dog, a goal, a collision, a surgery, a factory defect, or a crime scene. But before any description exists, there is sensory evidence: shape, motion, sound, and sequence. Signals that change over time.
Most AI systems are trained primarily on the compression.
We chose to build on the signal.
Our hypothesis was contrarian at the time: if intelligence is going to understand the physical world, it needs a native representation of video. Not a language model glancing at a few sampled frames. Not metadata wrapped around footage. Not captions pretending to be perception. A system that can perceive, index, retrieve, reason over, and eventually act on recorded reality.
Video is not just another modality. It is the closest digital record we have of reality as it unfolds. It contains space, time, sound, objects, people, intent, context, and consequence.
A caption may say the glass broke. Video preserves the seconds before and after: the hand moving, the object falling, the sound of impact, the reaction that follows. Causality lives in sequence. Video is sequence.
That is the bet TwelveLabs has been making.
What We Built
We built the company around three technical beliefs.
First, perception. A machine needs to turn raw video into meaning without flattening it into text too early. Our embedding model, Marengo, maps visual, audio, speech, and on-screen text signals into one searchable representation. Our video-language model, Pegasus, turns those representations into grounded descriptions, answers, and summaries.
Second, memory. A model that only looks at video at query time does not really know the archive. It is performing a temporary inspection. We built the opposite architecture. When video enters the system, it is understood once, converted into a durable representation, and kept addressable at the exact second of the exact file. The archive stops being passive storage. It becomes machine-readable memory.
Third, reasoning. Many important questions are not located in one clip. They are distributed across time. What changed across a season? Which pattern appeared before a failure? How did coverage of an event evolve across hundreds of broadcasts? Which moments matter in a library too large for any human team to watch?
Answering these questions requires a system that can search, gather evidence, compare events, and return conclusions grounded in the source footage.
Perception, memory, and reasoning form a loop.
We call this a Video Cognition System. It is not a model demo. It is an architecture for making video computational.
Why Now
This matters because the last decade of AI made text programmable. Language models turned words into tokens, and tokens became the semantic data layer for agents. Documents became context. Chats became workflows. Code became executable knowledge.
Video has not had that moment yet.
The world’s video is still mostly dark matter to machines. It sits in archives, camera systems, broadcasts, films, meetings, factories, hospitals, stadiums, drones, and satellites. It contains an enormous amount of human and physical information, but most of it is still accessed through filenames, folders, captions, transcripts, and human memory.
The richest record of reality is still largely outside the semantic layer that modern AI systems use.
We are changing that. Our goal is to make every second of video addressable, searchable, and usable by agents.
That is the path from video understanding to Video Superintelligence.
The Round
We raised $100 million to accelerate this work: to advance Marengo and Pegasus, to scale the Video Cognition System into the world’s most important video archives, and to build the team that can turn recorded reality into a substrate for AI.
The round was co-led by NEA and NAVER Ventures, with participation from Amazon, alongside Radical Ventures, Korea Investment Partners, Index Ventures, Quadrille Capital, and Red Bull Ventures. Many of them backed us when this thesis still looked unrealistic. They came back because the system is now real.
Come Build With Us
We are hiring researchers, engineers, product builders, and operators who believe the next frontier of AI will not be limited to what humans have written down. It will be built from what actually happened.
The road to Video Superintelligence starts here.
— Jae Lee, CEO and Co-founder, TwelveLabs





