Twelve Labs

프롤로그 - 트웰브랩스 한국팀 블로그를 시작하며

유하니

Welcome to the Twelve Labs Korean blog. We are sharing the stories of our people and the technology behind building video AI—directly bringing you the challenges faced and the technical approaches taken every day by our Science, MLE, and Engineering teams.

Welcome to the Twelve Labs Korean blog. We are sharing the stories of our people and the technology behind building video AI—directly bringing you the challenges faced and the technical approaches taken every day by our Science, MLE, and Engineering teams.

목차

No headings found on page

뉴스레터 구독하기

뉴스레터 구독하기

영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.

영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.

AI로 영상을 검색하고, 분석하고, 탐색하세요.

2026. 3. 20.

3분

링크 복사하기

Video is the richest format of data ever created. Text records events, and audio captures moments. But video is different. It shows space and time together, capturing what happened, in what order, and in what context.

Yet, AI still struggle to truly understand video. What happens when you put a video into an LLM? It tries to read the video by breaking it down like text. It's like trying to understand a movie just by reading the subtitles. In that process, motion, causality, and the flow of time — the very things that make video what it is — disappear. Video understanding is not an extension of text understanding; it is a problem that must be approached differently from the ground up. This is why Twelve Labs exists.

Twelve Labs is a company that builds AI models for video understanding. Text can be represented by words, and images by pixels, but video is a sequence of scenes flowing through time. "What's happening?", "In what order?", "What's the context?" — processing these questions simultaneously is the core of video understanding. In short, it is incredibly difficult.

Our approach is split into two phases. Marengo makes video searchable information, and Pegasus generates summaries and analysis based on that information. From raw video data to meaningful knowledge — our two models carry this flow end-to-end.

These two models are built together by three teams.

The Science Team researches the foundational performance of our models. "How should we represent video so that machines can understand it best?" — this is the team that tackles questions without straightforward answers every single day. They are solving problems that are as exciting as they are challenging.

The MLE Team bridges the gap between research and product. No matter how great a model is, it is meaningless unless it runs quickly and reliably in production. From training pipelines to serving optimization, this team is responsible for turning research into reality.

The Engineering Team builds the ways all of this reaches our customers. API design, infrastructure, and the overall product — most touchpoints where users experience Twelve Labs come from this team.

📌 Three Things to Know Before You Read

Before we dive in, let me introduce three terms that will come up frequently. Keeping these in mind will make everything much easier to follow.

Multimodal AI — AI that processes text, images, audio, and video frames simultaneously. Its architecture is fundamentally different from models that handle only one modality. Having these modalities move together across a timeline is one of the reasons video understanding is so challenging.

https://youtu.be/FS3sotFXqIU?si=m7_BkOvisqOYkPJ1

Semantic Search — A method of searching by meaning rather than keywords. You can find a

Video is the richest format of data ever created. Text records events, and audio captures moments. But video is different. It shows space and time together, capturing what happened, in what order, and in what context.

Yet, AI still struggle to truly understand video. What happens when you put a video into an LLM? It tries to read the video by breaking it down like text. It's like trying to understand a movie just by reading the subtitles. In that process, motion, causality, and the flow of time — the very things that make video what it is — disappear. Video understanding is not an extension of text understanding; it is a problem that must be approached differently from the ground up. This is why Twelve Labs exists.

Twelve Labs is a company that builds AI models for video understanding. Text can be represented by words, and images by pixels, but video is a sequence of scenes flowing through time. "What's happening?", "In what order?", "What's the context?" — processing these questions simultaneously is the core of video understanding. In short, it is incredibly difficult.

Our approach is split into two phases. Marengo makes video searchable information, and Pegasus generates summaries and analysis based on that information. From raw video data to meaningful knowledge — our two models carry this flow end-to-end.

These two models are built together by three teams.

The Science Team researches the foundational performance of our models. "How should we represent video so that machines can understand it best?" — this is the team that tackles questions without straightforward answers every single day. They are solving problems that are as exciting as they are challenging.

The MLE Team bridges the gap between research and product. No matter how great a model is, it is meaningless unless it runs quickly and reliably in production. From training pipelines to serving optimization, this team is responsible for turning research into reality.

The Engineering Team builds the ways all of this reaches our customers. API design, infrastructure, and the overall product — most touchpoints where users experience Twelve Labs come from this team.

📌 Three Things to Know Before You Read

Before we dive in, let me introduce three terms that will come up frequently. Keeping these in mind will make everything much easier to follow.

Multimodal AI — AI that processes text, images, audio, and video frames simultaneously. Its architecture is fundamentally different from models that handle only one modality. Having these modalities move together across a timeline is one of the reasons video understanding is so challenging.

https://youtu.be/FS3sotFXqIU?si=m7_BkOvisqOYkPJ1

Semantic Search — A method of searching by meaning rather than keywords. You can find a