Twelve Labs
No matter how incredible a model is, if it isn't used, it remains nothing more than a novelty.

김성주, 김수정
An engineer who started with a vision to build AI that understands the world. SJ, Twelve Labs' Pegasus Engineering Lead, explains the possibilities and limitations of video data, and why developing AI directly connected to products is so essential.
An engineer who started with a vision to build AI that understands the world. SJ, Twelve Labs' Pegasus Engineering Lead, explains the possibilities and limitations of video data, and why developing AI directly connected to products is so essential.

목차
뉴스레터 구독하기
뉴스레터 구독하기
영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.
영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.
AI로 영상을 검색하고, 분석하고, 탐색하세요.
2026. 3. 20.
5분
링크 복사하기
Since SJ first encountered AI, video has always been at the center. Instead of learning language models first and transitioning to video, for him, it was video from the very beginning. Perhaps that is why his way of explaining this work starts with a much more fundamental question rather than technology stacks or benchmark numbers: What does it mean to understand the world? and What do we need to build for that?
Q. Was there a specific trigger that led you to the video field?
Rather than being instantly drawn to video itself, I wanted to build AI that understands the world. To do that, you need data that closest resembles the world, and currently, among what we have at scale, that is video. We have LiDAR and sensor data, but video is the only format that already exists globally, and that everyone creates and consumes.
However, to be honest, while I believe video is the best data for understanding the world, I don't think we have fully figured out the best way to utilize it yet. We are solving that problem right now.
Q. Why is training AI on video so difficult?
There is one core reason why language models were able to develop explosively: they are trained to predict the next word in a sequence. This means almost all text on the internet serves as pre-labeled data. There is no need for humans to manually label it, allowing large volumes of data to be ingested quickly.
Video is different. To use a single 50MB video for training, you first have to decode it into individual image frames. Instantly, it swells to 1–2GB. Storage, processing, and resource allocation all explode at once. Even before the user base reaches that scale, you need enterprise-level engineering from day one.
And to turn that into high-quality data suitable for training, you have to design, refine, and iterate on complex pipelines. While it can be automated, the process is incredibly complex and engineering-heavy. It is a task that demands both cutting-edge AI capabilities and heavy-duty engineering.
As a result, there are only a handful of teams globally tackling this problem to this depth. In my view, it’s realistically just Google. Google has YouTube, a search engine, and a fundamentally different engineering DNA that allows them to do this naturally. For most other AI companies, diving deep into this problem lies outside their core DNA.
Q. While tackling such a difficult challenge, how do you set your direction?
For me, there is only one standard: Is it actually being used?
No matter how novel or superior a technology is compared to what's out there, if it serves no purpose in actual practice, it remains just a neat novelty. That is why our Video Analysis team is currently focusing on performance optimized for real-use cases. Instead of developing aimlessly around open-ended research questions, we want to do exactly what actual users need most, and do it exceptionally well.
What's interesting is when I ask ML engineers joining us "Why do you want to join?", almost all of them say, "Because I want to build AI connected to real products." It seems that engineers who feel that building frontier models alone is somehow not enough are the ones who come to us.
I deeply relate to that feeling. No matter how outstanding a model is, if no one uses it, it just ends as a cool laboratory experiment. Building something that actually works in production is far more meaningful to me.
Don't those two focus areas actually clash? Frontier models and a product focus.
Building what users want doesn't mean giving up on frontier model research. Rather, I see it as moving in the same direction while hit concrete, practical milestones. A great recent example of this is Claude Code. The acceleration of language models truly took off when they connected to actual products. User feedback flows back into the model, establishing a direction and creating a rapid feedback loop that accelerates the pace of development itself. Research separated from products either loses its way or ends up building something nobody uses.
Q. Tell us about the team. How do you work in the Seoul office?
Our Korean MLE team is completely glocalized. A single team takes full ownership of the entire cycle from data design to training, serving, and GA. We don't have to navigate timezone barriers, and because decision-making happens within the team, operations are naturally fast.
In the past, "collaborating with the SF team" was considered a major selling point. But my perspective has changed. A team is not global just because it connects to SF; if this team itself possesses frontier-level competitiveness—that is what makes us a truly global team. That is the direction we are headed.
Q. Lastly, what would you like to say to those considering joining us?
(Thinks for a moment)
It's fast, high-density, and you learn a lot. But that's not all.
We are solving problems that are incredibly rare globally, those problems are directly tied to real products, and you get to work with end-to-end ownership. None of the people who joined us came with prior experience in video AI. Yet, they came here, learned rapidly, and are building it together.
If you are someone who wants to build AI that understands the world, and wants to see that AI actually put to work—I believe this is the closest answer you will find in Korea.
SJ is a co-founder of Twelve Labs and the Engineering Lead (MLE) for the Pegasus team. Twelve Labs is looking for engineers to join our journey. → twelvelabs.io/ko/careers
Since SJ first encountered AI, video has always been at the center. Instead of learning language models first and transitioning to video, for him, it was video from the very beginning. Perhaps that is why his way of explaining this work starts with a much more fundamental question rather than technology stacks or benchmark numbers: What does it mean to understand the world? and What do we need to build for that?
Q. Was there a specific trigger that led you to the video field?
Rather than being instantly drawn to video itself, I wanted to build AI that understands the world. To do that, you need data that closest resembles the world, and currently, among what we have at scale, that is video. We have LiDAR and sensor data, but video is the only format that already exists globally, and that everyone creates and consumes.
However, to be honest, while I believe video is the best data for understanding the world, I don't think we have fully figured out the best way to utilize it yet. We are solving that problem right now.
Q. Why is training AI on video so difficult?
There is one core reason why language models were able to develop explosively: they are trained to predict the next word in a sequence. This means almost all text on the internet serves as pre-labeled data. There is no need for humans to manually label it, allowing large volumes of data to be ingested quickly.
Video is different. To use a single 50MB video for training, you first have to decode it into individual image frames. Instantly, it swells to 1–2GB. Storage, processing, and resource allocation all explode at once. Even before the user base reaches that scale, you need enterprise-level engineering from day one.
And to turn that into high-quality data suitable for training, you have to design, refine, and iterate on complex pipelines. While it can be automated, the process is incredibly complex and engineering-heavy. It is a task that demands both cutting-edge AI capabilities and heavy-duty engineering.
As a result, there are only a handful of teams globally tackling this problem to this depth. In my view, it’s realistically just Google. Google has YouTube, a search engine, and a fundamentally different engineering DNA that allows them to do this naturally. For most other AI companies, diving deep into this problem lies outside their core DNA.
Q. While tackling such a difficult challenge, how do you set your direction?
For me, there is only one standard: Is it actually being used?
No matter how novel or superior a technology is compared to what's out there, if it serves no purpose in actual practice, it remains just a neat novelty. That is why our Video Analysis team is currently focusing on performance optimized for real-use cases. Instead of developing aimlessly around open-ended research questions, we want to do exactly what actual users need most, and do it exceptionally well.
What's interesting is when I ask ML engineers joining us "Why do you want to join?", almost all of them say, "Because I want to build AI connected to real products." It seems that engineers who feel that building frontier models alone is somehow not enough are the ones who come to us.
I deeply relate to that feeling. No matter how outstanding a model is, if no one uses it, it just ends as a cool laboratory experiment. Building something that actually works in production is far more meaningful to me.
Don't those two focus areas actually clash? Frontier models and a product focus.
Building what users want doesn't mean giving up on frontier model research. Rather, I see it as moving in the same direction while hit concrete, practical milestones. A great recent example of this is Claude Code. The acceleration of language models truly took off when they connected to actual products. User feedback flows back into the model, establishing a direction and creating a rapid feedback loop that accelerates the pace of development itself. Research separated from products either loses its way or ends up building something nobody uses.
Q. Tell us about the team. How do you work in the Seoul office?
Our Korean MLE team is completely glocalized. A single team takes full ownership of the entire cycle from data design to training, serving, and GA. We don't have to navigate timezone barriers, and because decision-making happens within the team, operations are naturally fast.
In the past, "collaborating with the SF team" was considered a major selling point. But my perspective has changed. A team is not global just because it connects to SF; if this team itself possesses frontier-level competitiveness—that is what makes us a truly global team. That is the direction we are headed.
Q. Lastly, what would you like to say to those considering joining us?
(Thinks for a moment)
It's fast, high-density, and you learn a lot. But that's not all.
We are solving problems that are incredibly rare globally, those problems are directly tied to real products, and you get to work with end-to-end ownership. None of the people who joined us came with prior experience in video AI. Yet, they came here, learned rapidly, and are building it together.
If you are someone who wants to build AI that understands the world, and wants to see that AI actually put to work—I believe this is the closest answer you will find in Korea.
SJ is a co-founder of Twelve Labs and the Engineering Lead (MLE) for the Pegasus team. Twelve Labs is looking for engineers to join our journey. → twelvelabs.io/ko/careers




