Twelve Labs
Finding the Boundaries of Meaning: The People Building Video-Understanding Embeddings

김단, 김수
Embeddings are lossy compression. What to extract, and how much to discard while preserving the core meaning? Dan, Lead ML Scientist at Twelve Labs, shares the questions he's been pursuing since 2018, and why he's looking for those answers in video.
Embeddings are lossy compression. What to extract, and how much to discard while preserving the core meaning? Dan, Lead ML Scientist at Twelve Labs, shares the questions he's been pursuing since 2018, and why he's looking for those answers in video.

목차
뉴스레터 구독하기
뉴스레터 구독하기
영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.
영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.
AI로 영상을 검색하고, 분석하고, 탐색하세요.
2026. 4. 10.
7분
링크 복사하기
Dan leads the development of the video embedding model, Marengo, and video search systems at Twelve Labs, having researched multimodal embedding since 2018. Starting with image and text, he is now tackling the challenge of understanding video in semantic units. He chose video because it remains a frontier with questions no one has yet answered decisively—and because Twelve Labs is the only company globally addressing these challenges with this level of focus.
Q. You’ve been researching embeddings since 2018. Did you start with video from the beginning?
In the beginning, it was joint representation of image and text. It was research on projecting two distinct modalities into the same embedding space, and during that time, one question constantly persisted:
“How atomic must the segment be for an embedding to retain meaningful semantic information?”
Embedding is ultimately lossy compression; we compress the world's information and represent it in a single vector. If you represent the entire Lord of the Rings movie as a single 500-dimensional vector, most of the information vanishes. Yet, if you slice it too finely, you lose context. Where do we draw that line? This very question ultimately led me to video.
Q. Why video?
It was the most under-researched field. PDFs have layouts; where images and text are positioned is programmatically defined. Web pages have HTML and CSS that explicitly state the relationships between elements.
But what about video? Aside from frames existing sequentially over time, there's no markup language that defines the relationship between those frames. Where does one scene end and another begin? Even in academia, there is still no definitive answer to this question. Because video data is so massive, researching it is virtually impossible outside of an industrial scale. I wanted to tackle the unique problems that can only be solved in this specific environment.
Q. What specifically are you building right now?
Two primary initiatives: first, pushing the precision of video search to the limit, and second, driving the evolution of the embedding model itself.
In search, we are currently focusing heavily on reranking. The baseline search in Marengo ranks results by coordinate similarity between embeddings. In this approach, the 1st and 5th results cannot 'see' each other; they are only compared individually with the query. A reranker, on the other hand, gathers top-K results in one place and instructs the model: "Re-evaluate these jointly against the query." If the 1st result looks at the 5th and determines "actually, this one is a better match," the order changes. This drastically improves the sharpness and relevance of search results even for the same query.
Q. In practice, how critical is the question of 'how to slice' a video?
Consider a gunshot scene. That entire scene might only last 2 seconds. If you simply segment the video at fixed 10-second intervals, those 2 seconds of action get lumped in with completely unrelated scenes before and after. The resulting embedding ends up contaminated with all of that extra context. When you search "find the gunshot scene" against that embedding, it is only natural that your search results will be muddy.
This is why we are building a dedicated model to segment videos into semantic units. It is our direct answer to this challenge. Instead of slicing at fixed lengths, the model autonomously detects exact boundaries and structures between scenes. This is the foundational technology that determines not just search precision, but the overall quality of video understanding.
Q. What is the broader vision for Marengo?
It is a hierarchical structure. Slicing videos into semantic units gives each segment its own embedding. But what if we can dynamically recompose those units into high-level parent embeddings?
If you have a soccer video, you can divide it into the first and second half. The first half can then be split into Team A's offensive drives and Team B's offensive drives, which can be further broken down into individual actions. This forms a hierarchy. If a user asks for "first half with multiple turnovers," a broad high-level embedding provides the answer, while a search for "corner kick" points exactly to the smallest, most precise segment. It is a structure where the same system, operating on the same search interface, responds with varying granularities depending on the scope of the query.
Q. What has been your most rewarding moment here?
I heard back that an embedding model lead at a global big tech company tried Marengo firsthand and was stunned by its performance. It’s not that they don’t know how to build embedding models. However, at most companies, embeddings are treated as a byproduct of LLMs—they build a generation model and derive an embedding model from it. Almost no other company places embeddings at the absolute center of their focus.
Marengo is different. It is not a derivative of Pegasus. Marengo is the origin. I believe Twelve Labs is the only company globally pouring LLM-scale resources and serious commitment into embeddings. Because we are focused on video as our primary modality, we treat problems that other companies don't even think about—like how to segment video and how to extract meaning from real-time streams—as core research topics.
Q. How does research here differ from academic research?
Academia seeks the most generalizable answers. For a paper to be highly cited, a broad range of researchers must be able to apply it to their own work. Keeping the scope as broad as possible is academia's optimization strategy.
Industrial research works in reverse. There is a target market and concrete user needs. Narrowing the scope down to those elements directly translates to performance improvements. You can deliver far more precise results with the same amount of effort. However, if you submit this to academic conferences, reviews often come back stating it is "not general enough." In the mid-2020s, world-class AI companies with excellent productized models publish less at traditional academic conferences not because they lack interest, but because of this structural gap.
What is great about Twelve Labs is that we can engineer our research through a tight feedback loop with real customers. Real business challenges, like "decide whether to license petabytes of video based on this 10GB sample," drive our research directions. These problems do not even exist in academia. And the environment to solve them is right here—complete with petabytes of real-world video data, actual clients, and real feedback.
Q. What is the day-to-day working environment like?
Even as I do this interview, my agents are running in the background. Yesterday, while having dinner, I received a Slack alert that one of them crashed, and I was on edge. Whether that's good or bad, that represents how we work in this era.
At Twelve Labs, we have a policy called 'Tokens Never Sleep'. We place no cap on the use of AI tools. This is not simply a perk. It is designed to let us experience firsthand how work should be done in this AI era. Many companies cap usage or offer no support at all, and I believe that difference will create a significant gap down the road.
The absence of rigid corporate frameworks was slightly disorienting at first. When I asked, "What is the standard procedure for Process A?" the response was, "You can redefine it to whatever process you think is optimal, Dan." Now, I see this as a major strength. Because we do not treat legacy systems as sacred, we can design workflows from scratch using the best modern tools available.
Q. Who would thrive here, and who might find it challenging?
To be honest, if you find satisfaction in executing predefined tasks within a highly structured framework, this might not be the right fit. Here, you often have to define the problems yourself. Shifts in direction are also common. Even if you've poured effort into building something, if a better direction emerges, we pivot. If the psychological cost of sunk costs is too high for you, it can be demanding.
Conversely, if you excel at finding and solving problems that no one else is tackling, and you want to see those solutions directly integrated into a real product, there is no better environment. There are very few places in the world operating at the absolute frontier of video embeddings.
Q. Any final words for those considering joining Twelve Labs?
The traditional way humanity works will undergo massive shifts in the near future. Before this transition is complete, I urge you not to miss the opportunity to work at the absolute frontier of these new workflows. We are entering an era where anyone can own things end-to-end. The environment to prepare for that is right here—holding the authority, the resources, and the real-world challenges.
If that sounds exciting, you belong here.
Dan is a Lead ML Scientist at Twelve Labs, leading the development of Marengo Embedding and Search systems. Twelve Labs is actively hiring. → twelvelabs.io/careers
Dan leads the development of the video embedding model, Marengo, and video search systems at Twelve Labs, having researched multimodal embedding since 2018. Starting with image and text, he is now tackling the challenge of understanding video in semantic units. He chose video because it remains a frontier with questions no one has yet answered decisively—and because Twelve Labs is the only company globally addressing these challenges with this level of focus.
Q. You’ve been researching embeddings since 2018. Did you start with video from the beginning?
In the beginning, it was joint representation of image and text. It was research on projecting two distinct modalities into the same embedding space, and during that time, one question constantly persisted:
“How atomic must the segment be for an embedding to retain meaningful semantic information?”
Embedding is ultimately lossy compression; we compress the world's information and represent it in a single vector. If you represent the entire Lord of the Rings movie as a single 500-dimensional vector, most of the information vanishes. Yet, if you slice it too finely, you lose context. Where do we draw that line? This very question ultimately led me to video.
Q. Why video?
It was the most under-researched field. PDFs have layouts; where images and text are positioned is programmatically defined. Web pages have HTML and CSS that explicitly state the relationships between elements.
But what about video? Aside from frames existing sequentially over time, there's no markup language that defines the relationship between those frames. Where does one scene end and another begin? Even in academia, there is still no definitive answer to this question. Because video data is so massive, researching it is virtually impossible outside of an industrial scale. I wanted to tackle the unique problems that can only be solved in this specific environment.
Q. What specifically are you building right now?
Two primary initiatives: first, pushing the precision of video search to the limit, and second, driving the evolution of the embedding model itself.
In search, we are currently focusing heavily on reranking. The baseline search in Marengo ranks results by coordinate similarity between embeddings. In this approach, the 1st and 5th results cannot 'see' each other; they are only compared individually with the query. A reranker, on the other hand, gathers top-K results in one place and instructs the model: "Re-evaluate these jointly against the query." If the 1st result looks at the 5th and determines "actually, this one is a better match," the order changes. This drastically improves the sharpness and relevance of search results even for the same query.
Q. In practice, how critical is the question of 'how to slice' a video?
Consider a gunshot scene. That entire scene might only last 2 seconds. If you simply segment the video at fixed 10-second intervals, those 2 seconds of action get lumped in with completely unrelated scenes before and after. The resulting embedding ends up contaminated with all of that extra context. When you search "find the gunshot scene" against that embedding, it is only natural that your search results will be muddy.
This is why we are building a dedicated model to segment videos into semantic units. It is our direct answer to this challenge. Instead of slicing at fixed lengths, the model autonomously detects exact boundaries and structures between scenes. This is the foundational technology that determines not just search precision, but the overall quality of video understanding.
Q. What is the broader vision for Marengo?
It is a hierarchical structure. Slicing videos into semantic units gives each segment its own embedding. But what if we can dynamically recompose those units into high-level parent embeddings?
If you have a soccer video, you can divide it into the first and second half. The first half can then be split into Team A's offensive drives and Team B's offensive drives, which can be further broken down into individual actions. This forms a hierarchy. If a user asks for "first half with multiple turnovers," a broad high-level embedding provides the answer, while a search for "corner kick" points exactly to the smallest, most precise segment. It is a structure where the same system, operating on the same search interface, responds with varying granularities depending on the scope of the query.
Q. What has been your most rewarding moment here?
I heard back that an embedding model lead at a global big tech company tried Marengo firsthand and was stunned by its performance. It’s not that they don’t know how to build embedding models. However, at most companies, embeddings are treated as a byproduct of LLMs—they build a generation model and derive an embedding model from it. Almost no other company places embeddings at the absolute center of their focus.
Marengo is different. It is not a derivative of Pegasus. Marengo is the origin. I believe Twelve Labs is the only company globally pouring LLM-scale resources and serious commitment into embeddings. Because we are focused on video as our primary modality, we treat problems that other companies don't even think about—like how to segment video and how to extract meaning from real-time streams—as core research topics.
Q. How does research here differ from academic research?
Academia seeks the most generalizable answers. For a paper to be highly cited, a broad range of researchers must be able to apply it to their own work. Keeping the scope as broad as possible is academia's optimization strategy.
Industrial research works in reverse. There is a target market and concrete user needs. Narrowing the scope down to those elements directly translates to performance improvements. You can deliver far more precise results with the same amount of effort. However, if you submit this to academic conferences, reviews often come back stating it is "not general enough." In the mid-2020s, world-class AI companies with excellent productized models publish less at traditional academic conferences not because they lack interest, but because of this structural gap.
What is great about Twelve Labs is that we can engineer our research through a tight feedback loop with real customers. Real business challenges, like "decide whether to license petabytes of video based on this 10GB sample," drive our research directions. These problems do not even exist in academia. And the environment to solve them is right here—complete with petabytes of real-world video data, actual clients, and real feedback.
Q. What is the day-to-day working environment like?
Even as I do this interview, my agents are running in the background. Yesterday, while having dinner, I received a Slack alert that one of them crashed, and I was on edge. Whether that's good or bad, that represents how we work in this era.
At Twelve Labs, we have a policy called 'Tokens Never Sleep'. We place no cap on the use of AI tools. This is not simply a perk. It is designed to let us experience firsthand how work should be done in this AI era. Many companies cap usage or offer no support at all, and I believe that difference will create a significant gap down the road.
The absence of rigid corporate frameworks was slightly disorienting at first. When I asked, "What is the standard procedure for Process A?" the response was, "You can redefine it to whatever process you think is optimal, Dan." Now, I see this as a major strength. Because we do not treat legacy systems as sacred, we can design workflows from scratch using the best modern tools available.
Q. Who would thrive here, and who might find it challenging?
To be honest, if you find satisfaction in executing predefined tasks within a highly structured framework, this might not be the right fit. Here, you often have to define the problems yourself. Shifts in direction are also common. Even if you've poured effort into building something, if a better direction emerges, we pivot. If the psychological cost of sunk costs is too high for you, it can be demanding.
Conversely, if you excel at finding and solving problems that no one else is tackling, and you want to see those solutions directly integrated into a real product, there is no better environment. There are very few places in the world operating at the absolute frontier of video embeddings.
Q. Any final words for those considering joining Twelve Labs?
The traditional way humanity works will undergo massive shifts in the near future. Before this transition is complete, I urge you not to miss the opportunity to work at the absolute frontier of these new workflows. We are entering an era where anyone can own things end-to-end. The environment to prepare for that is right here—holding the authority, the resources, and the real-world challenges.
If that sounds exciting, you belong here.
Dan is a Lead ML Scientist at Twelve Labs, leading the development of Marengo Embedding and Search systems. Twelve Labs is actively hiring. → twelvelabs.io/careers




