Why I Joined TwelveLabs

The next frontier in AI is in helping machines understand the world as we do. That’s why I joined TwelveLabs as VP of Engineering.

The World’s Hidden Dataset

Video accounts for 90% of the world’s digitized data, yet most of it sits in storage, effectively invisible. Organizations collect endless footage, from media archives to autonomous driving to enterprise knowledge, but lack the tools to transform these data sets into structured information or knowledge.

We’ve learned to process text at scale but video at scale is orders of magnitude more complex, owing to its multidimensional nature and low information density. Video combines spatial, audio, and visual context over time. Human intelligence evolved to handle this seamlessly (our earliest memories aren’t words but moving pictures) but for AI models, that leap is far more complex.

This is what drew me to TwelveLabs.

I’ve had the opportunity to build large-scale distributed systems that serve billions of requests per day. I’ve worked on infrastructure where milliseconds matter, and on platforms that demanded elasticity and mission critical reliability under unpredictable workloads. I’ve seen first-hand how machine learning at scale can transform industries, but also how quickly complexity can become a barrier when infrastructure isn’t designed for scale.

Those experiences gave me conviction that breakthroughs in AI require more than just better models. They demand engineering excellence, system-level thinking, and a focus on usability. That’s exactly what excites me about TwelveLabs; the chance to combine state-of-the-art multi-modal & reasoning AI research with production-grade engineering, so that video understanding isn’t just a demo, but an everyday tool for enterprises, governments, developers, and creators.

A Research Foundation for the Future

At TwelveLabs, we are rethinking the foundation of multi-modal information:

Marengo, our multimodal video encoder, pioneered a multi-vector embedding architecture that captures different aspects of a video—visual details, motion, speech, and even on-screen text—leading to significant breakthroughs in search and retrieval accuracy.
Pegasus, our industry-grade video language model, delivers state-of-the-art temporal reasoning over long videos, scaling to hours with low latency and high accuracy. This matters because real-world video understanding requires connecting cause and effect over time.
And now with Jockey, our agentic video intelligence framework, we’re bridging perception and reasoning to enable AI systems to go beyond analysis to collaborate with humans to produce insights and creative outputs.

Together, these systems make video searchable, navigable, and actionable—a foundation for everything from content discovery and media workflows to safety, security, and scientific research.

Engineering at Scale

The challenge now is to build the AI and distributed systems infrastructure that scales this intelligence globally. My background in designing intelligent systems that balance latency, cost, and fault tolerance informs how I think about TwelveLabs’ next phase: architecting systems that can ingest, understand and reason over petabytes of video in real time, while remaining developer-friendly and enterprise-ready.

It’s a rare opportunity to shape both the science and systems of the frontiers of AI.

I’m honored to join Aiden Lee, Jae Lee, Soyoung Lee, Yoon Kim, and the rest of the TwelveLabs team in building what I believe will be a generational company. We are at the cusp of a new era: one where AI doesn’t just read, but truly sees and understands the visual world around us.

We’re Hiring!

TwelveLabs is at an inflection point.

If you’re excited by foundation model research & engineering, large-scale vision problems, or distributed systems engineering, I’d love to connect. Explore our open roles at twelvelabs.io/careers. If a role doesn’t fit but you think Twelve Labs is exactly what you were looking for, shoot me a message on LinkedIn.

The next frontier in AI is in helping machines understand the world as we do. That’s why I joined TwelveLabs as VP of Engineering.

The World’s Hidden Dataset

Video accounts for 90% of the world’s digitized data, yet most of it sits in storage, effectively invisible. Organizations collect endless footage, from media archives to autonomous driving to enterprise knowledge, but lack the tools to transform these data sets into structured information or knowledge.

We’ve learned to process text at scale but video at scale is orders of magnitude more complex, owing to its multidimensional nature and low information density. Video combines spatial, audio, and visual context over time. Human intelligence evolved to handle this seamlessly (our earliest memories aren’t words but moving pictures) but for AI models, that leap is far more complex.

This is what drew me to TwelveLabs.

I’ve had the opportunity to build large-scale distributed systems that serve billions of requests per day. I’ve worked on infrastructure where milliseconds matter, and on platforms that demanded elasticity and mission critical reliability under unpredictable workloads. I’ve seen first-hand how machine learning at scale can transform industries, but also how quickly complexity can become a barrier when infrastructure isn’t designed for scale.

Those experiences gave me conviction that breakthroughs in AI require more than just better models. They demand engineering excellence, system-level thinking, and a focus on usability. That’s exactly what excites me about TwelveLabs; the chance to combine state-of-the-art multi-modal & reasoning AI research with production-grade engineering, so that video understanding isn’t just a demo, but an everyday tool for enterprises, governments, developers, and creators.

A Research Foundation for the Future

At TwelveLabs, we are rethinking the foundation of multi-modal information:

Marengo, our multimodal video encoder, pioneered a multi-vector embedding architecture that captures different aspects of a video—visual details, motion, speech, and even on-screen text—leading to significant breakthroughs in search and retrieval accuracy.
Pegasus, our industry-grade video language model, delivers state-of-the-art temporal reasoning over long videos, scaling to hours with low latency and high accuracy. This matters because real-world video understanding requires connecting cause and effect over time.
And now with Jockey, our agentic video intelligence framework, we’re bridging perception and reasoning to enable AI systems to go beyond analysis to collaborate with humans to produce insights and creative outputs.

Together, these systems make video searchable, navigable, and actionable—a foundation for everything from content discovery and media workflows to safety, security, and scientific research.

Engineering at Scale

The challenge now is to build the AI and distributed systems infrastructure that scales this intelligence globally. My background in designing intelligent systems that balance latency, cost, and fault tolerance informs how I think about TwelveLabs’ next phase: architecting systems that can ingest, understand and reason over petabytes of video in real time, while remaining developer-friendly and enterprise-ready.

It’s a rare opportunity to shape both the science and systems of the frontiers of AI.

I’m honored to join Aiden Lee, Jae Lee, Soyoung Lee, Yoon Kim, and the rest of the TwelveLabs team in building what I believe will be a generational company. We are at the cusp of a new era: one where AI doesn’t just read, but truly sees and understands the visual world around us.

We’re Hiring!

TwelveLabs is at an inflection point.

If you’re excited by foundation model research & engineering, large-scale vision problems, or distributed systems engineering, I’d love to connect. Explore our open roles at twelvelabs.io/careers. If a role doesn’t fit but you think Twelve Labs is exactly what you were looking for, shoot me a message on LinkedIn.

The next frontier in AI is in helping machines understand the world as we do. That’s why I joined TwelveLabs as VP of Engineering.

The World’s Hidden Dataset

Video accounts for 90% of the world’s digitized data, yet most of it sits in storage, effectively invisible. Organizations collect endless footage, from media archives to autonomous driving to enterprise knowledge, but lack the tools to transform these data sets into structured information or knowledge.

We’ve learned to process text at scale but video at scale is orders of magnitude more complex, owing to its multidimensional nature and low information density. Video combines spatial, audio, and visual context over time. Human intelligence evolved to handle this seamlessly (our earliest memories aren’t words but moving pictures) but for AI models, that leap is far more complex.

This is what drew me to TwelveLabs.

I’ve had the opportunity to build large-scale distributed systems that serve billions of requests per day. I’ve worked on infrastructure where milliseconds matter, and on platforms that demanded elasticity and mission critical reliability under unpredictable workloads. I’ve seen first-hand how machine learning at scale can transform industries, but also how quickly complexity can become a barrier when infrastructure isn’t designed for scale.

Those experiences gave me conviction that breakthroughs in AI require more than just better models. They demand engineering excellence, system-level thinking, and a focus on usability. That’s exactly what excites me about TwelveLabs; the chance to combine state-of-the-art multi-modal & reasoning AI research with production-grade engineering, so that video understanding isn’t just a demo, but an everyday tool for enterprises, governments, developers, and creators.

A Research Foundation for the Future

At TwelveLabs, we are rethinking the foundation of multi-modal information:

Marengo, our multimodal video encoder, pioneered a multi-vector embedding architecture that captures different aspects of a video—visual details, motion, speech, and even on-screen text—leading to significant breakthroughs in search and retrieval accuracy.
Pegasus, our industry-grade video language model, delivers state-of-the-art temporal reasoning over long videos, scaling to hours with low latency and high accuracy. This matters because real-world video understanding requires connecting cause and effect over time.
And now with Jockey, our agentic video intelligence framework, we’re bridging perception and reasoning to enable AI systems to go beyond analysis to collaborate with humans to produce insights and creative outputs.

Together, these systems make video searchable, navigable, and actionable—a foundation for everything from content discovery and media workflows to safety, security, and scientific research.

Engineering at Scale

The challenge now is to build the AI and distributed systems infrastructure that scales this intelligence globally. My background in designing intelligent systems that balance latency, cost, and fault tolerance informs how I think about TwelveLabs’ next phase: architecting systems that can ingest, understand and reason over petabytes of video in real time, while remaining developer-friendly and enterprise-ready.

It’s a rare opportunity to shape both the science and systems of the frontiers of AI.

I’m honored to join Aiden Lee, Jae Lee, Soyoung Lee, Yoon Kim, and the rest of the TwelveLabs team in building what I believe will be a generational company. We are at the cusp of a new era: one where AI doesn’t just read, but truly sees and understands the visual world around us.

We’re Hiring!

TwelveLabs is at an inflection point.

If you’re excited by foundation model research & engineering, large-scale vision problems, or distributed systems engineering, I’d love to connect. Explore our open roles at twelvelabs.io/careers. If a role doesn’t fit but you think Twelve Labs is exactly what you were looking for, shoot me a message on LinkedIn.