Twelve Labs

비디오를 구조화된 데이터로: 시간 기반 메타데이터(TBM) 파이프라인 구축기

케빈 리

To turn video archives into queryable assets, we detail how Twelve Labs designed a dual-evaluation system that decouples temporal and metadata aspects, alongside schema-conditional segmentation.

To turn video archives into queryable assets, we detail how Twelve Labs designed a dual-evaluation system that decouples temporal and metadata aspects, alongside schema-conditional segmentation.

목차

No headings found on page

뉴스레터 구독하기

뉴스레터 구독하기

영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.

영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.

AI로 영상을 검색하고, 분석하고, 탐색하세요.

2026. 5. 4.

15분

링크 복사하기

Broadcasters, sports leagues, media companies, and enterprise platforms hold petabyte-scale video archives. Yet, most of this content remains in a state we call dark video. It exists, but it is unsearchable, unstructured, and cannot be utilized at a semantic level.

The reason is surprisingly simple: video is not text. You cannot run a grep on video files, nor can you execute SELECT * FROM video WHERE scene = 'scoring_play'. For video to have economic value, it must first be broken down into structured segments and machine-readable metadata. We call this process video assetization.

This article is the story of how we built Time-Based Metadata (TBM) and its evaluation framework. Along the way, one thing became crystal clear: designing and evaluating time-and-semantic-based video understanding requires a fundamentally different approach than text or image models.


1. The Video Assetization Gap: What Customers Actually Want

Consider a broadcaster managing a 200,000-hour news archive. Currently, this work is done manually. Video archivists manually log each segment, tag stories, identify speakers, and mark topic boundaries. It costs about $15 per hour of video, and budgets are shrinking. This approach simply does not scale.

Let's look at another example. Imagine a consumer brand looking to track appearances of their products across thousands of influencer videos. They need to find every moment a specific product appears on screen while the creator speaks directly to the camera. Merely knowing that the product appeared is not enough. They need to know exactly when, how prominently, and in what context it appeared.

These use cases are not hypothetical. Broadcast archiving, automated sports highlights, brand intelligence, compliance auditing—nearly every enterprise video workload we encounter demands two independent capabilities simultaneously:

  1. Precise Temporal Boundaries: Where does each segment start and end?

  2. Schema-Conforming Structured Metadata: What happened inside that segment?

If you ask most video-language models today, "What is happening in this video?", they generate fluent paragraphs of text. But the moment you ask them to "provide start and end timestamps for all editorial narratives along with structured fields (topic, speaker, confidence)," you quickly realize that general-purpose video reasoning and production-grade segmentation are two entirely different problems.

Let's visualize the actual output a broadcaster expects. Each editorial segment requires a title, description, editorial topic, featured people, and confidence level. For a one-hour news program, the TBM output looks like this:


Each segment includes structured metadata like editorial_subjects, visual_subjects, names, and confidence. It is the moment a 1-hour program transforms into a set of queryable objects.


2. Why Traditional Approaches Fail

The Boundary Prediction Problem

Large language models excel at span-level reasoning (summarization, Q&A, description). However, predicting where segment boundaries lie is a fundamentally different challenge.

Boundary prediction lies at the intersection of three technical challenges:

  • Weakly-supervised temporal localization: Ground truth is sparse and subjective.

  • Multimodal change-point detection: Boundaries are defined by simultaneous transitions in visual, audio, and semantic cues.

  • Token-level classification under autoregressive decoding: The model must output highly precise timestamps as tokens.

In our early experiments, even frontier models produced boundaries that were semantically reasonable but temporally imprecise—segment starts that should have been hard cuts were off by 5 to 15 seconds. The model understood what was happening, but struggled to pin down exactly when it happened.


Without a Schema, It's Not an Asset

Most video-language models produce free-form text.

"The video shows a news broadcast. Around the 2-minute mark, it transitions to a sports segment..."

This is useful for human consumption, but practically useless for downstream automation. What enterprise workflows actually require is this:

{
  "start_time": 120.45,
  "end_time": 245.80,
  "metadata": {
    "segment_title": "NFL Playoffs Recap",
    "editorial_subjects": ["NFL", "Playoffs", "Injuries"],
    "names": ["Patrick Mahomes", "Lamar Jackson"],
    "confidence": "HIGH"
  }
}

The gap between expressive text and machine-actionable structured output is more than just a formatting issue; it is a fundamental challenge tied to both modeling and evaluation.


3. Twelve Labs' Approach: Schema-Conditioned Temporal Extraction

Defining the Problem

When designing TBM, we faced a crucial fork in the road: Do we build a general-purpose model that takes a video and answers anything (with segmentation as just one of its features), or do we build a schema-conditioned extraction model that takes a video alongside a user-defined schema and fills out those segments and fields accordingly?

At first, the former seemed more natural. However, the failure patterns we saw in §2—drifted boundaries, free-form fields, and inconsistent outputs across runs—turned out to be structural limitations of the general-purpose approach. Without a schema, the model has no way of knowing what to find or how precisely to find it, and evaluation is reduced to checking if the output "looks plausible."

We chose the latter. TBM is not a general-purpose model trying to do everything; it is an extraction model optimized specifically for precise temporal boundaries and schema-conforming metadata.


Three Key Advantages of Schema-Conditioning

In TBM, users provide segment_definitions, which serve as a structured specification of what segments to find and what metadata fields to extract from each. This is not just a convenient API design. From a modeling perspective, schema conditioning provides three key advantages:

Reduction of the Search Space. Instead of generating open-ended natural language about any visual detail, the model operates within a constrained output space defined by the schema. For sports, specifying fields like down, scoring_play, and penalty_type is sufficient. This cleanly prunes away a massive space of possible but irrelevant observations.

Boundary Alignment Stabilization. Once the model knows exactly what it is looking for (e.g., a "play boundary" vs. an "ad transition"), it can leverage domain-specific temporal cues (whistles, formation changes, fade-to-black patterns) rather than relying on generic change-point heuristics.

Evaluation Anchoring. Each field in the schema becomes a concrete, measurable evaluation target. Instead of asking the vague question "Did the model understand the video?", we can ask "Did the model correctly identify the down field as 3 for this segment?"


4. Coverage Breadth: The 4-Layer Signal Framework

For schema conditioning to work effectively, the schema must map to the right signals. One of the reasons video segmentation is notoriously difficult is that the signals defining a boundary span multiple modalities and abstraction levels. These range from frame-level visual changes like camera angle cuts, to structural narrative transitions in news packages, physical audio cues like voice changes, and composite events in sports requiring simultaneous visual, audio, and rules-based understanding. A single unified approach cannot solve this alone.

We systematized this diversity into four distinct Cue Layers:

Cue Layer

Description

Example Boundary Signal

Low-Level Visual

Visual changes within frame transitions. Focuses on localized, pixel-level visual details.

Shot transitions, camera angle changes

High-Level Semantic

Global semantic or narrative changes. Spans both visual and audio streams.

Topic transitions, editorial package boundaries (Anchor ↔ Field report ↔ Studio)

Audio

Acoustic cues including speech, music, sound effects, and silence.

Speaker transitions, music/BGM transitions

Composite

Multimodal signals requiring joint visual, audio, and contextual understanding.

Play-by-play events in sports, commercial ad boundaries


What This Classification Means

Generic video models typically treat these four layers interchangeably. However, in production, a customer's segmentation requirement always maps directly to one or a combination of these layers:

  • Broadcast news "editorial narrative segmenting" → High-Level (topic transition + editorial package boundaries)

  • Video production "shot boundary detection" → Low-Level Visual (camera cuts, angle changes)

  • Sports league "play-by-play segmentation" → Composite (visual formations + whistle cues + game state rules)

  • Podcast platform "speaker diarization by topic" → Audio (speaker transitions + ASR topic drift)

This distinction isn't just academic; it dictates modeling strategies, training data curation, and evaluation metrics. Low-Level Visual behaves like a frame-wise change detection problem, whereas Composite demands long-context temporal reasoning.


5. Multimodal Grounding: Moving Beyond Text

Enterprise video segmentation projects often hit a wall: text descriptions alone are insufficient to specify what needs to be found.


Why Text-Only Falls Short

Consider a travel content platform with a specific problem:

Find every moment where the N Seoul Tower appears across tens of thousands of travel vlogs. Classify whether it's a distant establishing shot over the Han River, a walking shot from Namsan Park, a close-up from the observation deck, or a night-time light-up scene.

Text descriptions quickly hit their limit. Just because a model understands the name 'N Seoul Tower' in text does not mean it can reliably anchor that to its visual appearance. In video-language models, semantic familiarity with an entity is distinct from visual grounding. Depending on the training distribution, the visual representation might be weak, or the model might confuse it with other structurally similar towers like Tokyo Tower or CN Tower. However, providing a reference image of the N Seoul Tower allows the model to bypass the complex translation of text to vision, aligning the image embedding directly with the video frames.


The Entity Reference System

TBM introduces support for media_sources. By placing reference tags like <reference_name> within the segment definitions, users can let the model point directly to user-provided reference assets.

Source - King Sejong Institute Foundation, https://www.kogl.or.kr/recommend/recommendDivView.do?oc=&recommendIdx=91796&division=img#

{
  "segment_definitions": [{
    "id": "namsan_tower_appearances",
    "media_sources": [
      { "name": "namsan_tower_img", "media_type": "image", "media_url": "<https://cdn>.../namsan_tower.jpg" }
    ],
    "description": "Moments where the N Seoul Tower, as identified by <namsan_tower_img>, appears on screen",
    "fields": [
      { "name": "screen_prominence", "type": "string",
        "enum": ["HERO_SHOT", "PARTIAL", "BACKGROUND"] },
      { "name": "shot_type", "type": "string",
        "enum": ["CITY_ESTABLISHING", "PARK_WALK", "OBSERVATION_DECK", "NIGHT_LIGHTUP"] }
    ]
  }]
}

From a modeling perspective, this reframes segmentation from an open-vocabulary detection task to a grounded retrieval + temporal localization task:

  1. Reference resolution: The model binds <namsan_tower_img> to specific visual feature patterns across the video.

  2. Visual embedding alignment: The reference image is encoded into the same representation space as the video frames.

  3. Conditional boundary detection: Segment boundaries are dictated by the arrival and departure of the grounded visual entity, rather than generic scene shifts.

This is the difference between "find a city shot" and "find a shot containing this specific tower." It doesn't just change the technical difficulty; it fundamentally changes the utility of the product.

With entity grounding, segmentation can fail across three different vectors: temporal boundaries, semantic labels, and visual entity resolution. This complexity demands a more robust approach to evaluation.


6. Guaranteeing Quality: Dual-Track Evaluation

One of our most critical realizations while running TBM in production was that a single aggregate score cannot adequately capture segmentation quality.


The Confounding Problem

Consider these two failure scenarios:

Scenario A. The model generates a segment from 10.0s to 25.0s. The true ground truth segment is 12.0s to 48.0s. It captured the anchor's intro but missed the actual 23-second news story that followed. The semantic tags (anchor name, topic) were perfect, but the temporal coverage was poor.

Scenario B. The model correctly captures the boundary from 12.0s to 48.0s, but tags a "political analysis" segment as a "weather report." The boundary is perfect, but the metadata is wrong.

An aggregate evaluation metric that blends time and semantics cannot distinguish between these two failures. Yet, the remedy for each is completely different: Scenario A requires tuning temporal modeling, while Scenario B requires fixing semantic grounding. If we do not untangle these failure modes, we cannot confidently iterate on the model.


Our Solution: Two Independent Evaluation Tracks

Segment Track: "Did the model find the right window of time?"

We evaluate temporal accuracy across two dimensions: segment-level accuracy (how well individual segments are captured) and timeline-level coverage (how much of the timeline is correctly mapped). Both are essential; a model can score high on one and low on the other, and knowing where it fails dictates the path forward. Our final comparison uses a combined composite score as the primary diagnostic metric.

Metadata Track: "Within that window, did the model extract the correct structured values?"

Once segments are temporally matched, we evaluate the metadata fields within those matched pairs independently. We use an LLM-as-judge framework to score fields on a 0.0 to 5.0 scale, guided by field-specific schemas and criteria.

An important detail: We scale the metadata score by the strength of the temporal alignment. This prevents "lucky guesses" on poorly-defined boundaries from inflating the structural accuracy metric.

Operational benefits: Decoupling metadata evaluation from the video inference pipeline allows us to iterate rapidly on LLM-as-judge prompt templates and scoring rubrics without having to re-run expensive video model inference.

Why this separation matters: If temporal scores go up but metadata accuracy drops, it is a clear signal that the model is gaining temporal precision at the cost of semantic drift. This suggests we need to adjust training data balance, not model architecture. This level of diagnosis is only possible with split-track evaluation.


7. The Semantic Flywheel: The Compounding Value of Video Assets

Assetization is not a one-off run-and-done project. When built correctly, it forms a self-reinforcing loop where each run fuels the next:


We call this the Semantic Flywheel. Whether the domain is broadcast news, brand intelligence, compliance auditing, or sports highlights, the workloads share this exact same trajectory. As segment-metadata pairs accumulate, user corrections—adjusted boundaries, rejected tags, modified values—turn directly into high-fidelity training data for the next generation of models. The archive stops being a passive storage cost and becomes a dynamic system that grows more orderly over time.

The sole prerequisite for this flywheel to spin is observable improvement. If you cannot measure temporal precision independently of metadata precision, you cannot know which direction the model is moving. Without direction, systematic iteration is impossible, and the flywheel stalls. Without an evaluation framework that mirrors your asset structure, assetization remains just a temporary parlor trick.


8. What We Learned Along the Way

Building TBM and its evaluation framework brought several critical insights:

Segmentation is not Q&A. Treating temporal segmentation as "just another prompt-based task" for an LLM led us down unproductive paths for months. Boundary prediction has its own distinct failure modes, evaluation needs, and architecture sensitivities. Recognizing this earlier would have saved us months of development.

Structured output is far more valuable than fluent prose. In production, a perfectly formatted JSON with slightly imperfect boundaries is infinitely more useful than an expressive, unstructured narrative paragraph. Prioritizing machine-readability over human-readability was absolutely the right decision.

Determinism is a feature, not a constraint. Enforcing non-overlapping segments, strict schema adherence, and a default temperature=0 initially faced resistance for "limiting model creativity." In practice, these constraints are precisely what make the outputs reliable enough to power production automation. In enterprise systems, reliability always wins over open-ended capabilities.


What’s Next

We are still in the early innings of video assetization. Support for longer-context processing (3+ hours), richer multimodal conditioning (referencing multiple images and assets simultaneously), and extending this structured extraction beyond video to other containerized assets like audio and podcasts are top of mind for our engineering teams.

However, the grander challenge is ensuring our evaluation frameworks scale in tandem with model capabilities. As video-language models become more powerful, their failure modes become more subtle, demanding increasingly sophisticated evaluation methods.

The vast majority of the world's video archives remain dark. Unlocking their value requires more than just powerful models; it requires the frameworks to measure whether those models are pinpointing the right moments and describing them accurately.


To learn more about Pegasus 1.5 → Pegasus 1.5 Tech Blog

We are looking for builders to join us on this journey → Twelve Labs Careers

Broadcasters, sports leagues, media companies, and enterprise platforms hold petabyte-scale video archives. Yet, most of this content remains in a state we call dark video. It exists, but it is unsearchable, unstructured, and cannot be utilized at a semantic level.

The reason is surprisingly simple: video is not text. You cannot run a grep on video files, nor can you execute SELECT * FROM video WHERE scene = 'scoring_play'. For video to have economic value, it must first be broken down into structured segments and machine-readable metadata. We call this process video assetization.

This article is the story of how we built Time-Based Metadata (TBM) and its evaluation framework. Along the way, one thing became crystal clear: designing and evaluating time-and-semantic-based video understanding requires a fundamentally different approach than text or image models.


1. The Video Assetization Gap: What Customers Actually Want

Consider a broadcaster managing a 200,000-hour news archive. Currently, this work is done manually. Video archivists manually log each segment, tag stories, identify speakers, and mark topic boundaries. It costs about $15 per hour of video, and budgets are shrinking. This approach simply does not scale.

Let's look at another example. Imagine a consumer brand looking to track appearances of their products across thousands of influencer videos. They need to find every moment a specific product appears on screen while the creator speaks directly to the camera. Merely knowing that the product appeared is not enough. They need to know exactly when, how prominently, and in what context it appeared.

These use cases are not hypothetical. Broadcast archiving, automated sports highlights, brand intelligence, compliance auditing—nearly every enterprise video workload we encounter demands two independent capabilities simultaneously:

  1. Precise Temporal Boundaries: Where does each segment start and end?

  2. Schema-Conforming Structured Metadata: What happened inside that segment?

If you ask most video-language models today, "What is happening in this video?", they generate fluent paragraphs of text. But the moment you ask them to "provide start and end timestamps for all editorial narratives along with structured fields (topic, speaker, confidence)," you quickly realize that general-purpose video reasoning and production-grade segmentation are two entirely different problems.

Let's visualize the actual output a broadcaster expects. Each editorial segment requires a title, description, editorial topic, featured people, and confidence level. For a one-hour news program, the TBM output looks like this:


Each segment includes structured metadata like editorial_subjects, visual_subjects, names, and confidence. It is the moment a 1-hour program transforms into a set of queryable objects.


2. Why Traditional Approaches Fail

The Boundary Prediction Problem

Large language models excel at span-level reasoning (summarization, Q&A, description). However, predicting where segment boundaries lie is a fundamentally different challenge.

Boundary prediction lies at the intersection of three technical challenges:

  • Weakly-supervised temporal localization: Ground truth is sparse and subjective.

  • Multimodal change-point detection: Boundaries are defined by simultaneous transitions in visual, audio, and semantic cues.

  • Token-level classification under autoregressive decoding: The model must output highly precise timestamps as tokens.

In our early experiments, even frontier models produced boundaries that were semantically reasonable but temporally imprecise—segment starts that should have been hard cuts were off by 5 to 15 seconds. The model understood what was happening, but struggled to pin down exactly when it happened.


Without a Schema, It's Not an Asset

Most video-language models produce free-form text.

"The video shows a news broadcast. Around the 2-minute mark, it transitions to a sports segment..."

This is useful for human consumption, but practically useless for downstream automation. What enterprise workflows actually require is this:

{
  "start_time": 120.45,
  "end_time": 245.80,
  "metadata": {
    "segment_title": "NFL Playoffs Recap",
    "editorial_subjects": ["NFL", "Playoffs", "Injuries"],
    "names": ["Patrick Mahomes", "Lamar Jackson"],
    "confidence": "HIGH"
  }
}

The gap between expressive text and machine-actionable structured output is more than just a formatting issue; it is a fundamental challenge tied to both modeling and evaluation.


3. Twelve Labs' Approach: Schema-Conditioned Temporal Extraction

Defining the Problem

When designing TBM, we faced a crucial fork in the road: Do we build a general-purpose model that takes a video and answers anything (with segmentation as just one of its features), or do we build a schema-conditioned extraction model that takes a video alongside a user-defined schema and fills out those segments and fields accordingly?

At first, the former seemed more natural. However, the failure patterns we saw in §2—drifted boundaries, free-form fields, and inconsistent outputs across runs—turned out to be structural limitations of the general-purpose approach. Without a schema, the model has no way of knowing what to find or how precisely to find it, and evaluation is reduced to checking if the output "looks plausible."

We chose the latter. TBM is not a general-purpose model trying to do everything; it is an extraction model optimized specifically for precise temporal boundaries and schema-conforming metadata.


Three Key Advantages of Schema-Conditioning

In TBM, users provide segment_definitions, which serve as a structured specification of what segments to find and what metadata fields to extract from each. This is not just a convenient API design. From a modeling perspective, schema conditioning provides three key advantages:

Reduction of the Search Space. Instead of generating open-ended natural language about any visual detail, the model operates within a constrained output space defined by the schema. For sports, specifying fields like down, scoring_play, and penalty_type is sufficient. This cleanly prunes away a massive space of possible but irrelevant observations.

Boundary Alignment Stabilization. Once the model knows exactly what it is looking for (e.g., a "play boundary" vs. an "ad transition"), it can leverage domain-specific temporal cues (whistles, formation changes, fade-to-black patterns) rather than relying on generic change-point heuristics.

Evaluation Anchoring. Each field in the schema becomes a concrete, measurable evaluation target. Instead of asking the vague question "Did the model understand the video?", we can ask "Did the model correctly identify the down field as 3 for this segment?"


4. Coverage Breadth: The 4-Layer Signal Framework

For schema conditioning to work effectively, the schema must map to the right signals. One of the reasons video segmentation is notoriously difficult is that the signals defining a boundary span multiple modalities and abstraction levels. These range from frame-level visual changes like camera angle cuts, to structural narrative transitions in news packages, physical audio cues like voice changes, and composite events in sports requiring simultaneous visual, audio, and rules-based understanding. A single unified approach cannot solve this alone.

We systematized this diversity into four distinct Cue Layers:

Cue Layer

Description

Example Boundary Signal

Low-Level Visual

Visual changes within frame transitions. Focuses on localized, pixel-level visual details.

Shot transitions, camera angle changes

High-Level Semantic

Global semantic or narrative changes. Spans both visual and audio streams.

Topic transitions, editorial package boundaries (Anchor ↔ Field report ↔ Studio)

Audio

Acoustic cues including speech, music, sound effects, and silence.

Speaker transitions, music/BGM transitions

Composite

Multimodal signals requiring joint visual, audio, and contextual understanding.

Play-by-play events in sports, commercial ad boundaries


What This Classification Means

Generic video models typically treat these four layers interchangeably. However, in production, a customer's segmentation requirement always maps directly to one or a combination of these layers:

  • Broadcast news "editorial narrative segmenting" → High-Level (topic transition + editorial package boundaries)

  • Video production "shot boundary detection" → Low-Level Visual (camera cuts, angle changes)

  • Sports league "play-by-play segmentation" → Composite (visual formations + whistle cues + game state rules)

  • Podcast platform "speaker diarization by topic" → Audio (speaker transitions + ASR topic drift)

This distinction isn't just academic; it dictates modeling strategies, training data curation, and evaluation metrics. Low-Level Visual behaves like a frame-wise change detection problem, whereas Composite demands long-context temporal reasoning.


5. Multimodal Grounding: Moving Beyond Text

Enterprise video segmentation projects often hit a wall: text descriptions alone are insufficient to specify what needs to be found.


Why Text-Only Falls Short

Consider a travel content platform with a specific problem:

Find every moment where the N Seoul Tower appears across tens of thousands of travel vlogs. Classify whether it's a distant establishing shot over the Han River, a walking shot from Namsan Park, a close-up from the observation deck, or a night-time light-up scene.

Text descriptions quickly hit their limit. Just because a model understands the name 'N Seoul Tower' in text does not mean it can reliably anchor that to its visual appearance. In video-language models, semantic familiarity with an entity is distinct from visual grounding. Depending on the training distribution, the visual representation might be weak, or the model might confuse it with other structurally similar towers like Tokyo Tower or CN Tower. However, providing a reference image of the N Seoul Tower allows the model to bypass the complex translation of text to vision, aligning the image embedding directly with the video frames.


The Entity Reference System

TBM introduces support for media_sources. By placing reference tags like <reference_name> within the segment definitions, users can let the model point directly to user-provided reference assets.

Source - King Sejong Institute Foundation, https://www.kogl.or.kr/recommend/recommendDivView.do?oc=&recommendIdx=91796&division=img#

{
  "segment_definitions": [{
    "id": "namsan_tower_appearances",
    "media_sources": [
      { "name": "namsan_tower_img", "media_type": "image", "media_url": "<https://cdn>.../namsan_tower.jpg" }
    ],
    "description": "Moments where the N Seoul Tower, as identified by <namsan_tower_img>, appears on screen",
    "fields": [
      { "name": "screen_prominence", "type": "string",
        "enum": ["HERO_SHOT", "PARTIAL", "BACKGROUND"] },
      { "name": "shot_type", "type": "string",
        "enum": ["CITY_ESTABLISHING", "PARK_WALK", "OBSERVATION_DECK", "NIGHT_LIGHTUP"] }
    ]
  }]
}

From a modeling perspective, this reframes segmentation from an open-vocabulary detection task to a grounded retrieval + temporal localization task:

  1. Reference resolution: The model binds <namsan_tower_img> to specific visual feature patterns across the video.

  2. Visual embedding alignment: The reference image is encoded into the same representation space as the video frames.

  3. Conditional boundary detection: Segment boundaries are dictated by the arrival and departure of the grounded visual entity, rather than generic scene shifts.

This is the difference between "find a city shot" and "find a shot containing this specific tower." It doesn't just change the technical difficulty; it fundamentally changes the utility of the product.

With entity grounding, segmentation can fail across three different vectors: temporal boundaries, semantic labels, and visual entity resolution. This complexity demands a more robust approach to evaluation.


6. Guaranteeing Quality: Dual-Track Evaluation

One of our most critical realizations while running TBM in production was that a single aggregate score cannot adequately capture segmentation quality.


The Confounding Problem

Consider these two failure scenarios:

Scenario A. The model generates a segment from 10.0s to 25.0s. The true ground truth segment is 12.0s to 48.0s. It captured the anchor's intro but missed the actual 23-second news story that followed. The semantic tags (anchor name, topic) were perfect, but the temporal coverage was poor.

Scenario B. The model correctly captures the boundary from 12.0s to 48.0s, but tags a "political analysis" segment as a "weather report." The boundary is perfect, but the metadata is wrong.

An aggregate evaluation metric that blends time and semantics cannot distinguish between these two failures. Yet, the remedy for each is completely different: Scenario A requires tuning temporal modeling, while Scenario B requires fixing semantic grounding. If we do not untangle these failure modes, we cannot confidently iterate on the model.


Our Solution: Two Independent Evaluation Tracks

Segment Track: "Did the model find the right window of time?"

We evaluate temporal accuracy across two dimensions: segment-level accuracy (how well individual segments are captured) and timeline-level coverage (how much of the timeline is correctly mapped). Both are essential; a model can score high on one and low on the other, and knowing where it fails dictates the path forward. Our final comparison uses a combined composite score as the primary diagnostic metric.

Metadata Track: "Within that window, did the model extract the correct structured values?"

Once segments are temporally matched, we evaluate the metadata fields within those matched pairs independently. We use an LLM-as-judge framework to score fields on a 0.0 to 5.0 scale, guided by field-specific schemas and criteria.

An important detail: We scale the metadata score by the strength of the temporal alignment. This prevents "lucky guesses" on poorly-defined boundaries from inflating the structural accuracy metric.

Operational benefits: Decoupling metadata evaluation from the video inference pipeline allows us to iterate rapidly on LLM-as-judge prompt templates and scoring rubrics without having to re-run expensive video model inference.

Why this separation matters: If temporal scores go up but metadata accuracy drops, it is a clear signal that the model is gaining temporal precision at the cost of semantic drift. This suggests we need to adjust training data balance, not model architecture. This level of diagnosis is only possible with split-track evaluation.


7. The Semantic Flywheel: The Compounding Value of Video Assets

Assetization is not a one-off run-and-done project. When built correctly, it forms a self-reinforcing loop where each run fuels the next:


We call this the Semantic Flywheel. Whether the domain is broadcast news, brand intelligence, compliance auditing, or sports highlights, the workloads share this exact same trajectory. As segment-metadata pairs accumulate, user corrections—adjusted boundaries, rejected tags, modified values—turn directly into high-fidelity training data for the next generation of models. The archive stops being a passive storage cost and becomes a dynamic system that grows more orderly over time.

The sole prerequisite for this flywheel to spin is observable improvement. If you cannot measure temporal precision independently of metadata precision, you cannot know which direction the model is moving. Without direction, systematic iteration is impossible, and the flywheel stalls. Without an evaluation framework that mirrors your asset structure, assetization remains just a temporary parlor trick.


8. What We Learned Along the Way

Building TBM and its evaluation framework brought several critical insights:

Segmentation is not Q&A. Treating temporal segmentation as "just another prompt-based task" for an LLM led us down unproductive paths for months. Boundary prediction has its own distinct failure modes, evaluation needs, and architecture sensitivities. Recognizing this earlier would have saved us months of development.

Structured output is far more valuable than fluent prose. In production, a perfectly formatted JSON with slightly imperfect boundaries is infinitely more useful than an expressive, unstructured narrative paragraph. Prioritizing machine-readability over human-readability was absolutely the right decision.

Determinism is a feature, not a constraint. Enforcing non-overlapping segments, strict schema adherence, and a default temperature=0 initially faced resistance for "limiting model creativity." In practice, these constraints are precisely what make the outputs reliable enough to power production automation. In enterprise systems, reliability always wins over open-ended capabilities.


What’s Next

We are still in the early innings of video assetization. Support for longer-context processing (3+ hours), richer multimodal conditioning (referencing multiple images and assets simultaneously), and extending this structured extraction beyond video to other containerized assets like audio and podcasts are top of mind for our engineering teams.

However, the grander challenge is ensuring our evaluation frameworks scale in tandem with model capabilities. As video-language models become more powerful, their failure modes become more subtle, demanding increasingly sophisticated evaluation methods.

The vast majority of the world's video archives remain dark. Unlocking their value requires more than just powerful models; it requires the frameworks to measure whether those models are pinpointing the right moments and describing them accurately.


To learn more about Pegasus 1.5 → Pegasus 1.5 Tech Blog

We are looking for builders to join us on this journey → Twelve Labs Careers

Broadcasters, sports leagues, media companies, and enterprise platforms hold petabyte-scale video archives. Yet, most of this content remains in a state we call dark video. It exists, but it is unsearchable, unstructured, and cannot be utilized at a semantic level.

The reason is surprisingly simple: video is not text. You cannot run a grep on video files, nor can you execute SELECT * FROM video WHERE scene = 'scoring_play'. For video to have economic value, it must first be broken down into structured segments and machine-readable metadata. We call this process video assetization.

This article is the story of how we built Time-Based Metadata (TBM) and its evaluation framework. Along the way, one thing became crystal clear: designing and evaluating time-and-semantic-based video understanding requires a fundamentally different approach than text or image models.


1. The Video Assetization Gap: What Customers Actually Want

Consider a broadcaster managing a 200,000-hour news archive. Currently, this work is done manually. Video archivists manually log each segment, tag stories, identify speakers, and mark topic boundaries. It costs about $15 per hour of video, and budgets are shrinking. This approach simply does not scale.

Let's look at another example. Imagine a consumer brand looking to track appearances of their products across thousands of influencer videos. They need to find every moment a specific product appears on screen while the creator speaks directly to the camera. Merely knowing that the product appeared is not enough. They need to know exactly when, how prominently, and in what context it appeared.

These use cases are not hypothetical. Broadcast archiving, automated sports highlights, brand intelligence, compliance auditing—nearly every enterprise video workload we encounter demands two independent capabilities simultaneously:

  1. Precise Temporal Boundaries: Where does each segment start and end?

  2. Schema-Conforming Structured Metadata: What happened inside that segment?

If you ask most video-language models today, "What is happening in this video?", they generate fluent paragraphs of text. But the moment you ask them to "provide start and end timestamps for all editorial narratives along with structured fields (topic, speaker, confidence)," you quickly realize that general-purpose video reasoning and production-grade segmentation are two entirely different problems.

Let's visualize the actual output a broadcaster expects. Each editorial segment requires a title, description, editorial topic, featured people, and confidence level. For a one-hour news program, the TBM output looks like this:


Each segment includes structured metadata like editorial_subjects, visual_subjects, names, and confidence. It is the moment a 1-hour program transforms into a set of queryable objects.


2. Why Traditional Approaches Fail

The Boundary Prediction Problem

Large language models excel at span-level reasoning (summarization, Q&A, description). However, predicting where segment boundaries lie is a fundamentally different challenge.

Boundary prediction lies at the intersection of three technical challenges:

  • Weakly-supervised temporal localization: Ground truth is sparse and subjective.

  • Multimodal change-point detection: Boundaries are defined by simultaneous transitions in visual, audio, and semantic cues.

  • Token-level classification under autoregressive decoding: The model must output highly precise timestamps as tokens.

In our early experiments, even frontier models produced boundaries that were semantically reasonable but temporally imprecise—segment starts that should have been hard cuts were off by 5 to 15 seconds. The model understood what was happening, but struggled to pin down exactly when it happened.


Without a Schema, It's Not an Asset

Most video-language models produce free-form text.

"The video shows a news broadcast. Around the 2-minute mark, it transitions to a sports segment..."

This is useful for human consumption, but practically useless for downstream automation. What enterprise workflows actually require is this:

{
  "start_time": 120.45,
  "end_time": 245.80,
  "metadata": {
    "segment_title": "NFL Playoffs Recap",
    "editorial_subjects": ["NFL", "Playoffs", "Injuries"],
    "names": ["Patrick Mahomes", "Lamar Jackson"],
    "confidence": "HIGH"
  }
}

The gap between expressive text and machine-actionable structured output is more than just a formatting issue; it is a fundamental challenge tied to both modeling and evaluation.


3. Twelve Labs' Approach: Schema-Conditioned Temporal Extraction

Defining the Problem

When designing TBM, we faced a crucial fork in the road: Do we build a general-purpose model that takes a video and answers anything (with segmentation as just one of its features), or do we build a schema-conditioned extraction model that takes a video alongside a user-defined schema and fills out those segments and fields accordingly?

At first, the former seemed more natural. However, the failure patterns we saw in §2—drifted boundaries, free-form fields, and inconsistent outputs across runs—turned out to be structural limitations of the general-purpose approach. Without a schema, the model has no way of knowing what to find or how precisely to find it, and evaluation is reduced to checking if the output "looks plausible."

We chose the latter. TBM is not a general-purpose model trying to do everything; it is an extraction model optimized specifically for precise temporal boundaries and schema-conforming metadata.


Three Key Advantages of Schema-Conditioning

In TBM, users provide segment_definitions, which serve as a structured specification of what segments to find and what metadata fields to extract from each. This is not just a convenient API design. From a modeling perspective, schema conditioning provides three key advantages:

Reduction of the Search Space. Instead of generating open-ended natural language about any visual detail, the model operates within a constrained output space defined by the schema. For sports, specifying fields like down, scoring_play, and penalty_type is sufficient. This cleanly prunes away a massive space of possible but irrelevant observations.

Boundary Alignment Stabilization. Once the model knows exactly what it is looking for (e.g., a "play boundary" vs. an "ad transition"), it can leverage domain-specific temporal cues (whistles, formation changes, fade-to-black patterns) rather than relying on generic change-point heuristics.

Evaluation Anchoring. Each field in the schema becomes a concrete, measurable evaluation target. Instead of asking the vague question "Did the model understand the video?", we can ask "Did the model correctly identify the down field as 3 for this segment?"


4. Coverage Breadth: The 4-Layer Signal Framework

For schema conditioning to work effectively, the schema must map to the right signals. One of the reasons video segmentation is notoriously difficult is that the signals defining a boundary span multiple modalities and abstraction levels. These range from frame-level visual changes like camera angle cuts, to structural narrative transitions in news packages, physical audio cues like voice changes, and composite events in sports requiring simultaneous visual, audio, and rules-based understanding. A single unified approach cannot solve this alone.

We systematized this diversity into four distinct Cue Layers:

Cue Layer

Description

Example Boundary Signal

Low-Level Visual

Visual changes within frame transitions. Focuses on localized, pixel-level visual details.

Shot transitions, camera angle changes

High-Level Semantic

Global semantic or narrative changes. Spans both visual and audio streams.

Topic transitions, editorial package boundaries (Anchor ↔ Field report ↔ Studio)

Audio

Acoustic cues including speech, music, sound effects, and silence.

Speaker transitions, music/BGM transitions

Composite

Multimodal signals requiring joint visual, audio, and contextual understanding.

Play-by-play events in sports, commercial ad boundaries


What This Classification Means

Generic video models typically treat these four layers interchangeably. However, in production, a customer's segmentation requirement always maps directly to one or a combination of these layers:

  • Broadcast news "editorial narrative segmenting" → High-Level (topic transition + editorial package boundaries)

  • Video production "shot boundary detection" → Low-Level Visual (camera cuts, angle changes)

  • Sports league "play-by-play segmentation" → Composite (visual formations + whistle cues + game state rules)

  • Podcast platform "speaker diarization by topic" → Audio (speaker transitions + ASR topic drift)

This distinction isn't just academic; it dictates modeling strategies, training data curation, and evaluation metrics. Low-Level Visual behaves like a frame-wise change detection problem, whereas Composite demands long-context temporal reasoning.


5. Multimodal Grounding: Moving Beyond Text

Enterprise video segmentation projects often hit a wall: text descriptions alone are insufficient to specify what needs to be found.


Why Text-Only Falls Short

Consider a travel content platform with a specific problem:

Find every moment where the N Seoul Tower appears across tens of thousands of travel vlogs. Classify whether it's a distant establishing shot over the Han River, a walking shot from Namsan Park, a close-up from the observation deck, or a night-time light-up scene.

Text descriptions quickly hit their limit. Just because a model understands the name 'N Seoul Tower' in text does not mean it can reliably anchor that to its visual appearance. In video-language models, semantic familiarity with an entity is distinct from visual grounding. Depending on the training distribution, the visual representation might be weak, or the model might confuse it with other structurally similar towers like Tokyo Tower or CN Tower. However, providing a reference image of the N Seoul Tower allows the model to bypass the complex translation of text to vision, aligning the image embedding directly with the video frames.


The Entity Reference System

TBM introduces support for media_sources. By placing reference tags like <reference_name> within the segment definitions, users can let the model point directly to user-provided reference assets.

Source - King Sejong Institute Foundation, https://www.kogl.or.kr/recommend/recommendDivView.do?oc=&recommendIdx=91796&division=img#

{
  "segment_definitions": [{
    "id": "namsan_tower_appearances",
    "media_sources": [
      { "name": "namsan_tower_img", "media_type": "image", "media_url": "<https://cdn>.../namsan_tower.jpg" }
    ],
    "description": "Moments where the N Seoul Tower, as identified by <namsan_tower_img>, appears on screen",
    "fields": [
      { "name": "screen_prominence", "type": "string",
        "enum": ["HERO_SHOT", "PARTIAL", "BACKGROUND"] },
      { "name": "shot_type", "type": "string",
        "enum": ["CITY_ESTABLISHING", "PARK_WALK", "OBSERVATION_DECK", "NIGHT_LIGHTUP"] }
    ]
  }]
}

From a modeling perspective, this reframes segmentation from an open-vocabulary detection task to a grounded retrieval + temporal localization task:

  1. Reference resolution: The model binds <namsan_tower_img> to specific visual feature patterns across the video.

  2. Visual embedding alignment: The reference image is encoded into the same representation space as the video frames.

  3. Conditional boundary detection: Segment boundaries are dictated by the arrival and departure of the grounded visual entity, rather than generic scene shifts.

This is the difference between "find a city shot" and "find a shot containing this specific tower." It doesn't just change the technical difficulty; it fundamentally changes the utility of the product.

With entity grounding, segmentation can fail across three different vectors: temporal boundaries, semantic labels, and visual entity resolution. This complexity demands a more robust approach to evaluation.


6. Guaranteeing Quality: Dual-Track Evaluation

One of our most critical realizations while running TBM in production was that a single aggregate score cannot adequately capture segmentation quality.


The Confounding Problem

Consider these two failure scenarios:

Scenario A. The model generates a segment from 10.0s to 25.0s. The true ground truth segment is 12.0s to 48.0s. It captured the anchor's intro but missed the actual 23-second news story that followed. The semantic tags (anchor name, topic) were perfect, but the temporal coverage was poor.

Scenario B. The model correctly captures the boundary from 12.0s to 48.0s, but tags a "political analysis" segment as a "weather report." The boundary is perfect, but the metadata is wrong.

An aggregate evaluation metric that blends time and semantics cannot distinguish between these two failures. Yet, the remedy for each is completely different: Scenario A requires tuning temporal modeling, while Scenario B requires fixing semantic grounding. If we do not untangle these failure modes, we cannot confidently iterate on the model.


Our Solution: Two Independent Evaluation Tracks

Segment Track: "Did the model find the right window of time?"

We evaluate temporal accuracy across two dimensions: segment-level accuracy (how well individual segments are captured) and timeline-level coverage (how much of the timeline is correctly mapped). Both are essential; a model can score high on one and low on the other, and knowing where it fails dictates the path forward. Our final comparison uses a combined composite score as the primary diagnostic metric.

Metadata Track: "Within that window, did the model extract the correct structured values?"

Once segments are temporally matched, we evaluate the metadata fields within those matched pairs independently. We use an LLM-as-judge framework to score fields on a 0.0 to 5.0 scale, guided by field-specific schemas and criteria.

An important detail: We scale the metadata score by the strength of the temporal alignment. This prevents "lucky guesses" on poorly-defined boundaries from inflating the structural accuracy metric.

Operational benefits: Decoupling metadata evaluation from the video inference pipeline allows us to iterate rapidly on LLM-as-judge prompt templates and scoring rubrics without having to re-run expensive video model inference.

Why this separation matters: If temporal scores go up but metadata accuracy drops, it is a clear signal that the model is gaining temporal precision at the cost of semantic drift. This suggests we need to adjust training data balance, not model architecture. This level of diagnosis is only possible with split-track evaluation.


7. The Semantic Flywheel: The Compounding Value of Video Assets

Assetization is not a one-off run-and-done project. When built correctly, it forms a self-reinforcing loop where each run fuels the next:


We call this the Semantic Flywheel. Whether the domain is broadcast news, brand intelligence, compliance auditing, or sports highlights, the workloads share this exact same trajectory. As segment-metadata pairs accumulate, user corrections—adjusted boundaries, rejected tags, modified values—turn directly into high-fidelity training data for the next generation of models. The archive stops being a passive storage cost and becomes a dynamic system that grows more orderly over time.

The sole prerequisite for this flywheel to spin is observable improvement. If you cannot measure temporal precision independently of metadata precision, you cannot know which direction the model is moving. Without direction, systematic iteration is impossible, and the flywheel stalls. Without an evaluation framework that mirrors your asset structure, assetization remains just a temporary parlor trick.


8. What We Learned Along the Way

Building TBM and its evaluation framework brought several critical insights:

Segmentation is not Q&A. Treating temporal segmentation as "just another prompt-based task" for an LLM led us down unproductive paths for months. Boundary prediction has its own distinct failure modes, evaluation needs, and architecture sensitivities. Recognizing this earlier would have saved us months of development.

Structured output is far more valuable than fluent prose. In production, a perfectly formatted JSON with slightly imperfect boundaries is infinitely more useful than an expressive, unstructured narrative paragraph. Prioritizing machine-readability over human-readability was absolutely the right decision.

Determinism is a feature, not a constraint. Enforcing non-overlapping segments, strict schema adherence, and a default temperature=0 initially faced resistance for "limiting model creativity." In practice, these constraints are precisely what make the outputs reliable enough to power production automation. In enterprise systems, reliability always wins over open-ended capabilities.


What’s Next

We are still in the early innings of video assetization. Support for longer-context processing (3+ hours), richer multimodal conditioning (referencing multiple images and assets simultaneously), and extending this structured extraction beyond video to other containerized assets like audio and podcasts are top of mind for our engineering teams.

However, the grander challenge is ensuring our evaluation frameworks scale in tandem with model capabilities. As video-language models become more powerful, their failure modes become more subtle, demanding increasingly sophisticated evaluation methods.

The vast majority of the world's video archives remain dark. Unlocking their value requires more than just powerful models; it requires the frameworks to measure whether those models are pinpointing the right moments and describing them accurately.


To learn more about Pegasus 1.5 → Pegasus 1.5 Tech Blog

We are looking for builders to join us on this journey → Twelve Labs Careers