Product

From Dark Video to Structured Assets: How We Built Segment Metadata Extraction

Kevin Lee

How TwelveLabs built SME (Segment Metadata Extraction) — a schema-conditioned temporal extraction system that turns unsearchable video archives into structured, queryable assets — and the dual-track evaluation framework that makes iterative improvement possible.

How TwelveLabs built SME (Segment Metadata Extraction) — a schema-conditioned temporal extraction system that turns unsearchable video archives into structured, queryable assets — and the dual-track evaluation framework that makes iterative improvement possible.

In this article

No headings found on page

Join our newsletter

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Receive the latest advancements, tutorials, and industry insights in video understanding

Search, analyze, and explore your videos with AI.

2026/05/11

15 mins

Copy link to article

Broadcasters, sports leagues, media companies, and enterprise platforms sit on petabyte-scale video archives. Most of that content is what we call dark video — it exists, but you can't search it, it isn't structured, and you can't operate on it at the level of meaning.

The reason is almost embarrassingly simple: video isn't text. You can't grep a video file, and SELECT * FROM video WHERE scene = 'scoring_play' isn't a thing. For video to carry economic value, it first has to be broken down into structured segments and machine-readable metadata. We call this video assetization.

This post is the story of how we've built Time-Based Metadata (TBM) and the evaluation system around it. One thing has become clear along the way: designing and evaluating temporally-grounded video understanding requires a different approach from what works for text or image models.


1. The Assetization Gap: What Customers Actually Want

Picture a broadcaster managing a 200,000-hour news archive. Today, that work is done by hand. Archivists manually log each segment, tag stories, identify speakers, and mark topic boundaries. It costs roughly $15 per hour of video, and budgets are shrinking. This simply doesn't scale.

Or take a CPG brand trying to track its product placements across thousands of influencer videos. They need every moment a creator speaks to camera while the product is on screen. It's not enough to know the product appeared — they need to know exactly when, how prominently, and in what context.

These aren't hypotheticals. Broadcast archive management, sports highlight automation, brand intelligence, compliance auditing — nearly every enterprise video workload we encounter demands two independent things at once:

  1. Precise temporal boundaries: where does each segment start and end?

  2. Schema-conforming structured metadata: what happened inside that segment?

Ask most video-language models today "what's happening in this video?" and you'll get a fluent paragraph back. But the moment you ask for "the start/end timestamps of every editorial narrative, with structured fields for topic, speaker, and confidence," you quickly discover that general video reasoning and production-grade segmentation are very different problems.

Here's what a broadcaster actually needs as output. Each editorial segment requires a title, description, editorial subjects, named entities, and a confidence level. For a one-hour news program, TBM produces something like:


Each segment carries structured metadata — editorial_subjects, visual_subjects, names, confidence. That's the moment a one-hour program turns into a set of queryable objects.


2. Why Existing Approaches Fall Short

The Boundary Prediction Problem

Large language models excel at reasoning over a span — summarization, Q&A, content description. But predicting where the segment boundaries are is a different kind of task entirely.

Boundary prediction sits at the intersection of three hard problems:

  • Weakly-supervised temporal localization: ground truth is sparse and subjective.

  • Multimodal change-point detection: a boundary is defined by simultaneous shifts across visual, audio, and semantic signals.

  • Token-level classification under autoregressive decoding: the model has to emit precise timestamps as tokens.

In our early experiments, even frontier models produced boundaries that were semantically reasonable but temporally off — segments that should have started on a hard cut would drift by 5 to 15 seconds. The model understood what was happening; it just couldn't pinpoint when.

Without a Schema, It Isn't an Asset

Most video-language models generate free-form text:

"This video shows a news broadcast. Around the two-minute mark it transitions to a sports segment..."

That's useful for a human reader. It's nearly useless for downstream automation. What enterprise workflows actually need looks like this:

{
  "start_time": 120.45,
  "end_time": 245.80,
  "metadata": {
    "segment_title": "NFL Playoff Recap",
    "editorial_subjects": ["NFL", "playoffs", "injury report"],
    "names": ["Patrick Mahomes", "Lamar Jackson"],
    "confidence": "HIGH"
  }
}

The gap between expressive text and machine-readable structured output isn't just a formatting issue. It's a modeling and evaluation problem intertwined.


3. Our Approach: Schema-Conditioned Temporal Extraction

Choosing Which Problem to Solve

When we set out to design TBM, we hit a fork in the road. Build a general-purpose model that takes a video and answers anything about it (with segmentation as one capability among many)? Or build a schema-conditioned extraction model that takes a video plus a user-defined schema and fills in segments and fields exactly as that schema specifies?

The first option felt natural at first. But the failure patterns from §2 — fuzzy boundaries, free-form fields, run-to-run output drift — turned out to be structural limitations of the general-purpose framing. Without a schema, the model has no way to know what to look for or how precisely to find it, and evaluation never gets past "did it sound plausible?"

We chose the second. TBM isn't a model that does everything; it's an extraction model optimized for precise temporal boundaries and schema-conforming metadata.

What Schema Conditioning Actually Buys You

In TBM, users provide segment_definitions — a structured spec of which segments to find and which metadata fields to extract from each. This isn't just a convenient API choice. From a modeling standpoint, schema conditioning does three things:

Shrinks the search space. Instead of generating free-form natural language about every aspect of the video, the model operates inside the bounded output space the schema defines. For a sports broadcast, specifying just down, scoring_play, and penalty_type is enough — the vast space of plausible-but-irrelevant observations gets pruned away.

Stabilizes boundary alignment. Once the model knows what it's looking for ("play boundary" vs. "ad transition"), it can lean on domain-specific temporal cues — whistles, formation changes, fade-to-black patterns — instead of falling back on generic change-point heuristics.

Anchors evaluation. Each schema field becomes a concrete, measurable evaluation target. Instead of the fuzzy "did the model understand the video?" question, you can ask "did the model correctly identify the down field as 3 for this segment?"


4. Coverage Breadth: A Four-Tier Cue System

For schema conditioning to actually work, you need to know what kinds of signals the schema maps onto. And one of the things that makes video segmentation hard is that the kinds of signals defining a boundary vary wildly — frame-level visual changes like camera-angle cuts, structural transitions in news editorial packages, audio cues like speaker changes, and composite events in sports that combine vision, audio, and game rules. No single approach handles all of them.

We classified this diversity into four cue tiers:

Cue Tier

Description

Boundary Signal Examples

Low-Level Visual

Visual changes inside the frame. Narrow visual detail.

Shot transitions, camera-angle changes

High-Level Semantic

Semantic / narrative-level global shifts, across both visual and audio streams.

Topic transitions, editorial package boundaries (anchor ↔ field ↔ studio)

Audio

Auditory signals — speech, music, SFX, silence.

Speaker turns, music/BGM transitions

Composite

Multimodal signals combining visual, audio, and contextual information.

Per-play sports events, ad breaks

Why This Taxonomy Matters

Generic video models don't distinguish these tiers. But real customer segmentation needs map cleanly onto specific tiers, or specific combinations of them:

  • A broadcaster's "separate editorial narratives" → High-Level (topic transitions + editorial package boundaries)

  • An editorial team's "shot boundary detection" → Low-Level Visual (camera cuts, angle changes)

  • A sports league's "per-play segmentation" → Composite (formation + whistle + game rules)

  • A podcast platform's "speaker-segmented transcription" → Audio (speaker turns + ASR topic changes)

This isn't taxonomy for taxonomy's sake. Each tier demands different modeling strategies, training data, and evaluation metrics. Low-Level Visual is closer to frame-wise change detection; Composite requires long-context reasoning.


5. Multimodal Grounding: Beyond Text

There's a recurring constraint we hit in enterprise video segmentation: text alone often can't specify what to look for.

Why Text Isn't Enough

Take a travel content platform. The problem they want to solve:

Across tens of thousands of vlogs by foreign visitors to Seoul, find every moment "N Seoul Tower" appears on screen. And classify whether it's a distant establishing shot from the Han River or city skyline, a cut while walking through Namsan Park, a close-up from the observation deck, or a night-time light-up.

Text alone doesn't cut it. Just because a model knows the name "N Seoul Tower" doesn't mean it can reliably summon the visual appearance. Linguistic recognition and visual identification of a specific entity are separable problems in video-language models. Depending on training distribution, the visual representation may be fuzzy, or it may get confused with similar-looking broadcasting towers like Tokyo Tower or the CN Tower. Hand the model a single reference image of N Seoul Tower, and now it has a visual-embedding anchor to compare directly against video frames. The "translate text into vision" burden disappears.

The Entity Reference System

TBM supports media_sources. You insert <reference_name> tags inside a segment definition, and the model can point to user-registered reference images directly from the prose.


출처 - 세종학당재단, https://www.kogl.or.kr/recommend/recommendDivView.do?oc=&recommendIdx=91796&division=img#

Source: King Sejong Institute Foundation (KSIF)

{
  "segment_definitions": [{
    "id": "namsan_tower_appearances",
    "media_sources": [
      { "name": "namsan_tower_img", "media_type": "image", "media_url": "<https://cdn>.../namsan_tower.jpg" }
    ],
    "description": "Moments where N Seoul Tower, identified by <namsan_tower_img>, appears on screen",
    "fields": [
      { "name": "screen_prominence", "type": "string",
        "enum": ["HERO_SHOT", "PARTIAL", "BACKGROUND"] },
      { "name": "shot_type", "type": "string",
        "enum": ["CITY_ESTABLISHING", "PARK_WALK", "OBSERVATION_DECK", "NIGHT_LIGHTUP"] }
    ]
  }]
}

From a modeling standpoint, this reframes segmentation from an open-vocabulary detection problem into a grounded retrieval + temporal localization problem:

  1. Reference resolution: the model has to bind <namsan_tower_img> to a specific visual appearance pattern across the entire video.

  2. Visual embedding alignment: the reference image is encoded into the same representation space as the video frames.

  3. Conditional boundary detection: segment boundaries are determined by the co-occurrence of a specific visual entity, not by generic scene transitions.

That's the difference between "find cityscape shots" and "find shots where N Seoul Tower appears." Not a difference in technical difficulty — a difference in what problem the product is actually solving.

Once entity grounding enters the picture, segmentation can fail along three axes: temporal boundaries, semantic labels, and visual entity resolution. That adds another dimension to the evaluation problem.


6. Quality Assurance: Dual-Track Evaluation

The most decisive — and most counterintuitive — finding from operating TBM was that you can't judge segmentation quality with a single score.

The Hidden Coupling Problem

Compare these two failure scenarios:

Scenario A. The model produces a 10.0s–25.0s segment. The ground truth is 12.0s–48.0s. It caught the anchor's 15-second intro but completely missed the 23 seconds of main story that followed. Anchor name, topic, speaker tagging — all perfect. But the temporal boundary covered only a third of the actual content.

Scenario B. The model nailed 12.0s–48.0s exactly. But it tagged "political analysis" as "weather segment." Boundaries perfect, metadata wrong.

A single aggregate score can't distinguish these two cases. But the fixes pull in completely different directions. The first needs better temporal modeling; the second needs better semantic grounding. Without separating these failure modes, you can't even tell whether a model improvement actually helped.

Our Solution: Two Independent Evaluation Tracks

Segment Track: "Did the model find the right intervals?"

Measures temporal accuracy. We look at both the segment level (how well were individual intervals captured?) and the timeline level (which parts of the timeline were covered correctly?). Both matter — one can be strong while the other is weak, and which side fails tells you what to fix. For final comparisons we use a composite score combining both views as the headline metric.

Metadata Track: "Within the right intervals, did the model structure things correctly?"

After segments are matched, we evaluate each field independently within matched segment pairs. We use LLM-as-judge with field-type-specific guidelines on a 0.0–5.0 scale.

One more wrinkle. Field scores are weighted higher when boundary alignment is tighter. This prevents "looks-right" metadata from inflating scores in regions where the boundary itself was sloppy.

A side benefit. Because metadata evaluation runs as a post-processing step decoupled from inference, we can iterate on the LLM-as-judge criteria — prompts, scoring rubrics — quickly, without re-running expensive video inference.

Why this separation matters. If the segment score improves but the metadata score drops, that's a signal: temporal localization got better but semantic precision got worse. The fix isn't model architecture, it's training data balance. You can only diagnose this with the tracks separated.


7. The Semantic Flywheel: Compounding Returns from Assetization

Assetization isn't a one-shot job. Once it actually starts working, you enter a self-reinforcing loop where each turn fuels the next:


We call this the Semantic Flywheel. Broadcast archives, brand intelligence, compliance auditing, sports highlight automation — the workloads running on TBM span very different domains, but they all follow the same arc. As segment-metadata pairs accumulate, the usage trace itself — boundaries customers correct, tags they reject, fields they edit — becomes the training signal for the next generation of the model. Once it's spinning, the archive doesn't just get "organized"; it becomes a system that organizes itself more precisely with every turn.

The single precondition for the flywheel to spin is that improvement has to be observable. If you can't measure temporal accuracy and metadata accuracy independently, you can't tell which one is improving and which is regressing. Without direction, no systematic iteration; without systematic iteration, the flywheel stalls. Without an evaluation system that mirrors the structure of the asset, assetization stays a one-off trick.


8. What We Learned

A few takeaways from building TBM and its evaluation framework:

Segmentation isn't Q&A. The instinct to treat temporal segmentation as "yet another LLM task" sent us down unproductive paths for a while. Boundary prediction has different failure modes, different evaluation requirements, and different sensitivities to model architecture. Recognizing this earlier would have saved us months.

Structured output beats fluent output. In production, a perfect JSON segment with slightly imprecise boundaries is far more useful than beautiful prose with no structured fields. Prioritizing machine-readability over human-readability turned out to be the right call.

Determinism is a feature, not a constraint. Non-overlapping segments, strict schema conformance, temperature=0 defaults — every one of these constraints initially drew pushback as "limiting model capability." In practice, they're what made the output reliable enough for production automation. When customers build systems on top of an API, reliability beats capability.


What's Next

Video assetization is still in its early days. Longer contexts (3+ hours), richer multimodal conditioning (multiple entities, each with dozens of reference images), and extending the same structured-extraction framing beyond video to other containerized assets like audio and podcasts — these are what we're building next.

The bigger challenge, though, is building an evaluation framework that scales alongside the capabilities. As video-language models grow more capable, failure modes get more subtle, and evaluation has to get sharper to keep up.

Most enterprise video archives still go untouched. Extracting value from them isn't just about better models. It's about being able to measure whether the model is actually finding the right moments — and describing them correctly.

Broadcasters, sports leagues, media companies, and enterprise platforms sit on petabyte-scale video archives. Most of that content is what we call dark video — it exists, but you can't search it, it isn't structured, and you can't operate on it at the level of meaning.

The reason is almost embarrassingly simple: video isn't text. You can't grep a video file, and SELECT * FROM video WHERE scene = 'scoring_play' isn't a thing. For video to carry economic value, it first has to be broken down into structured segments and machine-readable metadata. We call this video assetization.

This post is the story of how we've built Time-Based Metadata (TBM) and the evaluation system around it. One thing has become clear along the way: designing and evaluating temporally-grounded video understanding requires a different approach from what works for text or image models.


1. The Assetization Gap: What Customers Actually Want

Picture a broadcaster managing a 200,000-hour news archive. Today, that work is done by hand. Archivists manually log each segment, tag stories, identify speakers, and mark topic boundaries. It costs roughly $15 per hour of video, and budgets are shrinking. This simply doesn't scale.

Or take a CPG brand trying to track its product placements across thousands of influencer videos. They need every moment a creator speaks to camera while the product is on screen. It's not enough to know the product appeared — they need to know exactly when, how prominently, and in what context.

These aren't hypotheticals. Broadcast archive management, sports highlight automation, brand intelligence, compliance auditing — nearly every enterprise video workload we encounter demands two independent things at once:

  1. Precise temporal boundaries: where does each segment start and end?

  2. Schema-conforming structured metadata: what happened inside that segment?

Ask most video-language models today "what's happening in this video?" and you'll get a fluent paragraph back. But the moment you ask for "the start/end timestamps of every editorial narrative, with structured fields for topic, speaker, and confidence," you quickly discover that general video reasoning and production-grade segmentation are very different problems.

Here's what a broadcaster actually needs as output. Each editorial segment requires a title, description, editorial subjects, named entities, and a confidence level. For a one-hour news program, TBM produces something like:


Each segment carries structured metadata — editorial_subjects, visual_subjects, names, confidence. That's the moment a one-hour program turns into a set of queryable objects.


2. Why Existing Approaches Fall Short

The Boundary Prediction Problem

Large language models excel at reasoning over a span — summarization, Q&A, content description. But predicting where the segment boundaries are is a different kind of task entirely.

Boundary prediction sits at the intersection of three hard problems:

  • Weakly-supervised temporal localization: ground truth is sparse and subjective.

  • Multimodal change-point detection: a boundary is defined by simultaneous shifts across visual, audio, and semantic signals.

  • Token-level classification under autoregressive decoding: the model has to emit precise timestamps as tokens.

In our early experiments, even frontier models produced boundaries that were semantically reasonable but temporally off — segments that should have started on a hard cut would drift by 5 to 15 seconds. The model understood what was happening; it just couldn't pinpoint when.

Without a Schema, It Isn't an Asset

Most video-language models generate free-form text:

"This video shows a news broadcast. Around the two-minute mark it transitions to a sports segment..."

That's useful for a human reader. It's nearly useless for downstream automation. What enterprise workflows actually need looks like this:

{
  "start_time": 120.45,
  "end_time": 245.80,
  "metadata": {
    "segment_title": "NFL Playoff Recap",
    "editorial_subjects": ["NFL", "playoffs", "injury report"],
    "names": ["Patrick Mahomes", "Lamar Jackson"],
    "confidence": "HIGH"
  }
}

The gap between expressive text and machine-readable structured output isn't just a formatting issue. It's a modeling and evaluation problem intertwined.


3. Our Approach: Schema-Conditioned Temporal Extraction

Choosing Which Problem to Solve

When we set out to design TBM, we hit a fork in the road. Build a general-purpose model that takes a video and answers anything about it (with segmentation as one capability among many)? Or build a schema-conditioned extraction model that takes a video plus a user-defined schema and fills in segments and fields exactly as that schema specifies?

The first option felt natural at first. But the failure patterns from §2 — fuzzy boundaries, free-form fields, run-to-run output drift — turned out to be structural limitations of the general-purpose framing. Without a schema, the model has no way to know what to look for or how precisely to find it, and evaluation never gets past "did it sound plausible?"

We chose the second. TBM isn't a model that does everything; it's an extraction model optimized for precise temporal boundaries and schema-conforming metadata.

What Schema Conditioning Actually Buys You

In TBM, users provide segment_definitions — a structured spec of which segments to find and which metadata fields to extract from each. This isn't just a convenient API choice. From a modeling standpoint, schema conditioning does three things:

Shrinks the search space. Instead of generating free-form natural language about every aspect of the video, the model operates inside the bounded output space the schema defines. For a sports broadcast, specifying just down, scoring_play, and penalty_type is enough — the vast space of plausible-but-irrelevant observations gets pruned away.

Stabilizes boundary alignment. Once the model knows what it's looking for ("play boundary" vs. "ad transition"), it can lean on domain-specific temporal cues — whistles, formation changes, fade-to-black patterns — instead of falling back on generic change-point heuristics.

Anchors evaluation. Each schema field becomes a concrete, measurable evaluation target. Instead of the fuzzy "did the model understand the video?" question, you can ask "did the model correctly identify the down field as 3 for this segment?"


4. Coverage Breadth: A Four-Tier Cue System

For schema conditioning to actually work, you need to know what kinds of signals the schema maps onto. And one of the things that makes video segmentation hard is that the kinds of signals defining a boundary vary wildly — frame-level visual changes like camera-angle cuts, structural transitions in news editorial packages, audio cues like speaker changes, and composite events in sports that combine vision, audio, and game rules. No single approach handles all of them.

We classified this diversity into four cue tiers:

Cue Tier

Description

Boundary Signal Examples

Low-Level Visual

Visual changes inside the frame. Narrow visual detail.

Shot transitions, camera-angle changes

High-Level Semantic

Semantic / narrative-level global shifts, across both visual and audio streams.

Topic transitions, editorial package boundaries (anchor ↔ field ↔ studio)

Audio

Auditory signals — speech, music, SFX, silence.

Speaker turns, music/BGM transitions

Composite

Multimodal signals combining visual, audio, and contextual information.

Per-play sports events, ad breaks

Why This Taxonomy Matters

Generic video models don't distinguish these tiers. But real customer segmentation needs map cleanly onto specific tiers, or specific combinations of them:

  • A broadcaster's "separate editorial narratives" → High-Level (topic transitions + editorial package boundaries)

  • An editorial team's "shot boundary detection" → Low-Level Visual (camera cuts, angle changes)

  • A sports league's "per-play segmentation" → Composite (formation + whistle + game rules)

  • A podcast platform's "speaker-segmented transcription" → Audio (speaker turns + ASR topic changes)

This isn't taxonomy for taxonomy's sake. Each tier demands different modeling strategies, training data, and evaluation metrics. Low-Level Visual is closer to frame-wise change detection; Composite requires long-context reasoning.


5. Multimodal Grounding: Beyond Text

There's a recurring constraint we hit in enterprise video segmentation: text alone often can't specify what to look for.

Why Text Isn't Enough

Take a travel content platform. The problem they want to solve:

Across tens of thousands of vlogs by foreign visitors to Seoul, find every moment "N Seoul Tower" appears on screen. And classify whether it's a distant establishing shot from the Han River or city skyline, a cut while walking through Namsan Park, a close-up from the observation deck, or a night-time light-up.

Text alone doesn't cut it. Just because a model knows the name "N Seoul Tower" doesn't mean it can reliably summon the visual appearance. Linguistic recognition and visual identification of a specific entity are separable problems in video-language models. Depending on training distribution, the visual representation may be fuzzy, or it may get confused with similar-looking broadcasting towers like Tokyo Tower or the CN Tower. Hand the model a single reference image of N Seoul Tower, and now it has a visual-embedding anchor to compare directly against video frames. The "translate text into vision" burden disappears.

The Entity Reference System

TBM supports media_sources. You insert <reference_name> tags inside a segment definition, and the model can point to user-registered reference images directly from the prose.


출처 - 세종학당재단, https://www.kogl.or.kr/recommend/recommendDivView.do?oc=&recommendIdx=91796&division=img#

Source: King Sejong Institute Foundation (KSIF)

{
  "segment_definitions": [{
    "id": "namsan_tower_appearances",
    "media_sources": [
      { "name": "namsan_tower_img", "media_type": "image", "media_url": "<https://cdn>.../namsan_tower.jpg" }
    ],
    "description": "Moments where N Seoul Tower, identified by <namsan_tower_img>, appears on screen",
    "fields": [
      { "name": "screen_prominence", "type": "string",
        "enum": ["HERO_SHOT", "PARTIAL", "BACKGROUND"] },
      { "name": "shot_type", "type": "string",
        "enum": ["CITY_ESTABLISHING", "PARK_WALK", "OBSERVATION_DECK", "NIGHT_LIGHTUP"] }
    ]
  }]
}

From a modeling standpoint, this reframes segmentation from an open-vocabulary detection problem into a grounded retrieval + temporal localization problem:

  1. Reference resolution: the model has to bind <namsan_tower_img> to a specific visual appearance pattern across the entire video.

  2. Visual embedding alignment: the reference image is encoded into the same representation space as the video frames.

  3. Conditional boundary detection: segment boundaries are determined by the co-occurrence of a specific visual entity, not by generic scene transitions.

That's the difference between "find cityscape shots" and "find shots where N Seoul Tower appears." Not a difference in technical difficulty — a difference in what problem the product is actually solving.

Once entity grounding enters the picture, segmentation can fail along three axes: temporal boundaries, semantic labels, and visual entity resolution. That adds another dimension to the evaluation problem.


6. Quality Assurance: Dual-Track Evaluation

The most decisive — and most counterintuitive — finding from operating TBM was that you can't judge segmentation quality with a single score.

The Hidden Coupling Problem

Compare these two failure scenarios:

Scenario A. The model produces a 10.0s–25.0s segment. The ground truth is 12.0s–48.0s. It caught the anchor's 15-second intro but completely missed the 23 seconds of main story that followed. Anchor name, topic, speaker tagging — all perfect. But the temporal boundary covered only a third of the actual content.

Scenario B. The model nailed 12.0s–48.0s exactly. But it tagged "political analysis" as "weather segment." Boundaries perfect, metadata wrong.

A single aggregate score can't distinguish these two cases. But the fixes pull in completely different directions. The first needs better temporal modeling; the second needs better semantic grounding. Without separating these failure modes, you can't even tell whether a model improvement actually helped.

Our Solution: Two Independent Evaluation Tracks

Segment Track: "Did the model find the right intervals?"

Measures temporal accuracy. We look at both the segment level (how well were individual intervals captured?) and the timeline level (which parts of the timeline were covered correctly?). Both matter — one can be strong while the other is weak, and which side fails tells you what to fix. For final comparisons we use a composite score combining both views as the headline metric.

Metadata Track: "Within the right intervals, did the model structure things correctly?"

After segments are matched, we evaluate each field independently within matched segment pairs. We use LLM-as-judge with field-type-specific guidelines on a 0.0–5.0 scale.

One more wrinkle. Field scores are weighted higher when boundary alignment is tighter. This prevents "looks-right" metadata from inflating scores in regions where the boundary itself was sloppy.

A side benefit. Because metadata evaluation runs as a post-processing step decoupled from inference, we can iterate on the LLM-as-judge criteria — prompts, scoring rubrics — quickly, without re-running expensive video inference.

Why this separation matters. If the segment score improves but the metadata score drops, that's a signal: temporal localization got better but semantic precision got worse. The fix isn't model architecture, it's training data balance. You can only diagnose this with the tracks separated.


7. The Semantic Flywheel: Compounding Returns from Assetization

Assetization isn't a one-shot job. Once it actually starts working, you enter a self-reinforcing loop where each turn fuels the next:


We call this the Semantic Flywheel. Broadcast archives, brand intelligence, compliance auditing, sports highlight automation — the workloads running on TBM span very different domains, but they all follow the same arc. As segment-metadata pairs accumulate, the usage trace itself — boundaries customers correct, tags they reject, fields they edit — becomes the training signal for the next generation of the model. Once it's spinning, the archive doesn't just get "organized"; it becomes a system that organizes itself more precisely with every turn.

The single precondition for the flywheel to spin is that improvement has to be observable. If you can't measure temporal accuracy and metadata accuracy independently, you can't tell which one is improving and which is regressing. Without direction, no systematic iteration; without systematic iteration, the flywheel stalls. Without an evaluation system that mirrors the structure of the asset, assetization stays a one-off trick.


8. What We Learned

A few takeaways from building TBM and its evaluation framework:

Segmentation isn't Q&A. The instinct to treat temporal segmentation as "yet another LLM task" sent us down unproductive paths for a while. Boundary prediction has different failure modes, different evaluation requirements, and different sensitivities to model architecture. Recognizing this earlier would have saved us months.

Structured output beats fluent output. In production, a perfect JSON segment with slightly imprecise boundaries is far more useful than beautiful prose with no structured fields. Prioritizing machine-readability over human-readability turned out to be the right call.

Determinism is a feature, not a constraint. Non-overlapping segments, strict schema conformance, temperature=0 defaults — every one of these constraints initially drew pushback as "limiting model capability." In practice, they're what made the output reliable enough for production automation. When customers build systems on top of an API, reliability beats capability.


What's Next

Video assetization is still in its early days. Longer contexts (3+ hours), richer multimodal conditioning (multiple entities, each with dozens of reference images), and extending the same structured-extraction framing beyond video to other containerized assets like audio and podcasts — these are what we're building next.

The bigger challenge, though, is building an evaluation framework that scales alongside the capabilities. As video-language models grow more capable, failure modes get more subtle, and evaluation has to get sharper to keep up.

Most enterprise video archives still go untouched. Extracting value from them isn't just about better models. It's about being able to measure whether the model is actually finding the right moments — and describing them correctly.

Broadcasters, sports leagues, media companies, and enterprise platforms sit on petabyte-scale video archives. Most of that content is what we call dark video — it exists, but you can't search it, it isn't structured, and you can't operate on it at the level of meaning.

The reason is almost embarrassingly simple: video isn't text. You can't grep a video file, and SELECT * FROM video WHERE scene = 'scoring_play' isn't a thing. For video to carry economic value, it first has to be broken down into structured segments and machine-readable metadata. We call this video assetization.

This post is the story of how we've built Time-Based Metadata (TBM) and the evaluation system around it. One thing has become clear along the way: designing and evaluating temporally-grounded video understanding requires a different approach from what works for text or image models.


1. The Assetization Gap: What Customers Actually Want

Picture a broadcaster managing a 200,000-hour news archive. Today, that work is done by hand. Archivists manually log each segment, tag stories, identify speakers, and mark topic boundaries. It costs roughly $15 per hour of video, and budgets are shrinking. This simply doesn't scale.

Or take a CPG brand trying to track its product placements across thousands of influencer videos. They need every moment a creator speaks to camera while the product is on screen. It's not enough to know the product appeared — they need to know exactly when, how prominently, and in what context.

These aren't hypotheticals. Broadcast archive management, sports highlight automation, brand intelligence, compliance auditing — nearly every enterprise video workload we encounter demands two independent things at once:

  1. Precise temporal boundaries: where does each segment start and end?

  2. Schema-conforming structured metadata: what happened inside that segment?

Ask most video-language models today "what's happening in this video?" and you'll get a fluent paragraph back. But the moment you ask for "the start/end timestamps of every editorial narrative, with structured fields for topic, speaker, and confidence," you quickly discover that general video reasoning and production-grade segmentation are very different problems.

Here's what a broadcaster actually needs as output. Each editorial segment requires a title, description, editorial subjects, named entities, and a confidence level. For a one-hour news program, TBM produces something like:


Each segment carries structured metadata — editorial_subjects, visual_subjects, names, confidence. That's the moment a one-hour program turns into a set of queryable objects.


2. Why Existing Approaches Fall Short

The Boundary Prediction Problem

Large language models excel at reasoning over a span — summarization, Q&A, content description. But predicting where the segment boundaries are is a different kind of task entirely.

Boundary prediction sits at the intersection of three hard problems:

  • Weakly-supervised temporal localization: ground truth is sparse and subjective.

  • Multimodal change-point detection: a boundary is defined by simultaneous shifts across visual, audio, and semantic signals.

  • Token-level classification under autoregressive decoding: the model has to emit precise timestamps as tokens.

In our early experiments, even frontier models produced boundaries that were semantically reasonable but temporally off — segments that should have started on a hard cut would drift by 5 to 15 seconds. The model understood what was happening; it just couldn't pinpoint when.

Without a Schema, It Isn't an Asset

Most video-language models generate free-form text:

"This video shows a news broadcast. Around the two-minute mark it transitions to a sports segment..."

That's useful for a human reader. It's nearly useless for downstream automation. What enterprise workflows actually need looks like this:

{
  "start_time": 120.45,
  "end_time": 245.80,
  "metadata": {
    "segment_title": "NFL Playoff Recap",
    "editorial_subjects": ["NFL", "playoffs", "injury report"],
    "names": ["Patrick Mahomes", "Lamar Jackson"],
    "confidence": "HIGH"
  }
}

The gap between expressive text and machine-readable structured output isn't just a formatting issue. It's a modeling and evaluation problem intertwined.


3. Our Approach: Schema-Conditioned Temporal Extraction

Choosing Which Problem to Solve

When we set out to design TBM, we hit a fork in the road. Build a general-purpose model that takes a video and answers anything about it (with segmentation as one capability among many)? Or build a schema-conditioned extraction model that takes a video plus a user-defined schema and fills in segments and fields exactly as that schema specifies?

The first option felt natural at first. But the failure patterns from §2 — fuzzy boundaries, free-form fields, run-to-run output drift — turned out to be structural limitations of the general-purpose framing. Without a schema, the model has no way to know what to look for or how precisely to find it, and evaluation never gets past "did it sound plausible?"

We chose the second. TBM isn't a model that does everything; it's an extraction model optimized for precise temporal boundaries and schema-conforming metadata.

What Schema Conditioning Actually Buys You

In TBM, users provide segment_definitions — a structured spec of which segments to find and which metadata fields to extract from each. This isn't just a convenient API choice. From a modeling standpoint, schema conditioning does three things:

Shrinks the search space. Instead of generating free-form natural language about every aspect of the video, the model operates inside the bounded output space the schema defines. For a sports broadcast, specifying just down, scoring_play, and penalty_type is enough — the vast space of plausible-but-irrelevant observations gets pruned away.

Stabilizes boundary alignment. Once the model knows what it's looking for ("play boundary" vs. "ad transition"), it can lean on domain-specific temporal cues — whistles, formation changes, fade-to-black patterns — instead of falling back on generic change-point heuristics.

Anchors evaluation. Each schema field becomes a concrete, measurable evaluation target. Instead of the fuzzy "did the model understand the video?" question, you can ask "did the model correctly identify the down field as 3 for this segment?"


4. Coverage Breadth: A Four-Tier Cue System

For schema conditioning to actually work, you need to know what kinds of signals the schema maps onto. And one of the things that makes video segmentation hard is that the kinds of signals defining a boundary vary wildly — frame-level visual changes like camera-angle cuts, structural transitions in news editorial packages, audio cues like speaker changes, and composite events in sports that combine vision, audio, and game rules. No single approach handles all of them.

We classified this diversity into four cue tiers:

Cue Tier

Description

Boundary Signal Examples

Low-Level Visual

Visual changes inside the frame. Narrow visual detail.

Shot transitions, camera-angle changes

High-Level Semantic

Semantic / narrative-level global shifts, across both visual and audio streams.

Topic transitions, editorial package boundaries (anchor ↔ field ↔ studio)

Audio

Auditory signals — speech, music, SFX, silence.

Speaker turns, music/BGM transitions

Composite

Multimodal signals combining visual, audio, and contextual information.

Per-play sports events, ad breaks

Why This Taxonomy Matters

Generic video models don't distinguish these tiers. But real customer segmentation needs map cleanly onto specific tiers, or specific combinations of them:

  • A broadcaster's "separate editorial narratives" → High-Level (topic transitions + editorial package boundaries)

  • An editorial team's "shot boundary detection" → Low-Level Visual (camera cuts, angle changes)

  • A sports league's "per-play segmentation" → Composite (formation + whistle + game rules)

  • A podcast platform's "speaker-segmented transcription" → Audio (speaker turns + ASR topic changes)

This isn't taxonomy for taxonomy's sake. Each tier demands different modeling strategies, training data, and evaluation metrics. Low-Level Visual is closer to frame-wise change detection; Composite requires long-context reasoning.


5. Multimodal Grounding: Beyond Text

There's a recurring constraint we hit in enterprise video segmentation: text alone often can't specify what to look for.

Why Text Isn't Enough

Take a travel content platform. The problem they want to solve:

Across tens of thousands of vlogs by foreign visitors to Seoul, find every moment "N Seoul Tower" appears on screen. And classify whether it's a distant establishing shot from the Han River or city skyline, a cut while walking through Namsan Park, a close-up from the observation deck, or a night-time light-up.

Text alone doesn't cut it. Just because a model knows the name "N Seoul Tower" doesn't mean it can reliably summon the visual appearance. Linguistic recognition and visual identification of a specific entity are separable problems in video-language models. Depending on training distribution, the visual representation may be fuzzy, or it may get confused with similar-looking broadcasting towers like Tokyo Tower or the CN Tower. Hand the model a single reference image of N Seoul Tower, and now it has a visual-embedding anchor to compare directly against video frames. The "translate text into vision" burden disappears.

The Entity Reference System

TBM supports media_sources. You insert <reference_name> tags inside a segment definition, and the model can point to user-registered reference images directly from the prose.


출처 - 세종학당재단, https://www.kogl.or.kr/recommend/recommendDivView.do?oc=&recommendIdx=91796&division=img#

Source: King Sejong Institute Foundation (KSIF)

{
  "segment_definitions": [{
    "id": "namsan_tower_appearances",
    "media_sources": [
      { "name": "namsan_tower_img", "media_type": "image", "media_url": "<https://cdn>.../namsan_tower.jpg" }
    ],
    "description": "Moments where N Seoul Tower, identified by <namsan_tower_img>, appears on screen",
    "fields": [
      { "name": "screen_prominence", "type": "string",
        "enum": ["HERO_SHOT", "PARTIAL", "BACKGROUND"] },
      { "name": "shot_type", "type": "string",
        "enum": ["CITY_ESTABLISHING", "PARK_WALK", "OBSERVATION_DECK", "NIGHT_LIGHTUP"] }
    ]
  }]
}

From a modeling standpoint, this reframes segmentation from an open-vocabulary detection problem into a grounded retrieval + temporal localization problem:

  1. Reference resolution: the model has to bind <namsan_tower_img> to a specific visual appearance pattern across the entire video.

  2. Visual embedding alignment: the reference image is encoded into the same representation space as the video frames.

  3. Conditional boundary detection: segment boundaries are determined by the co-occurrence of a specific visual entity, not by generic scene transitions.

That's the difference between "find cityscape shots" and "find shots where N Seoul Tower appears." Not a difference in technical difficulty — a difference in what problem the product is actually solving.

Once entity grounding enters the picture, segmentation can fail along three axes: temporal boundaries, semantic labels, and visual entity resolution. That adds another dimension to the evaluation problem.


6. Quality Assurance: Dual-Track Evaluation

The most decisive — and most counterintuitive — finding from operating TBM was that you can't judge segmentation quality with a single score.

The Hidden Coupling Problem

Compare these two failure scenarios:

Scenario A. The model produces a 10.0s–25.0s segment. The ground truth is 12.0s–48.0s. It caught the anchor's 15-second intro but completely missed the 23 seconds of main story that followed. Anchor name, topic, speaker tagging — all perfect. But the temporal boundary covered only a third of the actual content.

Scenario B. The model nailed 12.0s–48.0s exactly. But it tagged "political analysis" as "weather segment." Boundaries perfect, metadata wrong.

A single aggregate score can't distinguish these two cases. But the fixes pull in completely different directions. The first needs better temporal modeling; the second needs better semantic grounding. Without separating these failure modes, you can't even tell whether a model improvement actually helped.

Our Solution: Two Independent Evaluation Tracks

Segment Track: "Did the model find the right intervals?"

Measures temporal accuracy. We look at both the segment level (how well were individual intervals captured?) and the timeline level (which parts of the timeline were covered correctly?). Both matter — one can be strong while the other is weak, and which side fails tells you what to fix. For final comparisons we use a composite score combining both views as the headline metric.

Metadata Track: "Within the right intervals, did the model structure things correctly?"

After segments are matched, we evaluate each field independently within matched segment pairs. We use LLM-as-judge with field-type-specific guidelines on a 0.0–5.0 scale.

One more wrinkle. Field scores are weighted higher when boundary alignment is tighter. This prevents "looks-right" metadata from inflating scores in regions where the boundary itself was sloppy.

A side benefit. Because metadata evaluation runs as a post-processing step decoupled from inference, we can iterate on the LLM-as-judge criteria — prompts, scoring rubrics — quickly, without re-running expensive video inference.

Why this separation matters. If the segment score improves but the metadata score drops, that's a signal: temporal localization got better but semantic precision got worse. The fix isn't model architecture, it's training data balance. You can only diagnose this with the tracks separated.


7. The Semantic Flywheel: Compounding Returns from Assetization

Assetization isn't a one-shot job. Once it actually starts working, you enter a self-reinforcing loop where each turn fuels the next:


We call this the Semantic Flywheel. Broadcast archives, brand intelligence, compliance auditing, sports highlight automation — the workloads running on TBM span very different domains, but they all follow the same arc. As segment-metadata pairs accumulate, the usage trace itself — boundaries customers correct, tags they reject, fields they edit — becomes the training signal for the next generation of the model. Once it's spinning, the archive doesn't just get "organized"; it becomes a system that organizes itself more precisely with every turn.

The single precondition for the flywheel to spin is that improvement has to be observable. If you can't measure temporal accuracy and metadata accuracy independently, you can't tell which one is improving and which is regressing. Without direction, no systematic iteration; without systematic iteration, the flywheel stalls. Without an evaluation system that mirrors the structure of the asset, assetization stays a one-off trick.


8. What We Learned

A few takeaways from building TBM and its evaluation framework:

Segmentation isn't Q&A. The instinct to treat temporal segmentation as "yet another LLM task" sent us down unproductive paths for a while. Boundary prediction has different failure modes, different evaluation requirements, and different sensitivities to model architecture. Recognizing this earlier would have saved us months.

Structured output beats fluent output. In production, a perfect JSON segment with slightly imprecise boundaries is far more useful than beautiful prose with no structured fields. Prioritizing machine-readability over human-readability turned out to be the right call.

Determinism is a feature, not a constraint. Non-overlapping segments, strict schema conformance, temperature=0 defaults — every one of these constraints initially drew pushback as "limiting model capability." In practice, they're what made the output reliable enough for production automation. When customers build systems on top of an API, reliability beats capability.


What's Next

Video assetization is still in its early days. Longer contexts (3+ hours), richer multimodal conditioning (multiple entities, each with dozens of reference images), and extending the same structured-extraction framing beyond video to other containerized assets like audio and podcasts — these are what we're building next.

The bigger challenge, though, is building an evaluation framework that scales alongside the capabilities. As video-language models grow more capable, failure modes get more subtle, and evaluation has to get sharper to keep up.

Most enterprise video archives still go untouched. Extracting value from them isn't just about better models. It's about being able to measure whether the model is actually finding the right moments — and describing them correctly.