Media and Entertainment

Media and Entertainment

Media and Entertainment

Every Video Revolution Evolved What Stories We Could Tell

Ryan Khurana

Ryan Khurana

Ryan Khurana

As machines get better at understanding video, human creativity becomes more important, not less. When you can search thousands of hours for "moments of hope emerging from despair," the differentiator isn't finding footage—it's knowing what’s worth looking for. At TwelveLabs, we're not building tools that tell stories. We're building infrastructure that empowers storytellers to discover stories that were always there, waiting for the right technology to make them visible. The framework is changing. What picture will you create within it?

As machines get better at understanding video, human creativity becomes more important, not less. When you can search thousands of hours for "moments of hope emerging from despair," the differentiator isn't finding footage—it's knowing what’s worth looking for. At TwelveLabs, we're not building tools that tell stories. We're building infrastructure that empowers storytellers to discover stories that were always there, waiting for the right technology to make them visible. The framework is changing. What picture will you create within it?

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Search, analyze, and explore your videos with AI.

Oct 16, 2025

Oct 16, 2025

Oct 16, 2025

10 Minutes

10 Minutes

10 Minutes

Copy link to article

Copy link to article

Copy link to article

Marshall McLuhan famously observed that "it is the framework which changes with each new technology and not just the picture within the frame." In video production, this insight has proven prophetic—each technological leap hasn't merely sped up production; it's revealed entirely new forms of storytelling that were hiding in plain sight, waiting for the right tools to make them possible.

Today, we're witnessing another fundamental shift. But unlike past revolutions that imposed new constraints, AI video understanding removes the oldest bottleneck of all: human bandwidth for comprehending footage. This isn't about faster workflows—it's about stories that were theoretically possible but economically impossible finally becoming viable.


The Pattern: New Tech → New Bottleneck → New Content Grammar


1950s-1960s: When Film Reels Dictated Story Structure

In early days of television Lucille Ball and Desi Arnaz pioneered what has become the standard for sitcoms since, I Love Lucy. The show’s three-camera setup, live audiences, and real-time 35mm film shooting were staples for decades. The narrative structure of the show was dictated by the fact that film reels held only about 11 minutes of footage, forcing writers to structure scenes around reel changes. The high cost of film stock meant episodes were shot almost in real-time—Desilu Productions could film a 22-minute episode in just 60 minutes of shooting.

Source: https://picryl.com/media/filming-a-television-program-at-frenckells-studio-in-tampere-121965-1b5285

The result? Tightly scripted, theater-like performances with minimal editing. Networks saw so little value in recordings that 60-70% of the BBC's output from the 1950s-1970s was simply erased. Content was ephemeral by design.

The three camera format that I Love Lucy pioneered is still a mainstay of sitcoms, while the three act structure for narratives that film reel and projector limitations imposed remain the conventional standard for all narratives. The successive revolutions in technology did not do away with prior forms of creative expression, they made it so that other forms could also flourish.


1970s-1980s: The Tape Revolution Creates Live TV and 24-Hour Formats

Magnetic tape shattered television's theatrical constraints. Suddenly filming more got a lot cheaper, and the cameras themselves could become smaller too. Sony's U-matic format in 1974 enabled portable video, shrinking the time from shooting to broadcast from hours to minutes. Suddenly, "breaking news" became a format. "Live from the scene" became a genre. Instant replays became a sports necessity.

The pipe became the program. Twenty-four-hour news wasn't just more news—it was a fundamentally different beast, built on the assumption that cameras could be anywhere, anytime.

Meanwhile, Betacam in 1982 fused camera and recorder into a single unit, making field production even more nimble. The Quantel Paintbox (1981) turned TV into "graphic space"—essential for MTV's aesthetic and modern sports broadcasting.

Source: https://commons.wikimedia.org/wiki/File:Sony_Betacam_SP_with_Fujinon_lense_20080830.jpg

And crucially, VCRs and the 1984 Betamax decision legalized time-shifting. Audiences could now own time itself, creating entirely new markets for home video and fundamentally altering rerun economics.


1990s-2000s: Digital Archives Unlock Self-Referential Storytelling

The introduction of Avid's non-linear editing system in 1989 completed the transition from physical to digital. Suddenly, footage could be stored infinitely, searched instantly, and remixed endlessly. Archives transformed from worthless to valuable—old episodes found new life in reruns, DVD sets, and streaming libraries. "Clip shows" proliferated in the '90s and 2000s, mining past episodes for best-of compilations.

This technological shift birthed new content forms. Shows like Lost (2004-2010) could weave together multiple timelines—flashbacks, flash-forwards, and parallel realities—with precision impossible in linear editing. 24 (2001-2010) pioneered real-time storytelling with split screens showing simultaneous action across locations. The Wire (2002-2008) built season-long narrative arcs that referenced minute details from episodes aired years earlier. These weren't just faster ways to tell stories—they were narrative architectures that couldn't exist without random-access editing. Avid gave storytellers the ability to treat time itself as a malleable narrative element.

Source: https://commons.wikimedia.org/wiki/File:المونتاج_التلفزيوني_الخطي.jpg

It’s difficult to overstate the transformational impact that Non-Linear Editing had on video content. Stanley Kubrick famously noted that editing is the unique contribution of the language of film that distinguishes it from other arts. In the era of linear editing however the Director could exert total control over the process. Non-Linear Editing creating a technical craft that established and expanded the art of editing into its own creative discipline. The editor rather than an executor of the Director’s vision could now add his or her own unique style and creative flair into the process.


2000s-2010s: Cheap Cameras Create Reality TV's "Found" Narratives

When digital cameras became affordable, producers could deploy dozens of cameras rolling continuously. Reality TV exploded—but the real innovation wasn't in filming; it was in post-production. Shows like The Bachelor might shoot hundreds of hours for a single episode, but it’s the armies of loggers and story producers mining footage to "find" the narrative after filming wrapped that make it compelling.

The bottleneck shifted from camera costs to human review time. But this constraint created a new art form: stories crafted in hindsight from authentic moments, giving viewers "reality" that was actually carefully constructed editorial storytelling.


2010s-2020s: Smartphones and Personalization Enable a Content Explosion

When billions gained pocket cameras via smartphones, platforms like Instagram and TikTok emerged with mobile native formats: vertical, ephemeral, ultra-short. The constraint of small screens and short attention spans birthed new grammars—jump cuts, duets, challenges—optimized for algorithmic distribution rather than linear programming.

The rise of internet media consumption has also given rise to more proactive content recommendations personalized to each user. Netflix began not only showing recommended movies and shows to each user but personalizing the visuals for each piece of content to increase engagement. TikTok’s FYP created a watch-time driven loop that rapidly fit content to personal taste. These recommendation engines that leveraged metadata, transcripts, and visual content provided the consumer validation for the value of video understanding models that now sit to once again transform content.


The Current Revolution: From Decomposition to Comprehension

Here's what makes AI video understanding fundamentally different: Previous approaches decomposed video into constituent parts—transcript, objects, frames—then attempted to reconstruct meaning. This is like trying to understand a symphony by analyzing each instrument in isolation.

Video-native models comprehend the interplay between visual, temporal, and contextual elements simultaneously. They capture emergent properties that only exist in the interaction—the comedy in juxtaposition, the tension in timing, the meaning in movement.

Take for example this sequence from Shaun of the Dead which is equal parts horror and comedy. Shaun’s routine actions in the foreground and the zombie outbreak in the background are both portrayed with seriousness, but that juxtaposition creates the comedy. The humor doesn't exist in the visuals alone (mundane actions), the audio alone (ordinary sounds), or the narrative alone (getting ready for work).

TwelveLabs’ Pegasus model was able to identify that Shaun’s unawareness contrasts the clear signs of a zombie outbreak around him. What traditional approaches to video analysis that segmented video components lost as a consequence of focusing on the measurable, video understanding models retrieves.

This is demonstrated most clearly in the ability to explain the humor of the TV sequence where Shaun rapidly flips channels while being warned of the zombie apocalypse. It understood that there was valuable content that appeared in each channel that Shaun failed to acknowledge. While it missed some subtleties (like coherent warnings across channel flips), it grasped something more fundamental, that there was something “off” about what was happening that made it funny.

This isn't incremental improvement. It's the difference between a system that can identify "person" and "zombie" and one that understands their relationship creates comedy.

We are still in the early days of video understanding and while the models continue to improve, another key innovation lies in the creativity of those early adopters who build narrative formats around previously non-existent capabilities.

Previous revolutions imposed constraints that forced innovation. AI understanding removes the human bandwidth constraint that made certain formats economically impossible. Here's what becomes viable:


Multi-Perspective Narratives:

When Netflix released Arrested Development Season 4 in 2013, each episode followed a single character through the same time period. The experiment was bold—viewers could theoretically watch in any order and piece together the larger story. But audiences didn’t react well to the new format.

Netflix spent millions re-editing the entire season into chronological order for "Season 4 Remix: Fateful Consequences." This required editors to manually track every scene's temporal position, identify narrative dependencies, and rebuild 15 episodes into 22. The cost? Hundreds of hours of editorial work plus Mitch Hurwitz's direct involvement to maintain coherence.

Source: https://www.reddit.com/r/arresteddevelopment/comments/1g5y0k/arrested_development_season_4_timeline_warning/ (The complexity of mapping the character specific timeline back to chronological order for Arrested Development S4)

With video understanding, the two formats could become interchangeable. During production, video understanding could analyze dailies in real-time, flagging which character perspectives still need coverage for specific story beats. Storyboards could be drafted with multiple narrative paths in mind. Post-production could test dozens of arrangements—chronological, character-focused, thematic—without manual reconstruction. Most importantly, the speed of enables audiences to choose their viewing experience, with the system ensuring narrative coherence regardless of path chosen.

The same story truly becomes multiple stories, each valid and complete.


Micro-Pattern Discovery:

Streaming services spend billions figuring out exactly what content to recommend to the user, yet can’t seem to crack why instead of watching something new millions go to rewatch The Office for the fiftieth time. They know you completed nine seasons of a "workplace comedy" featuring "Steve Carell" with tags like "mockumentary" and "ensemble cast." What they can't capture: why you actually connected with it.

Instead of relying on titles, transcripts, and metadata tags alone to identify what’s worth recommending, the video itself can be the source of truth. The meaning and relationships inside of a video carry more information that until now were difficult to extract. For example, it may not be a certain actor that draws attention, but only when they are playing a particular type of character or have scenes where they get to show off their comedic skills. These nuances, invisible to metadata, become retrievable with video understanding.

This more in-depth understanding can create entirely new paths of content recommendation that aren’t based on intrusive data collection to create a person specific graph but rather letting the relationship between content and viewing habits themselves present the best recommendations.

Marengo's embeddings capture what we call the "fingerprint" of content—not just objects and actions, but the interplay between visual elements, audio cues, and temporal dynamics. This granular understanding means content creators can finally understand WHY audiences connect with specific moments, empowering them to craft more resonant stories rather than chasing algorithmic trends.


Ambient Production:

Love Island films contestants 24/7 with 80 robotic cameras and multiple studio cameras, capturing approximately 168 hours of footage weekly. To transform this into six hour-long episodes per week, the show employs 400 crew members working in shifts around the clock. Thirty editors and twenty producers work across different departments—story teams compile individual moments into scenes, stitch teams assemble scenes into acts, and executive producers determine running orders in real-time. The entire operation runs on a 24-hour turnaround, with Monday's drama airing by Tuesday night.

Nathan Fielder's The Rehearsal pushed ambient capture into new territory by creating controlled environments that became uncontrollably rich with story. The Alligator Lounge replica was so meticulously detailed that it functioned less like a set and more like a story-generation engine. Fielder intended to control specific narratives, but the environments took on lives of their own—extras developed real relationships, background actors created their own dramas, the fake became real. The controlled world became less controlled, generating more authentic (actually who knows with that show…) narratives than planned scenarios.

With video understanding, this abundance becomes manageable. Instead of armies of loggers making binary decisions about what to mark, AI could generate "smart dailies" that understand narrative potential. Not just "argument at 2:47 AM" but "emerging tension between X and Y characters." The system could track relationship dynamics across weeks, identify patterns invisible to exhausted loggers, and surface the slow-burn stories that current production misses.

Whole shows can be designed around this capability, expanding upon the successful format of Love Island for a fraction of the cost with a less intensive production process. Instead of pre-selecting "main characters" and hoping they generate drama, productions could cast wider nets and let stories emerge organically from the environment. The narrative becomes truly discovered rather than manufactured, found in the ambient capture rather than forced through producer intervention.


Semantic Editorial:

The Last Dance documentary required editors to review over 10,000 hours of archival footage. The production hired a team of assistant editors who spent months creating detailed logs with keywords, descriptions, and timecodes. Even with this massive investment, editors still missed relevant moments because human logging can't capture every nuance.

Current text-based editing tools like Adobe Premiere's transcription feature let editors search dialogue, but can't understand visual storytelling. An editor looking for "tension between Jordan and teammates" must rely on someone having logged those subjective moments or combing through the footage themselves.

Video understanding transforms this from keyword search to semantic query. An editor could ask: "Find moments where Jordan is physically isolated from teammates," or "Show me reaction shots where players avoid eye contact after Jordan speaks." The system understands not just objects and actions but relationships and emotions.

For The Last Dance, this could have revealed subtle dynamics no logger would catch—patterns of body language across seasons, evolving team dynamics visible in spacing and positioning, unspoken tensions that only become apparent when viewed systematically. The value of archives for narrative archeology explodes.


Infinite Versioning:

In the global, digital world there is rarely just one true version of a show or movie. Some content gets internationalized for different markets, reflecting taste in certain pace, cultural context, or product presence. Others get safety adapted for theatres, airlines, and kid-friendly versions of the merc’ with the mouth. Each version required manual re-editing, costing hundreds of thousands per market.

Video understanding enables dynamic versioning at scale. Instead of creating fixed alternative edits, productions could define parameters—pacing preferences, content thresholds, cultural emphasis—and generate versions in real-time. A Japanese version of Squid Game could emphasize different character moments than the American version, based on actual viewing patterns rather than assumptions.

More radically, versions could adapt to individual viewers. Not through invasive data collection, but by understanding how viewing patterns correlate with content fingerprints. The same source material becomes thousands of potential experiences, each maintaining narrative integrity while optimizing for resonance.


The Bottom Line

As machines get better at understanding video, human creativity becomes more important, not less. When you can search thousands of hours for "moments of hope emerging from despair," the differentiator isn't finding footage—it's knowing what’s worth looking for.

Template-based AI tools force content into predetermined boxes. Video understanding does the opposite—it reveals what's already there, waiting to be discovered. It handles the overwhelming mechanical burden of comprehension, then hands clean, organized possibilities to human creators for final alchemy.

We're still in the early days. Current models struggle with abstract humor, cultural nuance, and conceptual metaphor. They understand that something is happening but not always why it matters. Yet even these imperfect capabilities cross a threshold that enables genuinely new creative forms.

History shows artists don't wait for perfect tools—they exploit imperfect ones in unexpected ways. The question isn't whether video understanding has arrived complete. It's whether current capabilities, with all their limitations, enable stories we couldn't tell before.

The answer is yes.

At TwelveLabs, we're not building tools that tell stories. We're building infrastructure that empowers storytellers to discover stories that were always there, waiting for the right technology to make them visible.

The framework is changing. What picture will you create within it?

Ready to explore what's possible when video understanding meets your content? Learn more about our video understanding APIs or see our models in action.

Marshall McLuhan famously observed that "it is the framework which changes with each new technology and not just the picture within the frame." In video production, this insight has proven prophetic—each technological leap hasn't merely sped up production; it's revealed entirely new forms of storytelling that were hiding in plain sight, waiting for the right tools to make them possible.

Today, we're witnessing another fundamental shift. But unlike past revolutions that imposed new constraints, AI video understanding removes the oldest bottleneck of all: human bandwidth for comprehending footage. This isn't about faster workflows—it's about stories that were theoretically possible but economically impossible finally becoming viable.


The Pattern: New Tech → New Bottleneck → New Content Grammar


1950s-1960s: When Film Reels Dictated Story Structure

In early days of television Lucille Ball and Desi Arnaz pioneered what has become the standard for sitcoms since, I Love Lucy. The show’s three-camera setup, live audiences, and real-time 35mm film shooting were staples for decades. The narrative structure of the show was dictated by the fact that film reels held only about 11 minutes of footage, forcing writers to structure scenes around reel changes. The high cost of film stock meant episodes were shot almost in real-time—Desilu Productions could film a 22-minute episode in just 60 minutes of shooting.

Source: https://picryl.com/media/filming-a-television-program-at-frenckells-studio-in-tampere-121965-1b5285

The result? Tightly scripted, theater-like performances with minimal editing. Networks saw so little value in recordings that 60-70% of the BBC's output from the 1950s-1970s was simply erased. Content was ephemeral by design.

The three camera format that I Love Lucy pioneered is still a mainstay of sitcoms, while the three act structure for narratives that film reel and projector limitations imposed remain the conventional standard for all narratives. The successive revolutions in technology did not do away with prior forms of creative expression, they made it so that other forms could also flourish.


1970s-1980s: The Tape Revolution Creates Live TV and 24-Hour Formats

Magnetic tape shattered television's theatrical constraints. Suddenly filming more got a lot cheaper, and the cameras themselves could become smaller too. Sony's U-matic format in 1974 enabled portable video, shrinking the time from shooting to broadcast from hours to minutes. Suddenly, "breaking news" became a format. "Live from the scene" became a genre. Instant replays became a sports necessity.

The pipe became the program. Twenty-four-hour news wasn't just more news—it was a fundamentally different beast, built on the assumption that cameras could be anywhere, anytime.

Meanwhile, Betacam in 1982 fused camera and recorder into a single unit, making field production even more nimble. The Quantel Paintbox (1981) turned TV into "graphic space"—essential for MTV's aesthetic and modern sports broadcasting.

Source: https://commons.wikimedia.org/wiki/File:Sony_Betacam_SP_with_Fujinon_lense_20080830.jpg

And crucially, VCRs and the 1984 Betamax decision legalized time-shifting. Audiences could now own time itself, creating entirely new markets for home video and fundamentally altering rerun economics.


1990s-2000s: Digital Archives Unlock Self-Referential Storytelling

The introduction of Avid's non-linear editing system in 1989 completed the transition from physical to digital. Suddenly, footage could be stored infinitely, searched instantly, and remixed endlessly. Archives transformed from worthless to valuable—old episodes found new life in reruns, DVD sets, and streaming libraries. "Clip shows" proliferated in the '90s and 2000s, mining past episodes for best-of compilations.

This technological shift birthed new content forms. Shows like Lost (2004-2010) could weave together multiple timelines—flashbacks, flash-forwards, and parallel realities—with precision impossible in linear editing. 24 (2001-2010) pioneered real-time storytelling with split screens showing simultaneous action across locations. The Wire (2002-2008) built season-long narrative arcs that referenced minute details from episodes aired years earlier. These weren't just faster ways to tell stories—they were narrative architectures that couldn't exist without random-access editing. Avid gave storytellers the ability to treat time itself as a malleable narrative element.

Source: https://commons.wikimedia.org/wiki/File:المونتاج_التلفزيوني_الخطي.jpg

It’s difficult to overstate the transformational impact that Non-Linear Editing had on video content. Stanley Kubrick famously noted that editing is the unique contribution of the language of film that distinguishes it from other arts. In the era of linear editing however the Director could exert total control over the process. Non-Linear Editing creating a technical craft that established and expanded the art of editing into its own creative discipline. The editor rather than an executor of the Director’s vision could now add his or her own unique style and creative flair into the process.


2000s-2010s: Cheap Cameras Create Reality TV's "Found" Narratives

When digital cameras became affordable, producers could deploy dozens of cameras rolling continuously. Reality TV exploded—but the real innovation wasn't in filming; it was in post-production. Shows like The Bachelor might shoot hundreds of hours for a single episode, but it’s the armies of loggers and story producers mining footage to "find" the narrative after filming wrapped that make it compelling.

The bottleneck shifted from camera costs to human review time. But this constraint created a new art form: stories crafted in hindsight from authentic moments, giving viewers "reality" that was actually carefully constructed editorial storytelling.


2010s-2020s: Smartphones and Personalization Enable a Content Explosion

When billions gained pocket cameras via smartphones, platforms like Instagram and TikTok emerged with mobile native formats: vertical, ephemeral, ultra-short. The constraint of small screens and short attention spans birthed new grammars—jump cuts, duets, challenges—optimized for algorithmic distribution rather than linear programming.

The rise of internet media consumption has also given rise to more proactive content recommendations personalized to each user. Netflix began not only showing recommended movies and shows to each user but personalizing the visuals for each piece of content to increase engagement. TikTok’s FYP created a watch-time driven loop that rapidly fit content to personal taste. These recommendation engines that leveraged metadata, transcripts, and visual content provided the consumer validation for the value of video understanding models that now sit to once again transform content.


The Current Revolution: From Decomposition to Comprehension

Here's what makes AI video understanding fundamentally different: Previous approaches decomposed video into constituent parts—transcript, objects, frames—then attempted to reconstruct meaning. This is like trying to understand a symphony by analyzing each instrument in isolation.

Video-native models comprehend the interplay between visual, temporal, and contextual elements simultaneously. They capture emergent properties that only exist in the interaction—the comedy in juxtaposition, the tension in timing, the meaning in movement.

Take for example this sequence from Shaun of the Dead which is equal parts horror and comedy. Shaun’s routine actions in the foreground and the zombie outbreak in the background are both portrayed with seriousness, but that juxtaposition creates the comedy. The humor doesn't exist in the visuals alone (mundane actions), the audio alone (ordinary sounds), or the narrative alone (getting ready for work).

TwelveLabs’ Pegasus model was able to identify that Shaun’s unawareness contrasts the clear signs of a zombie outbreak around him. What traditional approaches to video analysis that segmented video components lost as a consequence of focusing on the measurable, video understanding models retrieves.

This is demonstrated most clearly in the ability to explain the humor of the TV sequence where Shaun rapidly flips channels while being warned of the zombie apocalypse. It understood that there was valuable content that appeared in each channel that Shaun failed to acknowledge. While it missed some subtleties (like coherent warnings across channel flips), it grasped something more fundamental, that there was something “off” about what was happening that made it funny.

This isn't incremental improvement. It's the difference between a system that can identify "person" and "zombie" and one that understands their relationship creates comedy.

We are still in the early days of video understanding and while the models continue to improve, another key innovation lies in the creativity of those early adopters who build narrative formats around previously non-existent capabilities.

Previous revolutions imposed constraints that forced innovation. AI understanding removes the human bandwidth constraint that made certain formats economically impossible. Here's what becomes viable:


Multi-Perspective Narratives:

When Netflix released Arrested Development Season 4 in 2013, each episode followed a single character through the same time period. The experiment was bold—viewers could theoretically watch in any order and piece together the larger story. But audiences didn’t react well to the new format.

Netflix spent millions re-editing the entire season into chronological order for "Season 4 Remix: Fateful Consequences." This required editors to manually track every scene's temporal position, identify narrative dependencies, and rebuild 15 episodes into 22. The cost? Hundreds of hours of editorial work plus Mitch Hurwitz's direct involvement to maintain coherence.

Source: https://www.reddit.com/r/arresteddevelopment/comments/1g5y0k/arrested_development_season_4_timeline_warning/ (The complexity of mapping the character specific timeline back to chronological order for Arrested Development S4)

With video understanding, the two formats could become interchangeable. During production, video understanding could analyze dailies in real-time, flagging which character perspectives still need coverage for specific story beats. Storyboards could be drafted with multiple narrative paths in mind. Post-production could test dozens of arrangements—chronological, character-focused, thematic—without manual reconstruction. Most importantly, the speed of enables audiences to choose their viewing experience, with the system ensuring narrative coherence regardless of path chosen.

The same story truly becomes multiple stories, each valid and complete.


Micro-Pattern Discovery:

Streaming services spend billions figuring out exactly what content to recommend to the user, yet can’t seem to crack why instead of watching something new millions go to rewatch The Office for the fiftieth time. They know you completed nine seasons of a "workplace comedy" featuring "Steve Carell" with tags like "mockumentary" and "ensemble cast." What they can't capture: why you actually connected with it.

Instead of relying on titles, transcripts, and metadata tags alone to identify what’s worth recommending, the video itself can be the source of truth. The meaning and relationships inside of a video carry more information that until now were difficult to extract. For example, it may not be a certain actor that draws attention, but only when they are playing a particular type of character or have scenes where they get to show off their comedic skills. These nuances, invisible to metadata, become retrievable with video understanding.

This more in-depth understanding can create entirely new paths of content recommendation that aren’t based on intrusive data collection to create a person specific graph but rather letting the relationship between content and viewing habits themselves present the best recommendations.

Marengo's embeddings capture what we call the "fingerprint" of content—not just objects and actions, but the interplay between visual elements, audio cues, and temporal dynamics. This granular understanding means content creators can finally understand WHY audiences connect with specific moments, empowering them to craft more resonant stories rather than chasing algorithmic trends.


Ambient Production:

Love Island films contestants 24/7 with 80 robotic cameras and multiple studio cameras, capturing approximately 168 hours of footage weekly. To transform this into six hour-long episodes per week, the show employs 400 crew members working in shifts around the clock. Thirty editors and twenty producers work across different departments—story teams compile individual moments into scenes, stitch teams assemble scenes into acts, and executive producers determine running orders in real-time. The entire operation runs on a 24-hour turnaround, with Monday's drama airing by Tuesday night.

Nathan Fielder's The Rehearsal pushed ambient capture into new territory by creating controlled environments that became uncontrollably rich with story. The Alligator Lounge replica was so meticulously detailed that it functioned less like a set and more like a story-generation engine. Fielder intended to control specific narratives, but the environments took on lives of their own—extras developed real relationships, background actors created their own dramas, the fake became real. The controlled world became less controlled, generating more authentic (actually who knows with that show…) narratives than planned scenarios.

With video understanding, this abundance becomes manageable. Instead of armies of loggers making binary decisions about what to mark, AI could generate "smart dailies" that understand narrative potential. Not just "argument at 2:47 AM" but "emerging tension between X and Y characters." The system could track relationship dynamics across weeks, identify patterns invisible to exhausted loggers, and surface the slow-burn stories that current production misses.

Whole shows can be designed around this capability, expanding upon the successful format of Love Island for a fraction of the cost with a less intensive production process. Instead of pre-selecting "main characters" and hoping they generate drama, productions could cast wider nets and let stories emerge organically from the environment. The narrative becomes truly discovered rather than manufactured, found in the ambient capture rather than forced through producer intervention.


Semantic Editorial:

The Last Dance documentary required editors to review over 10,000 hours of archival footage. The production hired a team of assistant editors who spent months creating detailed logs with keywords, descriptions, and timecodes. Even with this massive investment, editors still missed relevant moments because human logging can't capture every nuance.

Current text-based editing tools like Adobe Premiere's transcription feature let editors search dialogue, but can't understand visual storytelling. An editor looking for "tension between Jordan and teammates" must rely on someone having logged those subjective moments or combing through the footage themselves.

Video understanding transforms this from keyword search to semantic query. An editor could ask: "Find moments where Jordan is physically isolated from teammates," or "Show me reaction shots where players avoid eye contact after Jordan speaks." The system understands not just objects and actions but relationships and emotions.

For The Last Dance, this could have revealed subtle dynamics no logger would catch—patterns of body language across seasons, evolving team dynamics visible in spacing and positioning, unspoken tensions that only become apparent when viewed systematically. The value of archives for narrative archeology explodes.


Infinite Versioning:

In the global, digital world there is rarely just one true version of a show or movie. Some content gets internationalized for different markets, reflecting taste in certain pace, cultural context, or product presence. Others get safety adapted for theatres, airlines, and kid-friendly versions of the merc’ with the mouth. Each version required manual re-editing, costing hundreds of thousands per market.

Video understanding enables dynamic versioning at scale. Instead of creating fixed alternative edits, productions could define parameters—pacing preferences, content thresholds, cultural emphasis—and generate versions in real-time. A Japanese version of Squid Game could emphasize different character moments than the American version, based on actual viewing patterns rather than assumptions.

More radically, versions could adapt to individual viewers. Not through invasive data collection, but by understanding how viewing patterns correlate with content fingerprints. The same source material becomes thousands of potential experiences, each maintaining narrative integrity while optimizing for resonance.


The Bottom Line

As machines get better at understanding video, human creativity becomes more important, not less. When you can search thousands of hours for "moments of hope emerging from despair," the differentiator isn't finding footage—it's knowing what’s worth looking for.

Template-based AI tools force content into predetermined boxes. Video understanding does the opposite—it reveals what's already there, waiting to be discovered. It handles the overwhelming mechanical burden of comprehension, then hands clean, organized possibilities to human creators for final alchemy.

We're still in the early days. Current models struggle with abstract humor, cultural nuance, and conceptual metaphor. They understand that something is happening but not always why it matters. Yet even these imperfect capabilities cross a threshold that enables genuinely new creative forms.

History shows artists don't wait for perfect tools—they exploit imperfect ones in unexpected ways. The question isn't whether video understanding has arrived complete. It's whether current capabilities, with all their limitations, enable stories we couldn't tell before.

The answer is yes.

At TwelveLabs, we're not building tools that tell stories. We're building infrastructure that empowers storytellers to discover stories that were always there, waiting for the right technology to make them visible.

The framework is changing. What picture will you create within it?

Ready to explore what's possible when video understanding meets your content? Learn more about our video understanding APIs or see our models in action.

Marshall McLuhan famously observed that "it is the framework which changes with each new technology and not just the picture within the frame." In video production, this insight has proven prophetic—each technological leap hasn't merely sped up production; it's revealed entirely new forms of storytelling that were hiding in plain sight, waiting for the right tools to make them possible.

Today, we're witnessing another fundamental shift. But unlike past revolutions that imposed new constraints, AI video understanding removes the oldest bottleneck of all: human bandwidth for comprehending footage. This isn't about faster workflows—it's about stories that were theoretically possible but economically impossible finally becoming viable.


The Pattern: New Tech → New Bottleneck → New Content Grammar


1950s-1960s: When Film Reels Dictated Story Structure

In early days of television Lucille Ball and Desi Arnaz pioneered what has become the standard for sitcoms since, I Love Lucy. The show’s three-camera setup, live audiences, and real-time 35mm film shooting were staples for decades. The narrative structure of the show was dictated by the fact that film reels held only about 11 minutes of footage, forcing writers to structure scenes around reel changes. The high cost of film stock meant episodes were shot almost in real-time—Desilu Productions could film a 22-minute episode in just 60 minutes of shooting.

Source: https://picryl.com/media/filming-a-television-program-at-frenckells-studio-in-tampere-121965-1b5285

The result? Tightly scripted, theater-like performances with minimal editing. Networks saw so little value in recordings that 60-70% of the BBC's output from the 1950s-1970s was simply erased. Content was ephemeral by design.

The three camera format that I Love Lucy pioneered is still a mainstay of sitcoms, while the three act structure for narratives that film reel and projector limitations imposed remain the conventional standard for all narratives. The successive revolutions in technology did not do away with prior forms of creative expression, they made it so that other forms could also flourish.


1970s-1980s: The Tape Revolution Creates Live TV and 24-Hour Formats

Magnetic tape shattered television's theatrical constraints. Suddenly filming more got a lot cheaper, and the cameras themselves could become smaller too. Sony's U-matic format in 1974 enabled portable video, shrinking the time from shooting to broadcast from hours to minutes. Suddenly, "breaking news" became a format. "Live from the scene" became a genre. Instant replays became a sports necessity.

The pipe became the program. Twenty-four-hour news wasn't just more news—it was a fundamentally different beast, built on the assumption that cameras could be anywhere, anytime.

Meanwhile, Betacam in 1982 fused camera and recorder into a single unit, making field production even more nimble. The Quantel Paintbox (1981) turned TV into "graphic space"—essential for MTV's aesthetic and modern sports broadcasting.

Source: https://commons.wikimedia.org/wiki/File:Sony_Betacam_SP_with_Fujinon_lense_20080830.jpg

And crucially, VCRs and the 1984 Betamax decision legalized time-shifting. Audiences could now own time itself, creating entirely new markets for home video and fundamentally altering rerun economics.


1990s-2000s: Digital Archives Unlock Self-Referential Storytelling

The introduction of Avid's non-linear editing system in 1989 completed the transition from physical to digital. Suddenly, footage could be stored infinitely, searched instantly, and remixed endlessly. Archives transformed from worthless to valuable—old episodes found new life in reruns, DVD sets, and streaming libraries. "Clip shows" proliferated in the '90s and 2000s, mining past episodes for best-of compilations.

This technological shift birthed new content forms. Shows like Lost (2004-2010) could weave together multiple timelines—flashbacks, flash-forwards, and parallel realities—with precision impossible in linear editing. 24 (2001-2010) pioneered real-time storytelling with split screens showing simultaneous action across locations. The Wire (2002-2008) built season-long narrative arcs that referenced minute details from episodes aired years earlier. These weren't just faster ways to tell stories—they were narrative architectures that couldn't exist without random-access editing. Avid gave storytellers the ability to treat time itself as a malleable narrative element.

Source: https://commons.wikimedia.org/wiki/File:المونتاج_التلفزيوني_الخطي.jpg

It’s difficult to overstate the transformational impact that Non-Linear Editing had on video content. Stanley Kubrick famously noted that editing is the unique contribution of the language of film that distinguishes it from other arts. In the era of linear editing however the Director could exert total control over the process. Non-Linear Editing creating a technical craft that established and expanded the art of editing into its own creative discipline. The editor rather than an executor of the Director’s vision could now add his or her own unique style and creative flair into the process.


2000s-2010s: Cheap Cameras Create Reality TV's "Found" Narratives

When digital cameras became affordable, producers could deploy dozens of cameras rolling continuously. Reality TV exploded—but the real innovation wasn't in filming; it was in post-production. Shows like The Bachelor might shoot hundreds of hours for a single episode, but it’s the armies of loggers and story producers mining footage to "find" the narrative after filming wrapped that make it compelling.

The bottleneck shifted from camera costs to human review time. But this constraint created a new art form: stories crafted in hindsight from authentic moments, giving viewers "reality" that was actually carefully constructed editorial storytelling.


2010s-2020s: Smartphones and Personalization Enable a Content Explosion

When billions gained pocket cameras via smartphones, platforms like Instagram and TikTok emerged with mobile native formats: vertical, ephemeral, ultra-short. The constraint of small screens and short attention spans birthed new grammars—jump cuts, duets, challenges—optimized for algorithmic distribution rather than linear programming.

The rise of internet media consumption has also given rise to more proactive content recommendations personalized to each user. Netflix began not only showing recommended movies and shows to each user but personalizing the visuals for each piece of content to increase engagement. TikTok’s FYP created a watch-time driven loop that rapidly fit content to personal taste. These recommendation engines that leveraged metadata, transcripts, and visual content provided the consumer validation for the value of video understanding models that now sit to once again transform content.


The Current Revolution: From Decomposition to Comprehension

Here's what makes AI video understanding fundamentally different: Previous approaches decomposed video into constituent parts—transcript, objects, frames—then attempted to reconstruct meaning. This is like trying to understand a symphony by analyzing each instrument in isolation.

Video-native models comprehend the interplay between visual, temporal, and contextual elements simultaneously. They capture emergent properties that only exist in the interaction—the comedy in juxtaposition, the tension in timing, the meaning in movement.

Take for example this sequence from Shaun of the Dead which is equal parts horror and comedy. Shaun’s routine actions in the foreground and the zombie outbreak in the background are both portrayed with seriousness, but that juxtaposition creates the comedy. The humor doesn't exist in the visuals alone (mundane actions), the audio alone (ordinary sounds), or the narrative alone (getting ready for work).

TwelveLabs’ Pegasus model was able to identify that Shaun’s unawareness contrasts the clear signs of a zombie outbreak around him. What traditional approaches to video analysis that segmented video components lost as a consequence of focusing on the measurable, video understanding models retrieves.

This is demonstrated most clearly in the ability to explain the humor of the TV sequence where Shaun rapidly flips channels while being warned of the zombie apocalypse. It understood that there was valuable content that appeared in each channel that Shaun failed to acknowledge. While it missed some subtleties (like coherent warnings across channel flips), it grasped something more fundamental, that there was something “off” about what was happening that made it funny.

This isn't incremental improvement. It's the difference between a system that can identify "person" and "zombie" and one that understands their relationship creates comedy.

We are still in the early days of video understanding and while the models continue to improve, another key innovation lies in the creativity of those early adopters who build narrative formats around previously non-existent capabilities.

Previous revolutions imposed constraints that forced innovation. AI understanding removes the human bandwidth constraint that made certain formats economically impossible. Here's what becomes viable:


Multi-Perspective Narratives:

When Netflix released Arrested Development Season 4 in 2013, each episode followed a single character through the same time period. The experiment was bold—viewers could theoretically watch in any order and piece together the larger story. But audiences didn’t react well to the new format.

Netflix spent millions re-editing the entire season into chronological order for "Season 4 Remix: Fateful Consequences." This required editors to manually track every scene's temporal position, identify narrative dependencies, and rebuild 15 episodes into 22. The cost? Hundreds of hours of editorial work plus Mitch Hurwitz's direct involvement to maintain coherence.

Source: https://www.reddit.com/r/arresteddevelopment/comments/1g5y0k/arrested_development_season_4_timeline_warning/ (The complexity of mapping the character specific timeline back to chronological order for Arrested Development S4)

With video understanding, the two formats could become interchangeable. During production, video understanding could analyze dailies in real-time, flagging which character perspectives still need coverage for specific story beats. Storyboards could be drafted with multiple narrative paths in mind. Post-production could test dozens of arrangements—chronological, character-focused, thematic—without manual reconstruction. Most importantly, the speed of enables audiences to choose their viewing experience, with the system ensuring narrative coherence regardless of path chosen.

The same story truly becomes multiple stories, each valid and complete.


Micro-Pattern Discovery:

Streaming services spend billions figuring out exactly what content to recommend to the user, yet can’t seem to crack why instead of watching something new millions go to rewatch The Office for the fiftieth time. They know you completed nine seasons of a "workplace comedy" featuring "Steve Carell" with tags like "mockumentary" and "ensemble cast." What they can't capture: why you actually connected with it.

Instead of relying on titles, transcripts, and metadata tags alone to identify what’s worth recommending, the video itself can be the source of truth. The meaning and relationships inside of a video carry more information that until now were difficult to extract. For example, it may not be a certain actor that draws attention, but only when they are playing a particular type of character or have scenes where they get to show off their comedic skills. These nuances, invisible to metadata, become retrievable with video understanding.

This more in-depth understanding can create entirely new paths of content recommendation that aren’t based on intrusive data collection to create a person specific graph but rather letting the relationship between content and viewing habits themselves present the best recommendations.

Marengo's embeddings capture what we call the "fingerprint" of content—not just objects and actions, but the interplay between visual elements, audio cues, and temporal dynamics. This granular understanding means content creators can finally understand WHY audiences connect with specific moments, empowering them to craft more resonant stories rather than chasing algorithmic trends.


Ambient Production:

Love Island films contestants 24/7 with 80 robotic cameras and multiple studio cameras, capturing approximately 168 hours of footage weekly. To transform this into six hour-long episodes per week, the show employs 400 crew members working in shifts around the clock. Thirty editors and twenty producers work across different departments—story teams compile individual moments into scenes, stitch teams assemble scenes into acts, and executive producers determine running orders in real-time. The entire operation runs on a 24-hour turnaround, with Monday's drama airing by Tuesday night.

Nathan Fielder's The Rehearsal pushed ambient capture into new territory by creating controlled environments that became uncontrollably rich with story. The Alligator Lounge replica was so meticulously detailed that it functioned less like a set and more like a story-generation engine. Fielder intended to control specific narratives, but the environments took on lives of their own—extras developed real relationships, background actors created their own dramas, the fake became real. The controlled world became less controlled, generating more authentic (actually who knows with that show…) narratives than planned scenarios.

With video understanding, this abundance becomes manageable. Instead of armies of loggers making binary decisions about what to mark, AI could generate "smart dailies" that understand narrative potential. Not just "argument at 2:47 AM" but "emerging tension between X and Y characters." The system could track relationship dynamics across weeks, identify patterns invisible to exhausted loggers, and surface the slow-burn stories that current production misses.

Whole shows can be designed around this capability, expanding upon the successful format of Love Island for a fraction of the cost with a less intensive production process. Instead of pre-selecting "main characters" and hoping they generate drama, productions could cast wider nets and let stories emerge organically from the environment. The narrative becomes truly discovered rather than manufactured, found in the ambient capture rather than forced through producer intervention.


Semantic Editorial:

The Last Dance documentary required editors to review over 10,000 hours of archival footage. The production hired a team of assistant editors who spent months creating detailed logs with keywords, descriptions, and timecodes. Even with this massive investment, editors still missed relevant moments because human logging can't capture every nuance.

Current text-based editing tools like Adobe Premiere's transcription feature let editors search dialogue, but can't understand visual storytelling. An editor looking for "tension between Jordan and teammates" must rely on someone having logged those subjective moments or combing through the footage themselves.

Video understanding transforms this from keyword search to semantic query. An editor could ask: "Find moments where Jordan is physically isolated from teammates," or "Show me reaction shots where players avoid eye contact after Jordan speaks." The system understands not just objects and actions but relationships and emotions.

For The Last Dance, this could have revealed subtle dynamics no logger would catch—patterns of body language across seasons, evolving team dynamics visible in spacing and positioning, unspoken tensions that only become apparent when viewed systematically. The value of archives for narrative archeology explodes.


Infinite Versioning:

In the global, digital world there is rarely just one true version of a show or movie. Some content gets internationalized for different markets, reflecting taste in certain pace, cultural context, or product presence. Others get safety adapted for theatres, airlines, and kid-friendly versions of the merc’ with the mouth. Each version required manual re-editing, costing hundreds of thousands per market.

Video understanding enables dynamic versioning at scale. Instead of creating fixed alternative edits, productions could define parameters—pacing preferences, content thresholds, cultural emphasis—and generate versions in real-time. A Japanese version of Squid Game could emphasize different character moments than the American version, based on actual viewing patterns rather than assumptions.

More radically, versions could adapt to individual viewers. Not through invasive data collection, but by understanding how viewing patterns correlate with content fingerprints. The same source material becomes thousands of potential experiences, each maintaining narrative integrity while optimizing for resonance.


The Bottom Line

As machines get better at understanding video, human creativity becomes more important, not less. When you can search thousands of hours for "moments of hope emerging from despair," the differentiator isn't finding footage—it's knowing what’s worth looking for.

Template-based AI tools force content into predetermined boxes. Video understanding does the opposite—it reveals what's already there, waiting to be discovered. It handles the overwhelming mechanical burden of comprehension, then hands clean, organized possibilities to human creators for final alchemy.

We're still in the early days. Current models struggle with abstract humor, cultural nuance, and conceptual metaphor. They understand that something is happening but not always why it matters. Yet even these imperfect capabilities cross a threshold that enables genuinely new creative forms.

History shows artists don't wait for perfect tools—they exploit imperfect ones in unexpected ways. The question isn't whether video understanding has arrived complete. It's whether current capabilities, with all their limitations, enable stories we couldn't tell before.

The answer is yes.

At TwelveLabs, we're not building tools that tell stories. We're building infrastructure that empowers storytellers to discover stories that were always there, waiting for the right technology to make them visible.

The framework is changing. What picture will you create within it?

Ready to explore what's possible when video understanding meets your content? Learn more about our video understanding APIs or see our models in action.