How TwelveLabs Users Use Pegasus

Introduction

In February, we launched Pegasus . Unlike many academically oriented systems, Pegasus is designed to address the practical challenges of real-world video understanding and analysis, from fine-grained temporal reasoning to handling content that spans from seconds to hours.

Since launch, it has been adopted by a broad spectrum of users: from enterprises managing massive video datasets to individuals pursuing personal or creative projects. Their diverse use cases have gone beyond what we initially envisioned, expanding how we think about video intelligence in everyday workflows.

To explore this evolution, we analyzed user prompts, uncovering how Pegasus is reshaping human-AI collaboration in video understanding.

This report reveals:

An analysis of over 9,000 unique prompts reveals eleven task archetypes across four workflow intents that define how users work with Pegasus on video tasks.
These patterns show how users build complex, timeline-aware, editor-ready workflows—and point toward the system capabilities needed to support them.

How diverse are the tasks that users request from Pegasus?

Users leverage Pegasus across diverse domains, from films and commercials to sports, education, and safety checks. This wide range of applications demonstrates that Pegasus isn't confined to a single industry or workflow, but rather adapts to vastly different user needs.

To better understand this range, we conducted a mixed-methods analysis to identify patterns in how people interact with Pegasus.

Decoding Prompts: We used an LLM-based approach to distill the core intent and structure of complex user prompts.
Finding Similarities: Each prompt was transformed into a semantic embedding, creating a “fingerprint” that reveals how prompts relate in meaning.
Grouping Prompts: We applied clustering techniques to surface meaningful categories and iterated until the results were both clear and interpretable.
Human Review: Finally, our team manually refined these clusters to develop a taxonomy that reflects how Pegasus is actually used in the real world.

We analyzed Pegasus prompt logs from June 2025. Because user activity can be highly uneven and may skew aggregate counts, we analyzed data at the prompt level rather than by request volume. After exact-match deduplication, and after applying a 0.90 semantic similarity threshold, we arrived at unique prompts used for analysis.

This dataset formed the basis for identifying recurring task patterns and user intents.

Data preparation and analysis pipeline for Pegasus prompts, from raw logs to clustered task categories and intents.

11 Key Ways Users Leverage Pegasus

From this analysis, we identified 11 task categories that describe how users engage with Pegasus across the video workflow. These tasks cluster into four broad intents, showing how people use Pegasus to understand and organize content, ensure accuracy, transform it creatively, and measure its impact.

Four intent-based quadrants showing how Pegasus tasks are distributed across the video workflow. Users make sense of video through summarization, narrative building, and segmentation; keep it safe and accurate with content checks, transcription, and technical review; transform it with creativity through stylistic rewriting; and measure impact and performance through analytical tasks such as marketing, and interpretive analysis.

1 - Video Summaries

One common use case is video summaries to quickly grasp content without watching the entire video. This helps users save time while still understanding the key points or creating written records of what happened in the video. And even within this single category, we see a variety of ways people make use of summarization:

Many users simply ask for high-level summaries that capture the key scenes and main storyline, helping them quickly follow the flow of the video.
Others go beyond surface-level recaps, asking Pegasus to infer deeper meanings such as character intentions, recurring motifs, or ethical messages.

Analyze the content of the video. It’s about one of my characters.

Write me a 200-word summary with timestamps and reflections on morals.

What is the repeating man implying?

Some focus on presentation and structure, requesting outputs formatted into categories like Activity, Location, or Emotional Tone.

Summarize the video focusing on Activity, Location, Event Type, Main Content, and Emotional Tone.

Overall, Pegasus’s summarization feature is not just a tool for saving time but a new interface for video understanding. They gain immediate insights and reinterpret the content in their own way through the diverse summaries Pegasus generates.

2 - Narrative Construction

Beyond summaries, users turn to Pegasus to construct and extend narratives from videos. These prompts focus on telling the story in a way that feels continuous and immersive, almost like reading a screenplay or a novelization of the footage.

Some requests ask Pegasus to take a portion of a video and expand it into a fuller narrative, adding pacing and descriptive richness so the events feel more like a story than a plain recap.

…

Please analyze the current part of the film and seamlessly extend the description. Ensure the result is a cohesive and continuous narrative. Focus on describing the story and events naturally.

3 - Segmentation and Highlights

In more structured workflows, Pegasus is used to break videos into meaningful parts and extract highlight moments. Instead of treating a video as one long, uninterrupted stream, users want to structure it into chapters, find the most engaging clips, or analyze specific types of shots. This makes it easier to navigate long videos, create highlight reels, or prepare content for sharing and editing.

Illustrative mock-up of the Segmentation and Highlights use case, showing how users break a video into chapters and summaries. This figure is for demonstration purposes and not an actual Pegasus interface.

Some prompts focus on chapterization, where long videos are split into self-contained sections that can stand on their own like short clips for social media, or a recap table of key stages.

Breakdown this video into chapters. I want each chapter to have a title, timestamp range, and a summary that is broken down into bulleted points that are between 100-150 characters long.

Others ask for highlight reel extraction, pinpointing the most visually striking or emotionally engaging moments. These are often used for thumbnails, trailers, or marketing materials where a handful of seconds capture the entire video’s appeal.

You are a professional video analyst specializing in thumbnail optimization. Analyze this video to extract 3–4 thumbnail-worthy moments with timestamps and visual details.

Lastly, some prompts focus on shot or scene type identification, where users want a detailed log of cuts, transitions, zoom states, or on-screen elements. This level of precision is especially useful for professional editing, or video quality checks.

Analyze this video with extreme precision. Track every cut, zoom, and transition. Output timestamps with frame type, zoom state, and visual elements.

Through segmentation and highlight extraction, Pegasus transforms raw footage into structured building blocks. Users no longer have to scrub through hours of content. Instead, they can jump directly to the moments that matter most.

4 - Content Safety Checks

In safety-critical workflows, Pegasus is used to detect sensitive or rule-violating content in videos. Instead of relying on manual review, users ask Pegasus to automatically identify moments that could pose safety, compliance, or brand risks. By flagging these issues early, Pegasus helps reduce liability, enforce policy, and protect audiences.

Some prompts request detection of dangerous or illicit content, such as the appearance of weapons, acts of violence, or unsafe behavior. These outputs often include timestamps and contextual descriptions for easier review.

Detect every instance where a firearm or knife appears on screen.

Provide timestamps and describe the surrounding scene.

Others focus on policy violations, especially in workplace, transportation, or public safety contexts. A frequent example is catching when people ignore protective measures such as wearing helmets or seatbelts.

Rider or pillion is not wearing a helmet while the two-wheeler is in motion. Give me all the timestamps where this violation occurs.

Finally, Pegasus is also used for nudity or explicit scene filtering. Depending on the task, this could mean either excluding those moments to produce a “safe” version of the video, or explicitly documenting them for regulatory or editorial purposes.

Identify all segments with nudity. Provide timestamps and note whether the scene is partial or full exposure.

Through content safety checks, Pegasus serves as a safeguard and compliance layer that ensures responsible video use across industries.

5 - Transcription and On-screen Text Extraction

Pegasus is widely used to turn everything in a video into written text, whether it’s spoken dialogue or words that appear on screen. By capturing this information precisely, users can search, reuse, or analyze video content as easily as a document.

Many prompts ask for full, verbatim transcripts, including timestamps, speaker labels, and even filler words or disfluencies. This level of detail is especially valuable for legal, research, or accessibility contexts where accuracy matters.

Please create an exact, verbatim transcription of this video.

Include clear timestamps every 30 seconds, mark speaker changes, and preserve all filler words.

Others focus on extracting on-screen text and graphics, such as signs, prices, URLs, social media handles, or disclaimers. This is often used to catalog promotional content, detect compliance details, or build structured datasets from visual materials.

Extract all on-screen text appearing in this ad. Look carefully for:

Website URLs
Phone numbers
Prices and promo codes
Store addresses or locations
Hashtags and social media handles List EVERYTHING exactly as shown, with the timestamp of appearance.

By converting both spoken and visual text into a searchable, structured asset, Pegasus enables a wide range of downstream applications: from creating subtitles and searchable archives, to powering compliance checks and automated content analysis.

6 - Creative or Stylistic Video Descriptions

In creative workflows, Pegasus is used to recast video content into text with a specific tone, style, or format. The goal here isn’t to alter how the video looks but to rewrite its substance in a way that fits the target audience or platform. This can mean making the same facts sound upbeat and promotional for a marketing campaign, factual and restrained for an incident report, or short and catchy for a social post.

Illustrative mock-up of the Creative or Stylistic Video Descriptions use case, showing how Pegasus rewrites a video into text with a specific tone or style. In this example, the same footage is transformed into an investigative, dramatic script to match the user’s prompt. This figure is for demonstration purposes and not an actual Pegasus interface.

Some prompts focus on tone control, asking Pegasus to deliver the same information but in a particular voice like formal investigative language, or casual and energetic YouTube narration.

Summarize in 100 characters in an investigative report tone.

Sound like a YouTuber, include timestamps.

Others take a humorous or witty spin, where the goal is to entertain while conveying the essence of the video.

Summarize this like a stand-up joke in 30 words.

Roast this video like a comedian who’s had a bad day.

Many users request brief hooks, where the video needs to be distilled into a tweet-length line or an Instagram caption that grabs attention in seconds.

One to two lines for an Instagram card.

Generate a tweet-sized summary (under 280 characters).

Through text-first style conversion, Pegasus acts less like a summarizer and more like a creative writing assistant for video.

7 - Marketing and Ad Analysis

In marketing and advertising workflows, Pegasus is used to evaluate videos through a persuasive lens. Rather than just understanding what happens in a video, users here want to know how well it works as persuasive content. Pegasus helps marketers, creators, and strategists dissect videos to improve clarity, engagement, and conversion potential.

Some prompts focus on critique and improvement suggestions, asking Pegasus to identify weak spots in editing, scripting, or creative direction, often framed from a professional role perspective.

Please suggest, from an influencer marketing executive’s point of view, the following for this video: areas of improvement, suggestions on video editing, and scripting improvements.

Others request engagement and effectiveness analysis, looking closely at the hook, clarity of the message, and emotional impact. These insights help explain why a video works — or doesn’t — and how its techniques can be reused in future campaigns.

Assess clarity of the opening three seconds.

Suggest alternative call-to-action lines.

With marketing and ad analyses, Pegasus helps marketers understand how videos communicate messages and engage viewers, making it easier to improve content effectively.

8 - Fact Q&A, Entity and Event Identification

A frequent pattern in Pegasus usage is fact-oriented questioning, where the system is asked to confirm, deny, or extract specific details from a video. Instead of asking for summaries or narratives, these prompts are all about clarity: Is something there or not? Who is present? What exactly happened?

Many prompts ask simple yes/no verification, cutting through uncertainty to establish the presence or absence of an object, person, or action. This reduces the time users spend double-checking long recordings.

Is there any moment when the wipers are turned on?

Others focus on entity identification, where users want to list who appears in the video, what brands or objects are visible, or which events occur. This provides structured, factual labeling that can be used downstream for indexing, search, or training datasets.

List every person who appears in the video, in order of first appearance.

Identify all branded products visible throughout the footage.

Some queries involve event spotting, asking Pegasus to pinpoint specific occurrences like a safety violation, a key gesture, or a moment of interaction often with timestamps for precise reference.

At what times does the referee raise the red card?

By handling factual Q&A and entity or event identification, Pegasus serves as a reliable layer for factual verification in video. It helps transform uncertainty into clarity, improving precision for annotation, compliance, and large-scale analysis.

9 - Interpretive Q&A, Causes and Intent

Some of the prompts go beyond surface-level description and ask Pegasus to explain why something is happening in a video. Here, the user isn’t satisfied with knowing what is visible — they want reasoning, insights about cause and effect, or judgments about intent and quality. This makes Pegasus not just a summarizer or storyteller but an interpreter of events.

Many requests focus on cause-and-effect analysis, such as determining responsibility in a traffic incident or breaking down how an event unfolded from moment to moment.

What happened in this dashcam video incident and who is at fault?

Some prompts involve comparisons and evaluations, such as measuring athletic performance, judging interactions, or determining whether something meets certain standards. These questions require the model to connect subtle visual cues with reasoning.

Did the boxer in black have better head movement than the opponent?

In all of these cases, Pegasus is asked to infer intentions, causes, and outcomes rather than simply restate observations. This interpretive capability supports informed decision-making across contexts, from dispute resolution to performance analysis.

10 - Sports Analysis

In the sports domain, people use Pegasus for breaking down games into structured, analyzable moments. Rather than watching an entire match from start to finish, users can pull out the events, plays, and performances that matter most. This makes it invaluable for highlight creation, coaching reviews, scouting, and even fan engagement.

Some prompts focus on highlight identification, asking Pegasus to log every scoring event, foul, or pivotal moment with timestamps. This helps fans and analysts relive or review the game’s key sequences without scrubbing through hours of footage.

List every LA Lakers scoring play with timestamp and a brief description of the offensive sequence.

Others zoom in on player-specific analysis, whether that’s isolating clips of a single athlete, comparing performance across players, or tracking contributions over time. This creates a timeline of individual impact that is especially useful for scouts, commentators, and coaches.

Show me every clip where player #23 touches the ball and describe what happens.

Some prompts even request evaluative insights, such as assessing defensive rotations, judging shot selection, or identifying patterns in team strategy. By structuring these observations, Pegasus supports not only entertainment but also deeper tactical understanding.

In sports analysis, Pegasus functions as an on-demand assistant that captures structure, performance, and context from raw video. This enables players, professionals, and fans to engage with the game in smarter and more dynamic ways.

11 - Technical Review and Correction

Not all video tasks are about storytelling or highlights. Some focus on technical precision and quality control. In this category, users lean on Pegasus to ensure that videos meet rubrics, adhere to design standards, and pass objective checks. The goal is to catch errors early, measure learning or communication impact, and reduce costly rework.

Many prompts involve alignment with evaluation criteria, such as checking an instructional video against a rubric and adjusting the scoring guide so that it fairly reflects what is shown.

Evaluate this tutorial against the given rubric. Update the rubric where necessary so the scoring aligns with the video.

Others highlight visual accuracy or layout corrections, flagging issues like distorted scales on maps, overlapping text, or poorly formatted graphics. Pegasus is asked not only to spot the problems but also to propose fixes.

Assess whether shape proportions distort the real map. Mark overlapping text segments and propose corrections.

A third group of requests deals with quantitative analysis, where users want objective measures such as the frequency of certain terms, counts of visual occurrences, or vocabulary analysis in lectures. This turns video into structured data for evaluation.

Count how many times the instructor says “photosynthesis” and provide a frequency chart.

Through technical review and correction, Pegasus helps ensure that videos meet quality and design standards, reducing errors and maintaining consistency across projects.

Insights from Task Diversity: From Single Answers to a Working Partner

The 11 task categories highlight not only the breadth but also the depth of how people interact with Pegasus. What appear to be simple video tasks often unfold into layered, hybrid workflows that merge multiple intentions within a single prompt.

Across the dataset, users combine actions such as summarizing, extracting, counting, comparing, and rephrasing, sometimes within a single query. These compound prompts move between narrative language and structured output, showing that people are not merely asking for answers but co-creating processes of understanding, curation, and transformation.

Three characteristics we observe in video contexts

1 - Instructions are tied to the timeline

Formatting and evidence requirements are anchored to frame ranges. For example, “summarize every 5 minutes” implicitly requires chapter boundary detection and merge or split rules. Temporal constraints become central and are hard to express in plain text alone.

Design implication: Turn natural language into executable plans at the shot, scene, and sequence levels. Manage boundary detection, merge or split rules, and state tracking.

2 - Hybrid combinations expect editor-ready structure

Prompts bundle summarization, chapterization, normalized timecodes, key event extraction, and highlight proposals in one flow. Outputs therefore need to be immediately ingestible by editing or content systems, for example EDL, XML shotlists, CSV tables, or JSON playlists.

Design implication: Produce editor-ready outputs with normalized timecodes and standard fields such as start, end, label, evidence, and fps.

3 - Agentic reasoning is increasingly needed at the clip level

As user requests grow more compound and multi-step, they often demand precise frame spans, OCR snippets, speaker segments, and detection logs — requirements that can’t be satisfied in a single pass. Addressing these cases calls for iterative, agentic behavior, where the model plans, verifies, and refines across multiple operations anchored to time and evidence.

Design implication: Build self-verifying loops into the workflow. The system should autonomously plan and validate its outputs, attach timestamps, representative frames, OCR/ASR text, and confidence scores, and reprocess uncertain spans as needed.

How TwelveLabs empowers complex video workflows

Our product suite is designed to support these emerging patterns. Marengo provides high‑fidelity multimodal embeddings across video, audio, image and text, enabling flexible retrieval, entity filtering and fine‑grained content search. Pegasus specializes in video‑to‑text tasks, delivering summaries, detailed descriptions, timestamped event extraction and explanations grounded in multimodal evidence.

Together, these models form the foundation for the agentic video tasks users expect. When combined with supporting modules such as shot and scene detection, event segmentation, tracking and linking, temporal alignment, confidence calibration, evidence management, and iterative feedback, users can build goal-oriented, reliable, and scalable workflows that go far beyond a single answer.

To expose these capabilities to AI agents and LLM‑based assistants, we also provide an MCP (Model Context Protocol) server. MCP is an open protocol that allows tools to connect to external data sources and models through a standardized interface. In our implementation, the MCP server acts as the bridge between your agents and Twelve Labs models, so assistants can search videos, generate summaries, extract events or check safety policies without manual integration.

Conclusion and Key Takeaways

Across 11 task categories and four intents, Pegasus is not used as a single feature but as part of a broader workflow. Users blend summarization, segmentation, evidence gathering, and evaluation to reach a goal, which means video AI is judged by how well it collaborates across steps, not by any one step in isolation.

TwelveLabs is building for that reality. Marengo supplies retrieval and filtering, Pegasus provides reasoning and explanation tied to timestamps and evidence, and our MCP server exposes these capabilities to agents and tools so they can plan over timelines, produce editor-ready outputs, and verify results.

Introduction

In February, we launched Pegasus . Unlike many academically oriented systems, Pegasus is designed to address the practical challenges of real-world video understanding and analysis, from fine-grained temporal reasoning to handling content that spans from seconds to hours.

Since launch, it has been adopted by a broad spectrum of users: from enterprises managing massive video datasets to individuals pursuing personal or creative projects. Their diverse use cases have gone beyond what we initially envisioned, expanding how we think about video intelligence in everyday workflows.

To explore this evolution, we analyzed user prompts, uncovering how Pegasus is reshaping human-AI collaboration in video understanding.

This report reveals:

An analysis of over 9,000 unique prompts reveals eleven task archetypes across four workflow intents that define how users work with Pegasus on video tasks.
These patterns show how users build complex, timeline-aware, editor-ready workflows—and point toward the system capabilities needed to support them.

How diverse are the tasks that users request from Pegasus?

Users leverage Pegasus across diverse domains, from films and commercials to sports, education, and safety checks. This wide range of applications demonstrates that Pegasus isn't confined to a single industry or workflow, but rather adapts to vastly different user needs.

To better understand this range, we conducted a mixed-methods analysis to identify patterns in how people interact with Pegasus.

Decoding Prompts: We used an LLM-based approach to distill the core intent and structure of complex user prompts.
Finding Similarities: Each prompt was transformed into a semantic embedding, creating a “fingerprint” that reveals how prompts relate in meaning.
Grouping Prompts: We applied clustering techniques to surface meaningful categories and iterated until the results were both clear and interpretable.
Human Review: Finally, our team manually refined these clusters to develop a taxonomy that reflects how Pegasus is actually used in the real world.

We analyzed Pegasus prompt logs from June 2025. Because user activity can be highly uneven and may skew aggregate counts, we analyzed data at the prompt level rather than by request volume. After exact-match deduplication, and after applying a 0.90 semantic similarity threshold, we arrived at unique prompts used for analysis.

This dataset formed the basis for identifying recurring task patterns and user intents.

Data preparation and analysis pipeline for Pegasus prompts, from raw logs to clustered task categories and intents.

11 Key Ways Users Leverage Pegasus

From this analysis, we identified 11 task categories that describe how users engage with Pegasus across the video workflow. These tasks cluster into four broad intents, showing how people use Pegasus to understand and organize content, ensure accuracy, transform it creatively, and measure its impact.

Four intent-based quadrants showing how Pegasus tasks are distributed across the video workflow. Users make sense of video through summarization, narrative building, and segmentation; keep it safe and accurate with content checks, transcription, and technical review; transform it with creativity through stylistic rewriting; and measure impact and performance through analytical tasks such as marketing, and interpretive analysis.

1 - Video Summaries

One common use case is video summaries to quickly grasp content without watching the entire video. This helps users save time while still understanding the key points or creating written records of what happened in the video. And even within this single category, we see a variety of ways people make use of summarization:

Many users simply ask for high-level summaries that capture the key scenes and main storyline, helping them quickly follow the flow of the video.
Others go beyond surface-level recaps, asking Pegasus to infer deeper meanings such as character intentions, recurring motifs, or ethical messages.

Analyze the content of the video. It’s about one of my characters.

Write me a 200-word summary with timestamps and reflections on morals.

What is the repeating man implying?

Some focus on presentation and structure, requesting outputs formatted into categories like Activity, Location, or Emotional Tone.

Summarize the video focusing on Activity, Location, Event Type, Main Content, and Emotional Tone.

Overall, Pegasus’s summarization feature is not just a tool for saving time but a new interface for video understanding. They gain immediate insights and reinterpret the content in their own way through the diverse summaries Pegasus generates.

2 - Narrative Construction

Beyond summaries, users turn to Pegasus to construct and extend narratives from videos. These prompts focus on telling the story in a way that feels continuous and immersive, almost like reading a screenplay or a novelization of the footage.

Some requests ask Pegasus to take a portion of a video and expand it into a fuller narrative, adding pacing and descriptive richness so the events feel more like a story than a plain recap.

…

Please analyze the current part of the film and seamlessly extend the description. Ensure the result is a cohesive and continuous narrative. Focus on describing the story and events naturally.

3 - Segmentation and Highlights

In more structured workflows, Pegasus is used to break videos into meaningful parts and extract highlight moments. Instead of treating a video as one long, uninterrupted stream, users want to structure it into chapters, find the most engaging clips, or analyze specific types of shots. This makes it easier to navigate long videos, create highlight reels, or prepare content for sharing and editing.

Illustrative mock-up of the Segmentation and Highlights use case, showing how users break a video into chapters and summaries. This figure is for demonstration purposes and not an actual Pegasus interface.

Some prompts focus on chapterization, where long videos are split into self-contained sections that can stand on their own like short clips for social media, or a recap table of key stages.

Breakdown this video into chapters. I want each chapter to have a title, timestamp range, and a summary that is broken down into bulleted points that are between 100-150 characters long.

Others ask for highlight reel extraction, pinpointing the most visually striking or emotionally engaging moments. These are often used for thumbnails, trailers, or marketing materials where a handful of seconds capture the entire video’s appeal.

You are a professional video analyst specializing in thumbnail optimization. Analyze this video to extract 3–4 thumbnail-worthy moments with timestamps and visual details.

Lastly, some prompts focus on shot or scene type identification, where users want a detailed log of cuts, transitions, zoom states, or on-screen elements. This level of precision is especially useful for professional editing, or video quality checks.

Analyze this video with extreme precision. Track every cut, zoom, and transition. Output timestamps with frame type, zoom state, and visual elements.

Through segmentation and highlight extraction, Pegasus transforms raw footage into structured building blocks. Users no longer have to scrub through hours of content. Instead, they can jump directly to the moments that matter most.

4 - Content Safety Checks

In safety-critical workflows, Pegasus is used to detect sensitive or rule-violating content in videos. Instead of relying on manual review, users ask Pegasus to automatically identify moments that could pose safety, compliance, or brand risks. By flagging these issues early, Pegasus helps reduce liability, enforce policy, and protect audiences.

Some prompts request detection of dangerous or illicit content, such as the appearance of weapons, acts of violence, or unsafe behavior. These outputs often include timestamps and contextual descriptions for easier review.

Detect every instance where a firearm or knife appears on screen.

Provide timestamps and describe the surrounding scene.

Others focus on policy violations, especially in workplace, transportation, or public safety contexts. A frequent example is catching when people ignore protective measures such as wearing helmets or seatbelts.

Rider or pillion is not wearing a helmet while the two-wheeler is in motion. Give me all the timestamps where this violation occurs.

Finally, Pegasus is also used for nudity or explicit scene filtering. Depending on the task, this could mean either excluding those moments to produce a “safe” version of the video, or explicitly documenting them for regulatory or editorial purposes.

Identify all segments with nudity. Provide timestamps and note whether the scene is partial or full exposure.

Through content safety checks, Pegasus serves as a safeguard and compliance layer that ensures responsible video use across industries.

5 - Transcription and On-screen Text Extraction

Pegasus is widely used to turn everything in a video into written text, whether it’s spoken dialogue or words that appear on screen. By capturing this information precisely, users can search, reuse, or analyze video content as easily as a document.

Many prompts ask for full, verbatim transcripts, including timestamps, speaker labels, and even filler words or disfluencies. This level of detail is especially valuable for legal, research, or accessibility contexts where accuracy matters.

Please create an exact, verbatim transcription of this video.

Include clear timestamps every 30 seconds, mark speaker changes, and preserve all filler words.

Others focus on extracting on-screen text and graphics, such as signs, prices, URLs, social media handles, or disclaimers. This is often used to catalog promotional content, detect compliance details, or build structured datasets from visual materials.

Extract all on-screen text appearing in this ad. Look carefully for:

Website URLs
Phone numbers
Prices and promo codes
Store addresses or locations
Hashtags and social media handles List EVERYTHING exactly as shown, with the timestamp of appearance.

By converting both spoken and visual text into a searchable, structured asset, Pegasus enables a wide range of downstream applications: from creating subtitles and searchable archives, to powering compliance checks and automated content analysis.

6 - Creative or Stylistic Video Descriptions

In creative workflows, Pegasus is used to recast video content into text with a specific tone, style, or format. The goal here isn’t to alter how the video looks but to rewrite its substance in a way that fits the target audience or platform. This can mean making the same facts sound upbeat and promotional for a marketing campaign, factual and restrained for an incident report, or short and catchy for a social post.

Illustrative mock-up of the Creative or Stylistic Video Descriptions use case, showing how Pegasus rewrites a video into text with a specific tone or style. In this example, the same footage is transformed into an investigative, dramatic script to match the user’s prompt. This figure is for demonstration purposes and not an actual Pegasus interface.

Some prompts focus on tone control, asking Pegasus to deliver the same information but in a particular voice like formal investigative language, or casual and energetic YouTube narration.

Summarize in 100 characters in an investigative report tone.

Sound like a YouTuber, include timestamps.

Others take a humorous or witty spin, where the goal is to entertain while conveying the essence of the video.

Summarize this like a stand-up joke in 30 words.

Roast this video like a comedian who’s had a bad day.

Many users request brief hooks, where the video needs to be distilled into a tweet-length line or an Instagram caption that grabs attention in seconds.

One to two lines for an Instagram card.

Generate a tweet-sized summary (under 280 characters).

Through text-first style conversion, Pegasus acts less like a summarizer and more like a creative writing assistant for video.

7 - Marketing and Ad Analysis

In marketing and advertising workflows, Pegasus is used to evaluate videos through a persuasive lens. Rather than just understanding what happens in a video, users here want to know how well it works as persuasive content. Pegasus helps marketers, creators, and strategists dissect videos to improve clarity, engagement, and conversion potential.

Some prompts focus on critique and improvement suggestions, asking Pegasus to identify weak spots in editing, scripting, or creative direction, often framed from a professional role perspective.

Please suggest, from an influencer marketing executive’s point of view, the following for this video: areas of improvement, suggestions on video editing, and scripting improvements.

Others request engagement and effectiveness analysis, looking closely at the hook, clarity of the message, and emotional impact. These insights help explain why a video works — or doesn’t — and how its techniques can be reused in future campaigns.

Assess clarity of the opening three seconds.

Suggest alternative call-to-action lines.

With marketing and ad analyses, Pegasus helps marketers understand how videos communicate messages and engage viewers, making it easier to improve content effectively.

8 - Fact Q&A, Entity and Event Identification

A frequent pattern in Pegasus usage is fact-oriented questioning, where the system is asked to confirm, deny, or extract specific details from a video. Instead of asking for summaries or narratives, these prompts are all about clarity: Is something there or not? Who is present? What exactly happened?

Many prompts ask simple yes/no verification, cutting through uncertainty to establish the presence or absence of an object, person, or action. This reduces the time users spend double-checking long recordings.

Is there any moment when the wipers are turned on?

Others focus on entity identification, where users want to list who appears in the video, what brands or objects are visible, or which events occur. This provides structured, factual labeling that can be used downstream for indexing, search, or training datasets.

List every person who appears in the video, in order of first appearance.

Identify all branded products visible throughout the footage.

Some queries involve event spotting, asking Pegasus to pinpoint specific occurrences like a safety violation, a key gesture, or a moment of interaction often with timestamps for precise reference.

At what times does the referee raise the red card?

By handling factual Q&A and entity or event identification, Pegasus serves as a reliable layer for factual verification in video. It helps transform uncertainty into clarity, improving precision for annotation, compliance, and large-scale analysis.

9 - Interpretive Q&A, Causes and Intent

Some of the prompts go beyond surface-level description and ask Pegasus to explain why something is happening in a video. Here, the user isn’t satisfied with knowing what is visible — they want reasoning, insights about cause and effect, or judgments about intent and quality. This makes Pegasus not just a summarizer or storyteller but an interpreter of events.

Many requests focus on cause-and-effect analysis, such as determining responsibility in a traffic incident or breaking down how an event unfolded from moment to moment.

What happened in this dashcam video incident and who is at fault?

Some prompts involve comparisons and evaluations, such as measuring athletic performance, judging interactions, or determining whether something meets certain standards. These questions require the model to connect subtle visual cues with reasoning.

Did the boxer in black have better head movement than the opponent?

In all of these cases, Pegasus is asked to infer intentions, causes, and outcomes rather than simply restate observations. This interpretive capability supports informed decision-making across contexts, from dispute resolution to performance analysis.

10 - Sports Analysis

In the sports domain, people use Pegasus for breaking down games into structured, analyzable moments. Rather than watching an entire match from start to finish, users can pull out the events, plays, and performances that matter most. This makes it invaluable for highlight creation, coaching reviews, scouting, and even fan engagement.

Some prompts focus on highlight identification, asking Pegasus to log every scoring event, foul, or pivotal moment with timestamps. This helps fans and analysts relive or review the game’s key sequences without scrubbing through hours of footage.

List every LA Lakers scoring play with timestamp and a brief description of the offensive sequence.

Others zoom in on player-specific analysis, whether that’s isolating clips of a single athlete, comparing performance across players, or tracking contributions over time. This creates a timeline of individual impact that is especially useful for scouts, commentators, and coaches.

Show me every clip where player #23 touches the ball and describe what happens.

Some prompts even request evaluative insights, such as assessing defensive rotations, judging shot selection, or identifying patterns in team strategy. By structuring these observations, Pegasus supports not only entertainment but also deeper tactical understanding.

In sports analysis, Pegasus functions as an on-demand assistant that captures structure, performance, and context from raw video. This enables players, professionals, and fans to engage with the game in smarter and more dynamic ways.

11 - Technical Review and Correction

Not all video tasks are about storytelling or highlights. Some focus on technical precision and quality control. In this category, users lean on Pegasus to ensure that videos meet rubrics, adhere to design standards, and pass objective checks. The goal is to catch errors early, measure learning or communication impact, and reduce costly rework.

Many prompts involve alignment with evaluation criteria, such as checking an instructional video against a rubric and adjusting the scoring guide so that it fairly reflects what is shown.

Evaluate this tutorial against the given rubric. Update the rubric where necessary so the scoring aligns with the video.

Others highlight visual accuracy or layout corrections, flagging issues like distorted scales on maps, overlapping text, or poorly formatted graphics. Pegasus is asked not only to spot the problems but also to propose fixes.

Assess whether shape proportions distort the real map. Mark overlapping text segments and propose corrections.

A third group of requests deals with quantitative analysis, where users want objective measures such as the frequency of certain terms, counts of visual occurrences, or vocabulary analysis in lectures. This turns video into structured data for evaluation.

Count how many times the instructor says “photosynthesis” and provide a frequency chart.

Through technical review and correction, Pegasus helps ensure that videos meet quality and design standards, reducing errors and maintaining consistency across projects.

Insights from Task Diversity: From Single Answers to a Working Partner

The 11 task categories highlight not only the breadth but also the depth of how people interact with Pegasus. What appear to be simple video tasks often unfold into layered, hybrid workflows that merge multiple intentions within a single prompt.

Across the dataset, users combine actions such as summarizing, extracting, counting, comparing, and rephrasing, sometimes within a single query. These compound prompts move between narrative language and structured output, showing that people are not merely asking for answers but co-creating processes of understanding, curation, and transformation.

Three characteristics we observe in video contexts

1 - Instructions are tied to the timeline

Formatting and evidence requirements are anchored to frame ranges. For example, “summarize every 5 minutes” implicitly requires chapter boundary detection and merge or split rules. Temporal constraints become central and are hard to express in plain text alone.

Design implication: Turn natural language into executable plans at the shot, scene, and sequence levels. Manage boundary detection, merge or split rules, and state tracking.

2 - Hybrid combinations expect editor-ready structure

Prompts bundle summarization, chapterization, normalized timecodes, key event extraction, and highlight proposals in one flow. Outputs therefore need to be immediately ingestible by editing or content systems, for example EDL, XML shotlists, CSV tables, or JSON playlists.

Design implication: Produce editor-ready outputs with normalized timecodes and standard fields such as start, end, label, evidence, and fps.

3 - Agentic reasoning is increasingly needed at the clip level

As user requests grow more compound and multi-step, they often demand precise frame spans, OCR snippets, speaker segments, and detection logs — requirements that can’t be satisfied in a single pass. Addressing these cases calls for iterative, agentic behavior, where the model plans, verifies, and refines across multiple operations anchored to time and evidence.

Design implication: Build self-verifying loops into the workflow. The system should autonomously plan and validate its outputs, attach timestamps, representative frames, OCR/ASR text, and confidence scores, and reprocess uncertain spans as needed.

How TwelveLabs empowers complex video workflows

Our product suite is designed to support these emerging patterns. Marengo provides high‑fidelity multimodal embeddings across video, audio, image and text, enabling flexible retrieval, entity filtering and fine‑grained content search. Pegasus specializes in video‑to‑text tasks, delivering summaries, detailed descriptions, timestamped event extraction and explanations grounded in multimodal evidence.

Together, these models form the foundation for the agentic video tasks users expect. When combined with supporting modules such as shot and scene detection, event segmentation, tracking and linking, temporal alignment, confidence calibration, evidence management, and iterative feedback, users can build goal-oriented, reliable, and scalable workflows that go far beyond a single answer.

To expose these capabilities to AI agents and LLM‑based assistants, we also provide an MCP (Model Context Protocol) server. MCP is an open protocol that allows tools to connect to external data sources and models through a standardized interface. In our implementation, the MCP server acts as the bridge between your agents and Twelve Labs models, so assistants can search videos, generate summaries, extract events or check safety policies without manual integration.

Conclusion and Key Takeaways

Across 11 task categories and four intents, Pegasus is not used as a single feature but as part of a broader workflow. Users blend summarization, segmentation, evidence gathering, and evaluation to reach a goal, which means video AI is judged by how well it collaborates across steps, not by any one step in isolation.

TwelveLabs is building for that reality. Marengo supplies retrieval and filtering, Pegasus provides reasoning and explanation tied to timestamps and evidence, and our MCP server exposes these capabilities to agents and tools so they can plan over timelines, produce editor-ready outputs, and verify results.

Introduction

In February, we launched Pegasus . Unlike many academically oriented systems, Pegasus is designed to address the practical challenges of real-world video understanding and analysis, from fine-grained temporal reasoning to handling content that spans from seconds to hours.

Since launch, it has been adopted by a broad spectrum of users: from enterprises managing massive video datasets to individuals pursuing personal or creative projects. Their diverse use cases have gone beyond what we initially envisioned, expanding how we think about video intelligence in everyday workflows.

To explore this evolution, we analyzed user prompts, uncovering how Pegasus is reshaping human-AI collaboration in video understanding.

This report reveals:

An analysis of over 9,000 unique prompts reveals eleven task archetypes across four workflow intents that define how users work with Pegasus on video tasks.
These patterns show how users build complex, timeline-aware, editor-ready workflows—and point toward the system capabilities needed to support them.

How diverse are the tasks that users request from Pegasus?

Users leverage Pegasus across diverse domains, from films and commercials to sports, education, and safety checks. This wide range of applications demonstrates that Pegasus isn't confined to a single industry or workflow, but rather adapts to vastly different user needs.

To better understand this range, we conducted a mixed-methods analysis to identify patterns in how people interact with Pegasus.

Decoding Prompts: We used an LLM-based approach to distill the core intent and structure of complex user prompts.
Finding Similarities: Each prompt was transformed into a semantic embedding, creating a “fingerprint” that reveals how prompts relate in meaning.
Grouping Prompts: We applied clustering techniques to surface meaningful categories and iterated until the results were both clear and interpretable.
Human Review: Finally, our team manually refined these clusters to develop a taxonomy that reflects how Pegasus is actually used in the real world.

We analyzed Pegasus prompt logs from June 2025. Because user activity can be highly uneven and may skew aggregate counts, we analyzed data at the prompt level rather than by request volume. After exact-match deduplication, and after applying a 0.90 semantic similarity threshold, we arrived at unique prompts used for analysis.

This dataset formed the basis for identifying recurring task patterns and user intents.

Data preparation and analysis pipeline for Pegasus prompts, from raw logs to clustered task categories and intents.

11 Key Ways Users Leverage Pegasus

From this analysis, we identified 11 task categories that describe how users engage with Pegasus across the video workflow. These tasks cluster into four broad intents, showing how people use Pegasus to understand and organize content, ensure accuracy, transform it creatively, and measure its impact.

Four intent-based quadrants showing how Pegasus tasks are distributed across the video workflow. Users make sense of video through summarization, narrative building, and segmentation; keep it safe and accurate with content checks, transcription, and technical review; transform it with creativity through stylistic rewriting; and measure impact and performance through analytical tasks such as marketing, and interpretive analysis.

1 - Video Summaries

One common use case is video summaries to quickly grasp content without watching the entire video. This helps users save time while still understanding the key points or creating written records of what happened in the video. And even within this single category, we see a variety of ways people make use of summarization:

Many users simply ask for high-level summaries that capture the key scenes and main storyline, helping them quickly follow the flow of the video.
Others go beyond surface-level recaps, asking Pegasus to infer deeper meanings such as character intentions, recurring motifs, or ethical messages.

Analyze the content of the video. It’s about one of my characters.

Write me a 200-word summary with timestamps and reflections on morals.

What is the repeating man implying?

Some focus on presentation and structure, requesting outputs formatted into categories like Activity, Location, or Emotional Tone.

Summarize the video focusing on Activity, Location, Event Type, Main Content, and Emotional Tone.

Overall, Pegasus’s summarization feature is not just a tool for saving time but a new interface for video understanding. They gain immediate insights and reinterpret the content in their own way through the diverse summaries Pegasus generates.

2 - Narrative Construction

Beyond summaries, users turn to Pegasus to construct and extend narratives from videos. These prompts focus on telling the story in a way that feels continuous and immersive, almost like reading a screenplay or a novelization of the footage.

Some requests ask Pegasus to take a portion of a video and expand it into a fuller narrative, adding pacing and descriptive richness so the events feel more like a story than a plain recap.

…

Please analyze the current part of the film and seamlessly extend the description. Ensure the result is a cohesive and continuous narrative. Focus on describing the story and events naturally.

3 - Segmentation and Highlights

In more structured workflows, Pegasus is used to break videos into meaningful parts and extract highlight moments. Instead of treating a video as one long, uninterrupted stream, users want to structure it into chapters, find the most engaging clips, or analyze specific types of shots. This makes it easier to navigate long videos, create highlight reels, or prepare content for sharing and editing.

Illustrative mock-up of the Segmentation and Highlights use case, showing how users break a video into chapters and summaries. This figure is for demonstration purposes and not an actual Pegasus interface.

Some prompts focus on chapterization, where long videos are split into self-contained sections that can stand on their own like short clips for social media, or a recap table of key stages.

Breakdown this video into chapters. I want each chapter to have a title, timestamp range, and a summary that is broken down into bulleted points that are between 100-150 characters long.

Others ask for highlight reel extraction, pinpointing the most visually striking or emotionally engaging moments. These are often used for thumbnails, trailers, or marketing materials where a handful of seconds capture the entire video’s appeal.

You are a professional video analyst specializing in thumbnail optimization. Analyze this video to extract 3–4 thumbnail-worthy moments with timestamps and visual details.

Lastly, some prompts focus on shot or scene type identification, where users want a detailed log of cuts, transitions, zoom states, or on-screen elements. This level of precision is especially useful for professional editing, or video quality checks.

Analyze this video with extreme precision. Track every cut, zoom, and transition. Output timestamps with frame type, zoom state, and visual elements.

Through segmentation and highlight extraction, Pegasus transforms raw footage into structured building blocks. Users no longer have to scrub through hours of content. Instead, they can jump directly to the moments that matter most.

4 - Content Safety Checks

In safety-critical workflows, Pegasus is used to detect sensitive or rule-violating content in videos. Instead of relying on manual review, users ask Pegasus to automatically identify moments that could pose safety, compliance, or brand risks. By flagging these issues early, Pegasus helps reduce liability, enforce policy, and protect audiences.

Some prompts request detection of dangerous or illicit content, such as the appearance of weapons, acts of violence, or unsafe behavior. These outputs often include timestamps and contextual descriptions for easier review.

Detect every instance where a firearm or knife appears on screen.

Provide timestamps and describe the surrounding scene.

Others focus on policy violations, especially in workplace, transportation, or public safety contexts. A frequent example is catching when people ignore protective measures such as wearing helmets or seatbelts.

Rider or pillion is not wearing a helmet while the two-wheeler is in motion. Give me all the timestamps where this violation occurs.

Finally, Pegasus is also used for nudity or explicit scene filtering. Depending on the task, this could mean either excluding those moments to produce a “safe” version of the video, or explicitly documenting them for regulatory or editorial purposes.

Identify all segments with nudity. Provide timestamps and note whether the scene is partial or full exposure.

Through content safety checks, Pegasus serves as a safeguard and compliance layer that ensures responsible video use across industries.

5 - Transcription and On-screen Text Extraction

Pegasus is widely used to turn everything in a video into written text, whether it’s spoken dialogue or words that appear on screen. By capturing this information precisely, users can search, reuse, or analyze video content as easily as a document.

Many prompts ask for full, verbatim transcripts, including timestamps, speaker labels, and even filler words or disfluencies. This level of detail is especially valuable for legal, research, or accessibility contexts where accuracy matters.

Please create an exact, verbatim transcription of this video.

Include clear timestamps every 30 seconds, mark speaker changes, and preserve all filler words.

Others focus on extracting on-screen text and graphics, such as signs, prices, URLs, social media handles, or disclaimers. This is often used to catalog promotional content, detect compliance details, or build structured datasets from visual materials.

Extract all on-screen text appearing in this ad. Look carefully for:

Website URLs
Phone numbers
Prices and promo codes
Store addresses or locations
Hashtags and social media handles List EVERYTHING exactly as shown, with the timestamp of appearance.

By converting both spoken and visual text into a searchable, structured asset, Pegasus enables a wide range of downstream applications: from creating subtitles and searchable archives, to powering compliance checks and automated content analysis.

6 - Creative or Stylistic Video Descriptions

In creative workflows, Pegasus is used to recast video content into text with a specific tone, style, or format. The goal here isn’t to alter how the video looks but to rewrite its substance in a way that fits the target audience or platform. This can mean making the same facts sound upbeat and promotional for a marketing campaign, factual and restrained for an incident report, or short and catchy for a social post.

Illustrative mock-up of the Creative or Stylistic Video Descriptions use case, showing how Pegasus rewrites a video into text with a specific tone or style. In this example, the same footage is transformed into an investigative, dramatic script to match the user’s prompt. This figure is for demonstration purposes and not an actual Pegasus interface.

Some prompts focus on tone control, asking Pegasus to deliver the same information but in a particular voice like formal investigative language, or casual and energetic YouTube narration.

Summarize in 100 characters in an investigative report tone.

Sound like a YouTuber, include timestamps.

Others take a humorous or witty spin, where the goal is to entertain while conveying the essence of the video.

Summarize this like a stand-up joke in 30 words.

Roast this video like a comedian who’s had a bad day.

Many users request brief hooks, where the video needs to be distilled into a tweet-length line or an Instagram caption that grabs attention in seconds.

One to two lines for an Instagram card.

Generate a tweet-sized summary (under 280 characters).

Through text-first style conversion, Pegasus acts less like a summarizer and more like a creative writing assistant for video.

7 - Marketing and Ad Analysis

In marketing and advertising workflows, Pegasus is used to evaluate videos through a persuasive lens. Rather than just understanding what happens in a video, users here want to know how well it works as persuasive content. Pegasus helps marketers, creators, and strategists dissect videos to improve clarity, engagement, and conversion potential.

Some prompts focus on critique and improvement suggestions, asking Pegasus to identify weak spots in editing, scripting, or creative direction, often framed from a professional role perspective.

Please suggest, from an influencer marketing executive’s point of view, the following for this video: areas of improvement, suggestions on video editing, and scripting improvements.

Others request engagement and effectiveness analysis, looking closely at the hook, clarity of the message, and emotional impact. These insights help explain why a video works — or doesn’t — and how its techniques can be reused in future campaigns.

Assess clarity of the opening three seconds.

Suggest alternative call-to-action lines.

With marketing and ad analyses, Pegasus helps marketers understand how videos communicate messages and engage viewers, making it easier to improve content effectively.

8 - Fact Q&A, Entity and Event Identification

A frequent pattern in Pegasus usage is fact-oriented questioning, where the system is asked to confirm, deny, or extract specific details from a video. Instead of asking for summaries or narratives, these prompts are all about clarity: Is something there or not? Who is present? What exactly happened?

Many prompts ask simple yes/no verification, cutting through uncertainty to establish the presence or absence of an object, person, or action. This reduces the time users spend double-checking long recordings.

Is there any moment when the wipers are turned on?

Others focus on entity identification, where users want to list who appears in the video, what brands or objects are visible, or which events occur. This provides structured, factual labeling that can be used downstream for indexing, search, or training datasets.

List every person who appears in the video, in order of first appearance.

Identify all branded products visible throughout the footage.

Some queries involve event spotting, asking Pegasus to pinpoint specific occurrences like a safety violation, a key gesture, or a moment of interaction often with timestamps for precise reference.

At what times does the referee raise the red card?

By handling factual Q&A and entity or event identification, Pegasus serves as a reliable layer for factual verification in video. It helps transform uncertainty into clarity, improving precision for annotation, compliance, and large-scale analysis.

9 - Interpretive Q&A, Causes and Intent

Some of the prompts go beyond surface-level description and ask Pegasus to explain why something is happening in a video. Here, the user isn’t satisfied with knowing what is visible — they want reasoning, insights about cause and effect, or judgments about intent and quality. This makes Pegasus not just a summarizer or storyteller but an interpreter of events.

Many requests focus on cause-and-effect analysis, such as determining responsibility in a traffic incident or breaking down how an event unfolded from moment to moment.

What happened in this dashcam video incident and who is at fault?

Some prompts involve comparisons and evaluations, such as measuring athletic performance, judging interactions, or determining whether something meets certain standards. These questions require the model to connect subtle visual cues with reasoning.

Did the boxer in black have better head movement than the opponent?

In all of these cases, Pegasus is asked to infer intentions, causes, and outcomes rather than simply restate observations. This interpretive capability supports informed decision-making across contexts, from dispute resolution to performance analysis.

10 - Sports Analysis

In the sports domain, people use Pegasus for breaking down games into structured, analyzable moments. Rather than watching an entire match from start to finish, users can pull out the events, plays, and performances that matter most. This makes it invaluable for highlight creation, coaching reviews, scouting, and even fan engagement.

Some prompts focus on highlight identification, asking Pegasus to log every scoring event, foul, or pivotal moment with timestamps. This helps fans and analysts relive or review the game’s key sequences without scrubbing through hours of footage.

List every LA Lakers scoring play with timestamp and a brief description of the offensive sequence.

Others zoom in on player-specific analysis, whether that’s isolating clips of a single athlete, comparing performance across players, or tracking contributions over time. This creates a timeline of individual impact that is especially useful for scouts, commentators, and coaches.

Show me every clip where player #23 touches the ball and describe what happens.

Some prompts even request evaluative insights, such as assessing defensive rotations, judging shot selection, or identifying patterns in team strategy. By structuring these observations, Pegasus supports not only entertainment but also deeper tactical understanding.

In sports analysis, Pegasus functions as an on-demand assistant that captures structure, performance, and context from raw video. This enables players, professionals, and fans to engage with the game in smarter and more dynamic ways.

11 - Technical Review and Correction

Not all video tasks are about storytelling or highlights. Some focus on technical precision and quality control. In this category, users lean on Pegasus to ensure that videos meet rubrics, adhere to design standards, and pass objective checks. The goal is to catch errors early, measure learning or communication impact, and reduce costly rework.

Many prompts involve alignment with evaluation criteria, such as checking an instructional video against a rubric and adjusting the scoring guide so that it fairly reflects what is shown.

Evaluate this tutorial against the given rubric. Update the rubric where necessary so the scoring aligns with the video.

Others highlight visual accuracy or layout corrections, flagging issues like distorted scales on maps, overlapping text, or poorly formatted graphics. Pegasus is asked not only to spot the problems but also to propose fixes.

Assess whether shape proportions distort the real map. Mark overlapping text segments and propose corrections.

A third group of requests deals with quantitative analysis, where users want objective measures such as the frequency of certain terms, counts of visual occurrences, or vocabulary analysis in lectures. This turns video into structured data for evaluation.

Count how many times the instructor says “photosynthesis” and provide a frequency chart.

Through technical review and correction, Pegasus helps ensure that videos meet quality and design standards, reducing errors and maintaining consistency across projects.

Insights from Task Diversity: From Single Answers to a Working Partner

The 11 task categories highlight not only the breadth but also the depth of how people interact with Pegasus. What appear to be simple video tasks often unfold into layered, hybrid workflows that merge multiple intentions within a single prompt.

Across the dataset, users combine actions such as summarizing, extracting, counting, comparing, and rephrasing, sometimes within a single query. These compound prompts move between narrative language and structured output, showing that people are not merely asking for answers but co-creating processes of understanding, curation, and transformation.

Three characteristics we observe in video contexts

1 - Instructions are tied to the timeline

Formatting and evidence requirements are anchored to frame ranges. For example, “summarize every 5 minutes” implicitly requires chapter boundary detection and merge or split rules. Temporal constraints become central and are hard to express in plain text alone.

Design implication: Turn natural language into executable plans at the shot, scene, and sequence levels. Manage boundary detection, merge or split rules, and state tracking.

2 - Hybrid combinations expect editor-ready structure

Prompts bundle summarization, chapterization, normalized timecodes, key event extraction, and highlight proposals in one flow. Outputs therefore need to be immediately ingestible by editing or content systems, for example EDL, XML shotlists, CSV tables, or JSON playlists.

Design implication: Produce editor-ready outputs with normalized timecodes and standard fields such as start, end, label, evidence, and fps.

3 - Agentic reasoning is increasingly needed at the clip level

As user requests grow more compound and multi-step, they often demand precise frame spans, OCR snippets, speaker segments, and detection logs — requirements that can’t be satisfied in a single pass. Addressing these cases calls for iterative, agentic behavior, where the model plans, verifies, and refines across multiple operations anchored to time and evidence.

Design implication: Build self-verifying loops into the workflow. The system should autonomously plan and validate its outputs, attach timestamps, representative frames, OCR/ASR text, and confidence scores, and reprocess uncertain spans as needed.

How TwelveLabs empowers complex video workflows

Our product suite is designed to support these emerging patterns. Marengo provides high‑fidelity multimodal embeddings across video, audio, image and text, enabling flexible retrieval, entity filtering and fine‑grained content search. Pegasus specializes in video‑to‑text tasks, delivering summaries, detailed descriptions, timestamped event extraction and explanations grounded in multimodal evidence.

Together, these models form the foundation for the agentic video tasks users expect. When combined with supporting modules such as shot and scene detection, event segmentation, tracking and linking, temporal alignment, confidence calibration, evidence management, and iterative feedback, users can build goal-oriented, reliable, and scalable workflows that go far beyond a single answer.

To expose these capabilities to AI agents and LLM‑based assistants, we also provide an MCP (Model Context Protocol) server. MCP is an open protocol that allows tools to connect to external data sources and models through a standardized interface. In our implementation, the MCP server acts as the bridge between your agents and Twelve Labs models, so assistants can search videos, generate summaries, extract events or check safety policies without manual integration.

Conclusion and Key Takeaways

Across 11 task categories and four intents, Pegasus is not used as a single feature but as part of a broader workflow. Users blend summarization, segmentation, evidence gathering, and evaluation to reach a goal, which means video AI is judged by how well it collaborates across steps, not by any one step in isolation.

TwelveLabs is building for that reality. Marengo supplies retrieval and filtering, Pegasus provides reasoning and explanation tied to timestamps and evidence, and our MCP server exposes these capabilities to agents and tools so they can plan over timelines, produce editor-ready outputs, and verify results.