Author
Minjoon Seo, James Le
Date Published
March 12, 2024
Tags
Foundation models
Generate API
Generative AI
Multimodal AI
Video understanding
Share
Join our newsletter
You’re now subscribed to the Twelve Labs Newsletter! You'll be getting the latest news and updates in video understanding.
Oh no, something went wrong.
Please try again.
Check out the Pegasus-1 technical report on arXiV and HuggingFace!

1 - Introduction

At Twelve Labs, our goal is to advance video understanding through the creation of innovative multimodal AI models. In a previous post, "Introducing Video-to-Text and Pegasus-1 (80B)," we introduced the alpha version of Pegasus-1. This foundational model can generate descriptive text from video input. Today, we're thrilled to announce the launch of Pegasus-1's open beta.

Pegasus-1 is designed to understand and articulate complex video content, transforming how we interact with and analyze multimedia. With roughly 17 billion parameters, this model is a significant progression in multimodal AI. It can process and generate language from video input with exceptional accuracy and detail.

In this update, we'll explore the enhancements made to Pegasus-1 since its alpha release. These include improvements in data quality, video processing, and training methods. We will also share benchmarking results against leading commercial and open-source models, demonstrating Pegasus-1's superior performance in video summarization, question answering, and conversations. Beyond quantitative metrics, Pegasus-1 also shows qualitative improvements through its enhanced world knowledge and ability to capture detailed visual information.

2 - Model Overview

Figure 1: Pegasus model architecture and how our Marengo model plays a role

As a quick recap, Pegasus-1 is a multimodal foundation model designed to bridge the gap between video content and language, enabling machines to interpret and generate text based on video input. The architecture of Pegasus-1 is composed of three main components:

  1. The video encoder model processes the video input to generate rich embeddings from both video frames and audio speech recognition (ASR) data. These embeddings are dense representations that capture the visual and auditory essence of the video content.
  2. The video-language alignment model maps the video embeddings to corresponding language embeddings, creating a shared space where video and text representations are aligned. This alignment is crucial for the model to understand the correspondence between what is seen in the video and the language that describes it.
  3. The large language model decoder takes the aligned embeddings and user prompts to generate coherent and contextually relevant text output. This output can range from descriptive summaries to answers to specific questions about the video content.

Compared to the alpha version, the open-beta version of Pegasus-1 boasts approximately 17B parameters, making it a compact yet powerful tool for interpreting and generating text based on video data.

3 - Major Improvements

As we progress from the alpha to the open-beta version, we continue to refine and enhance the model to deliver even more accurate and detailed video-language understanding. These enhancements are driven by three key factors: high-quality data, optimized video processing, and refined training techniques.

3.1 - Data Improvement

In accordance with previous findings, we also find that the quality and granularity of the captions had a more significant impact on the model's performance than the sheer quantity of data. For instance, Pegasus-1, trained on 100,000 high-quality video-text pairs, easily outperforms the same architecture trained on a much larger dataset (10M+) with lower-quality captions.

With this experimental evidence in mind, we have designed an efficient data annotation pipeline to create high-quality video captions for the aforementioned 10M+ videos. Training on such a large amount of high quality video-text pairs, Pegasus attains foundational video understanding capabilities that are not observed in other models.

3.2 - Video Processing Improvement

We made substantial changes to the video processing pipeline to optimize both spatial and temporal resolution. We increased the number of patches per frame by 10x (spatial) and the number of frames by 1.5x (temporal), resulting in a 15x increase in the total number of patches per video. This enhancement allows Pegasus-1 to capture and convey more information per frame.

Pegasus-1 is now also able to better grasp the overall story and context of the video in a more coherent manner, as evidenced by both qualitative and quantitative analyses, particularly in question-answering datasets.

3.3 - Training Improvement

As a multimodal foundation model, Pegasus-1 is trained on massive multimodal datasets over multiple stages. However, multi-stage training often suffers from the phenomenon known as catastrophic forgetting. This occurs when a model, upon learning new information, rapidly forgets the old information it was previously trained on. This issue becomes even more pressing in multimodal models that undergo sequential training across modalities.

To address this, we employ a strategic training regimen that involves multiple stages - each meticulously designed to balance the acquisition of new knowledge with the preservation of previously learned information. The key to this approach lies in the selective updates (unfreezing) of model parameters and the careful adjustment of learning rates throughout the training process.

The open-beta version of Pegasus-1, compared to the alpha version, boasts enhanced capabilities, such as a heightened ability to capture fine-grained temporal moments and a reduced propensity for hallucination, resulting in increased robustness across diverse video domains. It also demonstrates expanded world knowledge and an improved capacity to list various moments in temporal order, rather than concentrating on a singular scene.

4 - Quantitative Benchmark Results

Figure 2: Experimental Results on Video Question Answering, Conversation, and Summarization Benchmarks.

In our comprehensive benchmarking efforts, Pegasus-1 has been evaluated against a spectrum of both commercial and open-source models. This section will detail the performance of Pegasus-1 in comparison to these models across various video-language modeling tasks.

4.1 - Baseline Models

The baseline models against which Pegasus-1 was benchmarked include:

  • Gemini Pro (1.5): A commercial multimodal model from Google DeepMind known for its robust video-language capabilities, released in November 2023 and most recently updated in February 2024. We are comparing against its most recent version, namely Gemini Pro 1.5.
  • Whisper + ChatGPT-3.5 (OpenAI): This combination is one of the few widely adopted approaches to video summarization. By leveraging the state-of-the-art speech-to-text and large language models, summaries are derived primarily from the video’s auditory content. A significant drawback is the oversight of valuable visual information within the video.
  • Vendor A’s Summary API: Widely adopted commercial product for audio and video summary generation. Vendor A’s Summary API seems to be based only on transcription data and language models (akin to ASR+ChatGPT3.5) to output a video summary.
  • Video-ChatGPT: Developed by Maaz et al. (June 2023), it is a video language model with a chat interface. The model processes video frames to capture visual events within a video. Note that it does not utilize the conversational information within the video.
  • VideoChat2: Developed by Li et al. (November 2023), it is a state-of-the-art open-source multimodal LLM, which employs progressive multi-modal training with diverse instruction-tuning data.

We specifically excluded image-based vision-language models such as LLaVA and GPT-4V from our comparison because they lack native video processing capabilities, which is a critical requirement for the tasks we are evaluating. Specifically, here are their limitations.

  • Many of these models can only see a single image, which perform poorly on most video benchmark datasets.
  • Some of these models (e.g. GPT-4V) can see multiple images at once, but they can process only a few frames (10 or less) of a video at a time, which are not sufficient for most videos that are longer than 1 minute.
  • Image-based models especially exhibit limitations when dealing with the dynamic and fluid nature of video content, which is not easily observable when the models treat the input as a series of images rather than a coherent video.
  • Moreover, the time it takes for these models to process videos is impractically long for real-world applications. This is because they lack the mechanisms to efficiently handle the temporal dimension of videos, which is essential for understanding the unfolding narrative and actions within the video content.

4.2 - Results on Video Question Answering
Table 2: Zero-Shot Evaluation on ActivityNet-QA and NExT-QA Test Split

In Video Question Answering tasks, Pegasus-1's zero-shot performance on both ActivityNet-QA and NExT-QA datasets is particularly noteworthy. Pegasus-1 demonstrates a remarkable ability to generalize and understand various videos to accurately answer video-related questions without task-specific training.

4.3 - Results on Video Conversations
Table 3: Zero-shot evaluation on Video-ChatGPT Benchmark

The Video-ChatGPT Benchmark (also known as QEFVC) results highlight Pegasus-1's adeptness in handling video conversations. Pegasus-1 leads the pack with scores that reflect its proficiency in Correctness, Detail, Context, Temporal understanding, and Consistency. Notably, Pegasus-1 scored 3.79 in Correctness and 4.29 in Detail, showcasing its nuanced grasp of video conversations and the context in which they occur.

4.4 - Results on Video Summarization
Table 4: Zero-shot evaluation on MSR-VTT Dataset (Following Video-ChatGPT Benchmark’s scoring scheme)

Lastly, Pegasus-1 has demonstrated superior performance on summarizing videos. We compare Pegasus-1 against its competitors on MSR-VTT Dataset, using VideoChatGPT Benchmark’s scoring scheme (”Temporal Understanding” and “Consistency” are omitted due to the nature of summarization). As shown above, Pegasus-1 outperforms the baseline models all metrics by a significant margin.

Through these benchmarks, Pegasus-1 has established itself as a formidable contender in the video-language modeling arena, setting new standards for zero-shot performance and generalization in video understanding tasks.

5 - Qualitative Examples

These are sample examples that are randomly selected from diverse domains to illustrate the capabilities of Pegasus-1.

Generation Examples
E-Learning Video
Ad Video
Movie Trailer Video
No items found.
Comparison against existing models
No items found.

6 - Limitations

Safety & Biases: Pegasus-1 is designed with safety mechanisms; however, as with any AI model, there is a risk of generating content that could be considered harmful or inappropriate without proper oversight and regulation. Our understanding of ethical and safety measures for video foundation models is ongoing. As we continue testing and gather feedback, a detailed evaluation and ethics report will be made available.

Video Duration: Our API supports videos ranging from 4 seconds to 20 minutes in length. This constraint is due to computational and memory considerations, which are common challenges in handling large-scale video data. As a result, users may need to segment longer videos into smaller parts to fully leverage the model's capabilities. We will work on natively supporting longer duration in future releases.

Hallucinations: Pegasus-1 can occasionally produce inaccurate outputs. While we have improved it from our alpha version to reduce hallucination, users should be mindful of this limitation, especially for tasks where high precision is required and factual correctness is critical.

7 - Closing Remarks

The journey of Pegasus-1 from its alpha to the beta version has been marked by significant enhancements. The meticulous improvements in training data quality, video processing capabilities, and advanced training techniques have culminated in a model that not only understands video content more deeply but also interacts in a conversational context with a level of sophistication that was previously unattainable.

The benchmark results speak for themselves, placing Pegasus-1 at the forefront of the industry, outperforming established models like Google's Gemini Pro and setting new standards in video QA and video conversational frameworks. These quantitative measures, alongside the qualitative improvements in world knowledge and detail recognition, underscore the transformative potential of Pegasus-1.

While we recognize the limitations of Pegasus-1, including safety concerns, video length constraints, and occasional hallucinations, these are areas of active research and development. Our commitment to improving Pegasus-1 remains steadfast as we aim to push the boundaries of video understanding technology.

Twelve Labs Team

This is a joint team effort across multiple functional groups including model and data (”core” indicates Core Contributor), engineering, product and business development. (First-name alphabetical order)

Model: Aiden Lee, Cooper Han, Flynn Jang (core), Jae Lee, Jay Yi (core), Jeff Kim, Jeremy Kim, Kyle Park, Lucas Lee, Mars Ha, Minjoon Seo, Ray Jung (core), William Go (core)

Data: Daniel Kim (core), Jay Suh (core)

Deployment: Abraham Jo, Ed Park, Hassan Kianinejad,  SJ Kim, Tony Moon, Wade Jeong

Product: Andrei Popescu,  Esther Kim,  EK Yoon,  Genie Heo, Henry Choi, Jenna  Kang, Kevin Han, Noah Seo, Sunny Nguyen, Ryan Won, Yeonhoo Park

Business & Operations: Anthony Giuliani, Dave Chung, Hans Yoon, James Le, Jenny Ahn, June Lee, Maninder Saini,  Meredith Sanders  Soyoung Lee, Sue Kim, Travis Couture

Resources:
  1. Link to sign up and play with our API
  2. Link to the API documentation
  3. Link to our Discord community to connect with fellow users and developers

If you use this model in your work, please use the following BibTeX citation and cite the author as Twelve Labs:

@misc{pegasus-1-beta,  author = {Twelve Labs Team},  title = {Pegasus-1 Open Beta: Setting New Standards in Video-Language Modeling},  url = {https://www.twelvelabs.io/blog/pegasus-1-beta},  year = {2024}}}

Related articles

Multimodal AI and How Video Understanding Will Revolutionize Media

A beginner guide to video understanding for M&E with MASV and Twelve Labs

James Le
Unleash the Power of Auto-Generating Video Title, Topics, and Hashtags

"Generate titles and hashtags" app can whip up a snazzy topic, a catchy title, and some trending hashtags for any video you fancy.

Meeran Kim
Introducing Marengo-2.6: A New State-of-the-Art Video Foundation Model for Any-to-Any Search

This blog post introduces Marengo-2.6, a new state-of-the-art multimodal embedding model capable of performing any-to-any search tasks.

Aiden Lee, James Le