Partnerships

Partnerships

Partnerships

From Video to Vector: Building Smart Video Agents with TwelveLabs and Langflow

James Le

James Le

James Le

TwelveLabs brings the brains: our Pegasus and Marengo models let your code understand, index, and chat with video content—answering questions, finding key moments, and even generating smart summaries. Langflow is your playground: a visual builder that turns your ideas into AI workflows and instantly serves them as APIs. Together, they make it easy to create, test, and deploy video AI solutions—whether you’re building a video search engine, a content moderation bot, or a multimodal assistant that can handle text and videos.

TwelveLabs brings the brains: our Pegasus and Marengo models let your code understand, index, and chat with video content—answering questions, finding key moments, and even generating smart summaries. Langflow is your playground: a visual builder that turns your ideas into AI workflows and instantly serves them as APIs. Together, they make it easy to create, test, and deploy video AI solutions—whether you’re building a video search engine, a content moderation bot, or a multimodal assistant that can handle text and videos.

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Search, analyze, and explore your videos with AI.

Jun 4, 2025

Jun 4, 2025

Jun 4, 2025

12 Min

12 Min

12 Min

Copy link to article

Copy link to article

Copy link to article

We'd like to give a huge thanks to the team at Langflow/DataStax (Melissa Herrera, Eric Hare, Gokul Krishnaa, and Alejandro Cantarero) for the collaboration!



1 - Introduction and Overview

Welcome to the wild world of video AI, where your apps can finally “see” what’s happening in videos—just like you do! 🎬

This tutorial is your golden ticket to building next-level video-powered agents with TwelveLabs and Langflow. No need for a PhD in computer vision—just bring your curiosity and a few video clips to the party.


Why should you care?

TwelveLabs brings the brains: our Pegasus and Marengo models let your code understand, index, and chat with video content—answering questions, finding key moments, and even generating smart summaries. Langflow is your playground: a visual builder that turns your ideas into AI workflows and instantly serves them as APIs. Together, they make it easy to create, test, and deploy video AI solutions—whether you’re building a video search engine, a content moderation bot, or a multimodal assistant that can handle text, images, and videos.


What’s inside?

In this tutorial, you’ll go from zero to video hero. You’ll set up your environment, master the TwelveLabs components in Langflow, and build real-world flows—like chatting with videos, generating and storing video embeddings, and even crafting a full-blown RAG system that retrieves and answers questions from your video library. Along the way, you’ll pick up best practices, discover killer use cases, and get inspired to build your own video-powered apps.

So grab some popcorn, fire up your code editor, and let’s make your AI workflows truly video-smart! 🚀🍿



2 - Setting Up Your Development Environment

Let's prepare your environment to start building with TwelveLabs and Langflow.



Installing Langflow
  1. Install Langflow using pip: pip install langflow

  2. Start the Langflow server: langflow run

  3. Access the Langflow UI by navigating to http://127.0.0.1:7860 in your browser.


Obtaining TwelveLabs API Credentials
  1. Create an account on the TwelveLabs platform if you don't already have one.

  2. Navigate to your profile settings and create a new API key.

  3. Make note of your API key as you'll need it to authenticate requests to TwelveLabs services.



Verifying TwelveLabs Components
  1. In the Langflow interface, create a new flow by clicking the "+" button.

  2. Open the component sidebar and search for "TwelveLabs". You should see the following components:

    • Video File

    • Split Video

    • Twelve Labs Pegasus Index Video

    • Twelve Labs Pegasus

    • Twelve Labs Text Embeddings

    • Twelve Labs Video Embeddings

    • Convert AstraDB to Pegasus Input

  3. If you don't see these components, ensure Langflow is updated to the latest version.


Preparing Sample Videos
  1. Select 2-3 short video clips (1-3 minutes each) for testing during this tutorial. Common formats like MP4, MOV, and AVI are supported.

  2. For optimal performance during learning, use videos with clear visual content and distinct scenes. This helps demonstrate the video understanding capabilities more effectively.

  3. Create a dedicated folder for your sample videos so they're easy to locate during the tutorial.



Troubleshooting Tips

If you encounter issues with the TwelveLabs components:

  1. API Authentication: Verify your API key is correctly entered and has not expired.

  2. Video Processing: Ensure your videos are in supported formats and aren't too large. Start with clips under 100MB for faster processing.

  3. Component Loading: If components don't appear, try restarting Langflow or checking the console for error messages.

  4. Dependencies: The integration requires FFmpeg for video processing. Install it using your system's package manager if not already available.

With your environment set up, you're ready to start building powerful video-enabled AI workflows with TwelveLabs and Langflow.



3 - Understanding TwelveLabs Components in Langflow

Langflow now includes seven powerful TwelveLabs components that enable advanced video understanding capabilities in your AI workflows. These components work together to process videos, generate embeddings, index video content, and enable natural language interactions with visual content.



3.1 - Video File Component

The Video File Component serves as your entry point for video processing workflows in Langflow. It handles a wide range of common video formats including MP4, AVI, MOV, and MKV, making it versatile for different video sources. Implementation is straightforward - simply provide a path to your video file, and the component returns a Data object containing both the file path and essential metadata. For optimal performance during development, we recommend starting with videos under 100MB.



3.2 - Split Video Component

This powerful component intelligently segments longer videos into smaller, more manageable clips. You can fine-tune the splitting process through several parameters: set your desired clip duration in seconds, choose how to handle the final clip (truncate, overlap with previous content, or keep as-is), and decide whether to retain the original video alongside the clips. The component outputs a collection of clips as Data objects, complete with detailed metadata. For best results in retrieval and understanding tasks, we recommend creating clips between 6 and 30 seconds in length.



3.3 - Twelve Labs Pegasus Index Video Component

At the heart of video indexing lies the Pegasus Index Video Component. It interfaces with TwelveLabs' Pegasus API to create searchable indexes of your video content. After providing your API key and video data (typically from the Split Video component), it generates indexed video data complete with unique video_id and index_id identifiers. This indexing step is crucial before performing any queries with the Pegasus component.



3.4 - Twelve Labs Pegasus Component

The Pegasus Component enables sophisticated natural language interaction with your indexed video content. Configuration requires your API key, the query text, and the relevant video and index identifiers. Acting as an intelligent reasoning layer, it processes both the query context and video content to generate conversational responses. This component truly bridges the gap between natural language understanding and video content analysis.



3.5 - Twelve Labs Text Embeddings Component

Leveraging TwelveLabs' Marengo model, this component generates vector embeddings from text input. It requires your API key and currently supports the Marengo-retrieval-2.7 model. The output is fully compatible with vector databases, making it perfect for sophisticated retrieval systems. For consistency in your applications, we recommend using the same embedding models across both text and video components.



3.6 - Twelve Labs Video Embeddings Component

Similar to its text counterpart, the Video Embeddings Component creates dense vector representations of video content. It shares the same configuration pattern as the text embeddings component but specializes in processing video files. The resulting embeddings enable powerful semantic similarity searches between videos or between text and video content, opening up exciting possibilities for multimodal applications.



3.7 - Convert AstraDB to Pegasus Input Component

This essential connector component bridges the gap between vector database operations and TwelveLabs components. It processes search results from AstraDB, extracting crucial video_id and index_id information, and formats them properly for the Pegasus component. This component becomes particularly valuable when implementing RAG (Retrieval Augmented Generation) patterns with video content, ensuring smooth data flow between your vector database and video processing pipeline.

These components are designed to work together seamlessly. For example, you can create a video search engine by:

  1. Loading videos with Video File

  2. Splitting them into digestible chunks with Split Video

  3. Indexing them with Pegasus Index Video

  4. Generating embeddings with Video Embeddings

  5. Storing in a vector database like AstraDB

  6. Retrieving relevant clips with Convert AstraDB to Pegasus Input

  7. Answering questions with Twelve Labs Pegasus

When building your workflows, think of these components as building blocks that handle different aspects of video understanding-from loading and processing to indexing, embedding, and retrieval.



4 - Pegasus Chat with Video: Your First Interactive Video Workflow

This section guides you through creating a fundamental Langflow workflow that allows you to "chat" with a video using the Twelve Labs Pegasus model. This flow demonstrates the direct integration of video processing, indexing, and natural language querying.

The "Pegasus Chat with Video" flow, as shown in the reference image, enables you to upload a video, have it automatically indexed by Twelve Labs, and then ask questions about its content through a chat interface.

Follow these steps to build this flow:

  1. Upload Your Video:

    • Drag a Video File component onto the Langflow canvas.

    • In the Video File component's properties, click to upload one of your sample video files (e.g., big_buck_bunny_720.mp4 as shown in the image). This component will load your video and make it available to other components.

  2. Connect Video Data to Pegasus:

    • Add a Twelve Labs Pegasus component to the canvas.

    • Connect the Data output (the red dot) from the Video File component to the Video Data input (the red dot) on the Twelve Labs Pegasus component. This tells the Pegasus component which video to process.

  3. Configure Pegasus Component:

    • In the Twelve Labs Pegasus component:

      • Enter your Twelve Labs API Key in the designated field.

      • Provide an Index Name. This can be any descriptive name you choose for the video index that will be created (e.g., "bunny-test-index").

      • The Model will default to a Pegasus model like pegasus1.2. You can typically leave this as is unless you have specific model requirements.

    • Developer Note: When you provide Video Data and an Index Name without an Index ID, the Pegasus component will automatically initiate the video indexing process with Twelve Labs for the new video. The video_id and index_id will be generated by Twelve Labs during this process.

  4. Enable User Input:

    • Add a Chat Input component to the canvas.

    • Connect the output of the Chat Input component to the Prompt input (the blue dot, labeled "Receiving input" in the image) of the Twelve Labs Pegasus component. This allows you to type questions that will be sent to the Pegasus model.

  5. Display Chat Responses:

    • Add a Chat Output component.

    • Connect the Message output (the blue dot) from the Twelve Labs Pegasus component to the input of this Chat Output component1. This will display the answers from the Pegasus model.

  6. Optional: Output Video ID for Re-chatting:

    • To avoid re-indexing the same video for subsequent chat sessions, you can capture the Video ID and Index ID generated by Twelve Labs.

    • Add another Chat Output component.

    • Connect the Video ID output (the blue dot) from the Twelve Labs Pegasus component to this new Chat Output component1.

    • Developer Tip: In future sessions with the same video, you can directly input the previously generated Video ID and Index ID into the Pegasus Video ID and Index ID fields of the Twelve Labs Pegasus component, respectively. This bypasses the re-indexing step, making interactions faster. If you only provide the Video Data and Index Name, it will re-index.

  7. Test in the Playground:

    • Open the Langflow playground (usually by clicking the chat bubble icon on the right sidebar).

    • Type a question about your video into the chat interface (e.g., "What animal is in the video?", "Describe the main character.").

    • The Pegasus component will process your video (if it's the first time), index it, and then answer your question based on the video's content. The response will appear in the connected Chat Output.

This simple yet powerful flow serves as a foundational example of how to integrate TwelveLabs' video understanding directly into an interactive Langflow agent. You can now ask questions and get answers about the content of your videos in near real-time.



5 - Video Embeddings with Marengo and AstraDB: Building Your Semantic Video Index

This section details how to construct a Langflow workflow to generate video embeddings using TwelveLabs' Marengo model and subsequently store these embeddings in DataStax AstraDB. This is a crucial step for enabling semantic search over your video library or for building advanced Retrieval Augmented Generation (RAG) systems with video content.

The workflow, as illustrated in the provided reference image, ingests a video, generates its embeddings, and then stores these embeddings along with video metadata in an AstraDB collection.


5.1 - Prerequisites: Setting up AstraDB

Before building the Langflow flow, ensure your AstraDB environment is ready:

  1. Create an AstraDB Database:

    • If you don't have one, sign up and create a new serverless database on DataStax AstraDB.

    • Follow the AstraDB setup guides for initial database creation.

  2. Generate an Application Token:

    • Navigate to your Database Details section in the AstraDB console.

    • Generate an Application Token with appropriate permissions (e.g., Database Administrator or a role that allows read/write to your collections). Securely store this token, as you'll need it for the Langflow component.

  3. Create a video_embeddings Collection:

    • Within your AstraDB database, create a new collection. You might name it video_embeddings or similar.

    • Crucially, configure this collection to support vector search. This typically involves:

      • Specifying that the collection will store vectors.

      • Setting the vector dimension to 1024, which is the dimension for embeddings generated by TwelveLabs' Marengo models (like Marengo-retrieval-2.7 shown in the image1).

    • Refer to the AstraDB documentation on collection management for detailed steps on enabling vector search and setting dimensions.



5.2 - Building the Langflow Flow

Once AstraDB is set up, construct the following flow in Langflow:

  1. Upload Your Video:

    • Drag a Video File component onto the canvas.

    • Configure it by uploading your desired video file (e.g., big_buck_bunny_720.mp4 as seen in the image1). This component makes the video data accessible to the workflow.

  2. Configure Video Embeddings Generation:

    • Add a Twelve Labs Video Embeddings component to the canvas.

    • Enter your Twelve Labs API Key in the API Key field.

    • Ensure the Model is set to Marengo-retrieval-2.7 (or another compatible Marengo model). This component will process the video and generate dense vector embeddings.

  3. Configure AstraDB for Ingestion:

    • Add a DS Astra DB component (or Astra DB as shown in the image1) to the canvas.

    • Input your Astra DB Application Token into the corresponding field.

    • Specify the Database name where you created your vector-enabled collection.

    • Enter the name of your Collection (e.g., video_embeddings).

  4. Connect Embeddings to AstraDB:

    • Connect the Embeddings output (the green dot) from the Twelve Labs Video Embeddings component to the Embedding Model input (the green dot) on the DS Astra DB component1.

    • Developer Note: This connection tells AstraDB how to interpret the incoming data as embeddings, but it doesn't provide the embeddings themselves. The actual embeddings come with the data to be ingested.

  5. Connect Video Data for Ingestion:

    • Connect the Data output (the red dot) from the Video File component to the Ingest Data input (the red dot) on the DS Astra DB component1.

    • Developer Note: When data is passed to the Ingest Data input, the Astra DB component, if configured with an embedding model or connected to an embedding component as in this flow, will expect or generate embeddings for the incoming data. In this setup, the Twelve Labs Video Embeddings component handles the embedding generation, and its output (which includes the embedding along with original data) is passed to DS Astra DB.

  6. Run the Flow to Embed and Ingest:

    • To trigger the process, you typically "run" or activate the final component in this chain, which is the DS Astra DB component. In Langflow, this might involve clicking a play/run button on the component or initiating the flow if it's part of a larger executable graph.

    • Upon execution, the Video File component loads the video. The Twelve Labs Video Embeddings component then receives this video data (often implicitly through the flow or explicitly if the Video Data input is connected, though not shown for this component in the screenshot), generates the embeddings. Finally, the DS Astra DB component takes the video data (now enriched with embeddings by the upstream component or by its own internal logic if the embedding component is connected to its Embedding Model input) and ingests it into the specified AstraDB collection.

After running this flow, your video's content will be represented as vector embeddings within your AstraDB collection, ready for semantic search, similarity comparisons, and as a knowledge base for RAG applications. You can verify this by querying your AstraDB collection directly.



6 - Advanced RAG Implementation with Video Content

Building a robust Retrieval Augmented Generation (RAG) system for video involves several stages, from processing and indexing your video content to enabling semantic search and contextual understanding. This section is broken into two parts. First, we'll focus on preparing your video data by splitting it, indexing the clips with Twelve Labs Pegasus, generating embeddings with Marengo, and storing everything in AstraDB.


Part 1: Split Video, Index Clips with Pegasus, and Save Embeddings to Astra DB

This initial flow, illustrated in the provided image, is the foundation of your video RAG system. It processes a source video into manageable clips, enriches these clips with indexing information from Pegasus, generates corresponding vector embeddings using Marengo, and finally stores this comprehensive dataset in AstraDB for efficient retrieval.

Here's how to construct this ingestion pipeline:

  1. Upload Source Video:

    • Begin by dragging a Video File component onto your Langflow canvas.

    • Configure it by uploading your video (e.g., big_buck_bunny_720... as shown in the image). This component outputs the raw video data.

  2. Split Video into Clips:

    • Add a Split Video component.

    • Connect the Data output from the Video File component to the Video Data input of the Split Video component.

    • Configure the Split Video component:

      • Clip Duration (seconds): Set this to define the length of each video segment. The image shows 6 seconds, which is a good starting point for granular analysis. You can adjust this based on your content; for instance, 30 seconds might be suitable for broader scene segmentation.

      • Last Clip Handling: Choose an option like Overlap Previous (as in the image) to ensure consistent clip lengths, or select Truncate or Keep Short based on your preference.

      • Include Original Video: Typically, this is turned off (as in the image) when processing clips for RAG, as you'll be working with the segments.

  3. Index Clips with Pegasus:

    • Introduce a Twelve Labs Pegasus Index Video component.

    • Connect the Video Clips output from the Split Video component to the Video Data input of the Twelve Labs Pegasus Index Video component. This component will process each clip.

    • In the component's settings:

      • Enter your Twelve Labs API Key.

      • Specify an Index Name (e.g., test index in the image). This helps organize your indexed content within Twelve Labs. The component will output Indexed Data, which includes the original clip data along with the video_id and index_id assigned by Pegasus.

      • The Model will default to a Pegasus model (e.g., pegasus1.2 ).

  4. Generate Video Embeddings with Marengo:

    • Add a Twelve Labs Video Embeddings component.

    • Enter your Twelve Labs API Key here as well.

    • Set the Model to Marengo-retrieval-2.7 (as shown in the image) or your preferred Marengo embedding model.

    • Crucially, connect the Indexed Data output from the Twelve Labs Pegasus Index Video component to the (implicit or explicit) video data input of the Twelve Labs Video Embeddings component. Developer Note: While the image doesn't show a direct named "Video Data" input on this component, the embeddings component needs to receive the video clips to process. The Indexed Data carries these clips.

  5. Configure AstraDB for Storage:

    • Place a DS Astra DB component onto the canvas.

    • Enter your Astra DB Application Token.

    • Select your target Database (e.g., video_embeddings ) and Collection (e.g., video_embeddings ) where the video data and embeddings will be stored. Ensure this collection is configured for vector search with a dimension matching your Marengo model (1024 for Marengo-retrieval-2.7).

  6. Connect Embeddings and Data to AstraDB:

    • Connect the Embeddings output from the Twelve Labs Video Embeddings component to the Embedding Model input on the DS Astra DB component. This tells AstraDB how to interpret the embedding vectors.

    • Connect the Indexed Data output from the Twelve Labs Pegasus Index Video component to the Ingest Data input on the DS Astra DB component. This sends the video clips (now enriched with Pegasus video_id and index_id) along with their generated embeddings to be stored.

  7. Run the Ingestion Flow:

    • Execute the flow. The Video File is loaded, split by Split Video. Each clip is then sent to Twelve Labs Pegasus Index Video for indexing. The resulting Indexed Data (clips + Pegasus IDs) is passed to Twelve Labs Video Embeddings to generate Marengo embeddings. Finally, the DS Astra DB component ingests this comprehensive data: the clips, their Pegasus index information, and their Marengo vector embeddings into your specified collection.

After this flow completes, your AstraDB collection will contain indexed video clips, each associated with its vector embedding and Pegasus identifiers, ready for the retrieval part of your RAG system.



Part 2: Query Video Embeddings and Chat with Video using Pegasus

With your video clips processed, indexed by Pegasus, and their Marengo embeddings stored in AstraDB (as detailed in Part 1), this second flow demonstrates the retrieval and generation aspects of RAG. As shown in the reference image, a user's text query is first embedded, then used to search AstraDB for relevant video clips. The information from these clips is then passed to the Twelve Labs Pegasus model to generate a contextual answer.

Here's how to build this retrieval and question-answering pipeline:

  1. User Query Input and Embedding:

    • Add a Chat Input component to the canvas. This will be where the user types their question.

    • Add a Twelve Labs Text Embeddings component.

      • Enter your Twelve Labs API Key.

      • Set the Model to Marengo-retrieval-2.7 (or the same Marengo model used in Part 1 for embedding the video clips).

    • Connect the output of the Chat Input component to the text input of the Twelve Labs Text Embeddings component. This ensures the user's query is converted into a vector embedding.

  2. Semantic Search in AstraDB:

    • Add a DS Astra DB component.

      • Enter your Astra DB Application Token.

      • Select the Database (e.g., video_embeddings) and Collection (e.g., video_embeddings) where your video clip embeddings are stored.

      • Connect the Embeddings output from the Twelve Labs Text Embeddings component to the Embedding Model input of the DS Astra DB component. This provides AstraDB with the query embedding.

      • Connect the output of the Chat Input component (the raw text query) to the Search Query input of the DS Astra DB component. Developer Note: While the embedding of the query is passed to Embedding Model, the Search Query input itself might be used by AstraDB for keyword filtering or other hybrid search strategies if configured, or simply to know what to perform the vector search against using the provided query embedding.

    • The DS Astra DB component will perform a similarity search in your collection, returning the Search Results which are the most relevant video clips (or their metadata, including video_id and index_id if stored from Part 1).

  3. Convert AstraDB Results for Pegasus:

    • Add a Convert AstraDB to Pegasus Input component.

    • Connect the Search Results output (the red dot, often carrying document metadata) from the DS Astra DB component to the AstraDB Results input of this converter component.

    • This utility component is crucial: it extracts the Index ID and Video ID from the AstraDB search results, which are necessary for the Pegasus component to identify which specific indexed video segment to "chat" with.

  4. Prepare Pegasus for Contextual Q&A:

    • Add a Twelve Labs Pegasus component.

      • Enter your Twelve Labs API Key.

      • Select the desired Pegasus Model (e.g., pegasus1.2 is the default).

    • Connect the Index ID output from the Convert AstraDB to Pegasus Input component to the Index ID input (labeled "Receiving input") on the Twelve Labs Pegasus component.

    • Connect the Video ID output from the Convert AstraDB to Pegasus Input component to the Pegasus Video ID input (labeled "Receiving input") on the Twelve Labs Pegasus component.

    • Developer Note: Unlike the "Pegasus Chat with Video" flow where video data might be directly provided for on-the-fly indexing, here we are providing specific Pegasus Video ID and Index ID for already indexed content retrieved from AstraDB. The Video Data input on the Pegasus component remains unconnected in this RAG retrieval flow.

  5. Pass Original Query to Pegasus:

    • Connect the output of the original Chat Input component (the one holding the user's question) directly to the Prompt input (labeled "Receiving input") of the Twelve Labs Pegasus component. This ensures Pegasus knows what question to answer using the context of the retrieved video segment.

  6. Display Pegasus's Response:

    • Add a Chat Output component.

    • Connect the Message output from the Twelve Labs Pegasus component to the input of this Chat Output component. This will display the AI-generated answer.

  7. Test Your Video RAG System:

    • Open the Langflow playground (chat interface).

    • Ask a question related to the content of the videos you processed in Part 1 (e.g., "What happens to the bunny at the beginning?", "Show me scenes with butterflies.").

    • The flow will execute as follows:

      1. Your question is captured by Chat Input.

      2. Twelve Labs Text Embeddings converts your question into a vector.

      3. DS Astra DB uses this vector to find the most semantically similar video clip(s) from your database.

      4. Convert AstraDB to Pegasus Input extracts the video_id and index_id of the top retrieved clip(s).

      5. Twelve Labs Pegasus receives these IDs (telling it which video segment to focus on) and your original question (telling it what to answer about that segment).

      6. Pegasus analyzes the specified video segment in relation to your question and generates an answer.

      7. The answer is displayed via Chat Output.

This two-part RAG architecture allows you to build sophisticated, scalable video understanding applications where users can conversationally retrieve and interact with relevant moments from a large video library.



7 - Specialized Use Cases & Performance Best Practices

TwelveLabs and Langflow’s combined video capabilities unlock a range of high-value business applications. You can build content moderation systems that automatically flag or summarize inappropriate video segments, video search engines that surface relevant moments from large archives, or even multimodal assistants that blend video, text, and image understanding for richer user experiences. These use cases are ideal for industries like media, education, and customer support, where quick access to video insights and automated analysis directly drive productivity, compliance, and user engagement.

To ensure your video-enabled workflows deliver reliable business results, focus on optimizing performance and cost. Manage API usage by batching video segments or using caching to avoid redundant processing, and choose clip durations that balance detail with computational efficiency. Monitor your AstraDB and embedding pipelines to maintain fast query responses as your video library grows, and consider hybrid search strategies that combine vector similarity with metadata filtering for more precise results.

By following these best practices, you maximize the scalability, accuracy, and cost-effectiveness of your video AI solutions—helping your organization extract actionable insights from video at scale and maintain a competitive edge in the market.



8 - Conclusion and Next Steps

Congratulations, video wizard! 🎬✨

You’ve just unlocked the power to build AI agents that truly see and understand video, using the dynamic duo of TwelveLabs and Langflow. Whether you’re creating the next smart video search engine, moderating content at scale, or crafting multimodal assistants, you’re now equipped with the tools to turn raw footage into actionable insights—and maybe even a little magic.

So what’s next?

Try experimenting with your own creative flows, push the limits with different video types, or share your coolest projects with the community. Keep an eye on the TwelveLabs and Langflow roadmaps for exciting new features, and don’t forget to join the conversation—your ideas and feedback help shape the future of video AI!

Go forth and build something amazing. The video universe is yours to explore! 🚀🎥


Additional Resources

We'd like to give a huge thanks to the team at Langflow/DataStax (Melissa Herrera, Eric Hare, Gokul Krishnaa, and Alejandro Cantarero) for the collaboration!



1 - Introduction and Overview

Welcome to the wild world of video AI, where your apps can finally “see” what’s happening in videos—just like you do! 🎬

This tutorial is your golden ticket to building next-level video-powered agents with TwelveLabs and Langflow. No need for a PhD in computer vision—just bring your curiosity and a few video clips to the party.


Why should you care?

TwelveLabs brings the brains: our Pegasus and Marengo models let your code understand, index, and chat with video content—answering questions, finding key moments, and even generating smart summaries. Langflow is your playground: a visual builder that turns your ideas into AI workflows and instantly serves them as APIs. Together, they make it easy to create, test, and deploy video AI solutions—whether you’re building a video search engine, a content moderation bot, or a multimodal assistant that can handle text, images, and videos.


What’s inside?

In this tutorial, you’ll go from zero to video hero. You’ll set up your environment, master the TwelveLabs components in Langflow, and build real-world flows—like chatting with videos, generating and storing video embeddings, and even crafting a full-blown RAG system that retrieves and answers questions from your video library. Along the way, you’ll pick up best practices, discover killer use cases, and get inspired to build your own video-powered apps.

So grab some popcorn, fire up your code editor, and let’s make your AI workflows truly video-smart! 🚀🍿



2 - Setting Up Your Development Environment

Let's prepare your environment to start building with TwelveLabs and Langflow.



Installing Langflow
  1. Install Langflow using pip: pip install langflow

  2. Start the Langflow server: langflow run

  3. Access the Langflow UI by navigating to http://127.0.0.1:7860 in your browser.


Obtaining TwelveLabs API Credentials
  1. Create an account on the TwelveLabs platform if you don't already have one.

  2. Navigate to your profile settings and create a new API key.

  3. Make note of your API key as you'll need it to authenticate requests to TwelveLabs services.



Verifying TwelveLabs Components
  1. In the Langflow interface, create a new flow by clicking the "+" button.

  2. Open the component sidebar and search for "TwelveLabs". You should see the following components:

    • Video File

    • Split Video

    • Twelve Labs Pegasus Index Video

    • Twelve Labs Pegasus

    • Twelve Labs Text Embeddings

    • Twelve Labs Video Embeddings

    • Convert AstraDB to Pegasus Input

  3. If you don't see these components, ensure Langflow is updated to the latest version.


Preparing Sample Videos
  1. Select 2-3 short video clips (1-3 minutes each) for testing during this tutorial. Common formats like MP4, MOV, and AVI are supported.

  2. For optimal performance during learning, use videos with clear visual content and distinct scenes. This helps demonstrate the video understanding capabilities more effectively.

  3. Create a dedicated folder for your sample videos so they're easy to locate during the tutorial.



Troubleshooting Tips

If you encounter issues with the TwelveLabs components:

  1. API Authentication: Verify your API key is correctly entered and has not expired.

  2. Video Processing: Ensure your videos are in supported formats and aren't too large. Start with clips under 100MB for faster processing.

  3. Component Loading: If components don't appear, try restarting Langflow or checking the console for error messages.

  4. Dependencies: The integration requires FFmpeg for video processing. Install it using your system's package manager if not already available.

With your environment set up, you're ready to start building powerful video-enabled AI workflows with TwelveLabs and Langflow.



3 - Understanding TwelveLabs Components in Langflow

Langflow now includes seven powerful TwelveLabs components that enable advanced video understanding capabilities in your AI workflows. These components work together to process videos, generate embeddings, index video content, and enable natural language interactions with visual content.



3.1 - Video File Component

The Video File Component serves as your entry point for video processing workflows in Langflow. It handles a wide range of common video formats including MP4, AVI, MOV, and MKV, making it versatile for different video sources. Implementation is straightforward - simply provide a path to your video file, and the component returns a Data object containing both the file path and essential metadata. For optimal performance during development, we recommend starting with videos under 100MB.



3.2 - Split Video Component

This powerful component intelligently segments longer videos into smaller, more manageable clips. You can fine-tune the splitting process through several parameters: set your desired clip duration in seconds, choose how to handle the final clip (truncate, overlap with previous content, or keep as-is), and decide whether to retain the original video alongside the clips. The component outputs a collection of clips as Data objects, complete with detailed metadata. For best results in retrieval and understanding tasks, we recommend creating clips between 6 and 30 seconds in length.



3.3 - Twelve Labs Pegasus Index Video Component

At the heart of video indexing lies the Pegasus Index Video Component. It interfaces with TwelveLabs' Pegasus API to create searchable indexes of your video content. After providing your API key and video data (typically from the Split Video component), it generates indexed video data complete with unique video_id and index_id identifiers. This indexing step is crucial before performing any queries with the Pegasus component.



3.4 - Twelve Labs Pegasus Component

The Pegasus Component enables sophisticated natural language interaction with your indexed video content. Configuration requires your API key, the query text, and the relevant video and index identifiers. Acting as an intelligent reasoning layer, it processes both the query context and video content to generate conversational responses. This component truly bridges the gap between natural language understanding and video content analysis.



3.5 - Twelve Labs Text Embeddings Component

Leveraging TwelveLabs' Marengo model, this component generates vector embeddings from text input. It requires your API key and currently supports the Marengo-retrieval-2.7 model. The output is fully compatible with vector databases, making it perfect for sophisticated retrieval systems. For consistency in your applications, we recommend using the same embedding models across both text and video components.



3.6 - Twelve Labs Video Embeddings Component

Similar to its text counterpart, the Video Embeddings Component creates dense vector representations of video content. It shares the same configuration pattern as the text embeddings component but specializes in processing video files. The resulting embeddings enable powerful semantic similarity searches between videos or between text and video content, opening up exciting possibilities for multimodal applications.



3.7 - Convert AstraDB to Pegasus Input Component

This essential connector component bridges the gap between vector database operations and TwelveLabs components. It processes search results from AstraDB, extracting crucial video_id and index_id information, and formats them properly for the Pegasus component. This component becomes particularly valuable when implementing RAG (Retrieval Augmented Generation) patterns with video content, ensuring smooth data flow between your vector database and video processing pipeline.

These components are designed to work together seamlessly. For example, you can create a video search engine by:

  1. Loading videos with Video File

  2. Splitting them into digestible chunks with Split Video

  3. Indexing them with Pegasus Index Video

  4. Generating embeddings with Video Embeddings

  5. Storing in a vector database like AstraDB

  6. Retrieving relevant clips with Convert AstraDB to Pegasus Input

  7. Answering questions with Twelve Labs Pegasus

When building your workflows, think of these components as building blocks that handle different aspects of video understanding-from loading and processing to indexing, embedding, and retrieval.



4 - Pegasus Chat with Video: Your First Interactive Video Workflow

This section guides you through creating a fundamental Langflow workflow that allows you to "chat" with a video using the Twelve Labs Pegasus model. This flow demonstrates the direct integration of video processing, indexing, and natural language querying.

The "Pegasus Chat with Video" flow, as shown in the reference image, enables you to upload a video, have it automatically indexed by Twelve Labs, and then ask questions about its content through a chat interface.

Follow these steps to build this flow:

  1. Upload Your Video:

    • Drag a Video File component onto the Langflow canvas.

    • In the Video File component's properties, click to upload one of your sample video files (e.g., big_buck_bunny_720.mp4 as shown in the image). This component will load your video and make it available to other components.

  2. Connect Video Data to Pegasus:

    • Add a Twelve Labs Pegasus component to the canvas.

    • Connect the Data output (the red dot) from the Video File component to the Video Data input (the red dot) on the Twelve Labs Pegasus component. This tells the Pegasus component which video to process.

  3. Configure Pegasus Component:

    • In the Twelve Labs Pegasus component:

      • Enter your Twelve Labs API Key in the designated field.

      • Provide an Index Name. This can be any descriptive name you choose for the video index that will be created (e.g., "bunny-test-index").

      • The Model will default to a Pegasus model like pegasus1.2. You can typically leave this as is unless you have specific model requirements.

    • Developer Note: When you provide Video Data and an Index Name without an Index ID, the Pegasus component will automatically initiate the video indexing process with Twelve Labs for the new video. The video_id and index_id will be generated by Twelve Labs during this process.

  4. Enable User Input:

    • Add a Chat Input component to the canvas.

    • Connect the output of the Chat Input component to the Prompt input (the blue dot, labeled "Receiving input" in the image) of the Twelve Labs Pegasus component. This allows you to type questions that will be sent to the Pegasus model.

  5. Display Chat Responses:

    • Add a Chat Output component.

    • Connect the Message output (the blue dot) from the Twelve Labs Pegasus component to the input of this Chat Output component1. This will display the answers from the Pegasus model.

  6. Optional: Output Video ID for Re-chatting:

    • To avoid re-indexing the same video for subsequent chat sessions, you can capture the Video ID and Index ID generated by Twelve Labs.

    • Add another Chat Output component.

    • Connect the Video ID output (the blue dot) from the Twelve Labs Pegasus component to this new Chat Output component1.

    • Developer Tip: In future sessions with the same video, you can directly input the previously generated Video ID and Index ID into the Pegasus Video ID and Index ID fields of the Twelve Labs Pegasus component, respectively. This bypasses the re-indexing step, making interactions faster. If you only provide the Video Data and Index Name, it will re-index.

  7. Test in the Playground:

    • Open the Langflow playground (usually by clicking the chat bubble icon on the right sidebar).

    • Type a question about your video into the chat interface (e.g., "What animal is in the video?", "Describe the main character.").

    • The Pegasus component will process your video (if it's the first time), index it, and then answer your question based on the video's content. The response will appear in the connected Chat Output.

This simple yet powerful flow serves as a foundational example of how to integrate TwelveLabs' video understanding directly into an interactive Langflow agent. You can now ask questions and get answers about the content of your videos in near real-time.



5 - Video Embeddings with Marengo and AstraDB: Building Your Semantic Video Index

This section details how to construct a Langflow workflow to generate video embeddings using TwelveLabs' Marengo model and subsequently store these embeddings in DataStax AstraDB. This is a crucial step for enabling semantic search over your video library or for building advanced Retrieval Augmented Generation (RAG) systems with video content.

The workflow, as illustrated in the provided reference image, ingests a video, generates its embeddings, and then stores these embeddings along with video metadata in an AstraDB collection.


5.1 - Prerequisites: Setting up AstraDB

Before building the Langflow flow, ensure your AstraDB environment is ready:

  1. Create an AstraDB Database:

    • If you don't have one, sign up and create a new serverless database on DataStax AstraDB.

    • Follow the AstraDB setup guides for initial database creation.

  2. Generate an Application Token:

    • Navigate to your Database Details section in the AstraDB console.

    • Generate an Application Token with appropriate permissions (e.g., Database Administrator or a role that allows read/write to your collections). Securely store this token, as you'll need it for the Langflow component.

  3. Create a video_embeddings Collection:

    • Within your AstraDB database, create a new collection. You might name it video_embeddings or similar.

    • Crucially, configure this collection to support vector search. This typically involves:

      • Specifying that the collection will store vectors.

      • Setting the vector dimension to 1024, which is the dimension for embeddings generated by TwelveLabs' Marengo models (like Marengo-retrieval-2.7 shown in the image1).

    • Refer to the AstraDB documentation on collection management for detailed steps on enabling vector search and setting dimensions.



5.2 - Building the Langflow Flow

Once AstraDB is set up, construct the following flow in Langflow:

  1. Upload Your Video:

    • Drag a Video File component onto the canvas.

    • Configure it by uploading your desired video file (e.g., big_buck_bunny_720.mp4 as seen in the image1). This component makes the video data accessible to the workflow.

  2. Configure Video Embeddings Generation:

    • Add a Twelve Labs Video Embeddings component to the canvas.

    • Enter your Twelve Labs API Key in the API Key field.

    • Ensure the Model is set to Marengo-retrieval-2.7 (or another compatible Marengo model). This component will process the video and generate dense vector embeddings.

  3. Configure AstraDB for Ingestion:

    • Add a DS Astra DB component (or Astra DB as shown in the image1) to the canvas.

    • Input your Astra DB Application Token into the corresponding field.

    • Specify the Database name where you created your vector-enabled collection.

    • Enter the name of your Collection (e.g., video_embeddings).

  4. Connect Embeddings to AstraDB:

    • Connect the Embeddings output (the green dot) from the Twelve Labs Video Embeddings component to the Embedding Model input (the green dot) on the DS Astra DB component1.

    • Developer Note: This connection tells AstraDB how to interpret the incoming data as embeddings, but it doesn't provide the embeddings themselves. The actual embeddings come with the data to be ingested.

  5. Connect Video Data for Ingestion:

    • Connect the Data output (the red dot) from the Video File component to the Ingest Data input (the red dot) on the DS Astra DB component1.

    • Developer Note: When data is passed to the Ingest Data input, the Astra DB component, if configured with an embedding model or connected to an embedding component as in this flow, will expect or generate embeddings for the incoming data. In this setup, the Twelve Labs Video Embeddings component handles the embedding generation, and its output (which includes the embedding along with original data) is passed to DS Astra DB.

  6. Run the Flow to Embed and Ingest:

    • To trigger the process, you typically "run" or activate the final component in this chain, which is the DS Astra DB component. In Langflow, this might involve clicking a play/run button on the component or initiating the flow if it's part of a larger executable graph.

    • Upon execution, the Video File component loads the video. The Twelve Labs Video Embeddings component then receives this video data (often implicitly through the flow or explicitly if the Video Data input is connected, though not shown for this component in the screenshot), generates the embeddings. Finally, the DS Astra DB component takes the video data (now enriched with embeddings by the upstream component or by its own internal logic if the embedding component is connected to its Embedding Model input) and ingests it into the specified AstraDB collection.

After running this flow, your video's content will be represented as vector embeddings within your AstraDB collection, ready for semantic search, similarity comparisons, and as a knowledge base for RAG applications. You can verify this by querying your AstraDB collection directly.



6 - Advanced RAG Implementation with Video Content

Building a robust Retrieval Augmented Generation (RAG) system for video involves several stages, from processing and indexing your video content to enabling semantic search and contextual understanding. This section is broken into two parts. First, we'll focus on preparing your video data by splitting it, indexing the clips with Twelve Labs Pegasus, generating embeddings with Marengo, and storing everything in AstraDB.


Part 1: Split Video, Index Clips with Pegasus, and Save Embeddings to Astra DB

This initial flow, illustrated in the provided image, is the foundation of your video RAG system. It processes a source video into manageable clips, enriches these clips with indexing information from Pegasus, generates corresponding vector embeddings using Marengo, and finally stores this comprehensive dataset in AstraDB for efficient retrieval.

Here's how to construct this ingestion pipeline:

  1. Upload Source Video:

    • Begin by dragging a Video File component onto your Langflow canvas.

    • Configure it by uploading your video (e.g., big_buck_bunny_720... as shown in the image). This component outputs the raw video data.

  2. Split Video into Clips:

    • Add a Split Video component.

    • Connect the Data output from the Video File component to the Video Data input of the Split Video component.

    • Configure the Split Video component:

      • Clip Duration (seconds): Set this to define the length of each video segment. The image shows 6 seconds, which is a good starting point for granular analysis. You can adjust this based on your content; for instance, 30 seconds might be suitable for broader scene segmentation.

      • Last Clip Handling: Choose an option like Overlap Previous (as in the image) to ensure consistent clip lengths, or select Truncate or Keep Short based on your preference.

      • Include Original Video: Typically, this is turned off (as in the image) when processing clips for RAG, as you'll be working with the segments.

  3. Index Clips with Pegasus:

    • Introduce a Twelve Labs Pegasus Index Video component.

    • Connect the Video Clips output from the Split Video component to the Video Data input of the Twelve Labs Pegasus Index Video component. This component will process each clip.

    • In the component's settings:

      • Enter your Twelve Labs API Key.

      • Specify an Index Name (e.g., test index in the image). This helps organize your indexed content within Twelve Labs. The component will output Indexed Data, which includes the original clip data along with the video_id and index_id assigned by Pegasus.

      • The Model will default to a Pegasus model (e.g., pegasus1.2 ).

  4. Generate Video Embeddings with Marengo:

    • Add a Twelve Labs Video Embeddings component.

    • Enter your Twelve Labs API Key here as well.

    • Set the Model to Marengo-retrieval-2.7 (as shown in the image) or your preferred Marengo embedding model.

    • Crucially, connect the Indexed Data output from the Twelve Labs Pegasus Index Video component to the (implicit or explicit) video data input of the Twelve Labs Video Embeddings component. Developer Note: While the image doesn't show a direct named "Video Data" input on this component, the embeddings component needs to receive the video clips to process. The Indexed Data carries these clips.

  5. Configure AstraDB for Storage:

    • Place a DS Astra DB component onto the canvas.

    • Enter your Astra DB Application Token.

    • Select your target Database (e.g., video_embeddings ) and Collection (e.g., video_embeddings ) where the video data and embeddings will be stored. Ensure this collection is configured for vector search with a dimension matching your Marengo model (1024 for Marengo-retrieval-2.7).

  6. Connect Embeddings and Data to AstraDB:

    • Connect the Embeddings output from the Twelve Labs Video Embeddings component to the Embedding Model input on the DS Astra DB component. This tells AstraDB how to interpret the embedding vectors.

    • Connect the Indexed Data output from the Twelve Labs Pegasus Index Video component to the Ingest Data input on the DS Astra DB component. This sends the video clips (now enriched with Pegasus video_id and index_id) along with their generated embeddings to be stored.

  7. Run the Ingestion Flow:

    • Execute the flow. The Video File is loaded, split by Split Video. Each clip is then sent to Twelve Labs Pegasus Index Video for indexing. The resulting Indexed Data (clips + Pegasus IDs) is passed to Twelve Labs Video Embeddings to generate Marengo embeddings. Finally, the DS Astra DB component ingests this comprehensive data: the clips, their Pegasus index information, and their Marengo vector embeddings into your specified collection.

After this flow completes, your AstraDB collection will contain indexed video clips, each associated with its vector embedding and Pegasus identifiers, ready for the retrieval part of your RAG system.



Part 2: Query Video Embeddings and Chat with Video using Pegasus

With your video clips processed, indexed by Pegasus, and their Marengo embeddings stored in AstraDB (as detailed in Part 1), this second flow demonstrates the retrieval and generation aspects of RAG. As shown in the reference image, a user's text query is first embedded, then used to search AstraDB for relevant video clips. The information from these clips is then passed to the Twelve Labs Pegasus model to generate a contextual answer.

Here's how to build this retrieval and question-answering pipeline:

  1. User Query Input and Embedding:

    • Add a Chat Input component to the canvas. This will be where the user types their question.

    • Add a Twelve Labs Text Embeddings component.

      • Enter your Twelve Labs API Key.

      • Set the Model to Marengo-retrieval-2.7 (or the same Marengo model used in Part 1 for embedding the video clips).

    • Connect the output of the Chat Input component to the text input of the Twelve Labs Text Embeddings component. This ensures the user's query is converted into a vector embedding.

  2. Semantic Search in AstraDB:

    • Add a DS Astra DB component.

      • Enter your Astra DB Application Token.

      • Select the Database (e.g., video_embeddings) and Collection (e.g., video_embeddings) where your video clip embeddings are stored.

      • Connect the Embeddings output from the Twelve Labs Text Embeddings component to the Embedding Model input of the DS Astra DB component. This provides AstraDB with the query embedding.

      • Connect the output of the Chat Input component (the raw text query) to the Search Query input of the DS Astra DB component. Developer Note: While the embedding of the query is passed to Embedding Model, the Search Query input itself might be used by AstraDB for keyword filtering or other hybrid search strategies if configured, or simply to know what to perform the vector search against using the provided query embedding.

    • The DS Astra DB component will perform a similarity search in your collection, returning the Search Results which are the most relevant video clips (or their metadata, including video_id and index_id if stored from Part 1).

  3. Convert AstraDB Results for Pegasus:

    • Add a Convert AstraDB to Pegasus Input component.

    • Connect the Search Results output (the red dot, often carrying document metadata) from the DS Astra DB component to the AstraDB Results input of this converter component.

    • This utility component is crucial: it extracts the Index ID and Video ID from the AstraDB search results, which are necessary for the Pegasus component to identify which specific indexed video segment to "chat" with.

  4. Prepare Pegasus for Contextual Q&A:

    • Add a Twelve Labs Pegasus component.

      • Enter your Twelve Labs API Key.

      • Select the desired Pegasus Model (e.g., pegasus1.2 is the default).

    • Connect the Index ID output from the Convert AstraDB to Pegasus Input component to the Index ID input (labeled "Receiving input") on the Twelve Labs Pegasus component.

    • Connect the Video ID output from the Convert AstraDB to Pegasus Input component to the Pegasus Video ID input (labeled "Receiving input") on the Twelve Labs Pegasus component.

    • Developer Note: Unlike the "Pegasus Chat with Video" flow where video data might be directly provided for on-the-fly indexing, here we are providing specific Pegasus Video ID and Index ID for already indexed content retrieved from AstraDB. The Video Data input on the Pegasus component remains unconnected in this RAG retrieval flow.

  5. Pass Original Query to Pegasus:

    • Connect the output of the original Chat Input component (the one holding the user's question) directly to the Prompt input (labeled "Receiving input") of the Twelve Labs Pegasus component. This ensures Pegasus knows what question to answer using the context of the retrieved video segment.

  6. Display Pegasus's Response:

    • Add a Chat Output component.

    • Connect the Message output from the Twelve Labs Pegasus component to the input of this Chat Output component. This will display the AI-generated answer.

  7. Test Your Video RAG System:

    • Open the Langflow playground (chat interface).

    • Ask a question related to the content of the videos you processed in Part 1 (e.g., "What happens to the bunny at the beginning?", "Show me scenes with butterflies.").

    • The flow will execute as follows:

      1. Your question is captured by Chat Input.

      2. Twelve Labs Text Embeddings converts your question into a vector.

      3. DS Astra DB uses this vector to find the most semantically similar video clip(s) from your database.

      4. Convert AstraDB to Pegasus Input extracts the video_id and index_id of the top retrieved clip(s).

      5. Twelve Labs Pegasus receives these IDs (telling it which video segment to focus on) and your original question (telling it what to answer about that segment).

      6. Pegasus analyzes the specified video segment in relation to your question and generates an answer.

      7. The answer is displayed via Chat Output.

This two-part RAG architecture allows you to build sophisticated, scalable video understanding applications where users can conversationally retrieve and interact with relevant moments from a large video library.



7 - Specialized Use Cases & Performance Best Practices

TwelveLabs and Langflow’s combined video capabilities unlock a range of high-value business applications. You can build content moderation systems that automatically flag or summarize inappropriate video segments, video search engines that surface relevant moments from large archives, or even multimodal assistants that blend video, text, and image understanding for richer user experiences. These use cases are ideal for industries like media, education, and customer support, where quick access to video insights and automated analysis directly drive productivity, compliance, and user engagement.

To ensure your video-enabled workflows deliver reliable business results, focus on optimizing performance and cost. Manage API usage by batching video segments or using caching to avoid redundant processing, and choose clip durations that balance detail with computational efficiency. Monitor your AstraDB and embedding pipelines to maintain fast query responses as your video library grows, and consider hybrid search strategies that combine vector similarity with metadata filtering for more precise results.

By following these best practices, you maximize the scalability, accuracy, and cost-effectiveness of your video AI solutions—helping your organization extract actionable insights from video at scale and maintain a competitive edge in the market.



8 - Conclusion and Next Steps

Congratulations, video wizard! 🎬✨

You’ve just unlocked the power to build AI agents that truly see and understand video, using the dynamic duo of TwelveLabs and Langflow. Whether you’re creating the next smart video search engine, moderating content at scale, or crafting multimodal assistants, you’re now equipped with the tools to turn raw footage into actionable insights—and maybe even a little magic.

So what’s next?

Try experimenting with your own creative flows, push the limits with different video types, or share your coolest projects with the community. Keep an eye on the TwelveLabs and Langflow roadmaps for exciting new features, and don’t forget to join the conversation—your ideas and feedback help shape the future of video AI!

Go forth and build something amazing. The video universe is yours to explore! 🚀🎥


Additional Resources

We'd like to give a huge thanks to the team at Langflow/DataStax (Melissa Herrera, Eric Hare, Gokul Krishnaa, and Alejandro Cantarero) for the collaboration!



1 - Introduction and Overview

Welcome to the wild world of video AI, where your apps can finally “see” what’s happening in videos—just like you do! 🎬

This tutorial is your golden ticket to building next-level video-powered agents with TwelveLabs and Langflow. No need for a PhD in computer vision—just bring your curiosity and a few video clips to the party.


Why should you care?

TwelveLabs brings the brains: our Pegasus and Marengo models let your code understand, index, and chat with video content—answering questions, finding key moments, and even generating smart summaries. Langflow is your playground: a visual builder that turns your ideas into AI workflows and instantly serves them as APIs. Together, they make it easy to create, test, and deploy video AI solutions—whether you’re building a video search engine, a content moderation bot, or a multimodal assistant that can handle text, images, and videos.


What’s inside?

In this tutorial, you’ll go from zero to video hero. You’ll set up your environment, master the TwelveLabs components in Langflow, and build real-world flows—like chatting with videos, generating and storing video embeddings, and even crafting a full-blown RAG system that retrieves and answers questions from your video library. Along the way, you’ll pick up best practices, discover killer use cases, and get inspired to build your own video-powered apps.

So grab some popcorn, fire up your code editor, and let’s make your AI workflows truly video-smart! 🚀🍿



2 - Setting Up Your Development Environment

Let's prepare your environment to start building with TwelveLabs and Langflow.



Installing Langflow
  1. Install Langflow using pip: pip install langflow

  2. Start the Langflow server: langflow run

  3. Access the Langflow UI by navigating to http://127.0.0.1:7860 in your browser.


Obtaining TwelveLabs API Credentials
  1. Create an account on the TwelveLabs platform if you don't already have one.

  2. Navigate to your profile settings and create a new API key.

  3. Make note of your API key as you'll need it to authenticate requests to TwelveLabs services.



Verifying TwelveLabs Components
  1. In the Langflow interface, create a new flow by clicking the "+" button.

  2. Open the component sidebar and search for "TwelveLabs". You should see the following components:

    • Video File

    • Split Video

    • Twelve Labs Pegasus Index Video

    • Twelve Labs Pegasus

    • Twelve Labs Text Embeddings

    • Twelve Labs Video Embeddings

    • Convert AstraDB to Pegasus Input

  3. If you don't see these components, ensure Langflow is updated to the latest version.


Preparing Sample Videos
  1. Select 2-3 short video clips (1-3 minutes each) for testing during this tutorial. Common formats like MP4, MOV, and AVI are supported.

  2. For optimal performance during learning, use videos with clear visual content and distinct scenes. This helps demonstrate the video understanding capabilities more effectively.

  3. Create a dedicated folder for your sample videos so they're easy to locate during the tutorial.



Troubleshooting Tips

If you encounter issues with the TwelveLabs components:

  1. API Authentication: Verify your API key is correctly entered and has not expired.

  2. Video Processing: Ensure your videos are in supported formats and aren't too large. Start with clips under 100MB for faster processing.

  3. Component Loading: If components don't appear, try restarting Langflow or checking the console for error messages.

  4. Dependencies: The integration requires FFmpeg for video processing. Install it using your system's package manager if not already available.

With your environment set up, you're ready to start building powerful video-enabled AI workflows with TwelveLabs and Langflow.



3 - Understanding TwelveLabs Components in Langflow

Langflow now includes seven powerful TwelveLabs components that enable advanced video understanding capabilities in your AI workflows. These components work together to process videos, generate embeddings, index video content, and enable natural language interactions with visual content.



3.1 - Video File Component

The Video File Component serves as your entry point for video processing workflows in Langflow. It handles a wide range of common video formats including MP4, AVI, MOV, and MKV, making it versatile for different video sources. Implementation is straightforward - simply provide a path to your video file, and the component returns a Data object containing both the file path and essential metadata. For optimal performance during development, we recommend starting with videos under 100MB.



3.2 - Split Video Component

This powerful component intelligently segments longer videos into smaller, more manageable clips. You can fine-tune the splitting process through several parameters: set your desired clip duration in seconds, choose how to handle the final clip (truncate, overlap with previous content, or keep as-is), and decide whether to retain the original video alongside the clips. The component outputs a collection of clips as Data objects, complete with detailed metadata. For best results in retrieval and understanding tasks, we recommend creating clips between 6 and 30 seconds in length.



3.3 - Twelve Labs Pegasus Index Video Component

At the heart of video indexing lies the Pegasus Index Video Component. It interfaces with TwelveLabs' Pegasus API to create searchable indexes of your video content. After providing your API key and video data (typically from the Split Video component), it generates indexed video data complete with unique video_id and index_id identifiers. This indexing step is crucial before performing any queries with the Pegasus component.



3.4 - Twelve Labs Pegasus Component

The Pegasus Component enables sophisticated natural language interaction with your indexed video content. Configuration requires your API key, the query text, and the relevant video and index identifiers. Acting as an intelligent reasoning layer, it processes both the query context and video content to generate conversational responses. This component truly bridges the gap between natural language understanding and video content analysis.



3.5 - Twelve Labs Text Embeddings Component

Leveraging TwelveLabs' Marengo model, this component generates vector embeddings from text input. It requires your API key and currently supports the Marengo-retrieval-2.7 model. The output is fully compatible with vector databases, making it perfect for sophisticated retrieval systems. For consistency in your applications, we recommend using the same embedding models across both text and video components.



3.6 - Twelve Labs Video Embeddings Component

Similar to its text counterpart, the Video Embeddings Component creates dense vector representations of video content. It shares the same configuration pattern as the text embeddings component but specializes in processing video files. The resulting embeddings enable powerful semantic similarity searches between videos or between text and video content, opening up exciting possibilities for multimodal applications.



3.7 - Convert AstraDB to Pegasus Input Component

This essential connector component bridges the gap between vector database operations and TwelveLabs components. It processes search results from AstraDB, extracting crucial video_id and index_id information, and formats them properly for the Pegasus component. This component becomes particularly valuable when implementing RAG (Retrieval Augmented Generation) patterns with video content, ensuring smooth data flow between your vector database and video processing pipeline.

These components are designed to work together seamlessly. For example, you can create a video search engine by:

  1. Loading videos with Video File

  2. Splitting them into digestible chunks with Split Video

  3. Indexing them with Pegasus Index Video

  4. Generating embeddings with Video Embeddings

  5. Storing in a vector database like AstraDB

  6. Retrieving relevant clips with Convert AstraDB to Pegasus Input

  7. Answering questions with Twelve Labs Pegasus

When building your workflows, think of these components as building blocks that handle different aspects of video understanding-from loading and processing to indexing, embedding, and retrieval.



4 - Pegasus Chat with Video: Your First Interactive Video Workflow

This section guides you through creating a fundamental Langflow workflow that allows you to "chat" with a video using the Twelve Labs Pegasus model. This flow demonstrates the direct integration of video processing, indexing, and natural language querying.

The "Pegasus Chat with Video" flow, as shown in the reference image, enables you to upload a video, have it automatically indexed by Twelve Labs, and then ask questions about its content through a chat interface.

Follow these steps to build this flow:

  1. Upload Your Video:

    • Drag a Video File component onto the Langflow canvas.

    • In the Video File component's properties, click to upload one of your sample video files (e.g., big_buck_bunny_720.mp4 as shown in the image). This component will load your video and make it available to other components.

  2. Connect Video Data to Pegasus:

    • Add a Twelve Labs Pegasus component to the canvas.

    • Connect the Data output (the red dot) from the Video File component to the Video Data input (the red dot) on the Twelve Labs Pegasus component. This tells the Pegasus component which video to process.

  3. Configure Pegasus Component:

    • In the Twelve Labs Pegasus component:

      • Enter your Twelve Labs API Key in the designated field.

      • Provide an Index Name. This can be any descriptive name you choose for the video index that will be created (e.g., "bunny-test-index").

      • The Model will default to a Pegasus model like pegasus1.2. You can typically leave this as is unless you have specific model requirements.

    • Developer Note: When you provide Video Data and an Index Name without an Index ID, the Pegasus component will automatically initiate the video indexing process with Twelve Labs for the new video. The video_id and index_id will be generated by Twelve Labs during this process.

  4. Enable User Input:

    • Add a Chat Input component to the canvas.

    • Connect the output of the Chat Input component to the Prompt input (the blue dot, labeled "Receiving input" in the image) of the Twelve Labs Pegasus component. This allows you to type questions that will be sent to the Pegasus model.

  5. Display Chat Responses:

    • Add a Chat Output component.

    • Connect the Message output (the blue dot) from the Twelve Labs Pegasus component to the input of this Chat Output component1. This will display the answers from the Pegasus model.

  6. Optional: Output Video ID for Re-chatting:

    • To avoid re-indexing the same video for subsequent chat sessions, you can capture the Video ID and Index ID generated by Twelve Labs.

    • Add another Chat Output component.

    • Connect the Video ID output (the blue dot) from the Twelve Labs Pegasus component to this new Chat Output component1.

    • Developer Tip: In future sessions with the same video, you can directly input the previously generated Video ID and Index ID into the Pegasus Video ID and Index ID fields of the Twelve Labs Pegasus component, respectively. This bypasses the re-indexing step, making interactions faster. If you only provide the Video Data and Index Name, it will re-index.

  7. Test in the Playground:

    • Open the Langflow playground (usually by clicking the chat bubble icon on the right sidebar).

    • Type a question about your video into the chat interface (e.g., "What animal is in the video?", "Describe the main character.").

    • The Pegasus component will process your video (if it's the first time), index it, and then answer your question based on the video's content. The response will appear in the connected Chat Output.

This simple yet powerful flow serves as a foundational example of how to integrate TwelveLabs' video understanding directly into an interactive Langflow agent. You can now ask questions and get answers about the content of your videos in near real-time.



5 - Video Embeddings with Marengo and AstraDB: Building Your Semantic Video Index

This section details how to construct a Langflow workflow to generate video embeddings using TwelveLabs' Marengo model and subsequently store these embeddings in DataStax AstraDB. This is a crucial step for enabling semantic search over your video library or for building advanced Retrieval Augmented Generation (RAG) systems with video content.

The workflow, as illustrated in the provided reference image, ingests a video, generates its embeddings, and then stores these embeddings along with video metadata in an AstraDB collection.


5.1 - Prerequisites: Setting up AstraDB

Before building the Langflow flow, ensure your AstraDB environment is ready:

  1. Create an AstraDB Database:

    • If you don't have one, sign up and create a new serverless database on DataStax AstraDB.

    • Follow the AstraDB setup guides for initial database creation.

  2. Generate an Application Token:

    • Navigate to your Database Details section in the AstraDB console.

    • Generate an Application Token with appropriate permissions (e.g., Database Administrator or a role that allows read/write to your collections). Securely store this token, as you'll need it for the Langflow component.

  3. Create a video_embeddings Collection:

    • Within your AstraDB database, create a new collection. You might name it video_embeddings or similar.

    • Crucially, configure this collection to support vector search. This typically involves:

      • Specifying that the collection will store vectors.

      • Setting the vector dimension to 1024, which is the dimension for embeddings generated by TwelveLabs' Marengo models (like Marengo-retrieval-2.7 shown in the image1).

    • Refer to the AstraDB documentation on collection management for detailed steps on enabling vector search and setting dimensions.



5.2 - Building the Langflow Flow

Once AstraDB is set up, construct the following flow in Langflow:

  1. Upload Your Video:

    • Drag a Video File component onto the canvas.

    • Configure it by uploading your desired video file (e.g., big_buck_bunny_720.mp4 as seen in the image1). This component makes the video data accessible to the workflow.

  2. Configure Video Embeddings Generation:

    • Add a Twelve Labs Video Embeddings component to the canvas.

    • Enter your Twelve Labs API Key in the API Key field.

    • Ensure the Model is set to Marengo-retrieval-2.7 (or another compatible Marengo model). This component will process the video and generate dense vector embeddings.

  3. Configure AstraDB for Ingestion:

    • Add a DS Astra DB component (or Astra DB as shown in the image1) to the canvas.

    • Input your Astra DB Application Token into the corresponding field.

    • Specify the Database name where you created your vector-enabled collection.

    • Enter the name of your Collection (e.g., video_embeddings).

  4. Connect Embeddings to AstraDB:

    • Connect the Embeddings output (the green dot) from the Twelve Labs Video Embeddings component to the Embedding Model input (the green dot) on the DS Astra DB component1.

    • Developer Note: This connection tells AstraDB how to interpret the incoming data as embeddings, but it doesn't provide the embeddings themselves. The actual embeddings come with the data to be ingested.

  5. Connect Video Data for Ingestion:

    • Connect the Data output (the red dot) from the Video File component to the Ingest Data input (the red dot) on the DS Astra DB component1.

    • Developer Note: When data is passed to the Ingest Data input, the Astra DB component, if configured with an embedding model or connected to an embedding component as in this flow, will expect or generate embeddings for the incoming data. In this setup, the Twelve Labs Video Embeddings component handles the embedding generation, and its output (which includes the embedding along with original data) is passed to DS Astra DB.

  6. Run the Flow to Embed and Ingest:

    • To trigger the process, you typically "run" or activate the final component in this chain, which is the DS Astra DB component. In Langflow, this might involve clicking a play/run button on the component or initiating the flow if it's part of a larger executable graph.

    • Upon execution, the Video File component loads the video. The Twelve Labs Video Embeddings component then receives this video data (often implicitly through the flow or explicitly if the Video Data input is connected, though not shown for this component in the screenshot), generates the embeddings. Finally, the DS Astra DB component takes the video data (now enriched with embeddings by the upstream component or by its own internal logic if the embedding component is connected to its Embedding Model input) and ingests it into the specified AstraDB collection.

After running this flow, your video's content will be represented as vector embeddings within your AstraDB collection, ready for semantic search, similarity comparisons, and as a knowledge base for RAG applications. You can verify this by querying your AstraDB collection directly.



6 - Advanced RAG Implementation with Video Content

Building a robust Retrieval Augmented Generation (RAG) system for video involves several stages, from processing and indexing your video content to enabling semantic search and contextual understanding. This section is broken into two parts. First, we'll focus on preparing your video data by splitting it, indexing the clips with Twelve Labs Pegasus, generating embeddings with Marengo, and storing everything in AstraDB.


Part 1: Split Video, Index Clips with Pegasus, and Save Embeddings to Astra DB

This initial flow, illustrated in the provided image, is the foundation of your video RAG system. It processes a source video into manageable clips, enriches these clips with indexing information from Pegasus, generates corresponding vector embeddings using Marengo, and finally stores this comprehensive dataset in AstraDB for efficient retrieval.

Here's how to construct this ingestion pipeline:

  1. Upload Source Video:

    • Begin by dragging a Video File component onto your Langflow canvas.

    • Configure it by uploading your video (e.g., big_buck_bunny_720... as shown in the image). This component outputs the raw video data.

  2. Split Video into Clips:

    • Add a Split Video component.

    • Connect the Data output from the Video File component to the Video Data input of the Split Video component.

    • Configure the Split Video component:

      • Clip Duration (seconds): Set this to define the length of each video segment. The image shows 6 seconds, which is a good starting point for granular analysis. You can adjust this based on your content; for instance, 30 seconds might be suitable for broader scene segmentation.

      • Last Clip Handling: Choose an option like Overlap Previous (as in the image) to ensure consistent clip lengths, or select Truncate or Keep Short based on your preference.

      • Include Original Video: Typically, this is turned off (as in the image) when processing clips for RAG, as you'll be working with the segments.

  3. Index Clips with Pegasus:

    • Introduce a Twelve Labs Pegasus Index Video component.

    • Connect the Video Clips output from the Split Video component to the Video Data input of the Twelve Labs Pegasus Index Video component. This component will process each clip.

    • In the component's settings:

      • Enter your Twelve Labs API Key.

      • Specify an Index Name (e.g., test index in the image). This helps organize your indexed content within Twelve Labs. The component will output Indexed Data, which includes the original clip data along with the video_id and index_id assigned by Pegasus.

      • The Model will default to a Pegasus model (e.g., pegasus1.2 ).

  4. Generate Video Embeddings with Marengo:

    • Add a Twelve Labs Video Embeddings component.

    • Enter your Twelve Labs API Key here as well.

    • Set the Model to Marengo-retrieval-2.7 (as shown in the image) or your preferred Marengo embedding model.

    • Crucially, connect the Indexed Data output from the Twelve Labs Pegasus Index Video component to the (implicit or explicit) video data input of the Twelve Labs Video Embeddings component. Developer Note: While the image doesn't show a direct named "Video Data" input on this component, the embeddings component needs to receive the video clips to process. The Indexed Data carries these clips.

  5. Configure AstraDB for Storage:

    • Place a DS Astra DB component onto the canvas.

    • Enter your Astra DB Application Token.

    • Select your target Database (e.g., video_embeddings ) and Collection (e.g., video_embeddings ) where the video data and embeddings will be stored. Ensure this collection is configured for vector search with a dimension matching your Marengo model (1024 for Marengo-retrieval-2.7).

  6. Connect Embeddings and Data to AstraDB:

    • Connect the Embeddings output from the Twelve Labs Video Embeddings component to the Embedding Model input on the DS Astra DB component. This tells AstraDB how to interpret the embedding vectors.

    • Connect the Indexed Data output from the Twelve Labs Pegasus Index Video component to the Ingest Data input on the DS Astra DB component. This sends the video clips (now enriched with Pegasus video_id and index_id) along with their generated embeddings to be stored.

  7. Run the Ingestion Flow:

    • Execute the flow. The Video File is loaded, split by Split Video. Each clip is then sent to Twelve Labs Pegasus Index Video for indexing. The resulting Indexed Data (clips + Pegasus IDs) is passed to Twelve Labs Video Embeddings to generate Marengo embeddings. Finally, the DS Astra DB component ingests this comprehensive data: the clips, their Pegasus index information, and their Marengo vector embeddings into your specified collection.

After this flow completes, your AstraDB collection will contain indexed video clips, each associated with its vector embedding and Pegasus identifiers, ready for the retrieval part of your RAG system.



Part 2: Query Video Embeddings and Chat with Video using Pegasus

With your video clips processed, indexed by Pegasus, and their Marengo embeddings stored in AstraDB (as detailed in Part 1), this second flow demonstrates the retrieval and generation aspects of RAG. As shown in the reference image, a user's text query is first embedded, then used to search AstraDB for relevant video clips. The information from these clips is then passed to the Twelve Labs Pegasus model to generate a contextual answer.

Here's how to build this retrieval and question-answering pipeline:

  1. User Query Input and Embedding:

    • Add a Chat Input component to the canvas. This will be where the user types their question.

    • Add a Twelve Labs Text Embeddings component.

      • Enter your Twelve Labs API Key.

      • Set the Model to Marengo-retrieval-2.7 (or the same Marengo model used in Part 1 for embedding the video clips).

    • Connect the output of the Chat Input component to the text input of the Twelve Labs Text Embeddings component. This ensures the user's query is converted into a vector embedding.

  2. Semantic Search in AstraDB:

    • Add a DS Astra DB component.

      • Enter your Astra DB Application Token.

      • Select the Database (e.g., video_embeddings) and Collection (e.g., video_embeddings) where your video clip embeddings are stored.

      • Connect the Embeddings output from the Twelve Labs Text Embeddings component to the Embedding Model input of the DS Astra DB component. This provides AstraDB with the query embedding.

      • Connect the output of the Chat Input component (the raw text query) to the Search Query input of the DS Astra DB component. Developer Note: While the embedding of the query is passed to Embedding Model, the Search Query input itself might be used by AstraDB for keyword filtering or other hybrid search strategies if configured, or simply to know what to perform the vector search against using the provided query embedding.

    • The DS Astra DB component will perform a similarity search in your collection, returning the Search Results which are the most relevant video clips (or their metadata, including video_id and index_id if stored from Part 1).

  3. Convert AstraDB Results for Pegasus:

    • Add a Convert AstraDB to Pegasus Input component.

    • Connect the Search Results output (the red dot, often carrying document metadata) from the DS Astra DB component to the AstraDB Results input of this converter component.

    • This utility component is crucial: it extracts the Index ID and Video ID from the AstraDB search results, which are necessary for the Pegasus component to identify which specific indexed video segment to "chat" with.

  4. Prepare Pegasus for Contextual Q&A:

    • Add a Twelve Labs Pegasus component.

      • Enter your Twelve Labs API Key.

      • Select the desired Pegasus Model (e.g., pegasus1.2 is the default).

    • Connect the Index ID output from the Convert AstraDB to Pegasus Input component to the Index ID input (labeled "Receiving input") on the Twelve Labs Pegasus component.

    • Connect the Video ID output from the Convert AstraDB to Pegasus Input component to the Pegasus Video ID input (labeled "Receiving input") on the Twelve Labs Pegasus component.

    • Developer Note: Unlike the "Pegasus Chat with Video" flow where video data might be directly provided for on-the-fly indexing, here we are providing specific Pegasus Video ID and Index ID for already indexed content retrieved from AstraDB. The Video Data input on the Pegasus component remains unconnected in this RAG retrieval flow.

  5. Pass Original Query to Pegasus:

    • Connect the output of the original Chat Input component (the one holding the user's question) directly to the Prompt input (labeled "Receiving input") of the Twelve Labs Pegasus component. This ensures Pegasus knows what question to answer using the context of the retrieved video segment.

  6. Display Pegasus's Response:

    • Add a Chat Output component.

    • Connect the Message output from the Twelve Labs Pegasus component to the input of this Chat Output component. This will display the AI-generated answer.

  7. Test Your Video RAG System:

    • Open the Langflow playground (chat interface).

    • Ask a question related to the content of the videos you processed in Part 1 (e.g., "What happens to the bunny at the beginning?", "Show me scenes with butterflies.").

    • The flow will execute as follows:

      1. Your question is captured by Chat Input.

      2. Twelve Labs Text Embeddings converts your question into a vector.

      3. DS Astra DB uses this vector to find the most semantically similar video clip(s) from your database.

      4. Convert AstraDB to Pegasus Input extracts the video_id and index_id of the top retrieved clip(s).

      5. Twelve Labs Pegasus receives these IDs (telling it which video segment to focus on) and your original question (telling it what to answer about that segment).

      6. Pegasus analyzes the specified video segment in relation to your question and generates an answer.

      7. The answer is displayed via Chat Output.

This two-part RAG architecture allows you to build sophisticated, scalable video understanding applications where users can conversationally retrieve and interact with relevant moments from a large video library.



7 - Specialized Use Cases & Performance Best Practices

TwelveLabs and Langflow’s combined video capabilities unlock a range of high-value business applications. You can build content moderation systems that automatically flag or summarize inappropriate video segments, video search engines that surface relevant moments from large archives, or even multimodal assistants that blend video, text, and image understanding for richer user experiences. These use cases are ideal for industries like media, education, and customer support, where quick access to video insights and automated analysis directly drive productivity, compliance, and user engagement.

To ensure your video-enabled workflows deliver reliable business results, focus on optimizing performance and cost. Manage API usage by batching video segments or using caching to avoid redundant processing, and choose clip durations that balance detail with computational efficiency. Monitor your AstraDB and embedding pipelines to maintain fast query responses as your video library grows, and consider hybrid search strategies that combine vector similarity with metadata filtering for more precise results.

By following these best practices, you maximize the scalability, accuracy, and cost-effectiveness of your video AI solutions—helping your organization extract actionable insights from video at scale and maintain a competitive edge in the market.



8 - Conclusion and Next Steps

Congratulations, video wizard! 🎬✨

You’ve just unlocked the power to build AI agents that truly see and understand video, using the dynamic duo of TwelveLabs and Langflow. Whether you’re creating the next smart video search engine, moderating content at scale, or crafting multimodal assistants, you’re now equipped with the tools to turn raw footage into actionable insights—and maybe even a little magic.

So what’s next?

Try experimenting with your own creative flows, push the limits with different video types, or share your coolest projects with the community. Keep an eye on the TwelveLabs and Langflow roadmaps for exciting new features, and don’t forget to join the conversation—your ideas and feedback help shape the future of video AI!

Go forth and build something amazing. The video universe is yours to explore! 🚀🎥


Additional Resources