Author
James Le
Date Published
October 11, 2024
Tags
Applications
Developers
Embeddings
Vector Database
Embed API
API Tutorial
Partnership
Semantic Search
Share
Join our newsletter
You’re now subscribed to the Twelve Labs Newsletter! You'll be getting the latest news and updates in video understanding.
Oh no, something went wrong.
Please try again.

Introduction

This tutorial demonstrates how to build a semantic video search engine with the powerful integration of Twelve Labs' Embed API with ApertureDB for advanced semantic video search. The Twelve Labs Embed API, powered by the state-of-the-art Marengo-2.6 video foundation model, generates multimodal embeddings that capture the essence of video content, including visual expressions, spoken words, and contextual information. These embeddings enable highly accurate and efficient semantic search capabilities when combined with ApertureDB, a graph-vector database optimized for computer vision and machine learning applications.

By following this workflow, developers and data scientists will learn how to leverage the Twelve Labs Embed API to create rich video embeddings and seamlessly integrate them along with the source videos into ApertureDB, for semantic search. This workflow showcases the potential for building sophisticated video analysis and retrieval systems, opening up new possibilities for content discovery, recommendation engines, and video-based applications across various industries.

‍

Setup and Installations

In this section, we will install the necessary libraries and dependencies required for integrating the Twelve Labs Embed API with ApertureDB. This includes the installation of the requests library for making API calls, the aperturedb Python SDK for interacting with the ApertureDB service, and the twelvelabs Python SDK for accessing the Twelve Labs Embed API.

# Install necessary libraries and dependencies
!pip install requests
!pip install aperturedb
!pip install twelvelabs

# Import required modules
import requests
import aperturedb
import twelvelabs

After running the above code, you will have all the necessary libraries installed and ready for use in the subsequent steps of the notebook.

‍

Configure API Key

In this section, we'll set up the necessary API keys for both Twelve Labs and ApertureDB. We'll use Google Colab's userdata feature to securely store and retrieve these keys.

from google.colab import userdata

# Configure Twelve Labs
TL_API_KEY = userdata.get('TL_API_KEY')

# Configure ApertureDB
ADB_PASSWORD = userdata.get('ADB_PASSWORD')

# Verify that the keys are properly set
if not TL_API_KEY:
    raise ValueError("Twelve Labs API key not found. Please set it in Colab's Secrets.")
if not ADB_PASSWORD:
    raise ValueError("ApertureDB password not found. Please set it in Colab's Secrets.")

print("API keys successfully configured.")

To use this code, you need to set up your API keys in Google Colab's Secrets:

  1. Go to the left sidebar in your Colab notebook.
  2. Click on the "Key" icon to open the "Secrets" panel.
  3. Add two secrets:some text
    • Key: TL_API_KEY, Value: Your Twelve Labs API key
    • Key: ADB_PASSWORD, Value: Your ApertureDB password

This approach ensures that your API keys are securely stored and not exposed in your notebook. The code checks if the keys are properly set and raises an error if they're missing, helping you troubleshoot any configuration issues.

‍

Generate Video Embeddings with Twelve Labs Embed API

In this section, we'll demonstrate how to generate video embeddings using the Twelve Labs Embed API. This API is powered by Marengo-2.6, Twelve Labs' state-of-the-art video foundation model designed for any-to-any retrieval tasks. Marengo-2.6 enables the creation of multimodal embeddings that capture the essence of video content, including visual expressions, body language, spoken words, and overall context.

Let's set up the Twelve Labs client and define a function to generate embeddings:

from twelvelabs import TwelveLabs
from twelvelabs.models.embed import EmbeddingsTask

# Initialize the Twelve Labs client
twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)

def generate_embedding(video_url):
    # Create an embedding task
    task = twelvelabs_client.embed.task.create(
        engine_name="Marengo-retrieval-2.6",
        video_url=video_url
    )
    print(f"Created task: id={task.id} engine_name={task.engine_name} status={task.status}")

    # Define a callback function to monitor task progress
    def on_task_update(task: EmbeddingsTask):
        print(f"  Status={task.status}")

    # Wait for the task to complete
    status = task.wait_for_done(
        sleep_interval=2,
        callback=on_task_update
    )
    print(f"Embedding done: {status}")

    # Retrieve the task result
    task_result = twelvelabs_client.embed.task.retrieve(task.id)

    # Extract and return the embeddings
    embeddings = []
    for v in task_result.video_embeddings:
        embeddings.append({
            'embedding': v.embedding.float,
            'start_offset_sec': v.start_offset_sec,
            'end_offset_sec': v.end_offset_sec,
            'embedding_scope': v.embedding_scope
        })

    return embeddings, task_result

# Example usage
video_url = "https://storage.googleapis.com/ad-demos-datasets/videos/Ecommerce%20v2.5.mp4"

# Generate embeddings for the video
embeddings, task_result = generate_embedding(video_url)

print(f"Generated {len(embeddings)} embeddings for the video")
for i, emb in enumerate(embeddings):
    print(f"Embedding {i+1}:")
    print(f"  Scope: {emb['embedding_scope']}")
    print(f"  Time range: {emb['start_offset_sec']} - {emb['end_offset_sec']} seconds")
    print(f"  Embedding vector (first 5 values): {emb['embedding'][:5]}")
    print()
    

This code demonstrates the process of generating video embeddings using the Twelve Labs Embed API. Here's a breakdown of what's happening:

  1. We initialize the Twelve Labs client with our API key.
  2. The generate_embedding function creates an embedding task using the Marengo-2.6 engine, which is optimized for video understanding and retrieval tasks.
  3. We monitor the task progress and wait for it to complete.
  4. Once the task is done, we retrieve the results and extract the embeddings.
  5. The embeddings are returned along with metadata such as the time range they correspond to and their scope.

The Marengo-2.6 model used by the Embed API offers several advantages:

  • Multimodal understanding: It captures the interplay between visual, audio, and textual elements in the video.
  • Temporal awareness: Unlike static image models, Marengo-2.6 accounts for motion, action, and temporal information in videos.
  • Flexible segmentation: The API can generate embeddings for different segments of a video or a single embedding for the entire video.
  • State-of-the-art performance: It provides more accurate and temporally coherent interpretations of video content compared to traditional approaches.

By using this API, developers can easily create contextual vector representations of their videos, enabling advanced search and analysis capabilities in their applications.

‍

Insert Embeddings into ApertureDB

ApertureDB allows us to store and manage multimodal data, including videos, clips, and their associated embeddings. In this section, we demonstrate how to insert the video embeddings generated by the Twelve Labs Embed API into ApertureDB. This step is crucial for enabling efficient semantic video search.

from typing import List
from aperturedb.DataModels import VideoDataModel, ClipDataModel, DescriptorDataModel, DescriptorSetDataModel
from aperturedb.CommonLibrary import create_connector, execute_query
from aperturedb.Query import generate_add_query
from aperturedb.Query import RangeType
from aperturedb.Connector import Connector
import json

# Define data models for the association of Video, Video Clips, and Embeddings
class ClipEmbeddingModel(ClipDataModel):
    embedding: DescriptorDataModel

class VideoClipsModel(VideoDataModel):
    title: str
    description: str
    clips: List[ClipEmbeddingModel] = []

def create_video_object_with_clips(URL: str, clips, collection):
    video = VideoClipsModel(url=URL, title="Ecommerce v2.5",
                            description="Ecommerce v2.5 video with clips by Marengo26")
    for clip in clips:
        video.clips.append(ClipEmbeddingModel(
            range_type=RangeType.TIME,
            start=clip['start_offset_sec'],
            stop=clip['end_offset_sec'],
            embedding=DescriptorDataModel(
                vector=clip['embedding'], set=collection)
        ))
    return video

video_url = "https://storage.googleapis.com/ad-demos-datasets/videos/Ecommerce%20v2.5.mp4"
clips = embeddings

# Instantiate an ApertureDB client
aperturedb_client = Connector(
    host="workshop.datasets.gcp.cloud.aperturedata.io",
    user="admin",
    password=ADB_PASSWORD
)

# Create a descriptor set (collection)
collection = DescriptorSetDataModel(
    name="marengo26", dimensions=len(clips[0]['embedding']))
q, blobs, c = generate_add_query(collection)
result, response, blobs = execute_query(query=q, blobs=blobs, client=aperturedb_client)
print(f"Descriptor set creation: {result=}, {response=}")

# Create and insert the video object with clips and embeddings
video = create_video_object_with_clips(video_url, clips, collection)
q, blobs, c = generate_add_query(video)
result, response, blobs = execute_query(query=q, blobs=blobs, client=aperturedb_client)
print(f"Video insertion: {result=}, {response=}")

This code demonstrates the process of inserting the video embeddings into ApertureDB:

  1. We define custom data models (ClipEmbeddingModel and VideoClipsModel) to represent the structure of our video data, including clips and their associated embeddings.
  2. The create_video_object_with_clips function creates a video object with its associated clips and embeddings, which, in ApertureDB, gets represented in a connected graph schema as shown in the figure below.
  3. We instantiate an ApertureDB client using the provided credentials.
  4. A descriptor set (collection) is created to store the embeddings. This set defines the search space for our embeddings.
  5. Finally, we create a video object that includes all the clips and their corresponding embeddings, and insert it into ApertureDB.

The schema created with adding the data as described above

‍

By structuring the data this way, we maintain the temporal information of each embedding within the video, allowing for more precise semantic search capabilities. Each clip is associated with its specific time range and embedding, enabling granular search and retrieval of video segments.

‍

Perform Semantic Video Search

In this section, we'll demonstrate how to perform a semantic video search using the embeddings stored in ApertureDB. We'll use a text query to find relevant video clips based on their semantic similarity.

import struct
from aperturedb.Descriptors import Descriptors
from aperturedb.Query import ObjectType
from aperturedb.NotebookHelpers import display_video_mp4
from IPython.display import display

# Generate a text embedding for our search query
text_embedding = twelvelabs_client.embed.create(
  engine_name="Marengo-retrieval-2.6",
  text="Show me the part which has lot of outfits being displayed",
  text_truncate="none"
)

print("Created a text embedding")
print(f" Engine: {text_embedding.engine_name}")
print(f" Embedding: {text_embedding.text_embedding.float[:5]}...")  # Display first 5 values

# Define the descriptor set we'll search in
descriptorset = "marengo26"

# Find similar descriptors to the text embedding
descriptors = Descriptors(aperturedb_client)
descriptors.find_similar(
  descriptorset,
  text_embedding.text_embedding.float,
  k_neighbors=3,
  distances=True
)

# Find connected clips to the descriptors
clip_descriptors = descriptors.get_connected_entities(ObjectType.CLIP)

print(f"Found {len(clip_descriptors)} relevant clips")

This code performs the following steps:

  1. Generates a text embedding for our search query using the Twelve Labs Embed API.
  2. Uses ApertureDB's Descriptors class to find similar embeddings in the "marengo26" descriptor set.
  3. Retrieves the video clips connected to the most similar embeddings.

‍

Display Results

Now, we'll display the search results, showing the metadata of the clips and the corresponding video segments.


# Show the metadata of the clips and the corresponding video segments
for i, clips in enumerate(clip_descriptors, 1):
    print(f"\nResult {i}:")
    for clip in clips:
        print(f"Clip metadata:")
        print(f"  Start time: {clip.start} seconds")
        print(f"  End time: {clip.stop} seconds")
        print(f"  Video URL: {clip.url}")
        
        # Display the video clip
        print("Displaying video clip:")
        display_video_mp4(clips.get_blob(clip))
        print("\n" + "-"*50 + "\n")

This code:

  1. Iterates through the found clips.
  2. Displays metadata for each clip, including start and end times, and the video URL.
  3. Uses the display_video_mp4 function to show the actual video segment corresponding to each clip.

‍

Conclusion

This tutorial has demonstrated the powerful integration of Twelve Labs' Embed API with ApertureDB for advanced semantic video search. By leveraging Twelve Labs' state-of-the-art Marengo-2.6 video foundation model, we've shown how to generate rich, multimodal embeddings that capture the essence of video content. These embeddings, when combined with ApertureDB's efficient multimodal data storage and vector search capabilities, enable highly accurate and context-aware video search functionalities.

The workflow presented here opens up new possibilities for content discovery, recommendation engines, and video-based applications across various industries. By understanding the semantic content of videos, developers can create more intuitive and powerful search experiences that go beyond simple keyword matching.

‍

Next Steps

To further enhance this semantic video search system, consider exploring the following avenues:

  1. Optimize embedding generation: Experiment with different video segmentation strategies to balance granularity and performance.
  2. Enhance search capabilities: Implement more advanced query techniques, such as multimodal queries combining text, image, and audio inputs. Twelve Labs Embed API will soon support image and audio embedding generation. ApertureDB already supports representation of images and multimodal embeddings.
  3. Scale the system: Test the performance with larger video datasets and optimize for high-volume queries.
  4. Implement user feedback: Incorporate user interactions to refine search results over time.
  5. Explore additional use cases: Adapt this workflow for specific applications like content moderation, video summarization, or personalized recommendations.

By building upon this foundation, developers can create sophisticated video analysis and retrieval systems that push the boundaries of what's possible in video understanding and search technology.

‍

Appendix

For your reference and further exploration:

  1. Complete Colab Notebook
  2. Twelve Labs API documentation
  3. ApertureDB documentation

We'd love to see what you build! Share your projects and experiences with the Twelve Labs and ApertureDB communities. Happy coding!

Generation Examples
No items found.
No items found.
Comparison against existing models
No items found.

Related articles

Building a Shade Finder App: Using Twelve Labs' API to Pinpoint Specific Colors in Videos

Whether you're looking to find the perfect berry-toned lipstick or just curious about spotting specific colors in your videos, this guide will help you leverage cutting-edge AI to do so effortlessly.

Meeran Kim
Building Advanced Video Understanding Applications: Integrating Twelve Labs Embed API with LanceDB for Multimodal AI

Leverage Twelve Labs Embed API and LanceDB to create AI applications that can process and analyze video content with unprecedented accuracy and efficiency.

James Le, Manish Maheshwari
A Recap of Denver Multimodal AI Hackathon

We had fun interacting with the AI community in Denver!

James Le
Advanced Video Search: Leveraging Twelve Labs and Milvus for Semantic Retrieval

Harness the power of Twelve Labs' advanced multimodal embeddings and Milvus' efficient vector database to create a robust video search solution.

James Le, Manish Maheshwari