파트너십

파트너십

파트너십

Building Cross-Modal Video Search with TwelveLabs and Pixeltable

James Le

James Le

James Le

This tutorial demonstrates how to build a production-ready cross-modal video search system by combining TwelveLabs' Marengo 3.0 multimodal embeddings with Pixeltable's declarative data infrastructure. You'll learn to create a unified semantic space where text, images, audio, and video can all be used interchangeably as search queries, enabling powerful capabilities like finding video clips from text descriptions, locating visually similar content from reference images, or discovering segments with matching audio characteristics. By the end, you'll have a working system that handles embedding computation, indexing, and incremental updates automatically with minimal code.

This tutorial demonstrates how to build a production-ready cross-modal video search system by combining TwelveLabs' Marengo 3.0 multimodal embeddings with Pixeltable's declarative data infrastructure. You'll learn to create a unified semantic space where text, images, audio, and video can all be used interchangeably as search queries, enabling powerful capabilities like finding video clips from text descriptions, locating visually similar content from reference images, or discovering segments with matching audio characteristics. By the end, you'll have a working system that handles embedding computation, indexing, and incremental updates automatically with minimal code.

뉴스레터 구독하기

최신 영상 AI 소식과 활용 팁, 업계 인사이트까지 한눈에 받아보세요

AI로 영상을 검색하고, 분석하고, 탐색하세요.

2026. 2. 4.

2026. 2. 4.

2026. 2. 4.

5 Minutes

5 Minutes

5 Minutes

링크 복사하기

링크 복사하기

링크 복사하기

Huge thanks to the Pixeltable team (Alison Hill, Pierre Brunelle, Marcel Kornacker, and Aaron Siegel) for collaborating with me on this integration!


Introduction

Building intelligent video applications presents a fundamental challenge: videos contain multiple modalities—visual content, audio, and text—that traditional search systems struggle to query seamlessly. Developers typically need separate models for each modality, complex bridging logic, and significant engineering effort to make cross-modal search work.

TwelveLabs' multimodal embeddings solve this by projecting text, images, audio, and video into a unified semantic space. This means you can search your video library using any modality: find clips using text descriptions ("person giving a speech"), locate videos visually similar to a reference image, discover content with matching audio characteristics, or identify similar video segments—all using the same embedding model.

In this tutorial, you'll build a complete cross-modal video search system combining TwelveLabs' Marengo 3.0 model with Pixeltable's declarative data infrastructure for multimodal AI. Pixeltable handles the complexity of embedding computation, indexing, and incremental updates automatically, letting you focus on building features rather than managing infrastructure. By the end, you'll have a working system that demonstrates true multimodal video understanding with minimal code.


Prerequisites and Setup

Before building your cross-modal search system, ensure you have Python 3.8 or later installed. You'll need a TwelveLabs API key, which you can obtain by signing up at playground.twelvelabs.io.

Start by installing the required packages:

pip install -qU

Next, configure your TwelveLabs API key securely. You can set it as an environment variable or provide it interactively:

import os
import getpass

if 'TWELVELABS_API_KEY' not in os.environ:
    os.environ['TWELVELABS_API_KEY'] = getpass.getpass('Enter your Twelve Labs API key: ')

Initialize Pixeltable and create a dedicated directory for this project:

import pixeltable as pxt
from pixeltable.functions.twelvelabs import embed

# Create a fresh directory for our video search system
pxt.drop_dir('video_search', force=True)
pxt.create_dir('video_search')

Pixeltable automatically handles data persistence, versioning, and metadata tracking in this directory.

Important: Twelve Labs requires audio and video content to be at least 4 seconds long for optimal embedding quality. Keep this constraint in mind when preparing your video content.


Core Concept: Multimodal Embeddings

Traditional video search requires separate models for each modality—one for visual content, another for audio, yet another for text—plus complex logic to bridge between them. TwelveLabs' Marengo model takes a fundamentally different approach: it creates a unified semantic space where all modalities coexist.

In this shared space, a video clip of someone speaking, the transcript of that speech, a still frame from the video, and a text description all map to nearby points. This enables true cross-modal search: query with any modality and retrieve relevant content from any modality.

Here are the cross-modal search capabilities you'll build:

Query Type

Use Case

Text → Video

Find clips matching "person giving a speech"

Image → Video

Locate videos visually similar to a reference photo

Audio → Video

Discover content with similar audio characteristics

Video → Video

Identify similar clips or alternative takes

This unified embedding space is what makes TwelveLabs' integration with Pixeltable so powerful—you build once and search across all modalities seamlessly.


Building the Video Search System

Creating a searchable video index with Pixeltable and TwelveLabs requires just a few declarative steps. Start by defining a table to store your videos:

from pixeltable.functions.video import splitter

# Create a table for videos
videos = pxt.create_table('video_search.videos', {'video': pxt.Video})

# Insert a sample video
video_url = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness.mp4'
videos.insert([{'video': video_url}])

Next, create a view that segments videos into searchable chunks. This is where Pixeltable's iterator functionality shines:

The splitter iterator automatically breaks each video into 5-second segments while ensuring no segment falls below TwelveLabs' 4-second minimum requirement. Each row in this view represents one video segment.

Now comes the critical step—adding the embedding index:

# Add embedding index for cross-modal search
video_chunks.add_embedding_index(
    'video_segment',
    embedding=embed.using(model_name='marengo3.0')
)

This single line triggers powerful automation: Pixeltable computes TwelveLabs embeddings for every video segment and maintains the index incrementally. When you add new videos, embeddings are computed automatically—no manual orchestration required.

The beauty of this declarative approach is that you specify what to compute (embeddings of video segments), not how to process them. Pixeltable handles batching, error recovery, and caching behind the scenes.


Cross-Modal Search Examples

With your video index built, you can now search using any modality. The key API to understand is the similarity() method, which you call on the video_segment column to produce a reusable similarity expression.


Text-to-Video Search

First, define a similarity expression against a text query:

# Define similarity to a text query
sim = video_chunks.video_segment.similarity(string='person giving a speech')

# Get the top 3 matching segments
results = video_chunks.order_by(sim, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim
)

results.collect()

Here sim is an expression representing “similarity between each segment and this text string,” which you can use in order_byselect, or additional filters.


Image-to-Video Search

Use the same pattern for an image query:

# Load a reference image
reference_image = pxt.Image('path/to/reference_image.jpg')

# Define similarity to the image
sim_image = video_chunks.video_segment.similarity(image=reference_image)

# Top 3 visually similar segments
results = video_chunks.order_by(sim_image, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim_image
)

results.collect()

This finds video segments whose visual content is closest to the reference image.


Audio-to-Video Search

You can also search using an audio clip:

# Load a reference audio sample
reference_audio = pxt.Audio('path/to/reference_audio.mp3')

# Define similarity to the audio
sim_audio = video_chunks.video_segment.similarity(audio=reference_audio)

# Top 3 segments with similar audio characteristics
results = video_chunks.order_by(sim_audio, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim_audio
)

results.collect()

This is useful for matching background music, speaking style, or other acoustic patterns.


Video-to-Video Search

Finally, use a video (or segment) as the query to find similar clips:

# Use an existing segment as the query
query_segment = (
    video_chunks
        .select(video_chunks.video_segment)
        .limit(1)
        .collect()[0]['video_segment']
)

# Define similarity to the query segment
sim_video = video_chunks.video_segment.similarity(video=query_segment)

# Top 5 similar segments across the library
results = video_chunks.order_by(sim_video, asc=False).limit(5).select(
    video_chunks.video_segment,
    score=sim_video
)

results.collect()

This enables content recommendation systems, duplicate detection, or finding alternative takes of similar scenes.

In all four cases you follow the same pattern: define a similarity() expression on the video_segment column, then reuse that expression in order_by and select to retrieve the best matches for your text, image, audio, or video query.


Beyond Video: Multimodal Data Types

Twelve Labs embeddings extend beyond video to create truly multimodal applications. You can embed text documents, standalone images, and audio files using the same Marengo model, enabling search across heterogeneous data types.

Create a product catalog table with multiple modalities:

# Create a product catalog with text and images
products = pxt.create_table(
    'video_search.products',
    {
        'title': pxt.String,
        'description': pxt.String,
        'thumbnail': pxt.Image
    }
)

# Add computed column combining title and description
products.add_computed_column(
		text_content=products.title + '. ' + products.description,
		if_exists='replace'
)

# Add embedding indices for both text and images
products.add_embedding_index(
    'text_content',
    embedding=embed.using(model_name='marengo3.0')
)

products.add_embedding_index(
    'thumbnail',
    embedding=embed.using(model_name='marengo3.0')
)

Now perform cross-modal catalog searches:

# Search products using text
text_results = products.select(
    products.title,
    products.thumbnail,
    products.score
).order_by(
    products.score, asc=False
).where(
    products.text_content.sim('outdoor hiking gear') > 0.6
).limit(5)

# Search products using an image
query_image = pxt.Image('path/to/query.jpg')
image_results = products.select(
    products.title,
    products.description,
    products.score
).order_by(
    products.score, asc=False
).where(
    products.thumbnail.sim(query_image) > 0.7
).limit(5)

You can even search your product catalog using video clips, or find products related to audio samples. This unified approach means a single embedding model powers search across your entire application—text documents, images, audio files, videos, and any combination thereof.

The same pattern applies to other use cases: search documentation using screenshots, find audio clips matching text queries, or locate images similar to video frames. Twelve Labs' multimodal embeddings eliminate the traditional boundaries between data types.


Performance and Best Practices

Pixeltable optimizes embedding computation automatically, but following these best practices ensures optimal performance in production environments.

Batch Operations: Insert multiple videos at once rather than one at a time. Pixeltable processes embeddings more efficiently in batches:

# Efficient: batch insert
videos.insert([
    {'video': 'url1.mp4'},
    {'video': 'url2.mp4'},
    {'video': 'url3.mp4'}
])

Similarity Thresholds: Tune your similarity thresholds based on precision/recall requirements. Higher thresholds (0.7-0.9) return fewer, more relevant results; lower thresholds (0.4-0.6) cast a wider net. Test with your specific content to find the sweet spot.

Incremental Updates: Pixeltable's computed columns update automatically when you add new data. The embedding index stays current without manual reprocessing—insert new videos anytime and they become immediately searchable.

Caching: Pixeltable manages embedding indices - once data is indexed, similarity searches don’t require additional API calls as long as the underlying data hasn’t changed.

Error Handling: Pixeltable retries failed embedding computations automatically (see configuration details to enable setting rate limits per provider). Additionally, monitor your TwelveLabs API quota to avoid rate limiting during large batch operations.

These practices help you build production-ready video search systems that scale efficiently while maintaining low latency for end users.


Additional Resources

Huge thanks to the Pixeltable team (Alison Hill, Pierre Brunelle, Marcel Kornacker, and Aaron Siegel) for collaborating with me on this integration!


Introduction

Building intelligent video applications presents a fundamental challenge: videos contain multiple modalities—visual content, audio, and text—that traditional search systems struggle to query seamlessly. Developers typically need separate models for each modality, complex bridging logic, and significant engineering effort to make cross-modal search work.

TwelveLabs' multimodal embeddings solve this by projecting text, images, audio, and video into a unified semantic space. This means you can search your video library using any modality: find clips using text descriptions ("person giving a speech"), locate videos visually similar to a reference image, discover content with matching audio characteristics, or identify similar video segments—all using the same embedding model.

In this tutorial, you'll build a complete cross-modal video search system combining TwelveLabs' Marengo 3.0 model with Pixeltable's declarative data infrastructure for multimodal AI. Pixeltable handles the complexity of embedding computation, indexing, and incremental updates automatically, letting you focus on building features rather than managing infrastructure. By the end, you'll have a working system that demonstrates true multimodal video understanding with minimal code.


Prerequisites and Setup

Before building your cross-modal search system, ensure you have Python 3.8 or later installed. You'll need a TwelveLabs API key, which you can obtain by signing up at playground.twelvelabs.io.

Start by installing the required packages:

pip install -qU

Next, configure your TwelveLabs API key securely. You can set it as an environment variable or provide it interactively:

import os
import getpass

if 'TWELVELABS_API_KEY' not in os.environ:
    os.environ['TWELVELABS_API_KEY'] = getpass.getpass('Enter your Twelve Labs API key: ')

Initialize Pixeltable and create a dedicated directory for this project:

import pixeltable as pxt
from pixeltable.functions.twelvelabs import embed

# Create a fresh directory for our video search system
pxt.drop_dir('video_search', force=True)
pxt.create_dir('video_search')

Pixeltable automatically handles data persistence, versioning, and metadata tracking in this directory.

Important: Twelve Labs requires audio and video content to be at least 4 seconds long for optimal embedding quality. Keep this constraint in mind when preparing your video content.


Core Concept: Multimodal Embeddings

Traditional video search requires separate models for each modality—one for visual content, another for audio, yet another for text—plus complex logic to bridge between them. TwelveLabs' Marengo model takes a fundamentally different approach: it creates a unified semantic space where all modalities coexist.

In this shared space, a video clip of someone speaking, the transcript of that speech, a still frame from the video, and a text description all map to nearby points. This enables true cross-modal search: query with any modality and retrieve relevant content from any modality.

Here are the cross-modal search capabilities you'll build:

Query Type

Use Case

Text → Video

Find clips matching "person giving a speech"

Image → Video

Locate videos visually similar to a reference photo

Audio → Video

Discover content with similar audio characteristics

Video → Video

Identify similar clips or alternative takes

This unified embedding space is what makes TwelveLabs' integration with Pixeltable so powerful—you build once and search across all modalities seamlessly.


Building the Video Search System

Creating a searchable video index with Pixeltable and TwelveLabs requires just a few declarative steps. Start by defining a table to store your videos:

from pixeltable.functions.video import splitter

# Create a table for videos
videos = pxt.create_table('video_search.videos', {'video': pxt.Video})

# Insert a sample video
video_url = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness.mp4'
videos.insert([{'video': video_url}])

Next, create a view that segments videos into searchable chunks. This is where Pixeltable's iterator functionality shines:

The splitter iterator automatically breaks each video into 5-second segments while ensuring no segment falls below TwelveLabs' 4-second minimum requirement. Each row in this view represents one video segment.

Now comes the critical step—adding the embedding index:

# Add embedding index for cross-modal search
video_chunks.add_embedding_index(
    'video_segment',
    embedding=embed.using(model_name='marengo3.0')
)

This single line triggers powerful automation: Pixeltable computes TwelveLabs embeddings for every video segment and maintains the index incrementally. When you add new videos, embeddings are computed automatically—no manual orchestration required.

The beauty of this declarative approach is that you specify what to compute (embeddings of video segments), not how to process them. Pixeltable handles batching, error recovery, and caching behind the scenes.


Cross-Modal Search Examples

With your video index built, you can now search using any modality. The key API to understand is the similarity() method, which you call on the video_segment column to produce a reusable similarity expression.


Text-to-Video Search

First, define a similarity expression against a text query:

# Define similarity to a text query
sim = video_chunks.video_segment.similarity(string='person giving a speech')

# Get the top 3 matching segments
results = video_chunks.order_by(sim, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim
)

results.collect()

Here sim is an expression representing “similarity between each segment and this text string,” which you can use in order_byselect, or additional filters.


Image-to-Video Search

Use the same pattern for an image query:

# Load a reference image
reference_image = pxt.Image('path/to/reference_image.jpg')

# Define similarity to the image
sim_image = video_chunks.video_segment.similarity(image=reference_image)

# Top 3 visually similar segments
results = video_chunks.order_by(sim_image, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim_image
)

results.collect()

This finds video segments whose visual content is closest to the reference image.


Audio-to-Video Search

You can also search using an audio clip:

# Load a reference audio sample
reference_audio = pxt.Audio('path/to/reference_audio.mp3')

# Define similarity to the audio
sim_audio = video_chunks.video_segment.similarity(audio=reference_audio)

# Top 3 segments with similar audio characteristics
results = video_chunks.order_by(sim_audio, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim_audio
)

results.collect()

This is useful for matching background music, speaking style, or other acoustic patterns.


Video-to-Video Search

Finally, use a video (or segment) as the query to find similar clips:

# Use an existing segment as the query
query_segment = (
    video_chunks
        .select(video_chunks.video_segment)
        .limit(1)
        .collect()[0]['video_segment']
)

# Define similarity to the query segment
sim_video = video_chunks.video_segment.similarity(video=query_segment)

# Top 5 similar segments across the library
results = video_chunks.order_by(sim_video, asc=False).limit(5).select(
    video_chunks.video_segment,
    score=sim_video
)

results.collect()

This enables content recommendation systems, duplicate detection, or finding alternative takes of similar scenes.

In all four cases you follow the same pattern: define a similarity() expression on the video_segment column, then reuse that expression in order_by and select to retrieve the best matches for your text, image, audio, or video query.


Beyond Video: Multimodal Data Types

Twelve Labs embeddings extend beyond video to create truly multimodal applications. You can embed text documents, standalone images, and audio files using the same Marengo model, enabling search across heterogeneous data types.

Create a product catalog table with multiple modalities:

# Create a product catalog with text and images
products = pxt.create_table(
    'video_search.products',
    {
        'title': pxt.String,
        'description': pxt.String,
        'thumbnail': pxt.Image
    }
)

# Add computed column combining title and description
products.add_computed_column(
		text_content=products.title + '. ' + products.description,
		if_exists='replace'
)

# Add embedding indices for both text and images
products.add_embedding_index(
    'text_content',
    embedding=embed.using(model_name='marengo3.0')
)

products.add_embedding_index(
    'thumbnail',
    embedding=embed.using(model_name='marengo3.0')
)

Now perform cross-modal catalog searches:

# Search products using text
text_results = products.select(
    products.title,
    products.thumbnail,
    products.score
).order_by(
    products.score, asc=False
).where(
    products.text_content.sim('outdoor hiking gear') > 0.6
).limit(5)

# Search products using an image
query_image = pxt.Image('path/to/query.jpg')
image_results = products.select(
    products.title,
    products.description,
    products.score
).order_by(
    products.score, asc=False
).where(
    products.thumbnail.sim(query_image) > 0.7
).limit(5)

You can even search your product catalog using video clips, or find products related to audio samples. This unified approach means a single embedding model powers search across your entire application—text documents, images, audio files, videos, and any combination thereof.

The same pattern applies to other use cases: search documentation using screenshots, find audio clips matching text queries, or locate images similar to video frames. Twelve Labs' multimodal embeddings eliminate the traditional boundaries between data types.


Performance and Best Practices

Pixeltable optimizes embedding computation automatically, but following these best practices ensures optimal performance in production environments.

Batch Operations: Insert multiple videos at once rather than one at a time. Pixeltable processes embeddings more efficiently in batches:

# Efficient: batch insert
videos.insert([
    {'video': 'url1.mp4'},
    {'video': 'url2.mp4'},
    {'video': 'url3.mp4'}
])

Similarity Thresholds: Tune your similarity thresholds based on precision/recall requirements. Higher thresholds (0.7-0.9) return fewer, more relevant results; lower thresholds (0.4-0.6) cast a wider net. Test with your specific content to find the sweet spot.

Incremental Updates: Pixeltable's computed columns update automatically when you add new data. The embedding index stays current without manual reprocessing—insert new videos anytime and they become immediately searchable.

Caching: Pixeltable manages embedding indices - once data is indexed, similarity searches don’t require additional API calls as long as the underlying data hasn’t changed.

Error Handling: Pixeltable retries failed embedding computations automatically (see configuration details to enable setting rate limits per provider). Additionally, monitor your TwelveLabs API quota to avoid rate limiting during large batch operations.

These practices help you build production-ready video search systems that scale efficiently while maintaining low latency for end users.


Additional Resources

Huge thanks to the Pixeltable team (Alison Hill, Pierre Brunelle, Marcel Kornacker, and Aaron Siegel) for collaborating with me on this integration!


Introduction

Building intelligent video applications presents a fundamental challenge: videos contain multiple modalities—visual content, audio, and text—that traditional search systems struggle to query seamlessly. Developers typically need separate models for each modality, complex bridging logic, and significant engineering effort to make cross-modal search work.

TwelveLabs' multimodal embeddings solve this by projecting text, images, audio, and video into a unified semantic space. This means you can search your video library using any modality: find clips using text descriptions ("person giving a speech"), locate videos visually similar to a reference image, discover content with matching audio characteristics, or identify similar video segments—all using the same embedding model.

In this tutorial, you'll build a complete cross-modal video search system combining TwelveLabs' Marengo 3.0 model with Pixeltable's declarative data infrastructure for multimodal AI. Pixeltable handles the complexity of embedding computation, indexing, and incremental updates automatically, letting you focus on building features rather than managing infrastructure. By the end, you'll have a working system that demonstrates true multimodal video understanding with minimal code.


Prerequisites and Setup

Before building your cross-modal search system, ensure you have Python 3.8 or later installed. You'll need a TwelveLabs API key, which you can obtain by signing up at playground.twelvelabs.io.

Start by installing the required packages:

pip install -qU

Next, configure your TwelveLabs API key securely. You can set it as an environment variable or provide it interactively:

import os
import getpass

if 'TWELVELABS_API_KEY' not in os.environ:
    os.environ['TWELVELABS_API_KEY'] = getpass.getpass('Enter your Twelve Labs API key: ')

Initialize Pixeltable and create a dedicated directory for this project:

import pixeltable as pxt
from pixeltable.functions.twelvelabs import embed

# Create a fresh directory for our video search system
pxt.drop_dir('video_search', force=True)
pxt.create_dir('video_search')

Pixeltable automatically handles data persistence, versioning, and metadata tracking in this directory.

Important: Twelve Labs requires audio and video content to be at least 4 seconds long for optimal embedding quality. Keep this constraint in mind when preparing your video content.


Core Concept: Multimodal Embeddings

Traditional video search requires separate models for each modality—one for visual content, another for audio, yet another for text—plus complex logic to bridge between them. TwelveLabs' Marengo model takes a fundamentally different approach: it creates a unified semantic space where all modalities coexist.

In this shared space, a video clip of someone speaking, the transcript of that speech, a still frame from the video, and a text description all map to nearby points. This enables true cross-modal search: query with any modality and retrieve relevant content from any modality.

Here are the cross-modal search capabilities you'll build:

Query Type

Use Case

Text → Video

Find clips matching "person giving a speech"

Image → Video

Locate videos visually similar to a reference photo

Audio → Video

Discover content with similar audio characteristics

Video → Video

Identify similar clips or alternative takes

This unified embedding space is what makes TwelveLabs' integration with Pixeltable so powerful—you build once and search across all modalities seamlessly.


Building the Video Search System

Creating a searchable video index with Pixeltable and TwelveLabs requires just a few declarative steps. Start by defining a table to store your videos:

from pixeltable.functions.video import splitter

# Create a table for videos
videos = pxt.create_table('video_search.videos', {'video': pxt.Video})

# Insert a sample video
video_url = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness.mp4'
videos.insert([{'video': video_url}])

Next, create a view that segments videos into searchable chunks. This is where Pixeltable's iterator functionality shines:

The splitter iterator automatically breaks each video into 5-second segments while ensuring no segment falls below TwelveLabs' 4-second minimum requirement. Each row in this view represents one video segment.

Now comes the critical step—adding the embedding index:

# Add embedding index for cross-modal search
video_chunks.add_embedding_index(
    'video_segment',
    embedding=embed.using(model_name='marengo3.0')
)

This single line triggers powerful automation: Pixeltable computes TwelveLabs embeddings for every video segment and maintains the index incrementally. When you add new videos, embeddings are computed automatically—no manual orchestration required.

The beauty of this declarative approach is that you specify what to compute (embeddings of video segments), not how to process them. Pixeltable handles batching, error recovery, and caching behind the scenes.


Cross-Modal Search Examples

With your video index built, you can now search using any modality. The key API to understand is the similarity() method, which you call on the video_segment column to produce a reusable similarity expression.


Text-to-Video Search

First, define a similarity expression against a text query:

# Define similarity to a text query
sim = video_chunks.video_segment.similarity(string='person giving a speech')

# Get the top 3 matching segments
results = video_chunks.order_by(sim, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim
)

results.collect()

Here sim is an expression representing “similarity between each segment and this text string,” which you can use in order_byselect, or additional filters.


Image-to-Video Search

Use the same pattern for an image query:

# Load a reference image
reference_image = pxt.Image('path/to/reference_image.jpg')

# Define similarity to the image
sim_image = video_chunks.video_segment.similarity(image=reference_image)

# Top 3 visually similar segments
results = video_chunks.order_by(sim_image, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim_image
)

results.collect()

This finds video segments whose visual content is closest to the reference image.


Audio-to-Video Search

You can also search using an audio clip:

# Load a reference audio sample
reference_audio = pxt.Audio('path/to/reference_audio.mp3')

# Define similarity to the audio
sim_audio = video_chunks.video_segment.similarity(audio=reference_audio)

# Top 3 segments with similar audio characteristics
results = video_chunks.order_by(sim_audio, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim_audio
)

results.collect()

This is useful for matching background music, speaking style, or other acoustic patterns.


Video-to-Video Search

Finally, use a video (or segment) as the query to find similar clips:

# Use an existing segment as the query
query_segment = (
    video_chunks
        .select(video_chunks.video_segment)
        .limit(1)
        .collect()[0]['video_segment']
)

# Define similarity to the query segment
sim_video = video_chunks.video_segment.similarity(video=query_segment)

# Top 5 similar segments across the library
results = video_chunks.order_by(sim_video, asc=False).limit(5).select(
    video_chunks.video_segment,
    score=sim_video
)

results.collect()

This enables content recommendation systems, duplicate detection, or finding alternative takes of similar scenes.

In all four cases you follow the same pattern: define a similarity() expression on the video_segment column, then reuse that expression in order_by and select to retrieve the best matches for your text, image, audio, or video query.


Beyond Video: Multimodal Data Types

Twelve Labs embeddings extend beyond video to create truly multimodal applications. You can embed text documents, standalone images, and audio files using the same Marengo model, enabling search across heterogeneous data types.

Create a product catalog table with multiple modalities:

# Create a product catalog with text and images
products = pxt.create_table(
    'video_search.products',
    {
        'title': pxt.String,
        'description': pxt.String,
        'thumbnail': pxt.Image
    }
)

# Add computed column combining title and description
products.add_computed_column(
		text_content=products.title + '. ' + products.description,
		if_exists='replace'
)

# Add embedding indices for both text and images
products.add_embedding_index(
    'text_content',
    embedding=embed.using(model_name='marengo3.0')
)

products.add_embedding_index(
    'thumbnail',
    embedding=embed.using(model_name='marengo3.0')
)

Now perform cross-modal catalog searches:

# Search products using text
text_results = products.select(
    products.title,
    products.thumbnail,
    products.score
).order_by(
    products.score, asc=False
).where(
    products.text_content.sim('outdoor hiking gear') > 0.6
).limit(5)

# Search products using an image
query_image = pxt.Image('path/to/query.jpg')
image_results = products.select(
    products.title,
    products.description,
    products.score
).order_by(
    products.score, asc=False
).where(
    products.thumbnail.sim(query_image) > 0.7
).limit(5)

You can even search your product catalog using video clips, or find products related to audio samples. This unified approach means a single embedding model powers search across your entire application—text documents, images, audio files, videos, and any combination thereof.

The same pattern applies to other use cases: search documentation using screenshots, find audio clips matching text queries, or locate images similar to video frames. Twelve Labs' multimodal embeddings eliminate the traditional boundaries between data types.


Performance and Best Practices

Pixeltable optimizes embedding computation automatically, but following these best practices ensures optimal performance in production environments.

Batch Operations: Insert multiple videos at once rather than one at a time. Pixeltable processes embeddings more efficiently in batches:

# Efficient: batch insert
videos.insert([
    {'video': 'url1.mp4'},
    {'video': 'url2.mp4'},
    {'video': 'url3.mp4'}
])

Similarity Thresholds: Tune your similarity thresholds based on precision/recall requirements. Higher thresholds (0.7-0.9) return fewer, more relevant results; lower thresholds (0.4-0.6) cast a wider net. Test with your specific content to find the sweet spot.

Incremental Updates: Pixeltable's computed columns update automatically when you add new data. The embedding index stays current without manual reprocessing—insert new videos anytime and they become immediately searchable.

Caching: Pixeltable manages embedding indices - once data is indexed, similarity searches don’t require additional API calls as long as the underlying data hasn’t changed.

Error Handling: Pixeltable retries failed embedding computations automatically (see configuration details to enable setting rate limits per provider). Additionally, monitor your TwelveLabs API quota to avoid rate limiting during large batch operations.

These practices help you build production-ready video search systems that scale efficiently while maintaining low latency for end users.


Additional Resources