Partnerships

Partnerships

Partnerships

Building Semantic Video Recommendations with TwelveLabs and LanceDB

James Le

James Le

James Le

By combining TwelveLabs, LanceDB, and Geneva you can build a recommendation system that understands video content directly. TwelveLabs provides embeddings and summaries that capture meaning beyond keywords. LanceDB is simple to use, runs as an embedded database, and stores multimodal data like video, images, text, and vectors. Geneva with Ray lets you scale from local development to distributed clusters with the same code.

By combining TwelveLabs, LanceDB, and Geneva you can build a recommendation system that understands video content directly. TwelveLabs provides embeddings and summaries that capture meaning beyond keywords. LanceDB is simple to use, runs as an embedded database, and stores multimodal data like video, images, text, and vectors. Geneva with Ray lets you scale from local development to distributed clusters with the same code.

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Search, analyze, and explore your videos with AI.

Sep 1, 2025

Sep 1, 2025

Sep 1, 2025

5 Minutes

5 Minutes

5 Minutes

Copy link to article

Copy link to article

Copy link to article

Note: The examples in this post use trimmed sample code for readability. If you want the complete code in a runnable notebook, you can find it here.

Most recommendation systems rely on metadata like titles, tags, or transcripts. That approach works, but it misses what is actually happening inside a video. What if your system could understand the visual and audio content itself?

In this post we will show how to build a semantic recommendation engine using TwelveLabs, LanceDB, and Geneva, LanceDB’s feature engineering package. In this scenario, TwelveLabs provides powerful multimodal embeddings that represent the meaning of a video. LanceDB stores those embeddings with metadata and gives you fast vector search through a simple Python API. Geneva builds on LanceDB to scale the pipeline across clusters using Ray, which means the exact same code can run on your laptop or on hundreds of machines.


Why TwelveLabs, LanceDB and Geneva?

TwelveLabs lets you embed video in a way that captures narrative flow, mood, and action. Queries like “a surfer riding a wave at sunset” can return matches even if no one tagged the clip that way.

LanceDB is a vector database built on Apache Arrow. It has three key strengths:

  • A simple Python API that feels natural for developers.

  • An embedded database that runs locally with no external services required.

  • Native multimodal support, so it can store video, images, text, and vectors with equal ease.

Geneva is built on LanceDB and adds distributed data processing. With Ray underneath, it scales embedding generation and queries across many workers.

This combination covers the full pipeline: ingest, embed, store, search, and scale.


Loading and Materializing Videos

We start with a sample dataset from HuggingFace called FineVideo. The loader creates a RecordBatch with raw video bytes plus captions, titles, IDs, and metadata.

def load_videos():
    dataset = load_dataset("HuggingFaceFV/finevideo", split="train", streaming=True)
    batch = []
    processed = 0

    for row in dataset:
        if processed >= 10:
            break

        video_bytes = row['mp4']
        json_metadata = row['json']

        batch.append({
            "video": video_bytes,
            "caption": json_metadata.get("youtube_title", "No description"),
            "youtube_title": json_metadata.get("youtube_title", ""),
            "video_id": f"video_{processed}",
            "duration": json_metadata.get("duration_seconds", 0),
            "resolution": json_metadata.get("resolution", "")
        })
        processed += 1

    return pa.RecordBatch.from_pylist(batch)

This creates a table that holds both raw video bytes and human-readable metadata. The benefit is that you do not need to separate video data from structured data — LanceDB handles both seamlessly.

With Geneva, we materialize this dataset into a table backed by LanceDB.

db = geneva.connect("/content/quickstart/")
tbl = db.create_table("videos", load_videos(), mode="overwrite")

At this point we have an embedded database of videos. Even before embeddings are added, this structure makes it easy to run queries, transformations, or visualizations.


Embedding Videos with TwelveLabs

The next step is generating embeddings with TwelveLabs’ Marengo 2.7 model. It outputs 1024-dimensional vectors representing video meaning. We use both “clip” and “video” scopes to get whole-video embeddings.

task = client.embed.tasks.create(
    model_name="Marengo-retrieval-2.7",
    video_file=video_file,
    video_embedding_scope=["clip", "video"]
)

status = client.embed.tasks.wait_for_done(task.id)
result = client.embed.tasks.retrieve(task.id)

video_segments = [seg for seg in result.video_embedding.segments
                  if seg.embedding_scope == "video"]

embedding_array = np.array(video_segments[0].float_, dtype=np.float32)

Here we ask for both clip and video scopes, ensuring we capture the whole context of the video. The vector captures patterns in visuals and sound, so similar activities cluster together even if the metadata is sparse.

With Geneva, embeddings become another column in the table:

tbl.add_columns({"embedding": GenVideoEmbeddings(
    twelve_labs_api_key=os.environ['TWELVE_LABS_API_KEY']
)})
tbl.backfill("embedding", concurrency=1)

The backfill call processes all rows and computes embeddings. In development we set concurrency to 1, but in production Geneva can run with high concurrency and let Ray parallelize across workers. That is how you scale from a dozen videos to millions.


Searching with LanceDB

Once embeddings are stored, LanceDB gives you a clean API for vector search. A query can be plain text, which TwelveLabs embeds, then compared with video vectors in LanceDB.

query = "educational tutorial"
query_result = client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text=query
)
qvec = np.array(query_result.text_embedding.segments[0].float_)

lance_db = lancedb.connect("/content/quickstart/")
lance_tbl = lance_db.open_table("videos")

results = (lance_tbl
          .search(qvec)
          .metric("cosine")
          .limit(3)
          .to_pandas())

Because LanceDB is Arrow-native, results come back as a pandas DataFrame. This makes it simple to integrate with analysis or a web application.


Summarizing with Pegasus

Sometimes a vector match is not enough. TwelveLabs also offers Pegasus, which generates summaries of videos. You can attach these summaries as another column in LanceDB, making search results more understandable.

index = client.indexes.create(
    index_name=f"lancedb_demo_{int(time.time())}",
    models=[{"model_name": "pegasus1.2", "model_options": ["visual", "audio"]}]
)

This step improves the user experience by letting you display a short, human-readable summary along with each recommendation.


Scaling with Geneva and Ray

Without Geneva you would need to manage ingestion and embedding jobs manually. That might be fine for a dozen videos, but it quickly breaks down at scale. Geneva brings declarative pipelines and Ray executes them in parallel.

Concern

LanceDB only

With Geneva and Ray

Ingestion

Manual loaders

Declarative pipelines

Embeddings

Sequential

Parallel across many workers

Storage

Local tables

Distributed LanceDB tables

ML and analytics

Custom scripts

Built-in distributed UDFs

This means you can prototype locally and move to production on a cluster without rewriting the workflow.


Conclusion

By combining TwelveLabs, LanceDB, and Geneva you can build a recommendation system that understands video content directly.

  • TwelveLabs provides embeddings and summaries that capture meaning beyond keywords.

  • LanceDB is simple to use, runs as an embedded database, and stores multimodal data like video, images, text, and vectors.

  • Geneva with Ray lets you scale from local development to distributed clusters with the same code.

This stack is a practical foundation for media platforms, education apps, or analytics tools that need semantic video recommendations at scale.


Try it out

Note: The examples in this post use trimmed sample code for readability. If you want the complete code in a runnable notebook, you can find it here.

Most recommendation systems rely on metadata like titles, tags, or transcripts. That approach works, but it misses what is actually happening inside a video. What if your system could understand the visual and audio content itself?

In this post we will show how to build a semantic recommendation engine using TwelveLabs, LanceDB, and Geneva, LanceDB’s feature engineering package. In this scenario, TwelveLabs provides powerful multimodal embeddings that represent the meaning of a video. LanceDB stores those embeddings with metadata and gives you fast vector search through a simple Python API. Geneva builds on LanceDB to scale the pipeline across clusters using Ray, which means the exact same code can run on your laptop or on hundreds of machines.


Why TwelveLabs, LanceDB and Geneva?

TwelveLabs lets you embed video in a way that captures narrative flow, mood, and action. Queries like “a surfer riding a wave at sunset” can return matches even if no one tagged the clip that way.

LanceDB is a vector database built on Apache Arrow. It has three key strengths:

  • A simple Python API that feels natural for developers.

  • An embedded database that runs locally with no external services required.

  • Native multimodal support, so it can store video, images, text, and vectors with equal ease.

Geneva is built on LanceDB and adds distributed data processing. With Ray underneath, it scales embedding generation and queries across many workers.

This combination covers the full pipeline: ingest, embed, store, search, and scale.


Loading and Materializing Videos

We start with a sample dataset from HuggingFace called FineVideo. The loader creates a RecordBatch with raw video bytes plus captions, titles, IDs, and metadata.

def load_videos():
    dataset = load_dataset("HuggingFaceFV/finevideo", split="train", streaming=True)
    batch = []
    processed = 0

    for row in dataset:
        if processed >= 10:
            break

        video_bytes = row['mp4']
        json_metadata = row['json']

        batch.append({
            "video": video_bytes,
            "caption": json_metadata.get("youtube_title", "No description"),
            "youtube_title": json_metadata.get("youtube_title", ""),
            "video_id": f"video_{processed}",
            "duration": json_metadata.get("duration_seconds", 0),
            "resolution": json_metadata.get("resolution", "")
        })
        processed += 1

    return pa.RecordBatch.from_pylist(batch)

This creates a table that holds both raw video bytes and human-readable metadata. The benefit is that you do not need to separate video data from structured data — LanceDB handles both seamlessly.

With Geneva, we materialize this dataset into a table backed by LanceDB.

db = geneva.connect("/content/quickstart/")
tbl = db.create_table("videos", load_videos(), mode="overwrite")

At this point we have an embedded database of videos. Even before embeddings are added, this structure makes it easy to run queries, transformations, or visualizations.


Embedding Videos with TwelveLabs

The next step is generating embeddings with TwelveLabs’ Marengo 2.7 model. It outputs 1024-dimensional vectors representing video meaning. We use both “clip” and “video” scopes to get whole-video embeddings.

task = client.embed.tasks.create(
    model_name="Marengo-retrieval-2.7",
    video_file=video_file,
    video_embedding_scope=["clip", "video"]
)

status = client.embed.tasks.wait_for_done(task.id)
result = client.embed.tasks.retrieve(task.id)

video_segments = [seg for seg in result.video_embedding.segments
                  if seg.embedding_scope == "video"]

embedding_array = np.array(video_segments[0].float_, dtype=np.float32)

Here we ask for both clip and video scopes, ensuring we capture the whole context of the video. The vector captures patterns in visuals and sound, so similar activities cluster together even if the metadata is sparse.

With Geneva, embeddings become another column in the table:

tbl.add_columns({"embedding": GenVideoEmbeddings(
    twelve_labs_api_key=os.environ['TWELVE_LABS_API_KEY']
)})
tbl.backfill("embedding", concurrency=1)

The backfill call processes all rows and computes embeddings. In development we set concurrency to 1, but in production Geneva can run with high concurrency and let Ray parallelize across workers. That is how you scale from a dozen videos to millions.


Searching with LanceDB

Once embeddings are stored, LanceDB gives you a clean API for vector search. A query can be plain text, which TwelveLabs embeds, then compared with video vectors in LanceDB.

query = "educational tutorial"
query_result = client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text=query
)
qvec = np.array(query_result.text_embedding.segments[0].float_)

lance_db = lancedb.connect("/content/quickstart/")
lance_tbl = lance_db.open_table("videos")

results = (lance_tbl
          .search(qvec)
          .metric("cosine")
          .limit(3)
          .to_pandas())

Because LanceDB is Arrow-native, results come back as a pandas DataFrame. This makes it simple to integrate with analysis or a web application.


Summarizing with Pegasus

Sometimes a vector match is not enough. TwelveLabs also offers Pegasus, which generates summaries of videos. You can attach these summaries as another column in LanceDB, making search results more understandable.

index = client.indexes.create(
    index_name=f"lancedb_demo_{int(time.time())}",
    models=[{"model_name": "pegasus1.2", "model_options": ["visual", "audio"]}]
)

This step improves the user experience by letting you display a short, human-readable summary along with each recommendation.


Scaling with Geneva and Ray

Without Geneva you would need to manage ingestion and embedding jobs manually. That might be fine for a dozen videos, but it quickly breaks down at scale. Geneva brings declarative pipelines and Ray executes them in parallel.

Concern

LanceDB only

With Geneva and Ray

Ingestion

Manual loaders

Declarative pipelines

Embeddings

Sequential

Parallel across many workers

Storage

Local tables

Distributed LanceDB tables

ML and analytics

Custom scripts

Built-in distributed UDFs

This means you can prototype locally and move to production on a cluster without rewriting the workflow.


Conclusion

By combining TwelveLabs, LanceDB, and Geneva you can build a recommendation system that understands video content directly.

  • TwelveLabs provides embeddings and summaries that capture meaning beyond keywords.

  • LanceDB is simple to use, runs as an embedded database, and stores multimodal data like video, images, text, and vectors.

  • Geneva with Ray lets you scale from local development to distributed clusters with the same code.

This stack is a practical foundation for media platforms, education apps, or analytics tools that need semantic video recommendations at scale.


Try it out

Note: The examples in this post use trimmed sample code for readability. If you want the complete code in a runnable notebook, you can find it here.

Most recommendation systems rely on metadata like titles, tags, or transcripts. That approach works, but it misses what is actually happening inside a video. What if your system could understand the visual and audio content itself?

In this post we will show how to build a semantic recommendation engine using TwelveLabs, LanceDB, and Geneva, LanceDB’s feature engineering package. In this scenario, TwelveLabs provides powerful multimodal embeddings that represent the meaning of a video. LanceDB stores those embeddings with metadata and gives you fast vector search through a simple Python API. Geneva builds on LanceDB to scale the pipeline across clusters using Ray, which means the exact same code can run on your laptop or on hundreds of machines.


Why TwelveLabs, LanceDB and Geneva?

TwelveLabs lets you embed video in a way that captures narrative flow, mood, and action. Queries like “a surfer riding a wave at sunset” can return matches even if no one tagged the clip that way.

LanceDB is a vector database built on Apache Arrow. It has three key strengths:

  • A simple Python API that feels natural for developers.

  • An embedded database that runs locally with no external services required.

  • Native multimodal support, so it can store video, images, text, and vectors with equal ease.

Geneva is built on LanceDB and adds distributed data processing. With Ray underneath, it scales embedding generation and queries across many workers.

This combination covers the full pipeline: ingest, embed, store, search, and scale.


Loading and Materializing Videos

We start with a sample dataset from HuggingFace called FineVideo. The loader creates a RecordBatch with raw video bytes plus captions, titles, IDs, and metadata.

def load_videos():
    dataset = load_dataset("HuggingFaceFV/finevideo", split="train", streaming=True)
    batch = []
    processed = 0

    for row in dataset:
        if processed >= 10:
            break

        video_bytes = row['mp4']
        json_metadata = row['json']

        batch.append({
            "video": video_bytes,
            "caption": json_metadata.get("youtube_title", "No description"),
            "youtube_title": json_metadata.get("youtube_title", ""),
            "video_id": f"video_{processed}",
            "duration": json_metadata.get("duration_seconds", 0),
            "resolution": json_metadata.get("resolution", "")
        })
        processed += 1

    return pa.RecordBatch.from_pylist(batch)

This creates a table that holds both raw video bytes and human-readable metadata. The benefit is that you do not need to separate video data from structured data — LanceDB handles both seamlessly.

With Geneva, we materialize this dataset into a table backed by LanceDB.

db = geneva.connect("/content/quickstart/")
tbl = db.create_table("videos", load_videos(), mode="overwrite")

At this point we have an embedded database of videos. Even before embeddings are added, this structure makes it easy to run queries, transformations, or visualizations.


Embedding Videos with TwelveLabs

The next step is generating embeddings with TwelveLabs’ Marengo 2.7 model. It outputs 1024-dimensional vectors representing video meaning. We use both “clip” and “video” scopes to get whole-video embeddings.

task = client.embed.tasks.create(
    model_name="Marengo-retrieval-2.7",
    video_file=video_file,
    video_embedding_scope=["clip", "video"]
)

status = client.embed.tasks.wait_for_done(task.id)
result = client.embed.tasks.retrieve(task.id)

video_segments = [seg for seg in result.video_embedding.segments
                  if seg.embedding_scope == "video"]

embedding_array = np.array(video_segments[0].float_, dtype=np.float32)

Here we ask for both clip and video scopes, ensuring we capture the whole context of the video. The vector captures patterns in visuals and sound, so similar activities cluster together even if the metadata is sparse.

With Geneva, embeddings become another column in the table:

tbl.add_columns({"embedding": GenVideoEmbeddings(
    twelve_labs_api_key=os.environ['TWELVE_LABS_API_KEY']
)})
tbl.backfill("embedding", concurrency=1)

The backfill call processes all rows and computes embeddings. In development we set concurrency to 1, but in production Geneva can run with high concurrency and let Ray parallelize across workers. That is how you scale from a dozen videos to millions.


Searching with LanceDB

Once embeddings are stored, LanceDB gives you a clean API for vector search. A query can be plain text, which TwelveLabs embeds, then compared with video vectors in LanceDB.

query = "educational tutorial"
query_result = client.embed.create(
    model_name="Marengo-retrieval-2.7",
    text=query
)
qvec = np.array(query_result.text_embedding.segments[0].float_)

lance_db = lancedb.connect("/content/quickstart/")
lance_tbl = lance_db.open_table("videos")

results = (lance_tbl
          .search(qvec)
          .metric("cosine")
          .limit(3)
          .to_pandas())

Because LanceDB is Arrow-native, results come back as a pandas DataFrame. This makes it simple to integrate with analysis or a web application.


Summarizing with Pegasus

Sometimes a vector match is not enough. TwelveLabs also offers Pegasus, which generates summaries of videos. You can attach these summaries as another column in LanceDB, making search results more understandable.

index = client.indexes.create(
    index_name=f"lancedb_demo_{int(time.time())}",
    models=[{"model_name": "pegasus1.2", "model_options": ["visual", "audio"]}]
)

This step improves the user experience by letting you display a short, human-readable summary along with each recommendation.


Scaling with Geneva and Ray

Without Geneva you would need to manage ingestion and embedding jobs manually. That might be fine for a dozen videos, but it quickly breaks down at scale. Geneva brings declarative pipelines and Ray executes them in parallel.

Concern

LanceDB only

With Geneva and Ray

Ingestion

Manual loaders

Declarative pipelines

Embeddings

Sequential

Parallel across many workers

Storage

Local tables

Distributed LanceDB tables

ML and analytics

Custom scripts

Built-in distributed UDFs

This means you can prototype locally and move to production on a cluster without rewriting the workflow.


Conclusion

By combining TwelveLabs, LanceDB, and Geneva you can build a recommendation system that understands video content directly.

  • TwelveLabs provides embeddings and summaries that capture meaning beyond keywords.

  • LanceDB is simple to use, runs as an embedded database, and stores multimodal data like video, images, text, and vectors.

  • Geneva with Ray lets you scale from local development to distributed clusters with the same code.

This stack is a practical foundation for media platforms, education apps, or analytics tools that need semantic video recommendations at scale.


Try it out