Tutorials

From Opaque Video to Monetizable Moments: Building a Contextual Ad Engine with TwelveLabs

Nathan Che

Most CTV and FAST platforms make ad decisions without analyzing what's actually happening on screen. This tutorial walks through building a contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for semantic embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.

Most CTV and FAST platforms make ad decisions without analyzing what's actually happening on screen. This tutorial walks through building a contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for semantic embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.

In this article

No headings found on page

Join our newsletter

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Receive the latest advancements, tutorials, and industry insights in video understanding

Search, analyze, and explore your videos with AI.

May 19, 2026

14 Minutes

Copy link to article

TLDR

Most CTV/FAST platforms still make ad decisions without looking at what's actually happening on screen. This tutorial walks through building a production-grade contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for multimodal embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding rather than stale metadata, with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.

What you'll build: A complete pipeline that transforms video content into queryable context, matches ads to scenes based on semantic similarity and brand safety rules, identifies optimal break points, and exports decisioning data to Databricks for downstream analytics.


Introduction

Most ad decision stacks treat video as an opaque blob. They rely on metadata, content labels, or historical audience segments to make placement decisions. Everything except the video itself.

This approach works for broad targeting. Keyword matching can get you in the right ballpark. But it leaves significant revenue on the table because it fails to account for three things:

  1. Timing: Ads placed without awareness of scene transitions interrupt the viewing experience and drive abandonment.

  2. Context: Brand safety violations happen when systems can't see what's actually happening on screen. An alcohol ad shouldn't run during a scene depicting addiction recovery.

  3. Depth: Surface-level demographic targeting misses the nuance of household income, viewing device, and real-time engagement signals.

This tutorial addresses all three by building a contextual ad engine that treats video as queryable, structured data rather than a black box. The engine combines:

  • TwelveLabs Pegasus 1.5 for fine-grained scene understanding: sentiment, tone, cast, environment, and GARM-aligned safety signals

  • TwelveLabs Marengo 3.0 for multimodal semantic embeddings that enable scene-to-ad similarity scoring

  • Databricks Delta Lake + Mosaic AI Vector Search for enterprise-grade storage and retrieval

  • FreeWheel/OpenRTB-compatible payload generation for direct integration with existing ad servers

Figure 1: Intelligence Scene Extraction in Video Inventory

The goal: answer the question "Which ad should run at this break, for this audience, in this specific scene, while respecting brand safety and campaign constraints?" with data grounded in actual video content.

Here's a walkthrough of the finished application:


Prerequisites

Before starting, you'll need:

  1. Node.js 18+ and npm/yarn/pnpm

  2. TwelveLabs API Key with two indexes:

    • TL_INDEX_ID for content videos

    • TL_AD_INDEX_ID for ad creatives

  3. Vercel Blob Token (BLOB_READ_WRITE_TOKEN) for handling large video file transfers to TwelveLabs

  4. OpenAI API Key (optional) for low-latency text embedding during IAB 3.1 taxonomy mapping

  5. Databricks Workspace (optional) with DATABRICKS_TOKEN, DATABRICKS_HOST, DATABRICKS_HTTP_PATH, and optionally DATABRICKS_CATALOG and DATABRICKS_SCHEMA

Clone and run:

>> git clone https://github.com/nathanchess/twelvelabs-context-ad-engine.git
>> cd contextual-ad-engine
>> npm install
>> cp .env.example .env.local
>> npm


Architecture Overview

Figure 2: Contextual Ad Engine Backend Architecture (LucidChart)

The architecture leverages two TwelveLabs models that serve complementary roles:

Marengo 3.0 is the encoder. It transforms video into searchable 512-dimensional vector embeddings, making products, emotions, environments, and moments queryable. This enables semantic matching between ad creatives and content scenes.

Pegasus 1.5 is the reasoning model. It generates structured metadata about each scene: demographics, brand safety flags, sentiment, and targeting recommendations. It supports structured outputs, producing consistent JSON that downstream systems can parse deterministically.

By leveraging their unique capabilities and metadata generated into a single deterministic calculation, shown on the right hand side of the technical architecture diagram, of (User-Ad Match Score) x (Scene-Ad Match Score) we are able to recommend ads not based off of pre-written text metadata, but making scene-level decisions grounded in real video understanding.

This allows the ad engine to treat each segment as a living context signal, considering:

  • Tone

  • Sentiment

  • Environment

  • Brand Safety

This approach makes ad decisions based on what's actually happening in the video, not on content metadata that was labeled weeks ago. For deeper background on the underlying technology, see the TwelveLabs Platform Overview and TwelveLabs Research.


Core Ad Decision | Placement Logic

The core decision logic combines both signals into a single score:

totalScore = adAffinity * sceneFit

Where adAffinity measures how well an ad fits the viewer profile (demographics, interests, policy constraints) and sceneFit measures how well the ad creative fits the current scene (semantic similarity + safety + tone + environment).

The scoring pipeline combines four weighted signals into the sceneFit calculation:

sceneFit =
  suitableMatch  * 0.15 +   // Pegasus suitable_categories overlap
  environmentFit * 0.15 +   // environment-category affinity
  toneCompat     * 0.10 +   // emotional tone compatibility
  contextMatch   * 0.60     // Marengo semantic cosine similarity

The weighting is intentional. In CTV/OTT monetization, the largest CPM lift typically comes from semantic context quality, so Marengo 3.0 drives most of the score. The remaining signals preserve rule-based controllability for policy and content safety teams who need deterministic guardrails.


Step 1: Generate Structured Ad Metadata (Pegasus + IAB + FreeWheel)

This step extracts structured scene intelligence from video content using Pegasus 1.5, normalizes it to IAB 3.1 taxonomy, and generates FreeWheel-compatible key-value pairs for ad server integration.


1.1  - Run Pegasus 1.5 with Structured Output

The /api/analyze endpoint handles three tasks:

  1. Accepts a prompt from the frontend (from the /videos or /ads page)

  2. Checks Vercel Blob cache to avoid redundant processing

  3. Calls Pegasus 1.5 with structured output and stores the result

const tl_client = new TwelveLabs({ apiKey: process.env.TL_API_KEY });
const result = await tl_client.analyze({
  videoId,
  prompt,
  temperature: 0.2,
  response_format
}, { timeoutInSeconds: 90 });

The output is time-aligned metadata that downstream systems can reason over: scene boundaries, sentiment, environment, cast, and safety flags. This replaces brittle keyword-based targeting with grounded video understanding.


1.2 - Normalize Model Output to IAB 3.1 via Embedding KNN ID Matching

The analysis output from Pegasus needs to map to the IAB Content Taxonomy 3.1 for ad server compatibility. The pipeline uses text embeddings and k-nearest-neighbor matching against canonical IAB IDs.

The approach maintains a closed reference table of approved IAB 3.1 rows:

export const IAB_ALLOWED_ROWS = [
  { tier1: "Alcohol", tier2: "Spirits", code: "1005" },
  { tier1: "Alcohol", tier2: "Beer", code: "1003" },
  { tier1: "Consumer Packaged Goods", tier2: "General Food", tier3: "Snacks", code: "1169" },
  { tier1: "Finance and Insurance", tier2: "Stocks and Investments", code: "1338" },
  { tier1: "Vehicles", tier2: "Automotive Ownership", tier3: "New Vehicle Ownership", code: "1536" },
  // ...
] as const;

Each row is embedded once at index time. At runtime, candidate labels from the model are embedded and matched via KNN to the nearest canonical IAB rows, then thresholded and deduplicated:

export function normalizeIabWithKnnPolicy(
  rawInput: unknown,
  categoryKey?: string
): IabPolicyResult {
  const rawItems = Array.isArray(rawInput) ? rawInput : [];
  // 1) Embed candidate text from model output
  const embeddedCandidates = embedCandidateLabels(rawItems);
  // 2) KNN against canonical IAB 3.1 embedding index
  const knnMatches = queryIabKnnIndex(embeddedCandidates, { k: 5 });
  // 3) Keep only policy-compliant matches above similarity threshold
  const normalizedItems = dedupeAndSort(
    applyIabMatchPolicy(knnMatches).filter(
      (item): item is IabTaxonomyItem => Boolean(item)
    )
  );
  const high = normalizedItems.filter((item) => item.confidence >= IAB_HIGH_CONFIDENCE);
  const medium = normalizedItems.filter((item) => item.confidence >= IAB_MEDIUM_CONFIDENCE);
  let effectiveItems: IabTaxonomyItem[] = [];
  let fallbackApplied = false;
  let fallbackReason: string | null = null;
  if (high.length > 0) {
    effectiveItems = high;
  } else if (medium.length > 0) {
    effectiveItems = medium;
    fallbackReason = "No high-confidence Tier-2/3 matches; using medium-confidence Tier-1 band.";
  } else {
    const fallback = (categoryKey && FALLBACK_BY_CATEGORY_KEY[categoryKey]) || [];
    effectiveItems = fallback;
    fallbackApplied = true;
    fallbackReason = fallback.length
      ? "No medium-confidence KNN matches; applied deterministic vertical fallback."
      : "No medium-confidence KNN matches and no category fallback mapping found.";
  }
  const effectiveTier1 = [...new Set(effectiveItems.map((item) => item.tier1))];
  const effectiveTier2 = high.length > 0
    ? [...new Set(effectiveItems.map((item) => item.tier2))]
    : [];
  const effectiveTier3 = high.length > 0
    ? [...new Set(effectiveItems.map((item) => item.tier3).filter((tier3): tier3 is string => Boolean(tier3)))]
    : [];
  const effectiveIabIds = high.length > 0
    ? [...new Set(effectiveItems.map((item) => item.iabId).filter(Boolean))]
    : [];
  const averageConfidence = normalizedItems.length > 0
    ? normalizedItems.reduce((sum, item) => sum + item.confidence, 0) / normalizedItems.length
    : 0;
  return {
    normalizedItems,
    effectiveTier1,
    effectiveTier2,
    effectiveTier3,
    effectiveIabIds,
    averageConfidence,
    fallbackApplied,
    fallbackReason,
  };
}

This pipeline:

  • Embeds model-generated category phrases

  • Runs KNN similarity search against canonical IAB 3.1 row embeddings

  • Snaps candidates to valid IAB taxonomy rows/IDs only

  • Deduplicates and ranks matches by confidence

  • Promotes high-confidence rows as effective targeting fields

  • Applies deterministic vertical fallback when confidence is too low

This is critical for production ad tech: it prevents taxonomy hallucination, enforces valid IAB 3.1 IDs, and still captures semantic nuance through embedding-based matching.


1.3 - Build FreeWheel KVP Payload from Normalized Metadata

Once IAB and context signals are normalized, the engine generates FreeWheel key-value pairs for downstream ad serving:

const freewheelPayload = {
  ad_server: "Freewheel",
  endpoint: "https://ads.freewheel.tv/ad/p/1",
  generated_kvps: {
    vw_brand: toBrand(parsed.company),
    vw_ctx_inc: includeContexts.join(","),
    vw_ctx_exc: excludeContexts.join(","),
    vw_garm_floor: "strict",
    vw_duration: String(duration),
    vw_ad_title: parsed.proposedTitle || "untitled",
    vw_iab_t1: policy.effectiveTier1.join(","),
    vw_iab_t2: policy.effectiveTier2.join(","),
    vw_iab_t3: policy.effectiveTier3.join(","),
    vw_iab_codes: policy.effectiveCodes.join(","),
    vw_iab_conf: policy.averageConfidence.toFixed(3),
  },
};

The key fields:

  • vw_ctx_inc combines target contexts and Pegasus-recommended contexts

  • vw_ctx_exc combines campaign exclusions, Pegasus negatives, and GARM flags

  • vw_iab_* fields are populated only from normalized/effective classes

This step is what connects AI-generated understanding to existing ad ops workflows. TwelveLabs provides semantic intelligence. Policy normalization ensures deterministic, auditable taxonomy behavior. FreeWheel/OpenRTB mapping makes the outputs deployable in production ad servers.


Step 2: Build Multimodal Embeddings with Marengo

Both content scenes and ad creatives are vectorized into the same 512-dimensional embedding space using Marengo 3.0. This enables true semantic matching between scenes and ads, not just keyword overlap.

export function cosineSimilarity(vecA: number[], vecB: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  const len = Math.min(vecA.length, vecB.length);
  for (let i = 0; i < len; i++) {
    dot += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }
  if (normA === 0 || normB === 0) return 0;
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

A visualization of these embeddings is available in the deployed application under the Metadata View, showing how semantically similar scenes cluster together.

To improve ranking spread, the engine normalizes expected cosine values and applies a non-linear boost (power transform). This separates high-quality matches more clearly in candidate rankings, making the difference between a 0.7 and 0.8 similarity score more meaningful for ad selection.


Step 3: Identify Optimal Ad Breaks

Before recommending ads, the engine identifies optimal monetization points within the content. Pegasus 1.5 analyzes each scene for:

  • Post-segment break quality

  • Interruption risk

  • Emotional valley detection

  • Transition type bonus

  • Mode-aware safety multiplier (strict, balanced, revenue_max)

The engine then applies spacing constraints and selects top breaks greedily with chronological ordering.

This matters in production because ad relevance is only useful if insertion timing is viewer-safe and UX-aware. A perfectly matched ad placed mid-sentence or during an emotional climax will still drive abandonment.


Step 4: Rank Ads with Safety Gates + Diversity Constraints

With optimal break points identified, embeddings computed, and metadata extracted, the engine can rank ads. But raw scoring isn't enough. Production ad decisioning requires two additional layers:

1. Hard Gates for Brand Safety

Before scoring, ads are filtered through:

  • User/category eligibility checks

  • Negative campaign context overlap detection

  • GARM-sensitive exclusions (alcohol, gambling, violence)

  • Safety mode gate policies

2. Cross-Break Diversity

No viewer, even one who loves cars, wants to see a car ad at every break. The engine enforces:

  • Same-ad repetition caps across breaks

  • Category frequency limits

  • Fallback logic when diversity constraints suppress top candidates

The result is an ad plan that is both high-scoring and broadcast-realistic, tailored to each viewer, scene, and available inventory.


Step 5: Export to Databricks for Enterprise Retrieval and Analytics

The metadata, embeddings, and ad decisioning data generated by this engine are only valuable if they flow into enterprise workflows. The engine exports all signals to Databricks Delta tables for downstream analytics and ML pipelines.

Queries are generated on-demand to match each user's Databricks workspace:

CREATE OR REPLACE VIEW ad_metadata_premium_spirits_vec AS
SELECT
  creative_id,
  campaign_name,
  from_json(marengo_embedding_json, 'array<double>') AS embedding
FROM main.default.ad_metadata_premium_spirits
WHERE vector_sync_status = 'embedded_marengo_clip_avg'

This data lift into Databricks enables:

  1. Mosaic AI Vector Search Indexing for semantic retrieval at scale

  2. Campaign QA with full audit trails on every decision

  3. Similarity Retrieval for creative ops and competitive analysis

  4. Model-assisted planning in BI and ML pipelines

For teams evaluating enterprise rollout, this is where the TwelveLabs + Databricks combination becomes compelling: model-native video intelligence meets production data governance and retrieval infrastructure.


Why TwelveLabs for Contextual Advertising

You've now built (or walked through) a contextual ad engine that makes placement decisions based on actual video content rather than stale metadata. Few systems outside of purpose-built video AI can support this depth of decisioning across timing, sentiment, semantics, and policy in a production-ready architecture.

TwelveLabs provides the foundation:

  • Pegasus 1.5 for fine-grained, structured scene intelligence

  • Marengo 3.0 for multimodal semantic retrieval and matching

  • An API-first architecture that integrates cleanly into existing ad tech stacks

The combination transforms video from an opaque storage cost into a queryable, monetizable asset.


Start Building

TLDR

Most CTV/FAST platforms still make ad decisions without looking at what's actually happening on screen. This tutorial walks through building a production-grade contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for multimodal embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding rather than stale metadata, with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.

What you'll build: A complete pipeline that transforms video content into queryable context, matches ads to scenes based on semantic similarity and brand safety rules, identifies optimal break points, and exports decisioning data to Databricks for downstream analytics.


Introduction

Most ad decision stacks treat video as an opaque blob. They rely on metadata, content labels, or historical audience segments to make placement decisions. Everything except the video itself.

This approach works for broad targeting. Keyword matching can get you in the right ballpark. But it leaves significant revenue on the table because it fails to account for three things:

  1. Timing: Ads placed without awareness of scene transitions interrupt the viewing experience and drive abandonment.

  2. Context: Brand safety violations happen when systems can't see what's actually happening on screen. An alcohol ad shouldn't run during a scene depicting addiction recovery.

  3. Depth: Surface-level demographic targeting misses the nuance of household income, viewing device, and real-time engagement signals.

This tutorial addresses all three by building a contextual ad engine that treats video as queryable, structured data rather than a black box. The engine combines:

  • TwelveLabs Pegasus 1.5 for fine-grained scene understanding: sentiment, tone, cast, environment, and GARM-aligned safety signals

  • TwelveLabs Marengo 3.0 for multimodal semantic embeddings that enable scene-to-ad similarity scoring

  • Databricks Delta Lake + Mosaic AI Vector Search for enterprise-grade storage and retrieval

  • FreeWheel/OpenRTB-compatible payload generation for direct integration with existing ad servers

Figure 1: Intelligence Scene Extraction in Video Inventory

The goal: answer the question "Which ad should run at this break, for this audience, in this specific scene, while respecting brand safety and campaign constraints?" with data grounded in actual video content.

Here's a walkthrough of the finished application:


Prerequisites

Before starting, you'll need:

  1. Node.js 18+ and npm/yarn/pnpm

  2. TwelveLabs API Key with two indexes:

    • TL_INDEX_ID for content videos

    • TL_AD_INDEX_ID for ad creatives

  3. Vercel Blob Token (BLOB_READ_WRITE_TOKEN) for handling large video file transfers to TwelveLabs

  4. OpenAI API Key (optional) for low-latency text embedding during IAB 3.1 taxonomy mapping

  5. Databricks Workspace (optional) with DATABRICKS_TOKEN, DATABRICKS_HOST, DATABRICKS_HTTP_PATH, and optionally DATABRICKS_CATALOG and DATABRICKS_SCHEMA

Clone and run:

>> git clone https://github.com/nathanchess/twelvelabs-context-ad-engine.git
>> cd contextual-ad-engine
>> npm install
>> cp .env.example .env.local
>> npm


Architecture Overview

Figure 2: Contextual Ad Engine Backend Architecture (LucidChart)

The architecture leverages two TwelveLabs models that serve complementary roles:

Marengo 3.0 is the encoder. It transforms video into searchable 512-dimensional vector embeddings, making products, emotions, environments, and moments queryable. This enables semantic matching between ad creatives and content scenes.

Pegasus 1.5 is the reasoning model. It generates structured metadata about each scene: demographics, brand safety flags, sentiment, and targeting recommendations. It supports structured outputs, producing consistent JSON that downstream systems can parse deterministically.

By leveraging their unique capabilities and metadata generated into a single deterministic calculation, shown on the right hand side of the technical architecture diagram, of (User-Ad Match Score) x (Scene-Ad Match Score) we are able to recommend ads not based off of pre-written text metadata, but making scene-level decisions grounded in real video understanding.

This allows the ad engine to treat each segment as a living context signal, considering:

  • Tone

  • Sentiment

  • Environment

  • Brand Safety

This approach makes ad decisions based on what's actually happening in the video, not on content metadata that was labeled weeks ago. For deeper background on the underlying technology, see the TwelveLabs Platform Overview and TwelveLabs Research.


Core Ad Decision | Placement Logic

The core decision logic combines both signals into a single score:

totalScore = adAffinity * sceneFit

Where adAffinity measures how well an ad fits the viewer profile (demographics, interests, policy constraints) and sceneFit measures how well the ad creative fits the current scene (semantic similarity + safety + tone + environment).

The scoring pipeline combines four weighted signals into the sceneFit calculation:

sceneFit =
  suitableMatch  * 0.15 +   // Pegasus suitable_categories overlap
  environmentFit * 0.15 +   // environment-category affinity
  toneCompat     * 0.10 +   // emotional tone compatibility
  contextMatch   * 0.60     // Marengo semantic cosine similarity

The weighting is intentional. In CTV/OTT monetization, the largest CPM lift typically comes from semantic context quality, so Marengo 3.0 drives most of the score. The remaining signals preserve rule-based controllability for policy and content safety teams who need deterministic guardrails.


Step 1: Generate Structured Ad Metadata (Pegasus + IAB + FreeWheel)

This step extracts structured scene intelligence from video content using Pegasus 1.5, normalizes it to IAB 3.1 taxonomy, and generates FreeWheel-compatible key-value pairs for ad server integration.


1.1  - Run Pegasus 1.5 with Structured Output

The /api/analyze endpoint handles three tasks:

  1. Accepts a prompt from the frontend (from the /videos or /ads page)

  2. Checks Vercel Blob cache to avoid redundant processing

  3. Calls Pegasus 1.5 with structured output and stores the result

const tl_client = new TwelveLabs({ apiKey: process.env.TL_API_KEY });
const result = await tl_client.analyze({
  videoId,
  prompt,
  temperature: 0.2,
  response_format
}, { timeoutInSeconds: 90 });

The output is time-aligned metadata that downstream systems can reason over: scene boundaries, sentiment, environment, cast, and safety flags. This replaces brittle keyword-based targeting with grounded video understanding.


1.2 - Normalize Model Output to IAB 3.1 via Embedding KNN ID Matching

The analysis output from Pegasus needs to map to the IAB Content Taxonomy 3.1 for ad server compatibility. The pipeline uses text embeddings and k-nearest-neighbor matching against canonical IAB IDs.

The approach maintains a closed reference table of approved IAB 3.1 rows:

export const IAB_ALLOWED_ROWS = [
  { tier1: "Alcohol", tier2: "Spirits", code: "1005" },
  { tier1: "Alcohol", tier2: "Beer", code: "1003" },
  { tier1: "Consumer Packaged Goods", tier2: "General Food", tier3: "Snacks", code: "1169" },
  { tier1: "Finance and Insurance", tier2: "Stocks and Investments", code: "1338" },
  { tier1: "Vehicles", tier2: "Automotive Ownership", tier3: "New Vehicle Ownership", code: "1536" },
  // ...
] as const;

Each row is embedded once at index time. At runtime, candidate labels from the model are embedded and matched via KNN to the nearest canonical IAB rows, then thresholded and deduplicated:

export function normalizeIabWithKnnPolicy(
  rawInput: unknown,
  categoryKey?: string
): IabPolicyResult {
  const rawItems = Array.isArray(rawInput) ? rawInput : [];
  // 1) Embed candidate text from model output
  const embeddedCandidates = embedCandidateLabels(rawItems);
  // 2) KNN against canonical IAB 3.1 embedding index
  const knnMatches = queryIabKnnIndex(embeddedCandidates, { k: 5 });
  // 3) Keep only policy-compliant matches above similarity threshold
  const normalizedItems = dedupeAndSort(
    applyIabMatchPolicy(knnMatches).filter(
      (item): item is IabTaxonomyItem => Boolean(item)
    )
  );
  const high = normalizedItems.filter((item) => item.confidence >= IAB_HIGH_CONFIDENCE);
  const medium = normalizedItems.filter((item) => item.confidence >= IAB_MEDIUM_CONFIDENCE);
  let effectiveItems: IabTaxonomyItem[] = [];
  let fallbackApplied = false;
  let fallbackReason: string | null = null;
  if (high.length > 0) {
    effectiveItems = high;
  } else if (medium.length > 0) {
    effectiveItems = medium;
    fallbackReason = "No high-confidence Tier-2/3 matches; using medium-confidence Tier-1 band.";
  } else {
    const fallback = (categoryKey && FALLBACK_BY_CATEGORY_KEY[categoryKey]) || [];
    effectiveItems = fallback;
    fallbackApplied = true;
    fallbackReason = fallback.length
      ? "No medium-confidence KNN matches; applied deterministic vertical fallback."
      : "No medium-confidence KNN matches and no category fallback mapping found.";
  }
  const effectiveTier1 = [...new Set(effectiveItems.map((item) => item.tier1))];
  const effectiveTier2 = high.length > 0
    ? [...new Set(effectiveItems.map((item) => item.tier2))]
    : [];
  const effectiveTier3 = high.length > 0
    ? [...new Set(effectiveItems.map((item) => item.tier3).filter((tier3): tier3 is string => Boolean(tier3)))]
    : [];
  const effectiveIabIds = high.length > 0
    ? [...new Set(effectiveItems.map((item) => item.iabId).filter(Boolean))]
    : [];
  const averageConfidence = normalizedItems.length > 0
    ? normalizedItems.reduce((sum, item) => sum + item.confidence, 0) / normalizedItems.length
    : 0;
  return {
    normalizedItems,
    effectiveTier1,
    effectiveTier2,
    effectiveTier3,
    effectiveIabIds,
    averageConfidence,
    fallbackApplied,
    fallbackReason,
  };
}

This pipeline:

  • Embeds model-generated category phrases

  • Runs KNN similarity search against canonical IAB 3.1 row embeddings

  • Snaps candidates to valid IAB taxonomy rows/IDs only

  • Deduplicates and ranks matches by confidence

  • Promotes high-confidence rows as effective targeting fields

  • Applies deterministic vertical fallback when confidence is too low

This is critical for production ad tech: it prevents taxonomy hallucination, enforces valid IAB 3.1 IDs, and still captures semantic nuance through embedding-based matching.


1.3 - Build FreeWheel KVP Payload from Normalized Metadata

Once IAB and context signals are normalized, the engine generates FreeWheel key-value pairs for downstream ad serving:

const freewheelPayload = {
  ad_server: "Freewheel",
  endpoint: "https://ads.freewheel.tv/ad/p/1",
  generated_kvps: {
    vw_brand: toBrand(parsed.company),
    vw_ctx_inc: includeContexts.join(","),
    vw_ctx_exc: excludeContexts.join(","),
    vw_garm_floor: "strict",
    vw_duration: String(duration),
    vw_ad_title: parsed.proposedTitle || "untitled",
    vw_iab_t1: policy.effectiveTier1.join(","),
    vw_iab_t2: policy.effectiveTier2.join(","),
    vw_iab_t3: policy.effectiveTier3.join(","),
    vw_iab_codes: policy.effectiveCodes.join(","),
    vw_iab_conf: policy.averageConfidence.toFixed(3),
  },
};

The key fields:

  • vw_ctx_inc combines target contexts and Pegasus-recommended contexts

  • vw_ctx_exc combines campaign exclusions, Pegasus negatives, and GARM flags

  • vw_iab_* fields are populated only from normalized/effective classes

This step is what connects AI-generated understanding to existing ad ops workflows. TwelveLabs provides semantic intelligence. Policy normalization ensures deterministic, auditable taxonomy behavior. FreeWheel/OpenRTB mapping makes the outputs deployable in production ad servers.


Step 2: Build Multimodal Embeddings with Marengo

Both content scenes and ad creatives are vectorized into the same 512-dimensional embedding space using Marengo 3.0. This enables true semantic matching between scenes and ads, not just keyword overlap.

export function cosineSimilarity(vecA: number[], vecB: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  const len = Math.min(vecA.length, vecB.length);
  for (let i = 0; i < len; i++) {
    dot += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }
  if (normA === 0 || normB === 0) return 0;
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

A visualization of these embeddings is available in the deployed application under the Metadata View, showing how semantically similar scenes cluster together.

To improve ranking spread, the engine normalizes expected cosine values and applies a non-linear boost (power transform). This separates high-quality matches more clearly in candidate rankings, making the difference between a 0.7 and 0.8 similarity score more meaningful for ad selection.


Step 3: Identify Optimal Ad Breaks

Before recommending ads, the engine identifies optimal monetization points within the content. Pegasus 1.5 analyzes each scene for:

  • Post-segment break quality

  • Interruption risk

  • Emotional valley detection

  • Transition type bonus

  • Mode-aware safety multiplier (strict, balanced, revenue_max)

The engine then applies spacing constraints and selects top breaks greedily with chronological ordering.

This matters in production because ad relevance is only useful if insertion timing is viewer-safe and UX-aware. A perfectly matched ad placed mid-sentence or during an emotional climax will still drive abandonment.


Step 4: Rank Ads with Safety Gates + Diversity Constraints

With optimal break points identified, embeddings computed, and metadata extracted, the engine can rank ads. But raw scoring isn't enough. Production ad decisioning requires two additional layers:

1. Hard Gates for Brand Safety

Before scoring, ads are filtered through:

  • User/category eligibility checks

  • Negative campaign context overlap detection

  • GARM-sensitive exclusions (alcohol, gambling, violence)

  • Safety mode gate policies

2. Cross-Break Diversity

No viewer, even one who loves cars, wants to see a car ad at every break. The engine enforces:

  • Same-ad repetition caps across breaks

  • Category frequency limits

  • Fallback logic when diversity constraints suppress top candidates

The result is an ad plan that is both high-scoring and broadcast-realistic, tailored to each viewer, scene, and available inventory.


Step 5: Export to Databricks for Enterprise Retrieval and Analytics

The metadata, embeddings, and ad decisioning data generated by this engine are only valuable if they flow into enterprise workflows. The engine exports all signals to Databricks Delta tables for downstream analytics and ML pipelines.

Queries are generated on-demand to match each user's Databricks workspace:

CREATE OR REPLACE VIEW ad_metadata_premium_spirits_vec AS
SELECT
  creative_id,
  campaign_name,
  from_json(marengo_embedding_json, 'array<double>') AS embedding
FROM main.default.ad_metadata_premium_spirits
WHERE vector_sync_status = 'embedded_marengo_clip_avg'

This data lift into Databricks enables:

  1. Mosaic AI Vector Search Indexing for semantic retrieval at scale

  2. Campaign QA with full audit trails on every decision

  3. Similarity Retrieval for creative ops and competitive analysis

  4. Model-assisted planning in BI and ML pipelines

For teams evaluating enterprise rollout, this is where the TwelveLabs + Databricks combination becomes compelling: model-native video intelligence meets production data governance and retrieval infrastructure.


Why TwelveLabs for Contextual Advertising

You've now built (or walked through) a contextual ad engine that makes placement decisions based on actual video content rather than stale metadata. Few systems outside of purpose-built video AI can support this depth of decisioning across timing, sentiment, semantics, and policy in a production-ready architecture.

TwelveLabs provides the foundation:

  • Pegasus 1.5 for fine-grained, structured scene intelligence

  • Marengo 3.0 for multimodal semantic retrieval and matching

  • An API-first architecture that integrates cleanly into existing ad tech stacks

The combination transforms video from an opaque storage cost into a queryable, monetizable asset.


Start Building

TLDR

Most CTV/FAST platforms still make ad decisions without looking at what's actually happening on screen. This tutorial walks through building a production-grade contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for multimodal embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding rather than stale metadata, with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.

What you'll build: A complete pipeline that transforms video content into queryable context, matches ads to scenes based on semantic similarity and brand safety rules, identifies optimal break points, and exports decisioning data to Databricks for downstream analytics.


Introduction

Most ad decision stacks treat video as an opaque blob. They rely on metadata, content labels, or historical audience segments to make placement decisions. Everything except the video itself.

This approach works for broad targeting. Keyword matching can get you in the right ballpark. But it leaves significant revenue on the table because it fails to account for three things:

  1. Timing: Ads placed without awareness of scene transitions interrupt the viewing experience and drive abandonment.

  2. Context: Brand safety violations happen when systems can't see what's actually happening on screen. An alcohol ad shouldn't run during a scene depicting addiction recovery.

  3. Depth: Surface-level demographic targeting misses the nuance of household income, viewing device, and real-time engagement signals.

This tutorial addresses all three by building a contextual ad engine that treats video as queryable, structured data rather than a black box. The engine combines:

  • TwelveLabs Pegasus 1.5 for fine-grained scene understanding: sentiment, tone, cast, environment, and GARM-aligned safety signals

  • TwelveLabs Marengo 3.0 for multimodal semantic embeddings that enable scene-to-ad similarity scoring

  • Databricks Delta Lake + Mosaic AI Vector Search for enterprise-grade storage and retrieval

  • FreeWheel/OpenRTB-compatible payload generation for direct integration with existing ad servers

Figure 1: Intelligence Scene Extraction in Video Inventory

The goal: answer the question "Which ad should run at this break, for this audience, in this specific scene, while respecting brand safety and campaign constraints?" with data grounded in actual video content.

Here's a walkthrough of the finished application:


Prerequisites

Before starting, you'll need:

  1. Node.js 18+ and npm/yarn/pnpm

  2. TwelveLabs API Key with two indexes:

    • TL_INDEX_ID for content videos

    • TL_AD_INDEX_ID for ad creatives

  3. Vercel Blob Token (BLOB_READ_WRITE_TOKEN) for handling large video file transfers to TwelveLabs

  4. OpenAI API Key (optional) for low-latency text embedding during IAB 3.1 taxonomy mapping

  5. Databricks Workspace (optional) with DATABRICKS_TOKEN, DATABRICKS_HOST, DATABRICKS_HTTP_PATH, and optionally DATABRICKS_CATALOG and DATABRICKS_SCHEMA

Clone and run:

>> git clone https://github.com/nathanchess/twelvelabs-context-ad-engine.git
>> cd contextual-ad-engine
>> npm install
>> cp .env.example .env.local
>> npm


Architecture Overview

Figure 2: Contextual Ad Engine Backend Architecture (LucidChart)

The architecture leverages two TwelveLabs models that serve complementary roles:

Marengo 3.0 is the encoder. It transforms video into searchable 512-dimensional vector embeddings, making products, emotions, environments, and moments queryable. This enables semantic matching between ad creatives and content scenes.

Pegasus 1.5 is the reasoning model. It generates structured metadata about each scene: demographics, brand safety flags, sentiment, and targeting recommendations. It supports structured outputs, producing consistent JSON that downstream systems can parse deterministically.

By leveraging their unique capabilities and metadata generated into a single deterministic calculation, shown on the right hand side of the technical architecture diagram, of (User-Ad Match Score) x (Scene-Ad Match Score) we are able to recommend ads not based off of pre-written text metadata, but making scene-level decisions grounded in real video understanding.

This allows the ad engine to treat each segment as a living context signal, considering:

  • Tone

  • Sentiment

  • Environment

  • Brand Safety

This approach makes ad decisions based on what's actually happening in the video, not on content metadata that was labeled weeks ago. For deeper background on the underlying technology, see the TwelveLabs Platform Overview and TwelveLabs Research.


Core Ad Decision | Placement Logic

The core decision logic combines both signals into a single score:

totalScore = adAffinity * sceneFit

Where adAffinity measures how well an ad fits the viewer profile (demographics, interests, policy constraints) and sceneFit measures how well the ad creative fits the current scene (semantic similarity + safety + tone + environment).

The scoring pipeline combines four weighted signals into the sceneFit calculation:

sceneFit =
  suitableMatch  * 0.15 +   // Pegasus suitable_categories overlap
  environmentFit * 0.15 +   // environment-category affinity
  toneCompat     * 0.10 +   // emotional tone compatibility
  contextMatch   * 0.60     // Marengo semantic cosine similarity

The weighting is intentional. In CTV/OTT monetization, the largest CPM lift typically comes from semantic context quality, so Marengo 3.0 drives most of the score. The remaining signals preserve rule-based controllability for policy and content safety teams who need deterministic guardrails.


Step 1: Generate Structured Ad Metadata (Pegasus + IAB + FreeWheel)

This step extracts structured scene intelligence from video content using Pegasus 1.5, normalizes it to IAB 3.1 taxonomy, and generates FreeWheel-compatible key-value pairs for ad server integration.


1.1  - Run Pegasus 1.5 with Structured Output

The /api/analyze endpoint handles three tasks:

  1. Accepts a prompt from the frontend (from the /videos or /ads page)

  2. Checks Vercel Blob cache to avoid redundant processing

  3. Calls Pegasus 1.5 with structured output and stores the result

const tl_client = new TwelveLabs({ apiKey: process.env.TL_API_KEY });
const result = await tl_client.analyze({
  videoId,
  prompt,
  temperature: 0.2,
  response_format
}, { timeoutInSeconds: 90 });

The output is time-aligned metadata that downstream systems can reason over: scene boundaries, sentiment, environment, cast, and safety flags. This replaces brittle keyword-based targeting with grounded video understanding.


1.2 - Normalize Model Output to IAB 3.1 via Embedding KNN ID Matching

The analysis output from Pegasus needs to map to the IAB Content Taxonomy 3.1 for ad server compatibility. The pipeline uses text embeddings and k-nearest-neighbor matching against canonical IAB IDs.

The approach maintains a closed reference table of approved IAB 3.1 rows:

export const IAB_ALLOWED_ROWS = [
  { tier1: "Alcohol", tier2: "Spirits", code: "1005" },
  { tier1: "Alcohol", tier2: "Beer", code: "1003" },
  { tier1: "Consumer Packaged Goods", tier2: "General Food", tier3: "Snacks", code: "1169" },
  { tier1: "Finance and Insurance", tier2: "Stocks and Investments", code: "1338" },
  { tier1: "Vehicles", tier2: "Automotive Ownership", tier3: "New Vehicle Ownership", code: "1536" },
  // ...
] as const;

Each row is embedded once at index time. At runtime, candidate labels from the model are embedded and matched via KNN to the nearest canonical IAB rows, then thresholded and deduplicated:

export function normalizeIabWithKnnPolicy(
  rawInput: unknown,
  categoryKey?: string
): IabPolicyResult {
  const rawItems = Array.isArray(rawInput) ? rawInput : [];
  // 1) Embed candidate text from model output
  const embeddedCandidates = embedCandidateLabels(rawItems);
  // 2) KNN against canonical IAB 3.1 embedding index
  const knnMatches = queryIabKnnIndex(embeddedCandidates, { k: 5 });
  // 3) Keep only policy-compliant matches above similarity threshold
  const normalizedItems = dedupeAndSort(
    applyIabMatchPolicy(knnMatches).filter(
      (item): item is IabTaxonomyItem => Boolean(item)
    )
  );
  const high = normalizedItems.filter((item) => item.confidence >= IAB_HIGH_CONFIDENCE);
  const medium = normalizedItems.filter((item) => item.confidence >= IAB_MEDIUM_CONFIDENCE);
  let effectiveItems: IabTaxonomyItem[] = [];
  let fallbackApplied = false;
  let fallbackReason: string | null = null;
  if (high.length > 0) {
    effectiveItems = high;
  } else if (medium.length > 0) {
    effectiveItems = medium;
    fallbackReason = "No high-confidence Tier-2/3 matches; using medium-confidence Tier-1 band.";
  } else {
    const fallback = (categoryKey && FALLBACK_BY_CATEGORY_KEY[categoryKey]) || [];
    effectiveItems = fallback;
    fallbackApplied = true;
    fallbackReason = fallback.length
      ? "No medium-confidence KNN matches; applied deterministic vertical fallback."
      : "No medium-confidence KNN matches and no category fallback mapping found.";
  }
  const effectiveTier1 = [...new Set(effectiveItems.map((item) => item.tier1))];
  const effectiveTier2 = high.length > 0
    ? [...new Set(effectiveItems.map((item) => item.tier2))]
    : [];
  const effectiveTier3 = high.length > 0
    ? [...new Set(effectiveItems.map((item) => item.tier3).filter((tier3): tier3 is string => Boolean(tier3)))]
    : [];
  const effectiveIabIds = high.length > 0
    ? [...new Set(effectiveItems.map((item) => item.iabId).filter(Boolean))]
    : [];
  const averageConfidence = normalizedItems.length > 0
    ? normalizedItems.reduce((sum, item) => sum + item.confidence, 0) / normalizedItems.length
    : 0;
  return {
    normalizedItems,
    effectiveTier1,
    effectiveTier2,
    effectiveTier3,
    effectiveIabIds,
    averageConfidence,
    fallbackApplied,
    fallbackReason,
  };
}

This pipeline:

  • Embeds model-generated category phrases

  • Runs KNN similarity search against canonical IAB 3.1 row embeddings

  • Snaps candidates to valid IAB taxonomy rows/IDs only

  • Deduplicates and ranks matches by confidence

  • Promotes high-confidence rows as effective targeting fields

  • Applies deterministic vertical fallback when confidence is too low

This is critical for production ad tech: it prevents taxonomy hallucination, enforces valid IAB 3.1 IDs, and still captures semantic nuance through embedding-based matching.


1.3 - Build FreeWheel KVP Payload from Normalized Metadata

Once IAB and context signals are normalized, the engine generates FreeWheel key-value pairs for downstream ad serving:

const freewheelPayload = {
  ad_server: "Freewheel",
  endpoint: "https://ads.freewheel.tv/ad/p/1",
  generated_kvps: {
    vw_brand: toBrand(parsed.company),
    vw_ctx_inc: includeContexts.join(","),
    vw_ctx_exc: excludeContexts.join(","),
    vw_garm_floor: "strict",
    vw_duration: String(duration),
    vw_ad_title: parsed.proposedTitle || "untitled",
    vw_iab_t1: policy.effectiveTier1.join(","),
    vw_iab_t2: policy.effectiveTier2.join(","),
    vw_iab_t3: policy.effectiveTier3.join(","),
    vw_iab_codes: policy.effectiveCodes.join(","),
    vw_iab_conf: policy.averageConfidence.toFixed(3),
  },
};

The key fields:

  • vw_ctx_inc combines target contexts and Pegasus-recommended contexts

  • vw_ctx_exc combines campaign exclusions, Pegasus negatives, and GARM flags

  • vw_iab_* fields are populated only from normalized/effective classes

This step is what connects AI-generated understanding to existing ad ops workflows. TwelveLabs provides semantic intelligence. Policy normalization ensures deterministic, auditable taxonomy behavior. FreeWheel/OpenRTB mapping makes the outputs deployable in production ad servers.


Step 2: Build Multimodal Embeddings with Marengo

Both content scenes and ad creatives are vectorized into the same 512-dimensional embedding space using Marengo 3.0. This enables true semantic matching between scenes and ads, not just keyword overlap.

export function cosineSimilarity(vecA: number[], vecB: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  const len = Math.min(vecA.length, vecB.length);
  for (let i = 0; i < len; i++) {
    dot += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }
  if (normA === 0 || normB === 0) return 0;
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

A visualization of these embeddings is available in the deployed application under the Metadata View, showing how semantically similar scenes cluster together.

To improve ranking spread, the engine normalizes expected cosine values and applies a non-linear boost (power transform). This separates high-quality matches more clearly in candidate rankings, making the difference between a 0.7 and 0.8 similarity score more meaningful for ad selection.


Step 3: Identify Optimal Ad Breaks

Before recommending ads, the engine identifies optimal monetization points within the content. Pegasus 1.5 analyzes each scene for:

  • Post-segment break quality

  • Interruption risk

  • Emotional valley detection

  • Transition type bonus

  • Mode-aware safety multiplier (strict, balanced, revenue_max)

The engine then applies spacing constraints and selects top breaks greedily with chronological ordering.

This matters in production because ad relevance is only useful if insertion timing is viewer-safe and UX-aware. A perfectly matched ad placed mid-sentence or during an emotional climax will still drive abandonment.


Step 4: Rank Ads with Safety Gates + Diversity Constraints

With optimal break points identified, embeddings computed, and metadata extracted, the engine can rank ads. But raw scoring isn't enough. Production ad decisioning requires two additional layers:

1. Hard Gates for Brand Safety

Before scoring, ads are filtered through:

  • User/category eligibility checks

  • Negative campaign context overlap detection

  • GARM-sensitive exclusions (alcohol, gambling, violence)

  • Safety mode gate policies

2. Cross-Break Diversity

No viewer, even one who loves cars, wants to see a car ad at every break. The engine enforces:

  • Same-ad repetition caps across breaks

  • Category frequency limits

  • Fallback logic when diversity constraints suppress top candidates

The result is an ad plan that is both high-scoring and broadcast-realistic, tailored to each viewer, scene, and available inventory.


Step 5: Export to Databricks for Enterprise Retrieval and Analytics

The metadata, embeddings, and ad decisioning data generated by this engine are only valuable if they flow into enterprise workflows. The engine exports all signals to Databricks Delta tables for downstream analytics and ML pipelines.

Queries are generated on-demand to match each user's Databricks workspace:

CREATE OR REPLACE VIEW ad_metadata_premium_spirits_vec AS
SELECT
  creative_id,
  campaign_name,
  from_json(marengo_embedding_json, 'array<double>') AS embedding
FROM main.default.ad_metadata_premium_spirits
WHERE vector_sync_status = 'embedded_marengo_clip_avg'

This data lift into Databricks enables:

  1. Mosaic AI Vector Search Indexing for semantic retrieval at scale

  2. Campaign QA with full audit trails on every decision

  3. Similarity Retrieval for creative ops and competitive analysis

  4. Model-assisted planning in BI and ML pipelines

For teams evaluating enterprise rollout, this is where the TwelveLabs + Databricks combination becomes compelling: model-native video intelligence meets production data governance and retrieval infrastructure.


Why TwelveLabs for Contextual Advertising

You've now built (or walked through) a contextual ad engine that makes placement decisions based on actual video content rather than stale metadata. Few systems outside of purpose-built video AI can support this depth of decisioning across timing, sentiment, semantics, and policy in a production-ready architecture.

TwelveLabs provides the foundation:

  • Pegasus 1.5 for fine-grained, structured scene intelligence

  • Marengo 3.0 for multimodal semantic retrieval and matching

  • An API-first architecture that integrates cleanly into existing ad tech stacks

The combination transforms video from an opaque storage cost into a queryable, monetizable asset.


Start Building