TwelveLabsがビデオの手動アノテーション時間を大幅に削減する方法

🎉 TwelveLabs Raises $100M Series B to build the future of video superintelligence. Read more.

プラットフォーム

価格

ソリューション

構築

資料

会社情報

Select Language

Playgroundへ移動

営業担当に相談する

🎉 TwelveLabs Raises $100M Series B to build the future of video superintelligence. Read more.

チュートリアル

TwelveLabsがわずか数回のAPI呼び出しで、何時間もの手動ビデオ注釈作業を削減する方法

ネイサン・チェ

開発者は、Twelve LabsのMarengo埋め込みとPegasusの構造化JSON出力を活用して自動動画アノテーションパイプラインを構築できます。これにより、カスタムタクソノミー（分類法）に基づくタイムスタンプ付きのトレーニングラベルを生成し、それを2次元の埋め込み空間で可視化して、JSON、CSV、またはCOCO形式でエクスポートし、PyTorchに直接インポートすることが可能になります。

この記事の内容

No headings found on page

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

AIを活用してビデオを検索、分析、探索します。

Try the Playground

2026/03/09

6分

記事へのリンクをコピー

Introduction

Labeling video data is one of the most expensive bottlenecks in machine learning. A single hour of footage requires 4 hours of human annotation time, and at $100 per hour for domain specialists (radiologists, safety auditors, sports analysts) the math stops working long before you reach your back catalog.

The footage exists. Dashcams, retail cameras, warehouse feeds, surgical recordings, sports broadcasts: enterprises sit on thousands of hours of video containing precisely the training signals their models need. But the path from raw footage to structured labels still runs through a labeling platform where human reviewers scrub frame by frame.

This tutorial builds a different pipeline. Instead of sending video to annotators, we send it to TwelveLabs' multimodal foundation models. Marengo encodes each video into dense, temporal embeddings that capture visual, audio, and contextual information in 512 dimensions. Pegasus then reasons over those embeddings and returns structured, timestamped annotations (complete with scene classifications, object detections, and confidence scores) all enforced by a strict JSON schema.

By the end of this walkthrough, you will have a working Next.js application that indexes video, auto-generates training labels against a custom taxonomy, visualizes your dataset in 2D embedding space, and exports to COCO, JSON, or CSV for direct ingestion into PyTorch or any standard ML pipeline.

Here is the finished application in action:

Architecture Overview

Before writing any code, here is the full pipeline. Each layer was chosen to solve a specific constraint of production video annotation.

Next.js 16 + React 19 handles both the frontend dashboard and the backend API routes. Serverless functions let us orchestrate the entire annotation pipeline without running a persistent server, which matters when you want ISV partners or internal teams to deploy this with a single vercel deploy.

Vercel Blob provides temporary video storage. Serverless environments impose strict payload limits (4.5 MB on Vercel), so the client uploads video files directly to Blob storage and passes only the resulting public URL to our API routes. This decouples upload size from function constraints.

TwelveLabs Marengo extracts 512-dimensional vector embeddings from each video. These are not frame-level features: Marengo processes video as a continuous multimodal signal across visual, audio, and temporal dimensions, producing a sequence of segment-level embeddings that capture semantic meaning rather than pixel data. This is what makes downstream clustering and visualization possible without any manual feature engineering.

TwelveLabs Pegasus (Analyze API) acts as the reasoning engine. Given the embeddings and a user-defined label taxonomy, Pegasus generates structured, machine-readable annotations with timestamps, scene classifications, and detected objects; all returned as JSON that maps directly to ML training formats.

For a deeper look at the data flow, the full architecture diagram is available on LucidChart: [TwelveLabs] - Automated Video Data Labeling Solution

Step 1: Prerequisite & Setup

You will need three things to run this project locally:

Node.js 18+ (20+ recommended)
TwelveLabs API Key from the TwelveLabs Dashboard
Vercel Blob Token for handling temporary video uploads

Clone the repository and install dependencies:

>> git clone https://github.com/nathanchess/tl-automated-video-data-labeler.git
>> cd tl-automated-video-data-labeler
>> npm

>> git clone https://github.com/nathanchess/tl-automated-video-data-labeler.git
>> cd tl-automated-video-data-labeler
>> npm

Set up your .env.local file:

TL_API_KEY=your_twelvelabs_api_key_here
TL_INDEX_NAME=your_default_index_name_here
BLOB_READ_WRITE_TOKEN

TL_API_KEY=your_twelvelabs_api_key_here
TL_INDEX_NAME=your_default_index_name_here
BLOB_READ_WRITE_TOKEN

Start the development server with npm run dev. The dashboard will be available at http://localhost:3000/indexes.

Step 2: Video Ingestion and Indexing

The ingestion flow begins when a user drops video files into the dashboard UI. The client uploads each file to Vercel Blob, then sends the resulting public URLs to a Next.js API route that triggers TwelveLabs indexing.

Here is the server-side logic in src/app/api/videos/route.js:

import { TwelveLabs } from 'twelvelabs-js';
import { NextResponse } from 'next/server';

const tl_client = new TwelveLabs({ apiKey: process.env.TL_API_KEY });

export async function POST(request) {
  const { videoURLs, metadata } = await request.json();

  // 1. Locate the target index
  const indexPager = await tl_client.indexes.list();
  let indexId = null;
  for await (const index of indexPager) {
    if (index.indexName === process.env.TL_INDEX_NAME) indexId = index.id;
  }

  if (!indexId) return NextResponse.json({ error: 'Index not found' }, { status: 404 });

  // 2. Create tasks and wait for completion
  for (const videoURL of videoURLs) {
    const task = await tl_client.tasks.create({
      indexId,
      videoUrl: videoURL,
      userMetadata: JSON.stringify(metadata),
    });

    const completedTask = await tl_client.tasks.waitForDone(task.id, {
      sleepInterval: 5,
    });

    if (completedTask.status !== 'ready') {
      throw new Error(`Task ${completedTask.id} failed`);
    }
  }

  return new Response('ok');
}

import { TwelveLabs } from 'twelvelabs-js';
import { NextResponse } from 'next/server';

const tl_client = new TwelveLabs({ apiKey: process.env.TL_API_KEY });

export async function POST(request) {
  const { videoURLs, metadata } = await request.json();

  // 1. Locate the target index
  const indexPager = await tl_client.indexes.list();
  let indexId = null;
  for await (const index of indexPager) {
    if (index.indexName === process.env.TL_INDEX_NAME) indexId = index.id;
  }

  if (!indexId) return NextResponse.json({ error: 'Index not found' }, { status: 404 });

  // 2. Create tasks and wait for completion
  for (const videoURL of videoURLs) {
    const task = await tl_client.tasks.create({
      indexId,
      videoUrl: videoURL,
      userMetadata: JSON.stringify(metadata),
    });

    const completedTask = await tl_client.tasks.waitForDone(task.id, {
      sleepInterval: 5,
    });

    if (completedTask.status !== 'ready') {
      throw new Error(`Task ${completedTask.id} failed`);
    }
  }

  return new Response('ok');
}

Production note: Video processing is fundamentally asynchronous. The waitForDone call here blocks the HTTP request until indexing completes, which works for a reference application processing a handful of clips. For production pipelines handling hours of footage, offload indexing to a background queue (Inngest, Upstash, or similar) and use TwelveLabs webhooks to notify your application when processing is complete. The SDK fully supports this pattern.

Step 3: Extracting and Pooling Embeddings

Once videos are indexed, we retrieve them along with their Marengo embeddings. These vectors are what power semantic grouping and the 2D visualization we build later.

const videoPager = await tl_client.indexes.videos.list(indexId);
  const videos = [];

  for await (const video of videoPager) {
    let embeddings = [];
    try {
      const videoData = await tl_client.indexes.videos.retrieve(indexId, video.id, {
        embeddingOption: ['visual'],
      });

      const segments = videoData.embedding?.videoEmbedding?.segments || [];
      if (segments.length > 0) {
        const dim = segments[0].float?.length;
        if (dim) {
          const sum = new Array(dim).fill(0);
          let count = 0;

          // Average the embeddings across all temporal segments
          for (const seg of segments) {
            if (seg.float && seg.float.length === dim) {
              for (let i = 0; i < dim; i++) sum[i] += seg.float[i];
              count++;
            }
          }
          if (count > 0) embeddings = sum.map((val) => val / count);
        }
      }
    } catch (embErr) {
      console.warn(`Failed to get embeddings for ${video.id}`);
    }
    videos.push({ ...video, embeddings });
  }

const videoPager = await tl_client.indexes.videos.list(indexId);
  const videos = [];

  for await (const video of videoPager) {
    let embeddings = [];
    try {
      const videoData = await tl_client.indexes.videos.retrieve(indexId, video.id, {
        embeddingOption: ['visual'],
      });

      const segments = videoData.embedding?.videoEmbedding?.segments || [];
      if (segments.length > 0) {
        const dim = segments[0].float?.length;
        if (dim) {
          const sum = new Array(dim).fill(0);
          let count = 0;

          // Average the embeddings across all temporal segments
          for (const seg of segments) {
            if (seg.float && seg.float.length === dim) {
              for (let i = 0; i < dim; i++) sum[i] += seg.float[i];
              count++;
            }
          }
          if (count > 0) embeddings = sum.map((val) => val / count);
        }
      }
    } catch (embErr) {
      console.warn(`Failed to get embeddings for ${video.id}`);
    }
    videos.push({ ...video, embeddings });
  }

Why mean pooling? Marengo outputs a sequence of embeddings mapped to temporal segments; an embedding for seconds 0–5, another for 5–10, and so on. That temporal granularity is valuable for search (finding the exact moment a forklift enters a restricted zone), but to visualize an entire video as a single point in a 2D scatterplot, we need a single vector per video.

Mean pooling (averaging the float values across all segments) produces a composite vector that represents the overall semantic content of the video. If the footage is primarily a forklift operating in a warehouse, the averaged vector will land in the "forklift operation" region of embedding space, regardless of which specific 5-second segment you examine.

Step 4: Zero-Shot Auto-Labeling with Pegasus

The core problem with using general-purpose LLMs for annotation is output structure. Language models return free-form text, which means developers end up writing fragile regular expressions to extract timestamps, labels, and confidence scores from prose paragraphs. One unexpected formatting change and the parser breaks.

The TwelveLabs Analyze API eliminates this by enforcing a strict JSON schema on Pegasus's output. You define exactly the structure your ML pipeline expects, inject the user's custom label taxonomy (e.g., shoplifting, forklift_violation, surgical_clamp) directly into the prompt, and receive machine-parseable annotations with zero post-processing.

const response_format = {
    type: 'json_schema',
    json_schema: {
      type: 'object',
      properties: {
        annotations: {
          type: 'array',
          items: {
            type: 'object',
            properties: {
              start_timestamp: { type: 'string' },
              end_timestamp: { type: 'string' },
              description: { type: 'string' },
              scene_classification: { type: 'string' },
              detected_objects: {
                type: 'array',
                items: {
                  type: 'object',
                  properties: {
                    label: { type: 'string' },
                    confidence_score: { type: 'number' },
                    start_timestamp: { type: 'string' },
                    end_timestamp: { type: 'string' },
                  },
                },
              },
            },
            required: ['start_timestamp', 'end_timestamp', 'description', 'scene_classification', 'detected_objects'],
          },
        },
      },
    },
  };

const response_format = {
    type: 'json_schema',
    json_schema: {
      type: 'object',
      properties: {
        annotations: {
          type: 'array',
          items: {
            type: 'object',
            properties: {
              start_timestamp: { type: 'string' },
              end_timestamp: { type: 'string' },
              description: { type: 'string' },
              scene_classification: { type: 'string' },
              detected_objects: {
                type: 'array',
                items: {
                  type: 'object',
                  properties: {
                    label: { type: 'string' },
                    confidence_score: { type: 'number' },
                    start_timestamp: { type: 'string' },
                    end_timestamp: { type: 'string' },
                  },
                },
              },
            },
            required: ['start_timestamp', 'end_timestamp', 'description', 'scene_classification', 'detected_objects'],
          },
        },
      },
    },
  };

This schema guarantees that every annotation includes start and end timestamps, a natural-language description, a scene classification label, and an array of detected objects (each with its own confidence score and temporal bounds). Because Pegasus is a zero-shot generative model, no fine-tuning is required. You define the taxonomy, pass the schema, and receive structured training data in seconds.

Step 5: Interactive Embedding Visualization with PCA

Numbers and labels are necessary for ML pipelines, but for data curation work, you need to see your dataset. The application includes an in-browser 2D scatter plot that projects Marengo's 512-dimensional embeddings into two dimensions using Principal Component Analysis (PCA).

Optimizing PCA for Browser-Based Visualization

Computing PCA on high-dimensional data is expensive. The standard approach (constructing a 512 × 512 covariance matrix and finding its eigenvalues via singular value decomposition) will freeze a browser tab. When the user is curating a dataset of 50 videos, they need immediate visual feedback, not a loading spinner.

The solution is a Gram matrix approach. Instead of computing the full D × D covariance matrix (where D = 512), we compute an N × N matrix of inner products between videos (where N = the number of videos in the dataset). For a 50-video dataset, finding the eigenvectors of a 50 × 50 matrix is orders of magnitude faster than operating on 512 dimensions; reducing complexity from O(D³) to O(N³) and keeping the visualization fluid in the main browser thread.

What the visualization reveals for annotation quality: Tight, well-separated clusters indicate that your labeled categories correspond to genuinely distinct semantic content. Outliers (points that sit far from any cluster) flag either rare events worth separate annotation attention (a safety incident in routine warehouse footage) or noisy data that may be misclassified. This gives data PMs and annotators immediate, actionable feedback on dataset health before they commit to expensive model training.

Step 6: Exporting for ML Pipelines

A curated dataset is only as valuable as its compatibility with downstream training tools. The application natively supports three export formats:

JSON provides the raw structured output for programmatic parsing in custom data loaders. Every field from the Pegasus annotation schema is preserved, including nested detected objects with confidence scores.
CSV flattens the annotations into spreadsheet-friendly rows with timestamps, labels, and descriptions; useful for quick manual audits or feeding into pandas-based preprocessing scripts.
COCO (Common Objects in Context) structures annotations into the videos, annotations, and categories arrays expected by major training frameworks. This is the format PyTorch DataLoaders consume natively, and it imports directly into workforce platforms like Label Studio and CVAT for a final human-in-the-loop review.

The COCO export is particularly important because it eliminates the reformatting step that typically sits between annotation and training. You go from raw video to indexed embeddings to structured labels to training-ready dataset (all within a single pipeline).

Business Impact: What This Changes

For ML and Computer Vision Teams

The economics of this pipeline shift the annotation bottleneck from human labor to compute:

Traditional manual workflow: 2–4 hours of human effort per 1 hour of video, at $50–100/hour for domain expertise. A 500-hour back catalog requires 1,000–2,000 hours of annotation labor; roughly 6–12 months of full-time work.
TwelveLabs automated workflow: Approximately 1 minute of compute per video for indexing and structured annotation, plus targeted human review of the auto-generated labels. The same 500-hour back catalog processes in days, not months.

Humans stay in the loop for quality assurance, but the expensive first pass (finding the 4 seconds of relevant action in a 2-hour video) is handled by the models. This makes it economically feasible to label historical footage that was previously too costly to touch.

For Data Labeling Platform ISVs

This application is a product blueprint. Integrating AI-first video annotation into your platform reduces per-project annotation effort by 80–90% based on the time savings demonstrated in this pipeline (minutes of compute replacing hours of manual review). That shifts your business model from selling raw human hours to selling high-margin, scalable intelligence; a differentiation that matters as annotation becomes increasingly commoditized.

Conclusion

This reference application demonstrates a complete pipeline for converting raw video into structured, training-ready datasets using TwelveLabs' video understanding APIs. Marengo encodes footage into dense temporal embeddings. Pegasus converts those embeddings into timestamped, schema-enforced training labels. The browser-based dashboard handles ingestion, visualization, and export (all without a persistent backend).

What you have after running this tutorial: a working application that you can point at your own footage, define your own label taxonomy, and generate ML-ready annotations in the time it takes to index the video.

The full source code is available on GitHub. Clone it, configure your TwelveLabs index, and start building.

Get the source code →

Try the live demo →

To explore what else you can build with TwelveLabs' video understanding APIs, visit the documentation or reach out to our team.

Resources