🎉 TwelveLabs Raises $100M Series B to build the future of video superintelligence. Read more.

プラットフォーム

価格

ソリューション

構築

資料

会社情報

Select Language

Playgroundへ移動

営業担当に相談する

🎉 TwelveLabs Raises $100M Series B to build the future of video superintelligence. Read more.

チュートリアル

手動レビューから自動化されたインテリジェンスへ：YOLOとTwelve Labsを活用した手術動画分析プラットフォームの開発

ミラン・キム

このチュートリアルでは、YOLOv8による術具検出とTwelve LabsのMarengoおよびPegasusを組み合わせた、Surgical Video Intelligence（外科手術ビデオ・インテリジェンス）の構築方法を詳しく解説します。このプラットフォームは、SOAP手術記録の自動生成、手術プロセスの段階（フェーズ）ごとのセグメンテーション、ビデオライブラリ全体のセマンティック検索、そして対話型の臨床Q&Aアシスタントである「Dr. Sage」の駆動を実現します。

この記事の内容

No headings found on page

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

AIを活用してビデオを検索、分析、探索します。

Try the Playground

2026/02/20

9分

記事へのリンクをコピー

Surgeons and residents routinely spend 3–5 hours per procedure reviewing laparoscopic footage frame-by-frame — logging instruments, identifying surgical phases, and writing operative reports by hand. Across a teaching hospital processing hundreds of procedures monthly, that manual review adds up to thousands of hours of clinical time diverted from patient care.

In this tutorial, we build Surgical Video Intelligence, a platform that treats surgical video as a rich, queryable dataset. The core architectural insight: pair YOLO (You Only Look Once) for precise object detection with Twelve Labs' multimodal video understanding API to create a system that can both see which tools are on screen and understand the surgical context around them.

⭐️ What We're Building: A Next.js application that automatically:

Detects surgical tools in real-time with bounding boxes and timestamps
Generates SOAP operative notes (Subjective, Objective, Assessment, Plan) from video evidence
Segments surgeries into chapters (e.g., "Dissection," "Closure")
Enables semantic search across video libraries (e.g., "find all cauterization moments")
Provides Dr. Sage, a conversational AI assistant for follow-up clinical Q&A

📌 Try the live demo | View the GitHub repository

System Architecture

Surgical video analysis requires two fundamentally different capabilities: pixel-precise instrument identification and contextual understanding of what those instruments are doing within a procedure. No single model handles both well.

Our architecture separates these concerns into two parallel pipelines that converge at inference time.

Video Ingestion: The user uploads a laparoscopic surgery video.
Tool Detection: A YOLOv8 model (fine-tuned on 7 distinct surgical instrument classes) scans the video and outputs a structured JSON file of detections with timestamps and bounding boxes.
Video Indexing: The video is indexed by Twelve Labs to enable semantic search (Marengo embeddings) and contextual analysis (Pegasus reasoning).
Context Fusion: When the system generates notes or answers questions, YOLO's detection data is injected into the prompt alongside Twelve Labs' video understanding — grounding the AI's reasoning in verified visual evidence.

This hybrid approach matters because each technology compensates for the other's blind spots. YOLO can identify a grasper with 95% confidence at frame 120, but it cannot tell you whether that grasper is being used for retraction or dissection. Twelve Labs can reason about surgical technique and phase progression, but benefits from hard detection data to prevent hallucinating instrument usage that never occurred.

Application Demo

Watch the full demo on Loom

1. Real-Time Tool Detection

The main analysis view provides a synchronized experience. On the left, the video player overlays real-time tool bounding boxes detected by YOLO. Below, a "swimlane" timeline visualizes exactly when each instrument appears — giving educators an instant, scannable map of instrument usage across the entire procedure.

2. Automatic SOAP Note Generation

Upon video upload and indexing, the system generates a first-person operative note using the Twelve Labs Analyze API, enriched with YOLO tool detection data. No manual input required. The resulting SOAP note includes accurate instrument references anchored to specific timestamps, reducing documentation time from hours to minutes.

3. Surgical Phase Segmentation

The Twelve Labs Analyze API automatically breaks the surgery into distinct phases — "Preparation," "Dissection," "Closure" — using structured JSON output. These chapters appear on an interactive timeline, enabling quick navigation to specific surgical stages. For training programs, this transforms a 90-minute recording into a navigable curriculum.

4. Semantic Search

Natural language search across surgical video libraries, powered by Twelve Labs Marengo's multimodal embeddings. Unlike keyword matching, this understands surgical context semantically.

Example query: "Show me when the surgeon used electrocautery" Result: Clips at 2:34–2:58 and 5:12–5:45 with 0.92 confidence

For a surgical educator searching "use of monopolar coagulation" across 50+ training videos, Marengo returns all instances with timestamps — enabling rapid compilation of technique montages without watching a single full recording.

5. Dr. Sage: Conversational Clinical Q&A

After automated analysis completes, Dr. Sage provides a conversational interface for follow-up questions. It has access to both YOLO detections and Twelve Labs video understanding, enabling queries like "What tools were used during dissection?" or "Explain the technique used at 3:45." This bridges the gap between automated reports and the ad-hoc questions that arise during surgical review and training sessions.

Deployment Architecture

The application is designed as a distributed system with clear separation between compute-intensive inference and user-facing operations.

Infrastructure Overview

Frontend (Vercel): Next.js application serving the UI and API routes
Backend (Railway): FastAPI server running YOLO inference
Storage (Vercel Blob): Centralized JSON storage for all analysis results
AI Processing (Twelve Labs): Cloud-based video understanding

Data Flow

When a user uploads a surgical video, two parallel pipelines activate:

Computer Vision Pipeline (Railway)

User uploads video → Frontend sends to Railway backend
YOLO processes the video frame-by-frame (every 120th frame for efficiency)
Backend generates detections.json:

{
  "detections": [
    {
      "frame": 120,
      "timestamp": 5.0,
      "tools": [
        {"class_name": "Grasper", "confidence": 0.95, "bbox": {...}}
      ]
    }
  ]
}

{
  "detections": [
    {
      "frame": 120,
      "timestamp": 5.0,
      "tools": [
        {"class_name": "Grasper", "confidence": 0.95, "bbox": {...}}
      ]
    }
  ]
}

Results upload to Vercel Blob at path: detections/{videoId}.json

Video Intelligence Pipeline (Twelve Labs + Vercel)

Video is indexed by Twelve Labs (Marengo for embeddings, Pegasus for understanding)
Frontend API routes trigger AI generation:
- SOAP Note: POST /api/analysis/{videoId}/soap
- Chapters/Timeline: POST /api/timeline
Results are saved to Vercel Blob at: analysis/{videoId}.json

Vercel Blob: Centralized State

Both pipelines converge at Vercel Blob, a key-value store for JSON files. This eliminates database complexity while maintaining fast lookups.

Storage schema:

How Twelve Labs Powers Intelligence

YOLO gives us the what and when — specific tools, exact timestamps. Twelve Labs gives us the why and how — phase recognition, technique analysis, contextual reasoning. Here is how we use three primary API capabilities: Analyze (for SOAP generation and chapter segmentation) and Search (for semantic retrieval).

1. Analyze API: SOAP Note Generation

The Analyze API is Twelve Labs' multimodal generation endpoint. We use it to create first-person operative notes grounded in YOLO detection data.

Source: frontend/src/app/api/analysis/[videoId]/soap/route.js

// 1. Fetch YOLO detection data from Vercel Blob
const toolData = await fetchToolDetection(videoId);

// 2. Format detections as readable text for the prompt
const toolContext = formatToolDetectionForPrompt(toolData);
// Example output: "DETECTED TOOLS: Grasper at 0:45, 1:20, 2:30..."

// 3. Enrich the SOAP generation prompt with hard detection data
const enrichedPrompt = `${soapPrompt}

## REFERENCE DATA
${toolContext}
Use this tool detection data to provide accurate tool names and usage times.`;

// 4. Call Twelve Labs Analyze API
const response = await getTwelveLabsClient().analyze({
    videoId: videoId,
    prompt: enrichedPrompt,
    temperature: 0.2 // Low temperature for factual, not creative, output
});

// 5. Parse JSON response and save to Vercel Blob
const parsedData = JSON.parse(response.data);
await saveToBlob(videoId, { operative_note: parsedData.operative_note });

// 1. Fetch YOLO detection data from Vercel Blob
const toolData = await fetchToolDetection(videoId);

// 2. Format detections as readable text for the prompt
const toolContext = formatToolDetectionForPrompt(toolData);
// Example output: "DETECTED TOOLS: Grasper at 0:45, 1:20, 2:30..."

// 3. Enrich the SOAP generation prompt with hard detection data
const enrichedPrompt = `${soapPrompt}

## REFERENCE DATA
${toolContext}
Use this tool detection data to provide accurate tool names and usage times.`;

// 4. Call Twelve Labs Analyze API
const response = await getTwelveLabsClient().analyze({
    videoId: videoId,
    prompt: enrichedPrompt,
    temperature: 0.2 // Low temperature for factual, not creative, output
});

// 5. Parse JSON response and save to Vercel Blob
const parsedData = JSON.parse(response.data);
await saveToBlob(videoId, { operative_note: parsedData.operative_note });

Why this works: By injecting YOLO detections into the prompt, we ground the model's reasoning in verified visual evidence. The AI cannot claim "I used a clipper at 5:30" if YOLO never detected one. This is the key architectural pattern — using computer vision outputs as factual constraints on multimodal generation.

2. Analyze API: Timeline Generation (Structured Output)

We also use the Analyze API for chapter generation by leveraging the responseFormat parameter to request structured JSON output. This provides schema-validated responses, eliminating fragile string parsing.

Source: frontend/src/app/api/timeline/route.js

const response = await getTwelveLabsClient().analyze({
    videoId: videoId,
    prompt: "Divide this surgery into distinct phases with medical terminology. For each phase, provide a title, summary, start time, and end time.",
    responseFormat: {
        type: "json_schema",
        jsonSchema: {
            type: "object",
            properties: {
                chapters: {
                    type: "array",
                    items: {
                        type: "object",
                        properties: {
                            chapterNumber: { type: "number" },
                            chapterTitle: { type: "string" },
                            chapterSummary: { type: "string" },
                            startSec: { type: "number" },
                            endSec: { type: "number" }
                        },
                        required: ["chapterNumber", "chapterTitle", "startSec", "endSec"]
                    }
                }
            }
        }
    }
});

// Parse the structured JSON response
const parsedData = JSON.parse(response.data);
// parsedData.chapters contains the chapter array

const response = await getTwelveLabsClient().analyze({
    videoId: videoId,
    prompt: "Divide this surgery into distinct phases with medical terminology. For each phase, provide a title, summary, start time, and end time.",
    responseFormat: {
        type: "json_schema",
        jsonSchema: {
            type: "object",
            properties: {
                chapters: {
                    type: "array",
                    items: {
                        type: "object",
                        properties: {
                            chapterNumber: { type: "number" },
                            chapterTitle: { type: "string" },
                            chapterSummary: { type: "string" },
                            startSec: { type: "number" },
                            endSec: { type: "number" }
                        },
                        required: ["chapterNumber", "chapterTitle", "startSec", "endSec"]
                    }
                }
            }
        }
    }
});

// Parse the structured JSON response
const parsedData = JSON.parse(response.data);
// parsedData.chapters contains the chapter array

Example output:

{
  "chapters": [
    {
      "chapterNumber": 1,
      "chapterTitle": "Preoperative Setup and Positioning",
      "chapterSummary": "Patient positioned and prepped for procedure",
      "startSec": 0,
      "endSec": 163
    },
    {
      "chapterNumber": 2,
      "chapterTitle": "Dissection and Control of Carotid Arteries",
      "startSec": 163,
      "endSec": 326
    }
  ]
}

{
  "chapters": [
    {
      "chapterNumber": 1,
      "chapterTitle": "Preoperative Setup and Positioning",
      "chapterSummary": "Patient positioned and prepped for procedure",
      "startSec": 0,
      "endSec": 163
    },
    {
      "chapterNumber": 2,
      "chapterTitle": "Dissection and Control of Carotid Arteries",
      "startSec": 163,
      "endSec": 326
    }
  ]
}

Why this works: The responseFormat parameter ensures the model returns valid, parseable JSON matching an exact schema. No regex, no post-processing heuristics — the structured output is ready for direct use in the timeline UI.

3. Search API: Semantic Retrieval with Marengo

The Search API uses Marengo's multimodal embeddings to find specific moments in video based on natural language queries.

Source: frontend/src/app/api/search/route.js

const response = await getTwelveLabsClient().search.query({
    indexId: process.env.NEXT_PUBLIC_TWELVELABS_MARENGO_INDEX_ID,
    searchOptions: ['visual', 'audio'], // Search both modalities
    queryText: query, // e.g., "clipping of cystic artery"
    groupBy: "clip",
    threshold: "low" // Balance precision vs. recall
});

// Iterate through results
for await (const clip of response) {
    console.log(`Found at ${clip.start}s - ${clip.end}s (confidence: ${clip.confidence})`);
    console.log(`Thumbnail: ${clip.thumbnailUrl}`);
}

const response = await getTwelveLabsClient().search.query({
    indexId: process.env.NEXT_PUBLIC_TWELVELABS_MARENGO_INDEX_ID,
    searchOptions: ['visual', 'audio'], // Search both modalities
    queryText: query, // e.g., "clipping of cystic artery"
    groupBy: "clip",
    threshold: "low" // Balance precision vs. recall
});

// Iterate through results
for await (const clip of response) {
    console.log(`Found at ${clip.start}s - ${clip.end}s (confidence: ${clip.confidence})`);
    console.log(`Thumbnail: ${clip.thumbnailUrl}`);
}

Why this works: Unlike keyword matching, Marengo understands surgical context. A search for "clipping of cystic artery" returns results even when no one says those exact words — the model recognizes the visual and procedural pattern. This enables building searchable surgical video archives without manual tagging.

Step-by-Step Implementation

Prerequisites

Node.js 18+ and Python 3.10+
Twelve Labs API Key: Get one from the Playground
Computer Vision Model: We use a YOLOv8 model fine-tuned on the Cholec80 dataset, which contains 80 cholecystectomy surgeries annotated with 7 instrument classes

Step 1: Setting Up the Backend (Railway)

The YOLO detection backend is a FastAPI server that accepts video uploads and returns JSON detections.

Source: backend/main.py

# backend/main.py
@app.post("/detect/upload")
async def detect_tools_upload(video_id: str, video: UploadFile):
    # Process video with YOLO
    results_data = run_inference(temp_video_path, video_id)

    # Upload results to Vercel Blob via Frontend API
    await upload_results_to_blob(video_id, results_data, blob_token)

    return {"status": "completed", "data": results_data}

# backend/main.py
@app.post("/detect/upload")
async def detect_tools_upload(video_id: str, video: UploadFile):
    # Process video with YOLO
    results_data = run_inference(temp_video_path, video_id)

    # Upload results to Vercel Blob via Frontend API
    await upload_results_to_blob(video_id, results_data, blob_token)

    return {"status": "completed", "data": results_data}

The backend processes every 120th frame (~1 frame per 5 seconds at 24fps), balancing detection accuracy against processing speed. For a 30-minute surgical video, this produces approximately 360 analyzed frames — enough to build a comprehensive instrument usage timeline without the cost of full frame-by-frame inference.

Step 2: Frontend API Routes (Vercel)

The Next.js API routes orchestrate the intelligence layer. Three key endpoints:

GET /api/detect-tools/{videoId}: Retrieves YOLO detections from Vercel Blob
POST /api/analysis/{videoId}/soap: Generates SOAP notes using Twelve Labs Analyze API
POST /api/timeline: Creates chapter segmentation using Twelve Labs Analyze API

Each route follows a consistent three-step pattern: fetch YOLO data from Blob (if available), call the Twelve Labs API with enriched prompts, and save results back to Blob for caching.

Step 3: Dr. Sage Chat Integration

The conversational interface enriches user queries with visual evidence before sending them to Twelve Labs.

Source: frontend/src/app/api/analysis/route.js

import { TwelveLabs } from 'twelvelabs-js';

// Helper: Convert raw JSON detections to a chat-friendly summary
function formatToolDetectionForChat(toolData) {
    // ... logic to aggregate frames into time ranges ...
    // Output example:
    // "- Grasper: 450 detections (0:05 - 12:30)"
    // "- Hook: 120 detections (4:15 - 8:00)"
    return summary;
}

export async function POST(request) {
    const { userQuery, videoId } = await request.json();

    // 1. Fetch the JSON data produced by our YOLO backend
    const toolData = await fetchToolDetection(videoId);

    // 2. Format it into natural language context
    // This turns thousands of data points into a readable summary for the AI
    const toolContext = formatToolDetectionForChat(toolData);

    // 3. Enrich the user's query
    // We tag this as system-injected data so the model treats it as ground truth
    const enrichedQuery = toolContext
        ? `${userQuery}\n\n[SYSTEM INJECTED DATA - DO NOT IGNORE]\n${toolContext}`
        : userQuery;

    // 4. Call Twelve Labs Pegasus
    const client = new TwelveLabs({ apiKey: process.env.TWELVELABS_API_KEY });
    const response = await client.analyze({
        videoId: videoId,
        prompt: enrichedQuery,
        temperature: 0.2 // Low temperature for factual accuracy
    });

    return new Response(JSON.stringify(response));
}

import { TwelveLabs } from 'twelvelabs-js';

// Helper: Convert raw JSON detections to a chat-friendly summary
function formatToolDetectionForChat(toolData) {
    // ... logic to aggregate frames into time ranges ...
    // Output example:
    // "- Grasper: 450 detections (0:05 - 12:30)"
    // "- Hook: 120 detections (4:15 - 8:00)"
    return summary;
}

export async function POST(request) {
    const { userQuery, videoId } = await request.json();

    // 1. Fetch the JSON data produced by our YOLO backend
    const toolData = await fetchToolDetection(videoId);

    // 2. Format it into natural language context
    // This turns thousands of data points into a readable summary for the AI
    const toolContext = formatToolDetectionForChat(toolData);

    // 3. Enrich the user's query
    // We tag this as system-injected data so the model treats it as ground truth
    const enrichedQuery = toolContext
        ? `${userQuery}\n\n[SYSTEM INJECTED DATA - DO NOT IGNORE]\n${toolContext}`
        : userQuery;

    // 4. Call Twelve Labs Pegasus
    const client = new TwelveLabs({ apiKey: process.env.TWELVELABS_API_KEY });
    const response = await client.analyze({
        videoId: videoId,
        prompt: enrichedQuery,
        temperature: 0.2 // Low temperature for factual accuracy
    });

    return new Response(JSON.stringify(response));
}

Step 4: Visualizing the Timeline

Raw detection data is useful for the AI, but clinicians need visuals. We map detection timestamps to a React timeline component where each "swimlane" represents a surgical instrument class:

Red: Bipolar (Cautery)
Green: Hook (Dissection)
Yellow: Grasper (Manipulation)

A surgical educator reviewing a trainee's cholecystectomy can instantly spot patterns — prolonged cautery before adequate dissection, or unusual instrument sequencing — and click any moment to verify in the video. The frontend maps the JSON detections array to CSS-positioned elements on the timeline bar.

Conclusion

By pairing precise computer vision with contextual multimodal reasoning, this platform converts surgical video from passive recordings into structured, searchable clinical data.

YOLO delivers the who and when: specific instruments identified with pixel-level confidence at exact timestamps.
Twelve Labs delivers the what and why: phase recognition, technique analysis, and operative note generation grounded in full-procedure context.

The critical design pattern — injecting verified CV detections as factual constraints on multimodal generation — is applicable well beyond surgery. Any domain where you need both precise object identification and contextual reasoning (manufacturing QA, sports analytics, security review) can benefit from this hybrid architecture.

Next steps to explore:

Multi-procedure analytics across a department's full surgical library
Real-time intraoperative assistance with streaming YOLO detection
Automated training curriculum generation from annotated case libraries

Get started: