Tutorials

Multi-Source Legal Evidence Reporting: Building an Investigation Platform with TwelveLabs via AWS Bedrock and NeMo Retriever

Hrishikesh Yadav

Legal investigators reviewing 40+ hours of multi-source video evidence (dashcams, bodycams, CCTV, doorbell cameras) spend $200-500/hour in paralegal time manually searching for critical moments. This tutorial shows how to build a cross-source evidence search platform using TwelveLabs through AWS Bedrock for video intelligence and NeMo Retriever for document search, enabling natural language queries across 12+ disparate video sources with sub-3-second response times. The result: 40 hours of evidence review becomes 4 hours of targeted investigation, with automated timeline reconstruction, entity tracking, and structured compliance analysis.

Legal investigators reviewing 40+ hours of multi-source video evidence (dashcams, bodycams, CCTV, doorbell cameras) spend $200-500/hour in paralegal time manually searching for critical moments. This tutorial shows how to build a cross-source evidence search platform using TwelveLabs through AWS Bedrock for video intelligence and NeMo Retriever for document search, enabling natural language queries across 12+ disparate video sources with sub-3-second response times. The result: 40 hours of evidence review becomes 4 hours of targeted investigation, with automated timeline reconstruction, entity tracking, and structured compliance analysis.

In this article

No headings found on page

Join our newsletter

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Receive the latest advancements, tutorials, and industry insights in video understanding

Search, analyze, and explore your videos with AI.

Apr 25, 2026

16 Minutes

Copy link to article

Introduction

Legal teams processing video evidence face a compounding problem. A single case can involve 40+ hours of footage from dashcams, bodycams, CCTV systems, doorbell cameras, and insurance submissions; all in different formats, resolutions, and timestamps. Investigators spend $200-500/hour in paralegal time manually reviewing this content. Miss a critical 10-second clip buried in hour 23 of camera feed 7, and the case outcome changes.

The traditional approach (manual review, basic metadata tagging, frame-by-frame analysis) doesn't scale. Legal teams need to search across disparate sources simultaneously, reconstruct timelines from fragmented footage, and identify critical moments without watching every second of video.

This tutorial demonstrates how to build a legal evidence investigation platform using TwelveLabs through AWS Bedrock for video intelligence and NVIDIA NeMo Retriever for document search. The result: investigators search 12 video sources with natural language queries ("find the red sedan" or "show me when the person entered the building"), get ranked results with exact timestamps, and generate structured compliance reports in minutes instead of days.

What you'll build: A multi-source evidence investigator that ingests mixed-format surveillance footage, enables cross-source semantic search, performs automated compliance analysis, and reconstructs chronological timelines from disparate video sources.

Time investment: 40 hours of evidence review becomes 4 hours of targeted investigation.

You can explore the demo of the application here: Legal Evidence Investigator Application

You can check out the source code here: GitHub Repository


Demo Application

This demonstration shows how the platform handles real investigative workflows: searching across multiple video sources, identifying critical moments with precise timestamps, generating structured compliance analysis, and providing an interactive Q&A interface for deeper investigation.

Key capabilities demonstrated:

  • Cross-source search across 12+ disparate video formats

  • Entity tracking (find specific person/vehicle across all sources)

  • Automated compliance analysis with risk categorization

  • Timeline reconstruction showing sequential evidence

  • Conversational video Q&A with clickable timestamps


System Architecture: Why Single-Index Multi-Source Design Matters

The core architectural decision in this application addresses a fundamental constraint: TwelveLabs search operates at the index level: you search one index per request, not individual videos. This shapes everything.

The naive approach would create one index per video source. This fails immediately: you'd need 12 separate search requests to find a person across 12 cameras, then manually merge and sort results in application logic. Slow, complex, brittle.

The correct approach uses a single-index multi-source strategy: all evidence videos live in one TwelveLabs index, differentiated by rich metadata tagging. One search query hits all sources simultaneously. Results come back pre-ranked by relevance, grouped by source video, with metadata filters enabling scoped searches when needed ("only bodycam footage" or "only footage from Main Street location").

System components:

  1. Video ingestion pipeline: Upload mixed-format footage to S3 → Index via twelvelabs.marengo-embed-3-0-v1:0 through Bedrock → Store multimodal embeddings with source metadata

  2. Document ingestion pipeline: Extract text from PDFs → Chunk intelligently → Embed via nvidia/llama-nemotron-embed-vl-1b-v2 → Store in document index

  3. Hybrid retrieval layer: Parallel search across video embeddings (Marengo) and document chunks (NeMo) → Merge results → Return unified response

  4. Analysis engine: Generate structured compliance reports via twelvelabs.pegasus-1-2-v1:0 including title, risk categories, detected objects, face timelines, and transcript segments

  5. Conversational interface: Video Q&A with clickable timestamp citations

This architecture delivers cross-source search without sacrificing performance. Single-request latency stays under 3 seconds even with 12 indexed videos because the index-level search pattern is optimized for exactly this scenario.


Preparation: What You Need Before Building


1 - AWS Bedrock Access for TwelveLabs Models

Set up AWS credentials with permissions for:

  • Amazon Bedrock runtime and model access

  • S3 for video storage and Bedrock async output

  • Access to these TwelveLabs models through Bedrock:

    • twelvelabs.marengo-embed-3-0-v1:0 (multimodal video embeddings)

    • twelvelabs.pegasus-1-2-v1:0 (video analysis and reasoning)

Why these models: Marengo generates the searchable representations of video content (the "encoder"), while Pegasus performs reasoning and structured analysis (the "interpreter"). You need both: Marengo makes video searchable, Pegasus makes it understandable.


2 - S3 Bucket Configuration

Create one S3 bucket structured for:

  • Video uploads and storage

  • Bedrock async embedding output location (required for batch jobs)

  • Generated thumbnails and analysis artifacts

  • Document storage

Why S3-based: Bedrock's async embedding API requires S3 input/output locations. This also enables scalable storage without hitting local disk limits.


3 - NVIDIA API Key

Get an NVIDIA API key to access:

  • nvidia/llama-nemotron-embed-vl-1b-v2 for document chunk embeddings

Why NeMo Retriever: Legal evidence isn't just video; it's police reports, witness statements, insurance forms. NeMo handles text retrieval while TwelveLabs handles video, giving you multimodal search across all evidence types.


4 - Clone and Configure
git clone https://github.com/Hrishikesh332/tl-compliance-intelligence
cd

Follow backend environment setup in .env.example:

  1. AWS credentials (Access Key ID, Secret Access Key, region)

  2. S3 bucket name and Bedrock output path

  3. NVIDIA API key

  4. Application-specific settings (index configuration, match thresholds)


Implementation Deep Dive


Part 1: Video Ingestion and Embedding Generation

The ingestion pipeline transforms uploaded surveillance footage into searchable multimodal embeddings. This happens asynchronously because Bedrock's embedding jobs can take several minutes for hour-long videos.


1.1 - Starting the Marengo Embedding Job

When a video uploads, the system immediately stores it in S3 and kicks off a Bedrock async embedding job:

Source: backend/app/services/bedrock_marengo.py (Line 111)

def start_video_embedding(
    s3_uri: str,
    output_s3_uri: str,
    bucket_owner: str | None = None,
) -> dict:
    client = get_bedrock_client()
    owner = bucket_owner
    body = {
        "inputType": "video",
        "video": {
            "mediaSource": media_source_s3(s3_uri, owner),
            "embeddingOption": ["visual", "audio"],
            "embeddingScope": ["clip", "asset"],
        },
    }
    resp = client.start_async_invoke(
        modelId=MARENGO_MODEL_ID,
        modelInput=body,
        outputDataConfig={"s3OutputDataConfig": {"s3Uri": output_s3_uri}},
    )
    return {
        "invocation_arn": resp.get("invocationArn", ""),
        "status": "pending",
    }

Why embeddingScope: ["clip", "asset"] matters: This generates embeddings at two levels:

  • Clip embeddings (6-second segments): Enable granular search -> find the exact 8-second window where a person appears

  • Asset embeddings (whole video): Enable video-level similarity and grouping

Both are necessary: clip embeddings power precise temporal search, asset embeddings enable "find videos similar to this one" workflows.

Why async processing is required: A 2-hour video generates 1,200 clip embeddings (2 hours ÷ 6 seconds per clip). This takes time. Async processing lets users upload multiple videos simultaneously without blocking the UI while embeddings are generated in the background.


1.2 - Background Job Queue and Completion Polling

The system maintains an in-memory queue of pending embedding jobs and polls Bedrock for completion:

Source: backend/app/utils/video_helpers.py (Line 104)

while True:
        job = bedrock_queue.get()
        task_id = job["task_id"]
        s3_uri = job["s3_uri"]
        output_uri = job["output_uri"]
        filename = job["filename"]
        meta = job["meta"]
        log.info("[QUEUE] Processing Bedrock start for %s (%s)", filename, task_id)
        success = False
        for attempt in range(1, max_retries + 1):
            try:
                result = start_video_embedding(s3_uri, output_uri)
                arn = result.get("invocation_arn", "")
                log.info("[QUEUE] Bedrock started for task_id=%s", task_id)
                video_tasks[task_id]["status"] = "indexing"
                video_tasks[task_id]["invocation_arn"] = arn
                video_tasks[task_id]["output_s3_uri"] = output_uri
                for rec in vs_index():
                    if rec.get("id") == task_id:
                        rec.setdefault("metadata", {})["status"] = "indexing"
                        break
                vs_save()
                with bedrock_poller_lock:
                    bedrock_poller_jobs.append({
                        "task_id": task_id,
                        "invocation_arn": arn,
                        "output_s3_uri": output_uri,
                        "started_at": time.monotonic(),
                    })
                success = True
                break

What this accomplishes: The queue processor updates task status from "pending" to "indexing," stores the Bedrock invocation ARN for tracking, and hands off to a separate polling thread. That poller checks job completion every 30 seconds, loads finished embeddings from S3, and marks videos as "ready" when processing completes.

Why retry logic matters: Bedrock has rate limits. The retry mechanism with exponential backoff ensures jobs eventually succeed even during high-volume uploads, preventing silent failures that would leave videos in permanent "pending" state.

Design pattern: Videos and entities share the same vector index, differentiated by type metadata. Documents use a separate chunk-based index. This gives unified video+entity search while keeping document retrieval independent.


1.3 - Document Ingestion with Semantic Chunking

Documents require different processing than video. PDFs are extracted, split into semantic chunks (by section when possible), embedded via NeMo, and stored as searchable records:

Source: backend/app/services/nemo_retriever.py (Line 688)

def ingest_document(file_path: str, doc_id: str, filename: str) -> dict:
    extra: dict = {}
    try:
        pdf_info = store_pdf_document(file_path, doc_id)
        extra.update(pdf_info)
    except Exception as exc:
        log.warning("Could not persist PDF for %s (%s)", doc_id, type(exc).__name__)
    ext = os.path.splitext(file_path)[1].lower()
    chunks: list[str] = []
    sections: list[str] = []
    if ext == ".pdf":
        try:
            pairs = split_into_semantic_chunks(file_path)
            sections = [s for s, _ in pairs]
            chunks = [t for _, t in pairs]
            log.info("Smart PDF chunking produced %d chunks for doc %s", len(chunks), doc_id)
        except Exception as exc:
            log.warning("Smart PDF extraction failed for doc %s, falling back to NeMo (%s)", doc_id, type(exc).__name__)
    if not chunks:
        chunks = extract_document(file_path)
    if not chunks:
        log.warning("No content extracted for doc %s", doc_id)
        return {"doc_id": doc_id, "chunks": 0, "status": "empty"}
    embeddings = embed_texts(chunks)
    add_chunks(
        doc_id, filename, chunks, embeddings,
        sections=sections or None,
        extra_metadata=extra or None,
    )
    return {"doc_id": doc_id, "chunks": len(chunks), "status": "ready"}

Smart chunking strategy: The system attempts section-based splitting first (preserving document structure), then falls back to paragraph-based chunking if structure detection fails. This matters for legal documents where section headers ("Incident Timeline," "Witness Statements") provide important retrieval context.

Embedding model selection: nvidia/llama-nemotron-embed-vl-1b-v2 generates embeddings optimized for passage retrieval. Each chunk gets its own vector, enabling precise document search at the paragraph level rather than whole-document matching.

Metadata preservation: The system stores original section headers and document links alongside chunk text, so retrieval results can point investigators back to the source paragraph in the original PDF.

Embedding generation via NVIDIA API:

Source: backend/app/services/nemo_retriever.py (Line 515)

def embed_via_requests(texts: list[str], input_type: str) -> list[list[float]]:
    """Call NVIDIA embeddings API directly with requests"""
    import requests
    t0 = time.perf_counter()

    resp = requests.post(
        "https://integrate.api.nvidia.com/v1/embeddings",
        headers={
            "Authorization": f"Bearer {NVIDIA_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "input": texts,
            "model": EMBED_MODEL,
            "encoding_format": "float",
            "input_type": input_type,
        },
        timeout=60,
    )
    resp.raise_for_status()
    data = resp.json()
    vectors = [d["embedding"] for d in data["data"]]
    first_dim = len(vectors[0]) if vectors else 0
    return vectors

The input_type parameter: NeMo distinguishes between "passage" (document chunks being indexed) and "query" (search requests). This improves retrieval accuracy by optimizing embeddings for their intended use: passages are embedded for storage, queries are embedded for matching.


1.4 - Creating Searchable Entity Records from Face Images

Legal investigations often require tracking specific individuals across multiple camera sources. The entity system enables this by converting face images into searchable embeddings:

Image embedding generation:

def embed_image(media_source: dict) -> list[float]:
    return invoke_embedding_model({
        "inputType": "image",
        "image": {"mediaSource": media_source},
    })

This uses Marengo's image embedding capability to generate a visual representation of a face. Unlike text embeddings, image embeddings capture visual features directly (facial structure, appearance, clothing) making them suitable for visual matching across video frames.


Entity creation from uploaded face image:

@entities_bp.route("/entities/from-image", methods=["POST"])
def api_entities_from_image():
    if "image" not in request.files:
        return jsonify({"error": "No 'image' file provided"}), 400

    file = request.files["image"]
    if file.filename == "":
        return jsonify({"error": "Empty filename"}), 400

    data = request.form or {}
    name = data.get("name") or (request.get_json(silent=True) or {}).get("name") or ""
    if not name.strip():
        return jsonify({"error": "Missing 'name'"}), 400

    image_bytes = file.read()
    faces = detect_and_crop_faces(image_bytes, min_confidence=ENTITY_FACE_MIN_CONFIDENCE)
    if not faces:
        return jsonify(
            {"error": "No face detected in image. Use a clear, front-facing photo with good lighting."}
        ), 404

    best = faces[0]
    face_b64 = best["image_base64"]
    embed_b64 = best.get("embedding_crop_base64") or face_b64

    import base64

    face_bytes = base64.b64decode(embed_b64)
    media = media_source_base64(face_bytes)

    try:
        embedding = embed_image(media)
    except Exception:
        return jsonify({"error": "Internal server error"}), 500

    entity_id = name.strip().lower().replace(" ", "-")
    rec = index_add(
        id=entity_id,
        embedding=embedding,
        metadata={"name": name.strip(), "face_snap_base64": face_b64},
        type="entity",
    )

    return jsonify(
        {
            "indexId": FIXED_INDEX_ID,
            "entity": {"id": rec["id"], "name": name.strip()},
            "face_snap_base64": face_b64,
        }
    )

The face detection step: Uses OpenCV's ResNet10 SSD detector to isolate faces before embedding. This preprocessing ensures Marengo receives a clean face crop rather than a full scene, improving match accuracy during search. The detector returns confidence scores, only faces above the minimum threshold are processed.

Why entity records live in the video index: Storing entity embeddings alongside video embeddings enables direct similarity search. When an investigator searches for "person-of-interest-x," the system compares that entity's embedding against all video clip embeddings in one retrieval operation.


Part 2: Cross-Source Search and Retrieval

Once videos and documents are indexed, the platform enables unified search across all sources. This is where the single-index architecture delivers its value: one query, all sources, ranked results.


2.1 - Video Content Search with Multimodal Embeddings

The search flow supports three query types through a single interface:

  1. Text queries: Natural language descriptions ("person in red jacket")

  2. Image queries: Upload a screenshot, find matching footage

  3. Entity queries: Search for a previously registered person/object

Source: backend/app/routes/search.py (Line 30)

def search_video_index(
    data: dict,
    *,
    request_query: str = "",
    request_top_k: int | None = None,
    image_bytes: bytes | None = None,
) -> tuple[list[dict], str, str | None]:

    query_emb, display_query, is_entity_search, err = get_search_embedding_from_request(
        data,
        request_query=request_query,
        image_bytes=image_bytes,
    )

Input normalization: get_search_embedding_from_request() handles all query types and returns a single embedding vector. Text queries get embedded via Marengo's text encoder, image queries via image encoder, entity queries retrieve the stored embedding directly. The search logic downstream doesn't care about input type—it just sees a vector to match.

Entity search optimization:

for r in results:
    meta = r.get("metadata", {})
    clips = []
    output_uri = meta.get("output_s3_uri") or f"{S3_EMBEDDINGS_OUTPUT}/{r['id']}"

    if is_entity_search:
        clips = clips_above_threshold(
            query_emb,
            output_uri,
            min_score=ENTITY_CLIP_MIN_SCORE,
            visual_only=True,
            max_clips=clips_per_video,
        )

    if not clips:
        clips = clip_search(
            query_emb,
            output_uri,
            top_n=clips_per_video,
            min_score=clip_min_score,
            visual_only=is_entity_search,
        )

The scoring difference: Entity searches use visual_only=True and higher match thresholds because face matching requires stronger visual similarity than general content search. Text queries can match via audio transcription or visual content (either modality counts). Entity queries must match visually.

Clip-level grounding:

out.append({
        "id": r["id"],
        "score": r["score"],
        "metadata": meta,
        "clips": clips,
    })

return out, display_query, None

Results include not just which videos match, but exactly where in each video the match occurs. An investigator searching for "person exiting vehicle" gets results like:

  • Video: "Dashcam - Main St" → Clips at 00:03:42-00:03:48, 00:07:15-00:07:21

  • Video: "CCTV - Parking Lot" → Clip at 00:12:03-00:12:09

This clip-level precision eliminates "search for needle, still get entire haystack" problems common in video search systems.


2.2 - Entity-Aware Video Search: Finding People Across Sources

Entity search enables investigators to track specific individuals across multiple unrelated camera sources. Upload a face image once, search all footage:

How it works:

  1. Entity embedding (generated during entity creation) is loaded from index

  2. System retrieves clip embeddings for each indexed video from S3 Bedrock output

  3. Similarity scoring identifies clips where the entity likely appears

  4. Videos are ranked by match strength and consistency (how many clips matched, how strong the scores)

Why this is faster than runtime face detection: Pre-computing clip embeddings during ingestion means search only performs similarity comparison, not face detection. Comparing 10,000 clip embeddings against an entity embedding takes milliseconds. Running face detection on 10,000 clips would take hours.

Match threshold tuning: The system uses ENTITY_CLIP_MIN_SCORE to filter weak matches. Setting this too low produces false positives (similar-looking people). Setting it too high misses valid matches (same person in different lighting/angles). The demo uses 0.75 as a balance point; your production system should make this user-configurable.


2.3 - Hybrid Search: Video + Document Retrieval

Legal evidence isn't just video. Investigators need to cross-reference footage with police reports, witness statements, and insurance forms. Hybrid search runs video and document retrieval in parallel:


Document search implementation:

def search_document_index(text_query: str, doc_top_k: int) -> list[dict]:
    from app.services.nemo_retriever import embed_query, search_docs
    t0 = time.perf_counter()
    log.info("[DOC_SEARCH] Started doc search top_k=%d", doc_top_k)
    query_emb = embed_query(text_query)
    docs = search_docs(query_emb, top_k=doc_top_k)
    return docs

Parallel execution pattern: The hybrid search handler starts video and document searches simultaneously using threading, then merges results once both complete. This keeps total latency close to the slower of the two searches rather than their sum.

Result merging strategy: Video results and document results are kept separate in the response (not interleaved by score) because they serve different investigative purposes. Videos provide visual evidence, documents provide corroborating narrative. Investigators review them differently.

For complete hybrid search implementation: View source


Part 3: Structured Compliance Analysis and Reporting

Search finds evidence. Analysis interprets it. The platform uses Pegasus through Bedrock to generate structured compliance reports from raw footage:


3.1 - Generating Structured Video Analysis

Pegasus transforms unstructured video into structured legal documentation: categorized risk levels, detected persons, timestamped transcripts, and compliance summaries.

Source: backend/app/services/bedrock_pegasus.py (Line 72)

body: dict = {
    "inputPrompt": prompt[:4000],
    "mediaSource": {
        "s3Location": {
            "uri": s3_uri,
            "bucketOwner": owner,
        }
    },
}
if temperature is not None:
    body["temperature"] = temperature
if response_schema is not None:
    body["responseFormat"] = {"jsonSchema": response_schema}

Temperature setting: Using temperature: 0 for compliance analysis ensures deterministic, repeatable output. The same video always produces the same analysis structure, critical for legal documentation where consistency matters.

Response schema enforcement: The jsonSchema parameter instructs Pegasus to return structured JSON rather than free-form text. This guarantees parse-able output and eliminates post-processing fragility.

Bedrock model invocation:

response = client.invoke_model(
    modelId=model_id,
    body=payload,
    contentType="application/json",
    accept="application/json",
)

The response contains generated text extracted from Bedrock's JSON response format. This text represents Pegasus's interpretation of the video content based on the provided prompt instructions.

Analysis prompt and parsing:

Source: backend/app/routes/videos.py (Line 335)

raw_text = pegasus_analyze_video(
    s3_uri,
    get_video_analysis_prompt(),
    temperature=0,
)
log.info("[ANALYSIS] Pegasus response received in %.1fs (len=%d)", time.perf_counter() - t0, len(raw_text or ""))
analysis_dict = parse_video_analysis_response(raw_text)

The analysis prompt (view source) instructs Pegasus to:

  • Categorize the video content (traffic incident, workplace safety, criminal activity, etc.)

  • Identify risk levels and specific risk factors

  • Extract timestamped transcript segments

  • Detect persons of interest with descriptions

  • Generate a compliance-focused summary

Parsing strategy: parse_video_analysis_response() handles malformed JSON gracefully (recovers triple-backtick wrapping, trailing commas, incomplete responses) because LLM outputs aren't always perfectly formatted. Robust parsing prevents analysis failures from minor formatting issues.

Transcript generation: The system runs a separate Pegasus call optimized for transcription, requesting timestamped segments in structured JSON. This produces ordered transcript entries with start/end times, enabling investigators to jump directly to spoken content.


3.2 - Object Detection and Face Keyframe Extraction

Beyond basic transcription, the platform identifies detected objects and useful face keyframes with precise timestamps (the specific moments where faces appear clearly):

Source: backend/app/routes/videos.py (Line 707)

raw_response = pegasus_analyze_video(s3_uri, get_detect_prompt())
log.info("[INSIGHTS] Pegasus response received in %.1fs (%d chars)", time.perf_counter() - t0, len(raw_response or ""))

detect_data = parse_detect_response(raw_response)
objects_raw = detect_data["objects"]
face_keyframes = detect_data["face_keyframes"]

The detection prompt (view source) asks Pegasus to identify:

  • Objects: Vehicles, weapons, physical evidence, environmental details

  • Face keyframes: Timestamps where faces appear with sufficient clarity for identification

Why keyframes matter: Not every frame containing a face is useful. Faces seen from behind, in motion blur, or poorly lit don't help identification. Pegasus selects keyframes where faces are frontal, well-lit, and clear; the frames an investigator would actually screenshot for evidence.

Post-processing: Once Pegasus returns keyframes and objects, the system extracts those specific frames, generates thumbnails, and stores them for quick review. This eliminates re-processing video every time an investigator needs to see a face.


3.3 - Face Presence Timeline: Where Does This Person Appear?

After detecting faces, the system builds a presence timeline showing when each detected person appears throughout the video:

Source: backend/app/routes/videos.py (Line 1027)

if use_marengo:
    # Marengo-based presence, match each face embedding to clip embeddings
    for j, emb in enumerate(face_embeddings):
        if not emb:
            continue
        clips = clips_above_threshold(
            emb,
            output_uri,
            min_score=FACE_PRESENCE_MATCH_THRESHOLD,
            visual_only=True,
            max_clips=50,
        )
        for clip in clips:
            c_start = float(clip.get("start", 0.0))
            c_end = float(clip.get("end", c_start + 0.5))
            for i in range(n_segments):
                s0 = i * seg_dur
                s1 = (i + 1) * seg_dur
                if c_end > s0 and c_start < s1:
                    presence_by_face[j]["segment_presence"][i] = 1

How the timeline is constructed:

  1. Video is divided into fixed-duration segments (e.g., 30-second windows)

  2. Each detected face gets its embedding compared against all clip embeddings

  3. Clips scoring above threshold are marked as "person present"

  4. Matched clips are mapped onto the segment timeline

  5. Result: A binary presence map showing which segments contain each person

Why this matters for investigations: An investigator can see at a glance: "Person A appears in segments 3, 7, 12, and 18" without watching the entire video. Click segment 7, jump directly to that 30-second window, confirm visual match.

The threshold trade-off: FACE_PRESENCE_MATCH_THRESHOLD controls sensitivity. Higher values reduce false positives but may miss valid appearances (different angles, lighting changes). Lower values catch more instances but produce more false matches requiring manual review. The demo uses 0.70 - production systems should let investigators adjust this per-case.


Operational Considerations for Production Deployment

This demo proves the concept. Deploying to production requires addressing several operational concerns:

  1. Scalability: The single-index approach works for demos (12 videos) but needs architectural adjustments for production loads (1,000+ videos per case). Consider index partitioning strategies or migrating to a dedicated vector database for clip-level embeddings.

  2. Cost management: Bedrock charges per embedding generation. A 2-hour video generates ~1,200 clip embeddings. Processing 100 hours of footage per case means 60,000+ embeddings. Plan for batch processing, caching strategies, and reuse of embeddings across related cases.

  3. Security and compliance: Legal evidence requires chain-of-custody tracking, access controls, and audit logs. The demo stores videos in S3; production needs encryption at rest, role-based access, and immutable audit trails.

  4. Accuracy validation: Entity matching and timeline reconstruction should include confidence scores and require human review before being submitted as legal evidence. AI-generated analysis supports investigators but doesn't replace human judgment.

  5. Format handling: The demo assumes standard video codecs. Production systems must handle corrupted files, non-standard formats, encrypted footage, and low-quality sources gracefully.


Conclusion

This application demonstrates how video intelligence transforms legal evidence review from a time-intensive manual process into a targeted investigation workflow. By combining TwelveLabs through AWS Bedrock for video understanding with NeMo Retriever for document search, investigators can:

  • Search 12+ disparate video sources simultaneously with natural language queries

  • Track specific individuals or vehicles across unrelated camera feeds

  • Reconstruct chronological timelines from fragmented footage

  • Generate structured compliance analysis with risk categorization

  • Access transcript segments and detected objects with precise timestamps

The efficiency gain: Manual review of 40 hours of multi-source footage takes an investigator 40-60 hours. This platform reduces that to 4-6 hours of targeted review so that investigators spend time validating findings instead of hunting for them.

For legal tech ISVs building evidence management platforms, this architecture provides a reference implementation showing how TwelveLabs integrates into existing workflows without requiring wholesale platform rewrites. The single-index multi-source pattern, hybrid video+document retrieval, and structured analysis capabilities translate directly to production legal tech products.


Additional Resources

  1. TwelveLabs on AWS Bedrock: Learn more about model access

  2. NeMo Retriever documentation: Explore document retrieval capabilities

  3. TwelveLabs sample applications: Browse additional use cases

  4. Join the TwelveLabs community: Discord


Next steps:

  1. Clone the reference implementation

  2. Configure AWS Bedrock access and test with your video sources

  3. Adapt the single-index architecture to your specific legal workflows

  4. Integrate with your existing evidence management systems

Introduction

Legal teams processing video evidence face a compounding problem. A single case can involve 40+ hours of footage from dashcams, bodycams, CCTV systems, doorbell cameras, and insurance submissions; all in different formats, resolutions, and timestamps. Investigators spend $200-500/hour in paralegal time manually reviewing this content. Miss a critical 10-second clip buried in hour 23 of camera feed 7, and the case outcome changes.

The traditional approach (manual review, basic metadata tagging, frame-by-frame analysis) doesn't scale. Legal teams need to search across disparate sources simultaneously, reconstruct timelines from fragmented footage, and identify critical moments without watching every second of video.

This tutorial demonstrates how to build a legal evidence investigation platform using TwelveLabs through AWS Bedrock for video intelligence and NVIDIA NeMo Retriever for document search. The result: investigators search 12 video sources with natural language queries ("find the red sedan" or "show me when the person entered the building"), get ranked results with exact timestamps, and generate structured compliance reports in minutes instead of days.

What you'll build: A multi-source evidence investigator that ingests mixed-format surveillance footage, enables cross-source semantic search, performs automated compliance analysis, and reconstructs chronological timelines from disparate video sources.

Time investment: 40 hours of evidence review becomes 4 hours of targeted investigation.

You can explore the demo of the application here: Legal Evidence Investigator Application

You can check out the source code here: GitHub Repository


Demo Application

This demonstration shows how the platform handles real investigative workflows: searching across multiple video sources, identifying critical moments with precise timestamps, generating structured compliance analysis, and providing an interactive Q&A interface for deeper investigation.

Key capabilities demonstrated:

  • Cross-source search across 12+ disparate video formats

  • Entity tracking (find specific person/vehicle across all sources)

  • Automated compliance analysis with risk categorization

  • Timeline reconstruction showing sequential evidence

  • Conversational video Q&A with clickable timestamps


System Architecture: Why Single-Index Multi-Source Design Matters

The core architectural decision in this application addresses a fundamental constraint: TwelveLabs search operates at the index level: you search one index per request, not individual videos. This shapes everything.

The naive approach would create one index per video source. This fails immediately: you'd need 12 separate search requests to find a person across 12 cameras, then manually merge and sort results in application logic. Slow, complex, brittle.

The correct approach uses a single-index multi-source strategy: all evidence videos live in one TwelveLabs index, differentiated by rich metadata tagging. One search query hits all sources simultaneously. Results come back pre-ranked by relevance, grouped by source video, with metadata filters enabling scoped searches when needed ("only bodycam footage" or "only footage from Main Street location").

System components:

  1. Video ingestion pipeline: Upload mixed-format footage to S3 → Index via twelvelabs.marengo-embed-3-0-v1:0 through Bedrock → Store multimodal embeddings with source metadata

  2. Document ingestion pipeline: Extract text from PDFs → Chunk intelligently → Embed via nvidia/llama-nemotron-embed-vl-1b-v2 → Store in document index

  3. Hybrid retrieval layer: Parallel search across video embeddings (Marengo) and document chunks (NeMo) → Merge results → Return unified response

  4. Analysis engine: Generate structured compliance reports via twelvelabs.pegasus-1-2-v1:0 including title, risk categories, detected objects, face timelines, and transcript segments

  5. Conversational interface: Video Q&A with clickable timestamp citations

This architecture delivers cross-source search without sacrificing performance. Single-request latency stays under 3 seconds even with 12 indexed videos because the index-level search pattern is optimized for exactly this scenario.


Preparation: What You Need Before Building


1 - AWS Bedrock Access for TwelveLabs Models

Set up AWS credentials with permissions for:

  • Amazon Bedrock runtime and model access

  • S3 for video storage and Bedrock async output

  • Access to these TwelveLabs models through Bedrock:

    • twelvelabs.marengo-embed-3-0-v1:0 (multimodal video embeddings)

    • twelvelabs.pegasus-1-2-v1:0 (video analysis and reasoning)

Why these models: Marengo generates the searchable representations of video content (the "encoder"), while Pegasus performs reasoning and structured analysis (the "interpreter"). You need both: Marengo makes video searchable, Pegasus makes it understandable.


2 - S3 Bucket Configuration

Create one S3 bucket structured for:

  • Video uploads and storage

  • Bedrock async embedding output location (required for batch jobs)

  • Generated thumbnails and analysis artifacts

  • Document storage

Why S3-based: Bedrock's async embedding API requires S3 input/output locations. This also enables scalable storage without hitting local disk limits.


3 - NVIDIA API Key

Get an NVIDIA API key to access:

  • nvidia/llama-nemotron-embed-vl-1b-v2 for document chunk embeddings

Why NeMo Retriever: Legal evidence isn't just video; it's police reports, witness statements, insurance forms. NeMo handles text retrieval while TwelveLabs handles video, giving you multimodal search across all evidence types.


4 - Clone and Configure
git clone https://github.com/Hrishikesh332/tl-compliance-intelligence
cd

Follow backend environment setup in .env.example:

  1. AWS credentials (Access Key ID, Secret Access Key, region)

  2. S3 bucket name and Bedrock output path

  3. NVIDIA API key

  4. Application-specific settings (index configuration, match thresholds)


Implementation Deep Dive


Part 1: Video Ingestion and Embedding Generation

The ingestion pipeline transforms uploaded surveillance footage into searchable multimodal embeddings. This happens asynchronously because Bedrock's embedding jobs can take several minutes for hour-long videos.


1.1 - Starting the Marengo Embedding Job

When a video uploads, the system immediately stores it in S3 and kicks off a Bedrock async embedding job:

Source: backend/app/services/bedrock_marengo.py (Line 111)

def start_video_embedding(
    s3_uri: str,
    output_s3_uri: str,
    bucket_owner: str | None = None,
) -> dict:
    client = get_bedrock_client()
    owner = bucket_owner
    body = {
        "inputType": "video",
        "video": {
            "mediaSource": media_source_s3(s3_uri, owner),
            "embeddingOption": ["visual", "audio"],
            "embeddingScope": ["clip", "asset"],
        },
    }
    resp = client.start_async_invoke(
        modelId=MARENGO_MODEL_ID,
        modelInput=body,
        outputDataConfig={"s3OutputDataConfig": {"s3Uri": output_s3_uri}},
    )
    return {
        "invocation_arn": resp.get("invocationArn", ""),
        "status": "pending",
    }

Why embeddingScope: ["clip", "asset"] matters: This generates embeddings at two levels:

  • Clip embeddings (6-second segments): Enable granular search -> find the exact 8-second window where a person appears

  • Asset embeddings (whole video): Enable video-level similarity and grouping

Both are necessary: clip embeddings power precise temporal search, asset embeddings enable "find videos similar to this one" workflows.

Why async processing is required: A 2-hour video generates 1,200 clip embeddings (2 hours ÷ 6 seconds per clip). This takes time. Async processing lets users upload multiple videos simultaneously without blocking the UI while embeddings are generated in the background.


1.2 - Background Job Queue and Completion Polling

The system maintains an in-memory queue of pending embedding jobs and polls Bedrock for completion:

Source: backend/app/utils/video_helpers.py (Line 104)

while True:
        job = bedrock_queue.get()
        task_id = job["task_id"]
        s3_uri = job["s3_uri"]
        output_uri = job["output_uri"]
        filename = job["filename"]
        meta = job["meta"]
        log.info("[QUEUE] Processing Bedrock start for %s (%s)", filename, task_id)
        success = False
        for attempt in range(1, max_retries + 1):
            try:
                result = start_video_embedding(s3_uri, output_uri)
                arn = result.get("invocation_arn", "")
                log.info("[QUEUE] Bedrock started for task_id=%s", task_id)
                video_tasks[task_id]["status"] = "indexing"
                video_tasks[task_id]["invocation_arn"] = arn
                video_tasks[task_id]["output_s3_uri"] = output_uri
                for rec in vs_index():
                    if rec.get("id") == task_id:
                        rec.setdefault("metadata", {})["status"] = "indexing"
                        break
                vs_save()
                with bedrock_poller_lock:
                    bedrock_poller_jobs.append({
                        "task_id": task_id,
                        "invocation_arn": arn,
                        "output_s3_uri": output_uri,
                        "started_at": time.monotonic(),
                    })
                success = True
                break

What this accomplishes: The queue processor updates task status from "pending" to "indexing," stores the Bedrock invocation ARN for tracking, and hands off to a separate polling thread. That poller checks job completion every 30 seconds, loads finished embeddings from S3, and marks videos as "ready" when processing completes.

Why retry logic matters: Bedrock has rate limits. The retry mechanism with exponential backoff ensures jobs eventually succeed even during high-volume uploads, preventing silent failures that would leave videos in permanent "pending" state.

Design pattern: Videos and entities share the same vector index, differentiated by type metadata. Documents use a separate chunk-based index. This gives unified video+entity search while keeping document retrieval independent.


1.3 - Document Ingestion with Semantic Chunking

Documents require different processing than video. PDFs are extracted, split into semantic chunks (by section when possible), embedded via NeMo, and stored as searchable records:

Source: backend/app/services/nemo_retriever.py (Line 688)

def ingest_document(file_path: str, doc_id: str, filename: str) -> dict:
    extra: dict = {}
    try:
        pdf_info = store_pdf_document(file_path, doc_id)
        extra.update(pdf_info)
    except Exception as exc:
        log.warning("Could not persist PDF for %s (%s)", doc_id, type(exc).__name__)
    ext = os.path.splitext(file_path)[1].lower()
    chunks: list[str] = []
    sections: list[str] = []
    if ext == ".pdf":
        try:
            pairs = split_into_semantic_chunks(file_path)
            sections = [s for s, _ in pairs]
            chunks = [t for _, t in pairs]
            log.info("Smart PDF chunking produced %d chunks for doc %s", len(chunks), doc_id)
        except Exception as exc:
            log.warning("Smart PDF extraction failed for doc %s, falling back to NeMo (%s)", doc_id, type(exc).__name__)
    if not chunks:
        chunks = extract_document(file_path)
    if not chunks:
        log.warning("No content extracted for doc %s", doc_id)
        return {"doc_id": doc_id, "chunks": 0, "status": "empty"}
    embeddings = embed_texts(chunks)
    add_chunks(
        doc_id, filename, chunks, embeddings,
        sections=sections or None,
        extra_metadata=extra or None,
    )
    return {"doc_id": doc_id, "chunks": len(chunks), "status": "ready"}

Smart chunking strategy: The system attempts section-based splitting first (preserving document structure), then falls back to paragraph-based chunking if structure detection fails. This matters for legal documents where section headers ("Incident Timeline," "Witness Statements") provide important retrieval context.

Embedding model selection: nvidia/llama-nemotron-embed-vl-1b-v2 generates embeddings optimized for passage retrieval. Each chunk gets its own vector, enabling precise document search at the paragraph level rather than whole-document matching.

Metadata preservation: The system stores original section headers and document links alongside chunk text, so retrieval results can point investigators back to the source paragraph in the original PDF.

Embedding generation via NVIDIA API:

Source: backend/app/services/nemo_retriever.py (Line 515)

def embed_via_requests(texts: list[str], input_type: str) -> list[list[float]]:
    """Call NVIDIA embeddings API directly with requests"""
    import requests
    t0 = time.perf_counter()

    resp = requests.post(
        "https://integrate.api.nvidia.com/v1/embeddings",
        headers={
            "Authorization": f"Bearer {NVIDIA_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "input": texts,
            "model": EMBED_MODEL,
            "encoding_format": "float",
            "input_type": input_type,
        },
        timeout=60,
    )
    resp.raise_for_status()
    data = resp.json()
    vectors = [d["embedding"] for d in data["data"]]
    first_dim = len(vectors[0]) if vectors else 0
    return vectors

The input_type parameter: NeMo distinguishes between "passage" (document chunks being indexed) and "query" (search requests). This improves retrieval accuracy by optimizing embeddings for their intended use: passages are embedded for storage, queries are embedded for matching.


1.4 - Creating Searchable Entity Records from Face Images

Legal investigations often require tracking specific individuals across multiple camera sources. The entity system enables this by converting face images into searchable embeddings:

Image embedding generation:

def embed_image(media_source: dict) -> list[float]:
    return invoke_embedding_model({
        "inputType": "image",
        "image": {"mediaSource": media_source},
    })

This uses Marengo's image embedding capability to generate a visual representation of a face. Unlike text embeddings, image embeddings capture visual features directly (facial structure, appearance, clothing) making them suitable for visual matching across video frames.


Entity creation from uploaded face image:

@entities_bp.route("/entities/from-image", methods=["POST"])
def api_entities_from_image():
    if "image" not in request.files:
        return jsonify({"error": "No 'image' file provided"}), 400

    file = request.files["image"]
    if file.filename == "":
        return jsonify({"error": "Empty filename"}), 400

    data = request.form or {}
    name = data.get("name") or (request.get_json(silent=True) or {}).get("name") or ""
    if not name.strip():
        return jsonify({"error": "Missing 'name'"}), 400

    image_bytes = file.read()
    faces = detect_and_crop_faces(image_bytes, min_confidence=ENTITY_FACE_MIN_CONFIDENCE)
    if not faces:
        return jsonify(
            {"error": "No face detected in image. Use a clear, front-facing photo with good lighting."}
        ), 404

    best = faces[0]
    face_b64 = best["image_base64"]
    embed_b64 = best.get("embedding_crop_base64") or face_b64

    import base64

    face_bytes = base64.b64decode(embed_b64)
    media = media_source_base64(face_bytes)

    try:
        embedding = embed_image(media)
    except Exception:
        return jsonify({"error": "Internal server error"}), 500

    entity_id = name.strip().lower().replace(" ", "-")
    rec = index_add(
        id=entity_id,
        embedding=embedding,
        metadata={"name": name.strip(), "face_snap_base64": face_b64},
        type="entity",
    )

    return jsonify(
        {
            "indexId": FIXED_INDEX_ID,
            "entity": {"id": rec["id"], "name": name.strip()},
            "face_snap_base64": face_b64,
        }
    )

The face detection step: Uses OpenCV's ResNet10 SSD detector to isolate faces before embedding. This preprocessing ensures Marengo receives a clean face crop rather than a full scene, improving match accuracy during search. The detector returns confidence scores, only faces above the minimum threshold are processed.

Why entity records live in the video index: Storing entity embeddings alongside video embeddings enables direct similarity search. When an investigator searches for "person-of-interest-x," the system compares that entity's embedding against all video clip embeddings in one retrieval operation.


Part 2: Cross-Source Search and Retrieval

Once videos and documents are indexed, the platform enables unified search across all sources. This is where the single-index architecture delivers its value: one query, all sources, ranked results.


2.1 - Video Content Search with Multimodal Embeddings

The search flow supports three query types through a single interface:

  1. Text queries: Natural language descriptions ("person in red jacket")

  2. Image queries: Upload a screenshot, find matching footage

  3. Entity queries: Search for a previously registered person/object

Source: backend/app/routes/search.py (Line 30)

def search_video_index(
    data: dict,
    *,
    request_query: str = "",
    request_top_k: int | None = None,
    image_bytes: bytes | None = None,
) -> tuple[list[dict], str, str | None]:

    query_emb, display_query, is_entity_search, err = get_search_embedding_from_request(
        data,
        request_query=request_query,
        image_bytes=image_bytes,
    )

Input normalization: get_search_embedding_from_request() handles all query types and returns a single embedding vector. Text queries get embedded via Marengo's text encoder, image queries via image encoder, entity queries retrieve the stored embedding directly. The search logic downstream doesn't care about input type—it just sees a vector to match.

Entity search optimization:

for r in results:
    meta = r.get("metadata", {})
    clips = []
    output_uri = meta.get("output_s3_uri") or f"{S3_EMBEDDINGS_OUTPUT}/{r['id']}"

    if is_entity_search:
        clips = clips_above_threshold(
            query_emb,
            output_uri,
            min_score=ENTITY_CLIP_MIN_SCORE,
            visual_only=True,
            max_clips=clips_per_video,
        )

    if not clips:
        clips = clip_search(
            query_emb,
            output_uri,
            top_n=clips_per_video,
            min_score=clip_min_score,
            visual_only=is_entity_search,
        )

The scoring difference: Entity searches use visual_only=True and higher match thresholds because face matching requires stronger visual similarity than general content search. Text queries can match via audio transcription or visual content (either modality counts). Entity queries must match visually.

Clip-level grounding:

out.append({
        "id": r["id"],
        "score": r["score"],
        "metadata": meta,
        "clips": clips,
    })

return out, display_query, None

Results include not just which videos match, but exactly where in each video the match occurs. An investigator searching for "person exiting vehicle" gets results like:

  • Video: "Dashcam - Main St" → Clips at 00:03:42-00:03:48, 00:07:15-00:07:21

  • Video: "CCTV - Parking Lot" → Clip at 00:12:03-00:12:09

This clip-level precision eliminates "search for needle, still get entire haystack" problems common in video search systems.


2.2 - Entity-Aware Video Search: Finding People Across Sources

Entity search enables investigators to track specific individuals across multiple unrelated camera sources. Upload a face image once, search all footage:

How it works:

  1. Entity embedding (generated during entity creation) is loaded from index

  2. System retrieves clip embeddings for each indexed video from S3 Bedrock output

  3. Similarity scoring identifies clips where the entity likely appears

  4. Videos are ranked by match strength and consistency (how many clips matched, how strong the scores)

Why this is faster than runtime face detection: Pre-computing clip embeddings during ingestion means search only performs similarity comparison, not face detection. Comparing 10,000 clip embeddings against an entity embedding takes milliseconds. Running face detection on 10,000 clips would take hours.

Match threshold tuning: The system uses ENTITY_CLIP_MIN_SCORE to filter weak matches. Setting this too low produces false positives (similar-looking people). Setting it too high misses valid matches (same person in different lighting/angles). The demo uses 0.75 as a balance point; your production system should make this user-configurable.


2.3 - Hybrid Search: Video + Document Retrieval

Legal evidence isn't just video. Investigators need to cross-reference footage with police reports, witness statements, and insurance forms. Hybrid search runs video and document retrieval in parallel:


Document search implementation:

def search_document_index(text_query: str, doc_top_k: int) -> list[dict]:
    from app.services.nemo_retriever import embed_query, search_docs
    t0 = time.perf_counter()
    log.info("[DOC_SEARCH] Started doc search top_k=%d", doc_top_k)
    query_emb = embed_query(text_query)
    docs = search_docs(query_emb, top_k=doc_top_k)
    return docs

Parallel execution pattern: The hybrid search handler starts video and document searches simultaneously using threading, then merges results once both complete. This keeps total latency close to the slower of the two searches rather than their sum.

Result merging strategy: Video results and document results are kept separate in the response (not interleaved by score) because they serve different investigative purposes. Videos provide visual evidence, documents provide corroborating narrative. Investigators review them differently.

For complete hybrid search implementation: View source


Part 3: Structured Compliance Analysis and Reporting

Search finds evidence. Analysis interprets it. The platform uses Pegasus through Bedrock to generate structured compliance reports from raw footage:


3.1 - Generating Structured Video Analysis

Pegasus transforms unstructured video into structured legal documentation: categorized risk levels, detected persons, timestamped transcripts, and compliance summaries.

Source: backend/app/services/bedrock_pegasus.py (Line 72)

body: dict = {
    "inputPrompt": prompt[:4000],
    "mediaSource": {
        "s3Location": {
            "uri": s3_uri,
            "bucketOwner": owner,
        }
    },
}
if temperature is not None:
    body["temperature"] = temperature
if response_schema is not None:
    body["responseFormat"] = {"jsonSchema": response_schema}

Temperature setting: Using temperature: 0 for compliance analysis ensures deterministic, repeatable output. The same video always produces the same analysis structure, critical for legal documentation where consistency matters.

Response schema enforcement: The jsonSchema parameter instructs Pegasus to return structured JSON rather than free-form text. This guarantees parse-able output and eliminates post-processing fragility.

Bedrock model invocation:

response = client.invoke_model(
    modelId=model_id,
    body=payload,
    contentType="application/json",
    accept="application/json",
)

The response contains generated text extracted from Bedrock's JSON response format. This text represents Pegasus's interpretation of the video content based on the provided prompt instructions.

Analysis prompt and parsing:

Source: backend/app/routes/videos.py (Line 335)

raw_text = pegasus_analyze_video(
    s3_uri,
    get_video_analysis_prompt(),
    temperature=0,
)
log.info("[ANALYSIS] Pegasus response received in %.1fs (len=%d)", time.perf_counter() - t0, len(raw_text or ""))
analysis_dict = parse_video_analysis_response(raw_text)

The analysis prompt (view source) instructs Pegasus to:

  • Categorize the video content (traffic incident, workplace safety, criminal activity, etc.)

  • Identify risk levels and specific risk factors

  • Extract timestamped transcript segments

  • Detect persons of interest with descriptions

  • Generate a compliance-focused summary

Parsing strategy: parse_video_analysis_response() handles malformed JSON gracefully (recovers triple-backtick wrapping, trailing commas, incomplete responses) because LLM outputs aren't always perfectly formatted. Robust parsing prevents analysis failures from minor formatting issues.

Transcript generation: The system runs a separate Pegasus call optimized for transcription, requesting timestamped segments in structured JSON. This produces ordered transcript entries with start/end times, enabling investigators to jump directly to spoken content.


3.2 - Object Detection and Face Keyframe Extraction

Beyond basic transcription, the platform identifies detected objects and useful face keyframes with precise timestamps (the specific moments where faces appear clearly):

Source: backend/app/routes/videos.py (Line 707)

raw_response = pegasus_analyze_video(s3_uri, get_detect_prompt())
log.info("[INSIGHTS] Pegasus response received in %.1fs (%d chars)", time.perf_counter() - t0, len(raw_response or ""))

detect_data = parse_detect_response(raw_response)
objects_raw = detect_data["objects"]
face_keyframes = detect_data["face_keyframes"]

The detection prompt (view source) asks Pegasus to identify:

  • Objects: Vehicles, weapons, physical evidence, environmental details

  • Face keyframes: Timestamps where faces appear with sufficient clarity for identification

Why keyframes matter: Not every frame containing a face is useful. Faces seen from behind, in motion blur, or poorly lit don't help identification. Pegasus selects keyframes where faces are frontal, well-lit, and clear; the frames an investigator would actually screenshot for evidence.

Post-processing: Once Pegasus returns keyframes and objects, the system extracts those specific frames, generates thumbnails, and stores them for quick review. This eliminates re-processing video every time an investigator needs to see a face.


3.3 - Face Presence Timeline: Where Does This Person Appear?

After detecting faces, the system builds a presence timeline showing when each detected person appears throughout the video:

Source: backend/app/routes/videos.py (Line 1027)

if use_marengo:
    # Marengo-based presence, match each face embedding to clip embeddings
    for j, emb in enumerate(face_embeddings):
        if not emb:
            continue
        clips = clips_above_threshold(
            emb,
            output_uri,
            min_score=FACE_PRESENCE_MATCH_THRESHOLD,
            visual_only=True,
            max_clips=50,
        )
        for clip in clips:
            c_start = float(clip.get("start", 0.0))
            c_end = float(clip.get("end", c_start + 0.5))
            for i in range(n_segments):
                s0 = i * seg_dur
                s1 = (i + 1) * seg_dur
                if c_end > s0 and c_start < s1:
                    presence_by_face[j]["segment_presence"][i] = 1

How the timeline is constructed:

  1. Video is divided into fixed-duration segments (e.g., 30-second windows)

  2. Each detected face gets its embedding compared against all clip embeddings

  3. Clips scoring above threshold are marked as "person present"

  4. Matched clips are mapped onto the segment timeline

  5. Result: A binary presence map showing which segments contain each person

Why this matters for investigations: An investigator can see at a glance: "Person A appears in segments 3, 7, 12, and 18" without watching the entire video. Click segment 7, jump directly to that 30-second window, confirm visual match.

The threshold trade-off: FACE_PRESENCE_MATCH_THRESHOLD controls sensitivity. Higher values reduce false positives but may miss valid appearances (different angles, lighting changes). Lower values catch more instances but produce more false matches requiring manual review. The demo uses 0.70 - production systems should let investigators adjust this per-case.


Operational Considerations for Production Deployment

This demo proves the concept. Deploying to production requires addressing several operational concerns:

  1. Scalability: The single-index approach works for demos (12 videos) but needs architectural adjustments for production loads (1,000+ videos per case). Consider index partitioning strategies or migrating to a dedicated vector database for clip-level embeddings.

  2. Cost management: Bedrock charges per embedding generation. A 2-hour video generates ~1,200 clip embeddings. Processing 100 hours of footage per case means 60,000+ embeddings. Plan for batch processing, caching strategies, and reuse of embeddings across related cases.

  3. Security and compliance: Legal evidence requires chain-of-custody tracking, access controls, and audit logs. The demo stores videos in S3; production needs encryption at rest, role-based access, and immutable audit trails.

  4. Accuracy validation: Entity matching and timeline reconstruction should include confidence scores and require human review before being submitted as legal evidence. AI-generated analysis supports investigators but doesn't replace human judgment.

  5. Format handling: The demo assumes standard video codecs. Production systems must handle corrupted files, non-standard formats, encrypted footage, and low-quality sources gracefully.


Conclusion

This application demonstrates how video intelligence transforms legal evidence review from a time-intensive manual process into a targeted investigation workflow. By combining TwelveLabs through AWS Bedrock for video understanding with NeMo Retriever for document search, investigators can:

  • Search 12+ disparate video sources simultaneously with natural language queries

  • Track specific individuals or vehicles across unrelated camera feeds

  • Reconstruct chronological timelines from fragmented footage

  • Generate structured compliance analysis with risk categorization

  • Access transcript segments and detected objects with precise timestamps

The efficiency gain: Manual review of 40 hours of multi-source footage takes an investigator 40-60 hours. This platform reduces that to 4-6 hours of targeted review so that investigators spend time validating findings instead of hunting for them.

For legal tech ISVs building evidence management platforms, this architecture provides a reference implementation showing how TwelveLabs integrates into existing workflows without requiring wholesale platform rewrites. The single-index multi-source pattern, hybrid video+document retrieval, and structured analysis capabilities translate directly to production legal tech products.


Additional Resources

  1. TwelveLabs on AWS Bedrock: Learn more about model access

  2. NeMo Retriever documentation: Explore document retrieval capabilities

  3. TwelveLabs sample applications: Browse additional use cases

  4. Join the TwelveLabs community: Discord


Next steps:

  1. Clone the reference implementation

  2. Configure AWS Bedrock access and test with your video sources

  3. Adapt the single-index architecture to your specific legal workflows

  4. Integrate with your existing evidence management systems

Introduction

Legal teams processing video evidence face a compounding problem. A single case can involve 40+ hours of footage from dashcams, bodycams, CCTV systems, doorbell cameras, and insurance submissions; all in different formats, resolutions, and timestamps. Investigators spend $200-500/hour in paralegal time manually reviewing this content. Miss a critical 10-second clip buried in hour 23 of camera feed 7, and the case outcome changes.

The traditional approach (manual review, basic metadata tagging, frame-by-frame analysis) doesn't scale. Legal teams need to search across disparate sources simultaneously, reconstruct timelines from fragmented footage, and identify critical moments without watching every second of video.

This tutorial demonstrates how to build a legal evidence investigation platform using TwelveLabs through AWS Bedrock for video intelligence and NVIDIA NeMo Retriever for document search. The result: investigators search 12 video sources with natural language queries ("find the red sedan" or "show me when the person entered the building"), get ranked results with exact timestamps, and generate structured compliance reports in minutes instead of days.

What you'll build: A multi-source evidence investigator that ingests mixed-format surveillance footage, enables cross-source semantic search, performs automated compliance analysis, and reconstructs chronological timelines from disparate video sources.

Time investment: 40 hours of evidence review becomes 4 hours of targeted investigation.

You can explore the demo of the application here: Legal Evidence Investigator Application

You can check out the source code here: GitHub Repository


Demo Application

This demonstration shows how the platform handles real investigative workflows: searching across multiple video sources, identifying critical moments with precise timestamps, generating structured compliance analysis, and providing an interactive Q&A interface for deeper investigation.

Key capabilities demonstrated:

  • Cross-source search across 12+ disparate video formats

  • Entity tracking (find specific person/vehicle across all sources)

  • Automated compliance analysis with risk categorization

  • Timeline reconstruction showing sequential evidence

  • Conversational video Q&A with clickable timestamps


System Architecture: Why Single-Index Multi-Source Design Matters

The core architectural decision in this application addresses a fundamental constraint: TwelveLabs search operates at the index level: you search one index per request, not individual videos. This shapes everything.

The naive approach would create one index per video source. This fails immediately: you'd need 12 separate search requests to find a person across 12 cameras, then manually merge and sort results in application logic. Slow, complex, brittle.

The correct approach uses a single-index multi-source strategy: all evidence videos live in one TwelveLabs index, differentiated by rich metadata tagging. One search query hits all sources simultaneously. Results come back pre-ranked by relevance, grouped by source video, with metadata filters enabling scoped searches when needed ("only bodycam footage" or "only footage from Main Street location").

System components:

  1. Video ingestion pipeline: Upload mixed-format footage to S3 → Index via twelvelabs.marengo-embed-3-0-v1:0 through Bedrock → Store multimodal embeddings with source metadata

  2. Document ingestion pipeline: Extract text from PDFs → Chunk intelligently → Embed via nvidia/llama-nemotron-embed-vl-1b-v2 → Store in document index

  3. Hybrid retrieval layer: Parallel search across video embeddings (Marengo) and document chunks (NeMo) → Merge results → Return unified response

  4. Analysis engine: Generate structured compliance reports via twelvelabs.pegasus-1-2-v1:0 including title, risk categories, detected objects, face timelines, and transcript segments

  5. Conversational interface: Video Q&A with clickable timestamp citations

This architecture delivers cross-source search without sacrificing performance. Single-request latency stays under 3 seconds even with 12 indexed videos because the index-level search pattern is optimized for exactly this scenario.


Preparation: What You Need Before Building


1 - AWS Bedrock Access for TwelveLabs Models

Set up AWS credentials with permissions for:

  • Amazon Bedrock runtime and model access

  • S3 for video storage and Bedrock async output

  • Access to these TwelveLabs models through Bedrock:

    • twelvelabs.marengo-embed-3-0-v1:0 (multimodal video embeddings)

    • twelvelabs.pegasus-1-2-v1:0 (video analysis and reasoning)

Why these models: Marengo generates the searchable representations of video content (the "encoder"), while Pegasus performs reasoning and structured analysis (the "interpreter"). You need both: Marengo makes video searchable, Pegasus makes it understandable.


2 - S3 Bucket Configuration

Create one S3 bucket structured for:

  • Video uploads and storage

  • Bedrock async embedding output location (required for batch jobs)

  • Generated thumbnails and analysis artifacts

  • Document storage

Why S3-based: Bedrock's async embedding API requires S3 input/output locations. This also enables scalable storage without hitting local disk limits.


3 - NVIDIA API Key

Get an NVIDIA API key to access:

  • nvidia/llama-nemotron-embed-vl-1b-v2 for document chunk embeddings

Why NeMo Retriever: Legal evidence isn't just video; it's police reports, witness statements, insurance forms. NeMo handles text retrieval while TwelveLabs handles video, giving you multimodal search across all evidence types.


4 - Clone and Configure
git clone https://github.com/Hrishikesh332/tl-compliance-intelligence
cd

Follow backend environment setup in .env.example:

  1. AWS credentials (Access Key ID, Secret Access Key, region)

  2. S3 bucket name and Bedrock output path

  3. NVIDIA API key

  4. Application-specific settings (index configuration, match thresholds)


Implementation Deep Dive


Part 1: Video Ingestion and Embedding Generation

The ingestion pipeline transforms uploaded surveillance footage into searchable multimodal embeddings. This happens asynchronously because Bedrock's embedding jobs can take several minutes for hour-long videos.


1.1 - Starting the Marengo Embedding Job

When a video uploads, the system immediately stores it in S3 and kicks off a Bedrock async embedding job:

Source: backend/app/services/bedrock_marengo.py (Line 111)

def start_video_embedding(
    s3_uri: str,
    output_s3_uri: str,
    bucket_owner: str | None = None,
) -> dict:
    client = get_bedrock_client()
    owner = bucket_owner
    body = {
        "inputType": "video",
        "video": {
            "mediaSource": media_source_s3(s3_uri, owner),
            "embeddingOption": ["visual", "audio"],
            "embeddingScope": ["clip", "asset"],
        },
    }
    resp = client.start_async_invoke(
        modelId=MARENGO_MODEL_ID,
        modelInput=body,
        outputDataConfig={"s3OutputDataConfig": {"s3Uri": output_s3_uri}},
    )
    return {
        "invocation_arn": resp.get("invocationArn", ""),
        "status": "pending",
    }

Why embeddingScope: ["clip", "asset"] matters: This generates embeddings at two levels:

  • Clip embeddings (6-second segments): Enable granular search -> find the exact 8-second window where a person appears

  • Asset embeddings (whole video): Enable video-level similarity and grouping

Both are necessary: clip embeddings power precise temporal search, asset embeddings enable "find videos similar to this one" workflows.

Why async processing is required: A 2-hour video generates 1,200 clip embeddings (2 hours ÷ 6 seconds per clip). This takes time. Async processing lets users upload multiple videos simultaneously without blocking the UI while embeddings are generated in the background.


1.2 - Background Job Queue and Completion Polling

The system maintains an in-memory queue of pending embedding jobs and polls Bedrock for completion:

Source: backend/app/utils/video_helpers.py (Line 104)

while True:
        job = bedrock_queue.get()
        task_id = job["task_id"]
        s3_uri = job["s3_uri"]
        output_uri = job["output_uri"]
        filename = job["filename"]
        meta = job["meta"]
        log.info("[QUEUE] Processing Bedrock start for %s (%s)", filename, task_id)
        success = False
        for attempt in range(1, max_retries + 1):
            try:
                result = start_video_embedding(s3_uri, output_uri)
                arn = result.get("invocation_arn", "")
                log.info("[QUEUE] Bedrock started for task_id=%s", task_id)
                video_tasks[task_id]["status"] = "indexing"
                video_tasks[task_id]["invocation_arn"] = arn
                video_tasks[task_id]["output_s3_uri"] = output_uri
                for rec in vs_index():
                    if rec.get("id") == task_id:
                        rec.setdefault("metadata", {})["status"] = "indexing"
                        break
                vs_save()
                with bedrock_poller_lock:
                    bedrock_poller_jobs.append({
                        "task_id": task_id,
                        "invocation_arn": arn,
                        "output_s3_uri": output_uri,
                        "started_at": time.monotonic(),
                    })
                success = True
                break

What this accomplishes: The queue processor updates task status from "pending" to "indexing," stores the Bedrock invocation ARN for tracking, and hands off to a separate polling thread. That poller checks job completion every 30 seconds, loads finished embeddings from S3, and marks videos as "ready" when processing completes.

Why retry logic matters: Bedrock has rate limits. The retry mechanism with exponential backoff ensures jobs eventually succeed even during high-volume uploads, preventing silent failures that would leave videos in permanent "pending" state.

Design pattern: Videos and entities share the same vector index, differentiated by type metadata. Documents use a separate chunk-based index. This gives unified video+entity search while keeping document retrieval independent.


1.3 - Document Ingestion with Semantic Chunking

Documents require different processing than video. PDFs are extracted, split into semantic chunks (by section when possible), embedded via NeMo, and stored as searchable records:

Source: backend/app/services/nemo_retriever.py (Line 688)

def ingest_document(file_path: str, doc_id: str, filename: str) -> dict:
    extra: dict = {}
    try:
        pdf_info = store_pdf_document(file_path, doc_id)
        extra.update(pdf_info)
    except Exception as exc:
        log.warning("Could not persist PDF for %s (%s)", doc_id, type(exc).__name__)
    ext = os.path.splitext(file_path)[1].lower()
    chunks: list[str] = []
    sections: list[str] = []
    if ext == ".pdf":
        try:
            pairs = split_into_semantic_chunks(file_path)
            sections = [s for s, _ in pairs]
            chunks = [t for _, t in pairs]
            log.info("Smart PDF chunking produced %d chunks for doc %s", len(chunks), doc_id)
        except Exception as exc:
            log.warning("Smart PDF extraction failed for doc %s, falling back to NeMo (%s)", doc_id, type(exc).__name__)
    if not chunks:
        chunks = extract_document(file_path)
    if not chunks:
        log.warning("No content extracted for doc %s", doc_id)
        return {"doc_id": doc_id, "chunks": 0, "status": "empty"}
    embeddings = embed_texts(chunks)
    add_chunks(
        doc_id, filename, chunks, embeddings,
        sections=sections or None,
        extra_metadata=extra or None,
    )
    return {"doc_id": doc_id, "chunks": len(chunks), "status": "ready"}

Smart chunking strategy: The system attempts section-based splitting first (preserving document structure), then falls back to paragraph-based chunking if structure detection fails. This matters for legal documents where section headers ("Incident Timeline," "Witness Statements") provide important retrieval context.

Embedding model selection: nvidia/llama-nemotron-embed-vl-1b-v2 generates embeddings optimized for passage retrieval. Each chunk gets its own vector, enabling precise document search at the paragraph level rather than whole-document matching.

Metadata preservation: The system stores original section headers and document links alongside chunk text, so retrieval results can point investigators back to the source paragraph in the original PDF.

Embedding generation via NVIDIA API:

Source: backend/app/services/nemo_retriever.py (Line 515)

def embed_via_requests(texts: list[str], input_type: str) -> list[list[float]]:
    """Call NVIDIA embeddings API directly with requests"""
    import requests
    t0 = time.perf_counter()

    resp = requests.post(
        "https://integrate.api.nvidia.com/v1/embeddings",
        headers={
            "Authorization": f"Bearer {NVIDIA_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "input": texts,
            "model": EMBED_MODEL,
            "encoding_format": "float",
            "input_type": input_type,
        },
        timeout=60,
    )
    resp.raise_for_status()
    data = resp.json()
    vectors = [d["embedding"] for d in data["data"]]
    first_dim = len(vectors[0]) if vectors else 0
    return vectors

The input_type parameter: NeMo distinguishes between "passage" (document chunks being indexed) and "query" (search requests). This improves retrieval accuracy by optimizing embeddings for their intended use: passages are embedded for storage, queries are embedded for matching.


1.4 - Creating Searchable Entity Records from Face Images

Legal investigations often require tracking specific individuals across multiple camera sources. The entity system enables this by converting face images into searchable embeddings:

Image embedding generation:

def embed_image(media_source: dict) -> list[float]:
    return invoke_embedding_model({
        "inputType": "image",
        "image": {"mediaSource": media_source},
    })

This uses Marengo's image embedding capability to generate a visual representation of a face. Unlike text embeddings, image embeddings capture visual features directly (facial structure, appearance, clothing) making them suitable for visual matching across video frames.


Entity creation from uploaded face image:

@entities_bp.route("/entities/from-image", methods=["POST"])
def api_entities_from_image():
    if "image" not in request.files:
        return jsonify({"error": "No 'image' file provided"}), 400

    file = request.files["image"]
    if file.filename == "":
        return jsonify({"error": "Empty filename"}), 400

    data = request.form or {}
    name = data.get("name") or (request.get_json(silent=True) or {}).get("name") or ""
    if not name.strip():
        return jsonify({"error": "Missing 'name'"}), 400

    image_bytes = file.read()
    faces = detect_and_crop_faces(image_bytes, min_confidence=ENTITY_FACE_MIN_CONFIDENCE)
    if not faces:
        return jsonify(
            {"error": "No face detected in image. Use a clear, front-facing photo with good lighting."}
        ), 404

    best = faces[0]
    face_b64 = best["image_base64"]
    embed_b64 = best.get("embedding_crop_base64") or face_b64

    import base64

    face_bytes = base64.b64decode(embed_b64)
    media = media_source_base64(face_bytes)

    try:
        embedding = embed_image(media)
    except Exception:
        return jsonify({"error": "Internal server error"}), 500

    entity_id = name.strip().lower().replace(" ", "-")
    rec = index_add(
        id=entity_id,
        embedding=embedding,
        metadata={"name": name.strip(), "face_snap_base64": face_b64},
        type="entity",
    )

    return jsonify(
        {
            "indexId": FIXED_INDEX_ID,
            "entity": {"id": rec["id"], "name": name.strip()},
            "face_snap_base64": face_b64,
        }
    )

The face detection step: Uses OpenCV's ResNet10 SSD detector to isolate faces before embedding. This preprocessing ensures Marengo receives a clean face crop rather than a full scene, improving match accuracy during search. The detector returns confidence scores, only faces above the minimum threshold are processed.

Why entity records live in the video index: Storing entity embeddings alongside video embeddings enables direct similarity search. When an investigator searches for "person-of-interest-x," the system compares that entity's embedding against all video clip embeddings in one retrieval operation.


Part 2: Cross-Source Search and Retrieval

Once videos and documents are indexed, the platform enables unified search across all sources. This is where the single-index architecture delivers its value: one query, all sources, ranked results.


2.1 - Video Content Search with Multimodal Embeddings

The search flow supports three query types through a single interface:

  1. Text queries: Natural language descriptions ("person in red jacket")

  2. Image queries: Upload a screenshot, find matching footage

  3. Entity queries: Search for a previously registered person/object

Source: backend/app/routes/search.py (Line 30)

def search_video_index(
    data: dict,
    *,
    request_query: str = "",
    request_top_k: int | None = None,
    image_bytes: bytes | None = None,
) -> tuple[list[dict], str, str | None]:

    query_emb, display_query, is_entity_search, err = get_search_embedding_from_request(
        data,
        request_query=request_query,
        image_bytes=image_bytes,
    )

Input normalization: get_search_embedding_from_request() handles all query types and returns a single embedding vector. Text queries get embedded via Marengo's text encoder, image queries via image encoder, entity queries retrieve the stored embedding directly. The search logic downstream doesn't care about input type—it just sees a vector to match.

Entity search optimization:

for r in results:
    meta = r.get("metadata", {})
    clips = []
    output_uri = meta.get("output_s3_uri") or f"{S3_EMBEDDINGS_OUTPUT}/{r['id']}"

    if is_entity_search:
        clips = clips_above_threshold(
            query_emb,
            output_uri,
            min_score=ENTITY_CLIP_MIN_SCORE,
            visual_only=True,
            max_clips=clips_per_video,
        )

    if not clips:
        clips = clip_search(
            query_emb,
            output_uri,
            top_n=clips_per_video,
            min_score=clip_min_score,
            visual_only=is_entity_search,
        )

The scoring difference: Entity searches use visual_only=True and higher match thresholds because face matching requires stronger visual similarity than general content search. Text queries can match via audio transcription or visual content (either modality counts). Entity queries must match visually.

Clip-level grounding:

out.append({
        "id": r["id"],
        "score": r["score"],
        "metadata": meta,
        "clips": clips,
    })

return out, display_query, None

Results include not just which videos match, but exactly where in each video the match occurs. An investigator searching for "person exiting vehicle" gets results like:

  • Video: "Dashcam - Main St" → Clips at 00:03:42-00:03:48, 00:07:15-00:07:21

  • Video: "CCTV - Parking Lot" → Clip at 00:12:03-00:12:09

This clip-level precision eliminates "search for needle, still get entire haystack" problems common in video search systems.


2.2 - Entity-Aware Video Search: Finding People Across Sources

Entity search enables investigators to track specific individuals across multiple unrelated camera sources. Upload a face image once, search all footage:

How it works:

  1. Entity embedding (generated during entity creation) is loaded from index

  2. System retrieves clip embeddings for each indexed video from S3 Bedrock output

  3. Similarity scoring identifies clips where the entity likely appears

  4. Videos are ranked by match strength and consistency (how many clips matched, how strong the scores)

Why this is faster than runtime face detection: Pre-computing clip embeddings during ingestion means search only performs similarity comparison, not face detection. Comparing 10,000 clip embeddings against an entity embedding takes milliseconds. Running face detection on 10,000 clips would take hours.

Match threshold tuning: The system uses ENTITY_CLIP_MIN_SCORE to filter weak matches. Setting this too low produces false positives (similar-looking people). Setting it too high misses valid matches (same person in different lighting/angles). The demo uses 0.75 as a balance point; your production system should make this user-configurable.


2.3 - Hybrid Search: Video + Document Retrieval

Legal evidence isn't just video. Investigators need to cross-reference footage with police reports, witness statements, and insurance forms. Hybrid search runs video and document retrieval in parallel:


Document search implementation:

def search_document_index(text_query: str, doc_top_k: int) -> list[dict]:
    from app.services.nemo_retriever import embed_query, search_docs
    t0 = time.perf_counter()
    log.info("[DOC_SEARCH] Started doc search top_k=%d", doc_top_k)
    query_emb = embed_query(text_query)
    docs = search_docs(query_emb, top_k=doc_top_k)
    return docs

Parallel execution pattern: The hybrid search handler starts video and document searches simultaneously using threading, then merges results once both complete. This keeps total latency close to the slower of the two searches rather than their sum.

Result merging strategy: Video results and document results are kept separate in the response (not interleaved by score) because they serve different investigative purposes. Videos provide visual evidence, documents provide corroborating narrative. Investigators review them differently.

For complete hybrid search implementation: View source


Part 3: Structured Compliance Analysis and Reporting

Search finds evidence. Analysis interprets it. The platform uses Pegasus through Bedrock to generate structured compliance reports from raw footage:


3.1 - Generating Structured Video Analysis

Pegasus transforms unstructured video into structured legal documentation: categorized risk levels, detected persons, timestamped transcripts, and compliance summaries.

Source: backend/app/services/bedrock_pegasus.py (Line 72)

body: dict = {
    "inputPrompt": prompt[:4000],
    "mediaSource": {
        "s3Location": {
            "uri": s3_uri,
            "bucketOwner": owner,
        }
    },
}
if temperature is not None:
    body["temperature"] = temperature
if response_schema is not None:
    body["responseFormat"] = {"jsonSchema": response_schema}

Temperature setting: Using temperature: 0 for compliance analysis ensures deterministic, repeatable output. The same video always produces the same analysis structure, critical for legal documentation where consistency matters.

Response schema enforcement: The jsonSchema parameter instructs Pegasus to return structured JSON rather than free-form text. This guarantees parse-able output and eliminates post-processing fragility.

Bedrock model invocation:

response = client.invoke_model(
    modelId=model_id,
    body=payload,
    contentType="application/json",
    accept="application/json",
)

The response contains generated text extracted from Bedrock's JSON response format. This text represents Pegasus's interpretation of the video content based on the provided prompt instructions.

Analysis prompt and parsing:

Source: backend/app/routes/videos.py (Line 335)

raw_text = pegasus_analyze_video(
    s3_uri,
    get_video_analysis_prompt(),
    temperature=0,
)
log.info("[ANALYSIS] Pegasus response received in %.1fs (len=%d)", time.perf_counter() - t0, len(raw_text or ""))
analysis_dict = parse_video_analysis_response(raw_text)

The analysis prompt (view source) instructs Pegasus to:

  • Categorize the video content (traffic incident, workplace safety, criminal activity, etc.)

  • Identify risk levels and specific risk factors

  • Extract timestamped transcript segments

  • Detect persons of interest with descriptions

  • Generate a compliance-focused summary

Parsing strategy: parse_video_analysis_response() handles malformed JSON gracefully (recovers triple-backtick wrapping, trailing commas, incomplete responses) because LLM outputs aren't always perfectly formatted. Robust parsing prevents analysis failures from minor formatting issues.

Transcript generation: The system runs a separate Pegasus call optimized for transcription, requesting timestamped segments in structured JSON. This produces ordered transcript entries with start/end times, enabling investigators to jump directly to spoken content.


3.2 - Object Detection and Face Keyframe Extraction

Beyond basic transcription, the platform identifies detected objects and useful face keyframes with precise timestamps (the specific moments where faces appear clearly):

Source: backend/app/routes/videos.py (Line 707)

raw_response = pegasus_analyze_video(s3_uri, get_detect_prompt())
log.info("[INSIGHTS] Pegasus response received in %.1fs (%d chars)", time.perf_counter() - t0, len(raw_response or ""))

detect_data = parse_detect_response(raw_response)
objects_raw = detect_data["objects"]
face_keyframes = detect_data["face_keyframes"]

The detection prompt (view source) asks Pegasus to identify:

  • Objects: Vehicles, weapons, physical evidence, environmental details

  • Face keyframes: Timestamps where faces appear with sufficient clarity for identification

Why keyframes matter: Not every frame containing a face is useful. Faces seen from behind, in motion blur, or poorly lit don't help identification. Pegasus selects keyframes where faces are frontal, well-lit, and clear; the frames an investigator would actually screenshot for evidence.

Post-processing: Once Pegasus returns keyframes and objects, the system extracts those specific frames, generates thumbnails, and stores them for quick review. This eliminates re-processing video every time an investigator needs to see a face.


3.3 - Face Presence Timeline: Where Does This Person Appear?

After detecting faces, the system builds a presence timeline showing when each detected person appears throughout the video:

Source: backend/app/routes/videos.py (Line 1027)

if use_marengo:
    # Marengo-based presence, match each face embedding to clip embeddings
    for j, emb in enumerate(face_embeddings):
        if not emb:
            continue
        clips = clips_above_threshold(
            emb,
            output_uri,
            min_score=FACE_PRESENCE_MATCH_THRESHOLD,
            visual_only=True,
            max_clips=50,
        )
        for clip in clips:
            c_start = float(clip.get("start", 0.0))
            c_end = float(clip.get("end", c_start + 0.5))
            for i in range(n_segments):
                s0 = i * seg_dur
                s1 = (i + 1) * seg_dur
                if c_end > s0 and c_start < s1:
                    presence_by_face[j]["segment_presence"][i] = 1

How the timeline is constructed:

  1. Video is divided into fixed-duration segments (e.g., 30-second windows)

  2. Each detected face gets its embedding compared against all clip embeddings

  3. Clips scoring above threshold are marked as "person present"

  4. Matched clips are mapped onto the segment timeline

  5. Result: A binary presence map showing which segments contain each person

Why this matters for investigations: An investigator can see at a glance: "Person A appears in segments 3, 7, 12, and 18" without watching the entire video. Click segment 7, jump directly to that 30-second window, confirm visual match.

The threshold trade-off: FACE_PRESENCE_MATCH_THRESHOLD controls sensitivity. Higher values reduce false positives but may miss valid appearances (different angles, lighting changes). Lower values catch more instances but produce more false matches requiring manual review. The demo uses 0.70 - production systems should let investigators adjust this per-case.


Operational Considerations for Production Deployment

This demo proves the concept. Deploying to production requires addressing several operational concerns:

  1. Scalability: The single-index approach works for demos (12 videos) but needs architectural adjustments for production loads (1,000+ videos per case). Consider index partitioning strategies or migrating to a dedicated vector database for clip-level embeddings.

  2. Cost management: Bedrock charges per embedding generation. A 2-hour video generates ~1,200 clip embeddings. Processing 100 hours of footage per case means 60,000+ embeddings. Plan for batch processing, caching strategies, and reuse of embeddings across related cases.

  3. Security and compliance: Legal evidence requires chain-of-custody tracking, access controls, and audit logs. The demo stores videos in S3; production needs encryption at rest, role-based access, and immutable audit trails.

  4. Accuracy validation: Entity matching and timeline reconstruction should include confidence scores and require human review before being submitted as legal evidence. AI-generated analysis supports investigators but doesn't replace human judgment.

  5. Format handling: The demo assumes standard video codecs. Production systems must handle corrupted files, non-standard formats, encrypted footage, and low-quality sources gracefully.


Conclusion

This application demonstrates how video intelligence transforms legal evidence review from a time-intensive manual process into a targeted investigation workflow. By combining TwelveLabs through AWS Bedrock for video understanding with NeMo Retriever for document search, investigators can:

  • Search 12+ disparate video sources simultaneously with natural language queries

  • Track specific individuals or vehicles across unrelated camera feeds

  • Reconstruct chronological timelines from fragmented footage

  • Generate structured compliance analysis with risk categorization

  • Access transcript segments and detected objects with precise timestamps

The efficiency gain: Manual review of 40 hours of multi-source footage takes an investigator 40-60 hours. This platform reduces that to 4-6 hours of targeted review so that investigators spend time validating findings instead of hunting for them.

For legal tech ISVs building evidence management platforms, this architecture provides a reference implementation showing how TwelveLabs integrates into existing workflows without requiring wholesale platform rewrites. The single-index multi-source pattern, hybrid video+document retrieval, and structured analysis capabilities translate directly to production legal tech products.


Additional Resources

  1. TwelveLabs on AWS Bedrock: Learn more about model access

  2. NeMo Retriever documentation: Explore document retrieval capabilities

  3. TwelveLabs sample applications: Browse additional use cases

  4. Join the TwelveLabs community: Discord


Next steps:

  1. Clone the reference implementation

  2. Configure AWS Bedrock access and test with your video sources

  3. Adapt the single-index architecture to your specific legal workflows

  4. Integrate with your existing evidence management systems