プラットフォーム

価格

ソリューション

構築

資料

会社

Select Language

Playground

営業担当に相談する

Tutorials

Building A Workplace Safety Compliance Application with TwelveLabs and NVIDIA VSS

Nathan Che

This tutorial walks through building a Workplace Safety Compliance Application that uses a fine-tuned YOLO model to chunk live CCTV footage, NVIDIA VSS with Twelve Labs Marengo and Pegasus to generate OSHA compliance reports and efficiency recommendations, and an AI chatbot for real-time workplace safety queries.

In this article

No headings found on page

Join our newsletter

Receive the latest advancements, tutorials, and industry insights in video understanding

Search, analyze, and explore your videos with AI.

Try the Playground

2025/10/31

15 Minutes

Copy link to article

In this tutorial, you will learn how to build a video intelligence platform that monitors live CCTV cameras in varying workplaces, including factories, construction sites, and more, to find risky employee behaviour or machinery, OSHA compliance violations, and efficiency gaps in near real-time. More importantly, you will learn about the latest TwelveLab’s integration into NVIDIA VSS, allowing you to handle analyzing and summarizing large volumes of video data built for hardware and solutions built on NVIDIA.

Introduction

What if your on-site CCTV cameras transformed to become completely autonomous, with 24/7 surveillance that not only reported back security risks, but detailed reports on specific OSHA compliance issues, efficiency gaps, and risk assessments? 📃

This might sound like magic, but it’s what we’ve built to showcase how TwelveLab’s video intelligence models are changing the way computers understand unstructured data for the recent NVIDIA GTC DC 2025 conference! Best of yet, our models directly integrate with pre-existing NVIDIA frameworks and hardware like NVIDIA Video Search and Summarization (VSS).

Today you’ll learn how this is all possible by not only deploying the application in this step-by-step guide, but also learning the in-depth technical architecture used to build this real-time video intelligence platform. Specifically you will build the platform that transforms live streams into:

OSHA Compliance Reports: Automatically generated with specific regulation and fine references.

Muda (Waste) Recommendations: Designed to support lean management and optimize workspace efficiency.

An Interactive Chatbot: Allows managers to ask detailed questions about their workplace and receive instant, web-sourced feedback and recommendations.

A Dynamic Event Timeline: Enables users to quickly identify what incidents occurred and when.

Contextual AI Actions: AI-generated buttons placed at specific video timestamps and coordinates to highlight inefficiencies, compliance issues, and subtle incidents.

* Note: The concepts and technology here stretch far beyond the workplace and if you’re interested in learning how TwelveLabs can make a difference in your industry, I highly recommend you to check out the Beyond The Workplace section of this blog!

Application Demo

Before we begin coding, please check out the video and deployed application below to get familiarized with what we’ll be building.

Test it out yourself: NVIDIA VSS + Twelve Labs Manufacturing Automation!

GitHub: nathanchess/twelvelabs-nvidia-vss-sample

With that in mind, let’s get started! 😊

Learning Objectives

In this tutorial you will:

Fine tune your own computer vision model on the You Only Look Once (YOLO) object detection algorithm with 15,000+ images for personal protective equipment classification.
Use FFmpeg to convert your MP4 files into live Real Time Streaming Protocol (RTSP) streams to simulate real CCTV cameras.
Learn how TwelveLabs integrates directly into NVIDIA VSS and AWS.
Understand advanced prompt engineering techniques such as chain-of-thought.
Build and run Docker containers to handle asynchronous API operations for video chunk uploading and live stream handling.

Prerequisites

Node.JS 20+: Node.js — Download Node.js ®
Python 3.8+: Download Python | Python.org
TwelveLabs API Key: Authentication | TwelveLabs
TwelveLabs Index: Python SDK | TwelveLabs
AWS Access Key: Credentials - Boto3 1.40.12 documentation
Docker Installation: Install | Docker Docs
Intermediate understanding of Python, APIs, and JavaScript.

Local Environment Setup

1 - Clone the repository into your local environment

>> git

2 - Clone NVIDIA VSS framework (with TwelveLab integration) into your local environment.

>> git

3 - Navigate into your AWS console and create a new S3 bucket named “nvidia-vss-source”

Full tutorial here: Creating a general purpose bucket - Amazon Simple Storage Service
This will act as storage for our fake CCTV camera footage!

4 - Add environment variables to frontend and rtsp-stream-worker folder.

.env.local (/frontend/)

TWELVE_LABS_API_KEY=...
NEXT_PUBLIC_TWELVELABS_MARENGO_INDEX_ID=...
NEXT_PUBLIC_TWELVELABS_PEGASUS_INDEX_ID=...


AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_S3_BUCKET_NAME=nvidia-vss-source
AWS_REGION=us-east-1


NEXT_PUBLIC_RTSP_STREAM_WORKER_URL="http://localhost:8000"
NEXT_PUBLIC_VSS_BASE_URL="http://127.0.0.1:8080"

.env (/rtsp-stream-worker/)

TWELVE_LABS_API_KEY=...


AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_S3_BUCKET_NAME=nvidia-vss-source
AWS_REGION=us-east-1


NEXT_PUBLIC_VSS_BASE_URL="http://127.0.0.1:8080"

5 - Build and run Docker containers for both rtsp-stream-worker and NVIDIA VSS

RTSP Stream Worker Instructions: twelvelabs-nvidia-vss-sample/rtsp-stream-worker at main · nathanchess/twelvelabs-nvidia-vss-sample

NVIDIA VSS Instructions: nvidia-vss/src/vss-engine/src/models/twelve_labs at main · james-le-twelve-labs/nvidia-vss

6 - Start frontend sample application using Node Package Manager (NPM).

Inside your GitHub repository, navigate to your frontend folder in console and type the following:

>> npm

Then navigate to localhost:3000 to access the site.

* Ensure that NPM is installed: Downloading and installing Node.js and npm | npm Docs

Building the CV Pipeline

So how did we achieve near real-time on an infinitely streaming live video?

💡Learning Opportunity: Not only is live stream handling difficult, with limited bandwidth and unstable connection, but also costly!

Let’s prove it with some simple math. Take TwelveLab’s multimodal model Marengo. How much would it cost if we were to process a full 24 hour worth of video content into an index?

Marengo: ($.042/min + %.0015/min) * 1440 minutes = $60.48

Video Indexing Costs: $0.042 / minute.
Infrastructure Costs: $0.0015 / minute.

$60.48 per day would definitely not be sustainable for a fleet of cameras, something very common in the public and security sector to cover every inch of a property.

Note: Feel free to run your own calculations at the TwelveLabs pricing calculator: https://www.twelvelabs.io/pricing-calculator

So how do we solve this pricing issue? Well, the answer lies in chunking and pre-processing! Of course the constraints and environment you are dealing with may be different, but I ended up fine–tuning a YOLO model on personal protective equipment (PPE) detection. Here are the results:

Above we can see the model in action, identifying helmets in a variety of different environments, angles, and lighting. By having a diverse training dataset of over 15,000+ images and augmenting the images to have different colors, angles, etc. we ended up getting >90% accuracy on common PPE items like vests, helmets, gloves, boots, and more!

💡Learning Opportunity: Though out of the scope of this blog, if you would like to view the training scripts and learn how to build one from scratch feel free to check out the /cv_model's README in the repository: https://github.com/nathanchess/twelvelabs-nvidia-vss-sample/tree/main/cv_model

So here’s how it’s used for cost-saving and video chunking:

/rtsp-stream-worker/main.py (Lines 357 - 391)

def analyze_video(self, video_source: str):
       
    people_count, ppe_count = 0, 0    


    results = self.model.predict(frame, conf=0.25, iou=0.45, max_det=1000)
    processed_frame = self._draw_boxes(frame, results)


    for box in results[0].boxes:
        class_id = int(box.cls[0])
        class_name = self.model.model.names[class_id]
        if class_name == "Person":
             people_count += 1
        else:
             ppe_count += 1


        # Write frame to output video
        video_writer.write(processed_frame)


        video_capture.release()
        video_writer.release()
        return new_video_source
 else:
        raise FileNotFoundError(f"Video file not found: {video_source}")

The code block simply above simply aggregates the number of people with protective gear seen in the frame. This data can then be used to create a custom chunking algorithm, tailored to a variety of workplace needs. Specifically, only video chunks of interest (missing PPE) will be fed into our multimodal model, reducing the 24 hour video content into minutes or even seconds of computation power.

Think about the type of PPE / compliance issues that may arise depending on the setting:

Food Processing Manufacturer: May require all employees to have a pair of gloves.
Medical Facilities: Require certain people to have face masks and gloves.
Construction Sites: Require a full set of PPE.

By having this customizability, individual factories can gain significant savings cost with this platform, while still maintaining high accuracy!

* Note: This does seem like a lot of upfront work, hence in the “Connecting with NVIDIA VSS” section, we will talk about how you can use pre-built video chunking algorithms in the NVIDIA VSS framework!

Creating a Fake Camera with Real-Time Streaming Protocol

Great, so now we have the chunking algorithm down ready to help with the pre-processing and cost saving for our live video streams. But how about the streams themselves?

Well upon a quick Google search you’ll notice it’s quite difficult to find open-source cameras linked to real factories, workplaces, etc. Which is why we’ll have to build our own in Python!

💡Learning Opportunity: What are the required components to simulate a fake camera?

To build a robust and realistic "fake" camera stream that mimics a real-world IP camera, we need to create a pipeline that handles content, broadcasting, and distribution. For our project, this pipeline consists of four main components:

The Content (A Video File): You need some "footage." This is just any standard video file that will play on a continuous loop, pretending to be a real, live scene.
The "Camera" (A Broadcaster): You need a process that acts like the camera itself. Its job is to take that video file and "broadcast" it onto the network, just like a real IP camera would.
The Camera's Signal (A Streaming Protocol): To broadcast, the "camera" needs to speak the right language. It sends its feed out using a common camera protocol like RTSP (Real-Time Streaming Protocol). This is the "raw" feed, designed for sending video between devices.
The Hub (A Media Server): You need a central server to "listen" for that stream. This server opens a specific network port and waits for the RTSP feed to arrive. Think of it as the central security hub where all the camera feeds come in.
The "For-Everyone" Format (HLS): That raw RTSP feed is great for servers, but terrible for web browsers or phones. So, the server's main job is to "re-package" the stream into a web-friendly format like HLS (HTTP Live Streaming). This format breaks the video into small, 10-second chunks that can be sent over the normal web.

Let’s see these components in action below in the technical architecture:

Technical Architecture Diagram: LucidChart (Click to view in full-screen)

This backend service is deployed on a monolithic architecture, meaning all our code is centralized into one codebase, and in our case a singular Deep Learning AMI EC2 instance! Let me briefly explain each piece of technology.

AWS S3 Buckets: This serves as the source video content for our simulated content. Put simply, just hold our MP4 files.

FFmpeg: This is the engine of our virtual camera. FFmpeg is a command-line powerhouse that does the heavy lifting:

It pulls the .mp4 file from S3.
It puts that file on an infinite loop (-stream_loop -1).
Most importantly, it transcodes this file in real-time into an RTSP (Real-Time Streaming Protocol) feed.

💡Learning Opportunity: RTSP is the native language of most IP security cameras. By creating an RTSP stream, FFmpeg is perfectly simulating a high-end camera broadcasting 24/7 onto our server's local network. Read more here: What is an RTSP Camera? – Real Time Streaming Protocol Explained Cloud based and Central Management

MediaMTX: This is our universal translator and distribution hub. The problem? Web browsers cannot play the RTSP stream that FFmpeg creates. MediaMTX is a brilliant, zero-dependency server that:

Ingests the single RTSP feed from FFmpeg.
Repackages it instantly into multiple web-friendly formats.

For our project, we use HLS (HTTP Live Streaming). MediaMTX automatically chops the live video into small, 10-second video files (.ts) and creates a "playlist" file (.m3u8) that tells the video player where to find the next chunk. This is the modern standard that allows any web browser or mobile app to play our stream smoothly.

Finetuned YOLO Model: This our previous chunking algorithm at play! As FFmpeg loops and encodes the video into RTSP format, we will use this fine-tuned model to detect objects inside the video.

CloudFlare SSH Tunnels: This is our secure gateway to the public internet. Our EC2 instance is completely locked down—it has zero open inbound ports, making it invisible to scanners and attackers.

💡Learning Opportunity: So how does the outside world see our HLS stream? Cloudflare's lightweight cloudflared agent creates a secure, outbound-only tunnel from our EC2 instance to the Cloudflare network. Cloudflare then acts as the public-facing entry point, giving us a stable URL (e.g., live.myproject.com), free SSL, DDoS protection, and caching, all without us ever needing a static IP or configuring a single firewall rule.

The end product? Highly secure HTTP Live Streaming protocols that can be easily connected to any common video player or website.

Notice how the temporary cloudflare URL generated works on public sites like hlsplayer.net ☺️! This means that our HLS URL is not only public, but formatted into a proper .m3u8 file, allowing any video player to find the next chunk properly.

Connecting with NVIDIA VSS

Great so now we have our fake cameras up and running and the pre-processing algorithm deployed onto our AWS EC2. Time to wrap things up by integrating arguably the most important part, a highly intelligent and context-aware video intelligence model to process chunks of interest. This is where NVIDIA VSS’s TwelveLabs integration steps in.

NVIDIA VSS Repo: james-le-twelve-labs/nvidia-vss: Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

💡Learning Opportunity: Before jumping into the code and technical architecture, it’s important to understand what NVIDIA VSS even is and the tools it provides to those that build software on top of NVIDIA hardware.

VSS stands for Video Search and Summarization. Essentially this is not a single NVIDIA software like CUDA, rather it is a NVIDIA AI Blueprint, giving developers a quick way to deploy powerful AI agents that can understand, search, and summarize video content. It does so with:

Vision Language Models (VLMs): This is the "seeing" and "understanding" part. VSS provides a pipeline to feed video frames into VLMs, which then generate rich, text-based descriptions (dense captions) of what is happening in each video chunk.

Large Language Models (LLMs): This is the "reasoning" and "communication" part. The text descriptions from the VLM are fed to an LLM, which is what allows you to perform summarization and have a natural language Q&A session.

Retrieval-Augmented Generation (RAG): VSS doesn't just pass data to an LLM; it uses a technique called RAG. It stores the generated video descriptions in a specialized database (a vector or graph database). When you ask a question, it first retrieves the most relevant video chunks from the database and then augments the LLM's prompt with this specific context, leading to highly accurate, data-grounded answers.

GPU-Accelerated Ingestion: It provides a high-performance pipeline for pulling in video (from files or live RTSP streams), decoding it, and preparing it for the AI models, all using the power of NVIDIA GPUs.

Computer Vision (CV) Pipeline Integration: VSS is designed to work with (not just replace) traditional CV. Developers can integrate object detection and tracking models (like YOLO or those in the NVIDIA DeepStream SDK). This adds critical metadata (e.g., "person_1," "box_5") that the VLM and LLM can use to provide even more specific and accurate answers.

Audio Transcription: The blueprint also includes tools to process the audio track from videos, converting speech into text. This adds another layer of searchable data, allowing you to query what was said in a video, not just what was seen.

NVIDIA NIMs (NVIDIA Inference Microservices): Instead of forcing developers to build and optimize their own AI model servers, VSS is often powered by NIMs. These are pre-packaged, optimized, and containerized microservices that make deploying the VLMs and LLMs as simple as running a container.

This NVIDIA VSS blueprint is incredibly powerful, but as you can see, it's also a lot to build, deploy, and manage. You're responsible for the VLM, the LLM, the audio transcription pipeline, the CV pipeline, and the RAG database—all running on your own infrastructure.

This is precisely where TwelveLabs provides a massive accelerator.

TwelveLabs Playground Index Search Capability (TwelveLabs | Home)

Imagine abstracting away all that complexity. Instead of managing a half-dozen different models and services, you get a complete remote deployment that handles everything for you. The intelligent chunking, the computer vision, the VLM analysis, the audio transcription, and the large-language-model reasoning are all unified into a single, powerful platform, accessible via a simple API. You send the video stream, and TwelveLabs returns the insights.

But here’s the most powerful part: this isn't an all-or-nothing replacement.

The true value is in modularity. Our architecture is designed to be highly configurable, allowing you to create hybrid NVIDIA-based AI workflows.

With this knowledge, let’s go ahead and look at how our technical architecture leverages the complete remote deployment in NVIDIA VSS.

Technical Architecture Diagram: LucidChart (Click to view in full-screen)

Simply put, the uploaded video chunks are passed in two steps.

AWS S3 Bucket (nvidia-vss-streams): Storing it in an external container, outside of the multimodal model, allows to have long-term forensics data for incidents, reporting, etc.

NVIDIA VSS TwelveLabs Integration: The uploaded video content is then further passed to NVIDIA VSS, where it will undergo a series of processes:

Index Creation: Create new dedicated index(es) for both the TwelveLabs Marengo and Pegasus model.
Video Storage: Store videos into indexes, allowing it to be ready to be searched, embedded, and summarized. Feel free to view all video capabilities offered by TwelveLabs here: TwelveLabs | Product Overview

The various features of the page then simply search and summarize the video content uploaded via. NVIDIA VSS API Endpoints! Let’s see how it works in action.

Example Feature 1: AI Compliance Chatbot — Jade.

This is the right panel on the screenshot below and essentially allows factory owners to instantly investigate video chunks deeper with unstructured queries. Whether that be questions about an incident, possible improvements, or local OSHA compliances based on geolocation, they can get answers immediately specific to their factory.

This was built on TwelveLabs' powerful Pegasus conversational model, with extra chat history prompt engineering!

/frontend/src/app/components/ClipChat.js (Lines 49-72)

const typingId = Date.now();
        setChatHistory(prev => [...prev, { role: 'assistant', text: '', date: Date.now(), typing: true, _id: typingId }]);


        try {
            const prompt = `You are Jade, an expert safety and compliance officer.
       
            Here is the chat history: ${chatHistory.map(m => `${m.role}: ${m.text}`).join('\n')};


            The user asks: ${message};
           
            The user's geolocation is unknown, please reference general safety and compliance standards.


            If the user asks about safety, compliance, or improvements, you should always reference the user's geolocation and the laws in that area when providing your response.
            Do not mention the coordinates, just the location and city.
           
            Be highly detailed and specific, by referencing specific machines, processes, and equipment you see in the video and the second.


            `;


            const resp = await fetch('/api/analysis', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ videoId, userQuery: prompt })
            });

* Notice: The remote deployment allowed your chunked video to effortlessly be prompted in realistically under 4 lines (excluding the prompt itself, which can at times get huge).

Example Feature 2: On-Demand Compliance Report & Video Metadata.

Factory owners can instantly generate full business-ready compliance reports with details like OSHA compliance (specific violations), potential fines, corrective actions, and efficiency recommendations.

/frontend/src/app/clips/[id]/page.js (Lines 328 - 331)

const response = await fetch(`/api/analysis/${clipData['pegasusId']}`, {
    method: 'GET',
    headers: { 'Content-Type': 'application/json' }
})

This was built on TwelveLabs' powerful Marengo search and summarization model, which allowed us to instantly generate detailed summaries of our video content with preset prompts!

And, that’s it! You have just built a platform that transforms live stream video content into searchable, summarized, and embedded structured content with the help of the NVIDIA VSS TwelveLabs integration with no additional hardware, GPU, or training costs 🥳.

Conclusion

Great thanks for reading along! You not only built an end-to-end real-time surveillance platform that could benefit millions of companies in the public sector, but learned about NVIDIA VSS and how the recent TwelveLabs integration can help bring your video-intensive projects from ideation to code in hours.

Check out some more in-depth resources regarding this project here:

Technical Architecture Diagram: LucidApp
Technical Design Document: [NVIDIA GTC] - Manufacturing Automation Technical Design
NVIDIA VSS TwelveLabs Integration Repo: james-le-twelve-labs/nvidia-vss: Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A
Project Repo: nathanchess/twelvelabs-nvidia-vss-sample
Demo Video: NVIDIA VSS TwelveLabs Integration: Manufacturing Automation

Introduction

OSHA Compliance Reports: Automatically generated with specific regulation and fine references.

Muda (Waste) Recommendations: Designed to support lean management and optimize workspace efficiency.

An Interactive Chatbot: Allows managers to ask detailed questions about their workplace and receive instant, web-sourced feedback and recommendations.

A Dynamic Event Timeline: Enables users to quickly identify what incidents occurred and when.

Contextual AI Actions: AI-generated buttons placed at specific video timestamps and coordinates to highlight inefficiencies, compliance issues, and subtle incidents.

Application Demo

Before we begin coding, please check out the video and deployed application below to get familiarized with what we’ll be building.

Test it out yourself: NVIDIA VSS + Twelve Labs Manufacturing Automation!

GitHub: nathanchess/twelvelabs-nvidia-vss-sample

With that in mind, let’s get started! 😊

Learning Objectives

In this tutorial you will:

Fine tune your own computer vision model on the You Only Look Once (YOLO) object detection algorithm with 15,000+ images for personal protective equipment classification.
Use FFmpeg to convert your MP4 files into live Real Time Streaming Protocol (RTSP) streams to simulate real CCTV cameras.
Learn how TwelveLabs integrates directly into NVIDIA VSS and AWS.
Understand advanced prompt engineering techniques such as chain-of-thought.
Build and run Docker containers to handle asynchronous API operations for video chunk uploading and live stream handling.

Prerequisites

Node.JS 20+: Node.js — Download Node.js ®
Python 3.8+: Download Python | Python.org
TwelveLabs API Key: Authentication | TwelveLabs
TwelveLabs Index: Python SDK | TwelveLabs
AWS Access Key: Credentials - Boto3 1.40.12 documentation
Docker Installation: Install | Docker Docs
Intermediate understanding of Python, APIs, and JavaScript.

Local Environment Setup

1 - Clone the repository into your local environment

>> git

2 - Clone NVIDIA VSS framework (with TwelveLab integration) into your local environment.

>> git

3 - Navigate into your AWS console and create a new S3 bucket named “nvidia-vss-source”

Full tutorial here: Creating a general purpose bucket - Amazon Simple Storage Service
This will act as storage for our fake CCTV camera footage!

4 - Add environment variables to frontend and rtsp-stream-worker folder.

.env.local (/frontend/)

TWELVE_LABS_API_KEY=...
NEXT_PUBLIC_TWELVELABS_MARENGO_INDEX_ID=...
NEXT_PUBLIC_TWELVELABS_PEGASUS_INDEX_ID=...


AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_S3_BUCKET_NAME=nvidia-vss-source
AWS_REGION=us-east-1


NEXT_PUBLIC_RTSP_STREAM_WORKER_URL="http://localhost:8000"
NEXT_PUBLIC_VSS_BASE_URL="http://127.0.0.1:8080"

.env (/rtsp-stream-worker/)

TWELVE_LABS_API_KEY=...


AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_S3_BUCKET_NAME=nvidia-vss-source
AWS_REGION=us-east-1


NEXT_PUBLIC_VSS_BASE_URL="http://127.0.0.1:8080"

5 - Build and run Docker containers for both rtsp-stream-worker and NVIDIA VSS

RTSP Stream Worker Instructions: twelvelabs-nvidia-vss-sample/rtsp-stream-worker at main · nathanchess/twelvelabs-nvidia-vss-sample

NVIDIA VSS Instructions: nvidia-vss/src/vss-engine/src/models/twelve_labs at main · james-le-twelve-labs/nvidia-vss

6 - Start frontend sample application using Node Package Manager (NPM).

Inside your GitHub repository, navigate to your frontend folder in console and type the following:

>> npm

Then navigate to localhost:3000 to access the site.

* Ensure that NPM is installed: Downloading and installing Node.js and npm | npm Docs

Building the CV Pipeline

So how did we achieve near real-time on an infinitely streaming live video?

💡Learning Opportunity: Not only is live stream handling difficult, with limited bandwidth and unstable connection, but also costly!

Let’s prove it with some simple math. Take TwelveLab’s multimodal model Marengo. How much would it cost if we were to process a full 24 hour worth of video content into an index?

Marengo: ($.042/min + %.0015/min) * 1440 minutes = $60.48

Video Indexing Costs: $0.042 / minute.
Infrastructure Costs: $0.0015 / minute.

$60.48 per day would definitely not be sustainable for a fleet of cameras, something very common in the public and security sector to cover every inch of a property.

Note: Feel free to run your own calculations at the TwelveLabs pricing calculator: https://www.twelvelabs.io/pricing-calculator

So here’s how it’s used for cost-saving and video chunking:

/rtsp-stream-worker/main.py (Lines 357 - 391)

def analyze_video(self, video_source: str):
       
    people_count, ppe_count = 0, 0    


    results = self.model.predict(frame, conf=0.25, iou=0.45, max_det=1000)
    processed_frame = self._draw_boxes(frame, results)


    for box in results[0].boxes:
        class_id = int(box.cls[0])
        class_name = self.model.model.names[class_id]
        if class_name == "Person":
             people_count += 1
        else:
             ppe_count += 1


        # Write frame to output video
        video_writer.write(processed_frame)


        video_capture.release()
        video_writer.release()
        return new_video_source
 else:
        raise FileNotFoundError(f"Video file not found: {video_source}")

Think about the type of PPE / compliance issues that may arise depending on the setting:

Food Processing Manufacturer: May require all employees to have a pair of gloves.
Medical Facilities: Require certain people to have face masks and gloves.
Construction Sites: Require a full set of PPE.

By having this customizability, individual factories can gain significant savings cost with this platform, while still maintaining high accuracy!

Creating a Fake Camera with Real-Time Streaming Protocol

Great, so now we have the chunking algorithm down ready to help with the pre-processing and cost saving for our live video streams. But how about the streams themselves?

Well upon a quick Google search you’ll notice it’s quite difficult to find open-source cameras linked to real factories, workplaces, etc. Which is why we’ll have to build our own in Python!

💡Learning Opportunity: What are the required components to simulate a fake camera?

The Content (A Video File): You need some "footage." This is just any standard video file that will play on a continuous loop, pretending to be a real, live scene.
The "Camera" (A Broadcaster): You need a process that acts like the camera itself. Its job is to take that video file and "broadcast" it onto the network, just like a real IP camera would.
The Camera's Signal (A Streaming Protocol): To broadcast, the "camera" needs to speak the right language. It sends its feed out using a common camera protocol like RTSP (Real-Time Streaming Protocol). This is the "raw" feed, designed for sending video between devices.
The Hub (A Media Server): You need a central server to "listen" for that stream. This server opens a specific network port and waits for the RTSP feed to arrive. Think of it as the central security hub where all the camera feeds come in.
The "For-Everyone" Format (HLS): That raw RTSP feed is great for servers, but terrible for web browsers or phones. So, the server's main job is to "re-package" the stream into a web-friendly format like HLS (HTTP Live Streaming). This format breaks the video into small, 10-second chunks that can be sent over the normal web.

Let’s see these components in action below in the technical architecture:

Technical Architecture Diagram: LucidChart (Click to view in full-screen)

AWS S3 Buckets: This serves as the source video content for our simulated content. Put simply, just hold our MP4 files.

FFmpeg: This is the engine of our virtual camera. FFmpeg is a command-line powerhouse that does the heavy lifting:

It pulls the .mp4 file from S3.
It puts that file on an infinite loop (-stream_loop -1).
Most importantly, it transcodes this file in real-time into an RTSP (Real-Time Streaming Protocol) feed.

Ingests the single RTSP feed from FFmpeg.
Repackages it instantly into multiple web-friendly formats.

Finetuned YOLO Model: This our previous chunking algorithm at play! As FFmpeg loops and encodes the video into RTSP format, we will use this fine-tuned model to detect objects inside the video.

The end product? Highly secure HTTP Live Streaming protocols that can be easily connected to any common video player or website.

Connecting with NVIDIA VSS

NVIDIA VSS Repo: james-le-twelve-labs/nvidia-vss: Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

Vision Language Models (VLMs): This is the "seeing" and "understanding" part. VSS provides a pipeline to feed video frames into VLMs, which then generate rich, text-based descriptions (dense captions) of what is happening in each video chunk.

Large Language Models (LLMs): This is the "reasoning" and "communication" part. The text descriptions from the VLM are fed to an LLM, which is what allows you to perform summarization and have a natural language Q&A session.

Retrieval-Augmented Generation (RAG): VSS doesn't just pass data to an LLM; it uses a technique called RAG. It stores the generated video descriptions in a specialized database (a vector or graph database). When you ask a question, it first retrieves the most relevant video chunks from the database and then augments the LLM's prompt with this specific context, leading to highly accurate, data-grounded answers.

GPU-Accelerated Ingestion: It provides a high-performance pipeline for pulling in video (from files or live RTSP streams), decoding it, and preparing it for the AI models, all using the power of NVIDIA GPUs.

Computer Vision (CV) Pipeline Integration: VSS is designed to work with (not just replace) traditional CV. Developers can integrate object detection and tracking models (like YOLO or those in the NVIDIA DeepStream SDK). This adds critical metadata (e.g., "person_1," "box_5") that the VLM and LLM can use to provide even more specific and accurate answers.

Audio Transcription: The blueprint also includes tools to process the audio track from videos, converting speech into text. This adds another layer of searchable data, allowing you to query what was said in a video, not just what was seen.

NVIDIA NIMs (NVIDIA Inference Microservices): Instead of forcing developers to build and optimize their own AI model servers, VSS is often powered by NIMs. These are pre-packaged, optimized, and containerized microservices that make deploying the VLMs and LLMs as simple as running a container.

This is precisely where TwelveLabs provides a massive accelerator.

TwelveLabs Playground Index Search Capability (TwelveLabs | Home)

But here’s the most powerful part: this isn't an all-or-nothing replacement.

The true value is in modularity. Our architecture is designed to be highly configurable, allowing you to create hybrid NVIDIA-based AI workflows.

With this knowledge, let’s go ahead and look at how our technical architecture leverages the complete remote deployment in NVIDIA VSS.

Technical Architecture Diagram: LucidChart (Click to view in full-screen)

Simply put, the uploaded video chunks are passed in two steps.

AWS S3 Bucket (nvidia-vss-streams): Storing it in an external container, outside of the multimodal model, allows to have long-term forensics data for incidents, reporting, etc.

NVIDIA VSS TwelveLabs Integration: The uploaded video content is then further passed to NVIDIA VSS, where it will undergo a series of processes:

Index Creation: Create new dedicated index(es) for both the TwelveLabs Marengo and Pegasus model.
Video Storage: Store videos into indexes, allowing it to be ready to be searched, embedded, and summarized. Feel free to view all video capabilities offered by TwelveLabs here: TwelveLabs | Product Overview

The various features of the page then simply search and summarize the video content uploaded via. NVIDIA VSS API Endpoints! Let’s see how it works in action.

Example Feature 1: AI Compliance Chatbot — Jade.

This was built on TwelveLabs' powerful Pegasus conversational model, with extra chat history prompt engineering!

/frontend/src/app/components/ClipChat.js (Lines 49-72)

const typingId = Date.now();
        setChatHistory(prev => [...prev, { role: 'assistant', text: '', date: Date.now(), typing: true, _id: typingId }]);


        try {
            const prompt = `You are Jade, an expert safety and compliance officer.
       
            Here is the chat history: ${chatHistory.map(m => `${m.role}: ${m.text}`).join('\n')};


            The user asks: ${message};
           
            The user's geolocation is unknown, please reference general safety and compliance standards.


            If the user asks about safety, compliance, or improvements, you should always reference the user's geolocation and the laws in that area when providing your response.
            Do not mention the coordinates, just the location and city.
           
            Be highly detailed and specific, by referencing specific machines, processes, and equipment you see in the video and the second.


            `;


            const resp = await fetch('/api/analysis', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ videoId, userQuery: prompt })
            });

* Notice: The remote deployment allowed your chunked video to effortlessly be prompted in realistically under 4 lines (excluding the prompt itself, which can at times get huge).

Example Feature 2: On-Demand Compliance Report & Video Metadata.

/frontend/src/app/clips/[id]/page.js (Lines 328 - 331)

const response = await fetch(`/api/analysis/${clipData['pegasusId']}`, {
    method: 'GET',
    headers: { 'Content-Type': 'application/json' }
})

This was built on TwelveLabs' powerful Marengo search and summarization model, which allowed us to instantly generate detailed summaries of our video content with preset prompts!

Conclusion

Check out some more in-depth resources regarding this project here:

Technical Architecture Diagram: LucidApp
Technical Design Document: [NVIDIA GTC] - Manufacturing Automation Technical Design
NVIDIA VSS TwelveLabs Integration Repo: james-le-twelve-labs/nvidia-vss: Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A
Project Repo: nathanchess/twelvelabs-nvidia-vss-sample
Demo Video: NVIDIA VSS TwelveLabs Integration: Manufacturing Automation

Introduction

OSHA Compliance Reports: Automatically generated with specific regulation and fine references.

Muda (Waste) Recommendations: Designed to support lean management and optimize workspace efficiency.

An Interactive Chatbot: Allows managers to ask detailed questions about their workplace and receive instant, web-sourced feedback and recommendations.

A Dynamic Event Timeline: Enables users to quickly identify what incidents occurred and when.

Contextual AI Actions: AI-generated buttons placed at specific video timestamps and coordinates to highlight inefficiencies, compliance issues, and subtle incidents.

Application Demo

Before we begin coding, please check out the video and deployed application below to get familiarized with what we’ll be building.

Test it out yourself: NVIDIA VSS + Twelve Labs Manufacturing Automation!

GitHub: nathanchess/twelvelabs-nvidia-vss-sample

With that in mind, let’s get started! 😊

Learning Objectives

In this tutorial you will:

Fine tune your own computer vision model on the You Only Look Once (YOLO) object detection algorithm with 15,000+ images for personal protective equipment classification.
Use FFmpeg to convert your MP4 files into live Real Time Streaming Protocol (RTSP) streams to simulate real CCTV cameras.
Learn how TwelveLabs integrates directly into NVIDIA VSS and AWS.
Understand advanced prompt engineering techniques such as chain-of-thought.
Build and run Docker containers to handle asynchronous API operations for video chunk uploading and live stream handling.

Prerequisites

Node.JS 20+: Node.js — Download Node.js ®
Python 3.8+: Download Python | Python.org
TwelveLabs API Key: Authentication | TwelveLabs
TwelveLabs Index: Python SDK | TwelveLabs
AWS Access Key: Credentials - Boto3 1.40.12 documentation
Docker Installation: Install | Docker Docs
Intermediate understanding of Python, APIs, and JavaScript.

Local Environment Setup

1 - Clone the repository into your local environment

>> git

2 - Clone NVIDIA VSS framework (with TwelveLab integration) into your local environment.

>> git

3 - Navigate into your AWS console and create a new S3 bucket named “nvidia-vss-source”

Full tutorial here: Creating a general purpose bucket - Amazon Simple Storage Service
This will act as storage for our fake CCTV camera footage!

4 - Add environment variables to frontend and rtsp-stream-worker folder.

.env.local (/frontend/)

TWELVE_LABS_API_KEY=...
NEXT_PUBLIC_TWELVELABS_MARENGO_INDEX_ID=...
NEXT_PUBLIC_TWELVELABS_PEGASUS_INDEX_ID=...


AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_S3_BUCKET_NAME=nvidia-vss-source
AWS_REGION=us-east-1


NEXT_PUBLIC_RTSP_STREAM_WORKER_URL="http://localhost:8000"
NEXT_PUBLIC_VSS_BASE_URL="http://127.0.0.1:8080"

.env (/rtsp-stream-worker/)

TWELVE_LABS_API_KEY=...


AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_S3_BUCKET_NAME=nvidia-vss-source
AWS_REGION=us-east-1


NEXT_PUBLIC_VSS_BASE_URL="http://127.0.0.1:8080"

5 - Build and run Docker containers for both rtsp-stream-worker and NVIDIA VSS

RTSP Stream Worker Instructions: twelvelabs-nvidia-vss-sample/rtsp-stream-worker at main · nathanchess/twelvelabs-nvidia-vss-sample

NVIDIA VSS Instructions: nvidia-vss/src/vss-engine/src/models/twelve_labs at main · james-le-twelve-labs/nvidia-vss

6 - Start frontend sample application using Node Package Manager (NPM).

Inside your GitHub repository, navigate to your frontend folder in console and type the following:

>> npm

Then navigate to localhost:3000 to access the site.

* Ensure that NPM is installed: Downloading and installing Node.js and npm | npm Docs

Building the CV Pipeline

So how did we achieve near real-time on an infinitely streaming live video?

💡Learning Opportunity: Not only is live stream handling difficult, with limited bandwidth and unstable connection, but also costly!

Let’s prove it with some simple math. Take TwelveLab’s multimodal model Marengo. How much would it cost if we were to process a full 24 hour worth of video content into an index?

Marengo: ($.042/min + %.0015/min) * 1440 minutes = $60.48

Video Indexing Costs: $0.042 / minute.
Infrastructure Costs: $0.0015 / minute.

$60.48 per day would definitely not be sustainable for a fleet of cameras, something very common in the public and security sector to cover every inch of a property.

Note: Feel free to run your own calculations at the TwelveLabs pricing calculator: https://www.twelvelabs.io/pricing-calculator

So here’s how it’s used for cost-saving and video chunking:

/rtsp-stream-worker/main.py (Lines 357 - 391)

def analyze_video(self, video_source: str):
       
    people_count, ppe_count = 0, 0    


    results = self.model.predict(frame, conf=0.25, iou=0.45, max_det=1000)
    processed_frame = self._draw_boxes(frame, results)


    for box in results[0].boxes:
        class_id = int(box.cls[0])
        class_name = self.model.model.names[class_id]
        if class_name == "Person":
             people_count += 1
        else:
             ppe_count += 1


        # Write frame to output video
        video_writer.write(processed_frame)


        video_capture.release()
        video_writer.release()
        return new_video_source
 else:
        raise FileNotFoundError(f"Video file not found: {video_source}")

Think about the type of PPE / compliance issues that may arise depending on the setting:

Food Processing Manufacturer: May require all employees to have a pair of gloves.
Medical Facilities: Require certain people to have face masks and gloves.
Construction Sites: Require a full set of PPE.

By having this customizability, individual factories can gain significant savings cost with this platform, while still maintaining high accuracy!

Creating a Fake Camera with Real-Time Streaming Protocol

Great, so now we have the chunking algorithm down ready to help with the pre-processing and cost saving for our live video streams. But how about the streams themselves?

Well upon a quick Google search you’ll notice it’s quite difficult to find open-source cameras linked to real factories, workplaces, etc. Which is why we’ll have to build our own in Python!

💡Learning Opportunity: What are the required components to simulate a fake camera?

The Content (A Video File): You need some "footage." This is just any standard video file that will play on a continuous loop, pretending to be a real, live scene.
The "Camera" (A Broadcaster): You need a process that acts like the camera itself. Its job is to take that video file and "broadcast" it onto the network, just like a real IP camera would.
The Camera's Signal (A Streaming Protocol): To broadcast, the "camera" needs to speak the right language. It sends its feed out using a common camera protocol like RTSP (Real-Time Streaming Protocol). This is the "raw" feed, designed for sending video between devices.
The Hub (A Media Server): You need a central server to "listen" for that stream. This server opens a specific network port and waits for the RTSP feed to arrive. Think of it as the central security hub where all the camera feeds come in.
The "For-Everyone" Format (HLS): That raw RTSP feed is great for servers, but terrible for web browsers or phones. So, the server's main job is to "re-package" the stream into a web-friendly format like HLS (HTTP Live Streaming). This format breaks the video into small, 10-second chunks that can be sent over the normal web.

Let’s see these components in action below in the technical architecture:

Technical Architecture Diagram: LucidChart (Click to view in full-screen)

AWS S3 Buckets: This serves as the source video content for our simulated content. Put simply, just hold our MP4 files.

FFmpeg: This is the engine of our virtual camera. FFmpeg is a command-line powerhouse that does the heavy lifting:

It pulls the .mp4 file from S3.
It puts that file on an infinite loop (-stream_loop -1).
Most importantly, it transcodes this file in real-time into an RTSP (Real-Time Streaming Protocol) feed.

Ingests the single RTSP feed from FFmpeg.
Repackages it instantly into multiple web-friendly formats.

Finetuned YOLO Model: This our previous chunking algorithm at play! As FFmpeg loops and encodes the video into RTSP format, we will use this fine-tuned model to detect objects inside the video.

The end product? Highly secure HTTP Live Streaming protocols that can be easily connected to any common video player or website.

Connecting with NVIDIA VSS

NVIDIA VSS Repo: james-le-twelve-labs/nvidia-vss: Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

Vision Language Models (VLMs): This is the "seeing" and "understanding" part. VSS provides a pipeline to feed video frames into VLMs, which then generate rich, text-based descriptions (dense captions) of what is happening in each video chunk.

Large Language Models (LLMs): This is the "reasoning" and "communication" part. The text descriptions from the VLM are fed to an LLM, which is what allows you to perform summarization and have a natural language Q&A session.

Retrieval-Augmented Generation (RAG): VSS doesn't just pass data to an LLM; it uses a technique called RAG. It stores the generated video descriptions in a specialized database (a vector or graph database). When you ask a question, it first retrieves the most relevant video chunks from the database and then augments the LLM's prompt with this specific context, leading to highly accurate, data-grounded answers.

GPU-Accelerated Ingestion: It provides a high-performance pipeline for pulling in video (from files or live RTSP streams), decoding it, and preparing it for the AI models, all using the power of NVIDIA GPUs.

Computer Vision (CV) Pipeline Integration: VSS is designed to work with (not just replace) traditional CV. Developers can integrate object detection and tracking models (like YOLO or those in the NVIDIA DeepStream SDK). This adds critical metadata (e.g., "person_1," "box_5") that the VLM and LLM can use to provide even more specific and accurate answers.

Audio Transcription: The blueprint also includes tools to process the audio track from videos, converting speech into text. This adds another layer of searchable data, allowing you to query what was said in a video, not just what was seen.

NVIDIA NIMs (NVIDIA Inference Microservices): Instead of forcing developers to build and optimize their own AI model servers, VSS is often powered by NIMs. These are pre-packaged, optimized, and containerized microservices that make deploying the VLMs and LLMs as simple as running a container.

This is precisely where TwelveLabs provides a massive accelerator.

TwelveLabs Playground Index Search Capability (TwelveLabs | Home)

But here’s the most powerful part: this isn't an all-or-nothing replacement.

The true value is in modularity. Our architecture is designed to be highly configurable, allowing you to create hybrid NVIDIA-based AI workflows.

With this knowledge, let’s go ahead and look at how our technical architecture leverages the complete remote deployment in NVIDIA VSS.

Technical Architecture Diagram: LucidChart (Click to view in full-screen)

Simply put, the uploaded video chunks are passed in two steps.

AWS S3 Bucket (nvidia-vss-streams): Storing it in an external container, outside of the multimodal model, allows to have long-term forensics data for incidents, reporting, etc.

NVIDIA VSS TwelveLabs Integration: The uploaded video content is then further passed to NVIDIA VSS, where it will undergo a series of processes:

Index Creation: Create new dedicated index(es) for both the TwelveLabs Marengo and Pegasus model.
Video Storage: Store videos into indexes, allowing it to be ready to be searched, embedded, and summarized. Feel free to view all video capabilities offered by TwelveLabs here: TwelveLabs | Product Overview

The various features of the page then simply search and summarize the video content uploaded via. NVIDIA VSS API Endpoints! Let’s see how it works in action.

Example Feature 1: AI Compliance Chatbot — Jade.

This was built on TwelveLabs' powerful Pegasus conversational model, with extra chat history prompt engineering!

/frontend/src/app/components/ClipChat.js (Lines 49-72)

const typingId = Date.now();
        setChatHistory(prev => [...prev, { role: 'assistant', text: '', date: Date.now(), typing: true, _id: typingId }]);


        try {
            const prompt = `You are Jade, an expert safety and compliance officer.
       
            Here is the chat history: ${chatHistory.map(m => `${m.role}: ${m.text}`).join('\n')};


            The user asks: ${message};
           
            The user's geolocation is unknown, please reference general safety and compliance standards.


            If the user asks about safety, compliance, or improvements, you should always reference the user's geolocation and the laws in that area when providing your response.
            Do not mention the coordinates, just the location and city.
           
            Be highly detailed and specific, by referencing specific machines, processes, and equipment you see in the video and the second.


            `;


            const resp = await fetch('/api/analysis', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ videoId, userQuery: prompt })
            });

* Notice: The remote deployment allowed your chunked video to effortlessly be prompted in realistically under 4 lines (excluding the prompt itself, which can at times get huge).

Example Feature 2: On-Demand Compliance Report & Video Metadata.

/frontend/src/app/clips/[id]/page.js (Lines 328 - 331)

const response = await fetch(`/api/analysis/${clipData['pegasusId']}`, {
    method: 'GET',
    headers: { 'Content-Type': 'application/json' }
})

This was built on TwelveLabs' powerful Marengo search and summarization model, which allowed us to instantly generate detailed summaries of our video content with preset prompts!