How to perform Video OCR using Twelve Labs API?
Ankit Khare
Date Published
May 19, 2023
API Tutorial
Video OCR
Video understanding
Text-in-video search
Join our newsletter
You’re now subscribed to the Twelve Labs Newsletter! You'll be getting the latest news and updates in video understanding.
Oh no, something went wrong.
Please try again.

Video Optical Character Recognition (OCR) involves detecting and extracting text from video frames using computer vision and machine learning algorithms. With video OCR, you can easily sift through your video content, pinpointing the exact moments where certain words, phrases or even entire sentences make their appearance on the screen. Imagine the applications - from streamlining content search and navigation, to diving deep into content analysis, optimizing advertisement placement, summing up content, turbocharging SEO, and ensuring compliance and monitoring.

Examples of elements that can be recognized by video OCR include:

  • Slide content during presentations or meetings
  • Product names as they're showcased on screen, such as in advertisements, films, or TV shows
  • Athlete or team names as they're displayed on jerseys during sports broadcasts
  • Nametags and names visible during meetings or conferences
  • Scribbles on whiteboards within lecture videos
  • Documents captured within video footage
  • Handwritten texts appearing on screen
  • License plate numbers and building names
  • Subtitles, captions, and ending credits within films and interviews

In this tutorial we will explore how the Twelve Labs platform enables video OCR at two distinct levels. On the video level, we're taking on the entire video in one fell swoop, harnessing every morsel of text it holds. On the other hand, the index level approach sharpens our focus, honing in on a specific keyword or a cluster of keywords, which we'll input as natural language queries to perform a comprehensive search across a library of videos indexed on the Twelve Labs platform.

The cherry on top? With Twelve Labs API at your disposal, you can accomplish all of this without worrying about the nitty-gritty of implementing and maintaining the OCR process. We've got your back from development to infrastructure, and even ongoing support. So gear up, and let's embark on this exciting expedition into the realm of video OCR together.


The Twelve Labs platform is presently in its open beta phase, and we are offering free video indexing credits for up to 10 hours upon sign-up. It'll be advantageous for you to signup and get acquainted with the foundational aspects of the Twelve Labs platform before diving into this tutorial. Things like understanding video indexing, indexing options, the Task API, and search options are vital to smoothly follow through this tutorial, all of which I've covered extensively in my first tutorial. However, if you hit a roadblock or find yourself lost at any juncture, don't hesitate to reach out. By the way, our response times on our Discord server are lightning fast 🚅🏎️⚡️ if Discord is your preferred platform.

Quick tour of the tutorial

Following our previous discourse, we will explore video OCR tackling it from two distinct angles and levels. Accordingly, I've divided this tutorial into two pivotal sections, followed by a finale where we bring everything together in a working demo web-app:

Video OCR - A three step process

The process of extracting all recognized text from a specific video entails these three steps:

  • Video Indexing - No surprises at this step; if you've been following along with my past tutorials, this step should feel like a familiar friend.
  • Retrieve the unique identifier of the video - Once Twelve Labs platform finishes indexing our video we will retrieve the unique identifier of the video we require the OCR for.
  • Extract the text that appears on the screen - We'll pinpoint the video by using the specific index we created and the video id associated with the video we need OCR for. The API will do the heavy lifting, serving up the results we're after.
Text-in-video Search - searching for specific text within all indexed videos

Video OCR enabled us to scrutinize an entire video and distill all instances of text. Now, the text-in-video search feature empowers us to zero in on precise moments or video snippets where the input or searched text materializes. This greatly diminishes the time spent perusing a sizable catalogue of videos, yielding accurate search results predicated on alignment of search terms with the text that becomes visible on screen during video playbacks.

In our initial tutorials, we delved into content search within indexed videos, using natural language queries and various search options like visual (audio-visual search), conversation (dialogue search), and text-in-video (OCR). In this tutorial, we're going to repurpose our approach, harnessing only OCR technology to search for text within videos. To optimize processing time and costs, we'll create an index using solely the text_in_video indexing option. Then, we'll fire off our search query with the text_in_video search option, enabling us to discover relevant text matches within the indexed videos.

Building the Demo App

To bring it all home, we'll take the data yielded by the API endpoints and showcase them on a webpage, spinning up a Flask-based demo app that serves up a simple HTML page. The result of the video OCR will be neatly tabulated, displaying timestamps and associated text, while the text search will show the query we used and the corresponding video segments we found in response.

Video OCR - A three step process

For the sake of simplicity, I've uploaded just two videos to an index using a pre-existing account. Feel free to sign up; given we're currently in open beta, you'll receive complimentary credits allowing you to index up to 10 hours of video content. If your needs extend beyond that, check out our pricing page for upgrading to the Developer plan.

Video Indexing

Here, we’re going to delve into the essential elements that we'll need to include in our Jupyter notebook. This includes the necessary imports, defining API URLs, creating the index, and uploading videos from our local file system to kick off the indexing process:

%env API_URL =

!pip install requests

import os
import requests
import glob
from pprint import pprint

# Retrieve the URL of the API and my API key
API_URL = os.getenv("API_URL")
assert API_URL

API_KEY = os.getenv("API_KEY")
assert API_KEY
# Construct the URL of the `/indexes` endpoint
INDEXES_URL = f"{API_URL}/indexes"

# Set the header of the request
default_header = {
    "x-api-key": API_KEY

# Define a function to create an index with a given name
def create_index(index_name, index_options, engine):
    # Declare a dictionary named data
    data = {
        "engine_id": engine,
        "index_options": index_options,
        "index_name": index_name,

    # Create an index
    response =, headers=default_header, json=data)

    # Store the unique identifier of your index
    INDEX_ID = response.json().get('_id')

    # Check if the status code is 201 and print success
    if response.status_code == 201:
        print(f"Status code: {response.status_code} - The request was successful and a new index was created.")
        print(f"Status code: {response.status_code}")
    return INDEX_ID

# Create the indexes
INDEX_ID = create_index(index_name = "extract_text", index_options=["text_in_video"], engine = "marengo2.5")

# Print the created index IDs
print(f"Created index IDs: {INDEX_ID}")

Uploading two videos to the index we've just created. The videos are titled "A Brief History of Film" (courtesy of Film Thought Project, available at and "GPT - Explained!" (courtesy of CodeEmporium, available at I have downloaded these videos from their respective YouTube channels and saved them in a folder named 'static' on my local hard drive. We'll use these local files to index the videos onto the Twelve Labs platform:

import os
import requests
from concurrent.futures import ThreadPoolExecutor

TASKS_URL = f"{API_URL}/tasks"
video_folder = 'static'  # folder containing the video files

def upload_video(file_name):
    # Validate if a video already exists in the index
    task_list_response = requests.get(
        params={"index_id": INDEX_ID, "filename": file_name},
    if "data" in task_list_response.json():
        task_list = task_list_response.json()["data"]
        if len(task_list) > 0:
            if task_list[0]['status'] == 'ready': 
                print(f"Video '{file_name}' already exists in index {INDEX_ID}")
                print("task pending or validating")

    # Proceed further to create a new task to index the current video if the video didn't exist in the index already
    print("Entering task creation code for the file: ", file_name)
    if file_name.endswith('.mp4'):  # Make sure the file is an MP4 video
        file_path = os.path.join(video_folder, file_name)  # Get the full path of the video file
        with open(file_path, "rb") as file_stream:
            data = {
                "index_id": INDEX_ID,
                "language": "en"
            file_param = [
                ("video_file", (file_name, file_stream, "application/octet-stream")),] #The video will be indexed on the platform using the same name as the video file itself.
            response =, headers=default_header, data=data, files=file_param)
            TASK_ID = response.json().get("_id")
            # Check if the status code is 201 and print success
            if response.status_code == 201:
                print(f"Status code: {response.status_code} - The request was successful and a new resource was created.")
                print(f"Status code: {response.status_code}")
            print(f"File name: {file_name}")

# Get list of video files
video_files = [f for f in os.listdir(video_folder) if f.endswith('.mp4')]

# Create a ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
    # Use executor to run upload_video in parallel for all video files, video_files)

Retrieve the unique identifier of the video

Now let's enumerate all the videos in our index. This allows us to retain the video ID of a specific video, the goal being to extract all the text embedded within it. Furthermore, akin to our methods in prior tutorials, I'm assembling a list of video IDs and their respective titles, designed to be subsequently fed into our Flask application.

# List all the videos in an index
default_header = {
    "x-api-key": API_KEY
INDEXES_VIDEOS_URL = f"{API_URL}/indexes/{INDEX_ID}/videos"
response = requests.get(INDEXES_VIDEOS_URL, headers=default_header)

response_json = response.json()

video_id_name_list = [{'video_id': video['_id'], 'video_name': video['metadata']['filename']} for video in response_json['data']]


{'data': [{'_id': '###a917186daab572f349243',
           'created_at': '2023-04-27T14:18:48Z',
           'metadata': {'duration': 1300.173875,
                        'engine_id': 'marengo2.5',
                        'filename': 'A Brief History of Film.mp4',
                        'fps': 23.976023976023978,
                        'height': 720,
                        'size': 188214297,
                        'width': 1280},
           'updated_at': '2023-04-27T14:20:11Z'},
          {'_id': '###3da86daab572f349241',
           'created_at': '2023-04-27T13:08:19Z',
           'metadata': {'duration': 550.7,
                        'engine_id': 'marengo2.5',
                        'filename': 'GPT - Explained!.mp4',
                        'fps': 30,
                        'height': 720,
                        'size': 22838593,
                        'width': 1152},
           'updated_at': '2023-04-27T13:08:42Z'}],
 'page_info': {'limit_per_page': 10,
               'page': 1,
               'total_duration': 5402.873875,
               'total_page': 1,
               'total_results': 3}}

[{'video_id': '###a849b86daab572f349242',
  'video_name': 'A Brief History of Film.mp4'},
 {'video_id': '###a73da86daab572f349241', 'video_name': 'GPT - Explained!.mp4'}]

Extract the text that appears on the screen

Time to put our plan into action! We'll now proceed to extract all textual content from the chosen video:

VIDEO_ID = '###a849b86daab572f349242'
TEXT_IN_VIDEO_URL = f"{API_URL}/indexes/{INDEX_ID}/videos/{VIDEO_ID}/text-in-video"

response = requests.get(TEXT_IN_VIDEO_URL, headers=default_header)
print (f"Status code: {response.status_code}")
ocr_data = response.json()
pprint (ocr_data)


Status code: 200
{'data': [{'end': 3, 'start': 1, 'value': 'Film Thought Project'},
          {'end': 6, 'start': 5, 'value': 'Film'},
          {'end': 22,
           'start': 18,
           'value': "'L'arrivée d'un train en gare de La Ciotat"},
          {'end': 28, 'start': 18, 'value': 'Year:'},
          {'end': 28, 'start': 23, 'value': '2015'},
          {'end': 28, 'start': 23, 'value': 'Production Co.'},
          {'end': 28, 'start': 23, 'value': 'Alejandro G. Iñárritu'},
          {'end': 28, 'start': 23, 'value': 'Regency Enterprises'},
          {'end': 28, 'start': 23, 'value': "'The Revenant'"},
          {'end': 30, 'start': 29, 'value': "Let's"},
          {'end': 40, 'start': 32, 'value': 'Film:'},
          {'end': 34, 'start': 33, 'value': 'Film Thought Project'},
          {'end': 40, 'start': 35, 'value': 'Director:'},
          {'end': 40, 'start': 35, 'value': 'Production Co.'},
          {'end': 40, 'start': 36, 'value': 'Alfred Hitchcock'},
          {'end': 40, 'start': 36, 'value': '1958'},
          {'end': 40, 'start': 36, 'value': 'Alfred J. Hitchcock Productions'},
          {'end': 40, 'start': 37, 'value': 'Year:'},
          {'end': 40, 'start': 38, 'value': "'Vertigo'"},
          {'end': 45, 'start': 44, 'value': 'PRESS START'},
          {'end': 46, 'start': 45, 'value': '2020'},
          {'end': 47, 'start': 46, 'value': '2018'},
          {'end': 48, 'start': 47, 'value': '1975'},
          {'end': 53, 'start': 49, 'value': '1870s'},
          {'end': 61, 'start': 67, 'value': 'Eadweard Muybridge'},
          {'end': 69, 'start': 75, 'value': 'See you soon'}],
 'id': '###a849b86daab572f349242',
 'index_id': '###a73aa8b1dd6cde172a933'}

As you can see, the API extracted all the texts on screen, line by line, like a charm. You can save these texts as metadata for downstream workflows such as filtering, classifying and searching content.

Text-in-video Search - searching for specific text within all indexed videos

Launching our search query utilizing the text_in_video search option to uncover pertinent text matches within our collection of indexed videos:

# Construct the URL of the `/search` endpoint
SEARCH_URL = f"{API_URL}/search/"

# Declare a dictionary named `data`
data = {
    "index_id": INDEX_ID,
    "query": "horse",
    "search_options": [
# Make a search request
response =, headers=default_header, json=data)
if response.status_code == 200:
    print(f"Status code: {response.status_code} - Success")
    print(f"Status code: {response.status_code}")
search_data = response.json()


Status code: 200 - Success
{'data': [{'confidence': 'high',
           'end': 64,
           'metadata': [{'text': 'THE HORSE IN MOTION.',
                         'type': 'text_in_video'}],
           'score': 92.28,
           'start': 63,
           'video_id': '###a849b86daab572f349242'},
          {'confidence': 'high',
           'end': 91,
           'metadata': [{'text': 'THE HORSE IN MOTION.',
                         'type': 'text_in_video'}],
           'score': 92.28,
           'start': 88,
           'video_id': '###a849b86daab572f349242'}],
 'page_info': {'limit_per_page': 10,
               'page_expired_at': '2023-05-12T00:03:43Z',
               'total_results': 2},
 'search_pool': {'index_id': '###a73aa8b1dd6cde172a933',
                 'total_count': 3,
                 'total_duration': 5403}}

💡Bear in mind that the text-in-video search feature is set up to locate all occurrences within the indexed videos where the input query aligns (not necessarily word-for-word) with the text visually presented on screen as the video plays. For instance, if I enter "horse moving,"  the system will identify instances where the on-screen text reads "horse in motion." However, the confidence level of this match will be lower compared to when I input "horse in motion”. The confidence level depends on the percentage of words matched with the natural language query we input. For example, a two out of three-word match will yield a higher confidence level than a match with only one word.

A peek at Twelve Labs Playground's text-in-video search results for a given query
The specific video instance aligning with the input query being played
The model's confidence increases as soon as the query aligns with the on-screen text

Preparing the data for the Flask application to ensure our results will be presented neatly:

video_data = [{'start': d['start'], 'end': d['end'], 'confidence': d['confidence'], 'text': d['metadata'][0]['text']} for d in search_data['data']]
video_search_dict = {}

for vd in video_data:
    if search_data['data'][0]['video_id'] in video_search_dict:
        video_search_dict[search_data['data'][0]['video_id']] = [vd]



{'###a849b86daab572f349242': [{'confidence': 'high',
                               'end': 64,
                               'start': 63,
                               'text': 'THE HORSE IN MOTION.'},
                              {'confidence': 'high',
                               'end': 91,
                               'start': 88,
                               'text': 'THE HORSE IN MOTION.'}]}

Further data preparation for the video OCR results, followed by our standard procedure of pickling everything:

video_id = ocr_data.get('id')
data_list = ocr_data.get('data')

data_to_save = {
    'video_id': video_id,
    'data_list': data_list,
    'video_id_name_list': video_id_name_list,
    'video_search_dict': video_search_dict

import pickle

# Save data to a pickle file
with open('data.pkl', 'wb') as f:
    pickle.dump(data_to_save, f)

Building the Demo App

We're now at the final leg of our video OCR adventure - bringing together all elements to animate our results. Besides the standard configuration we implement for fetching videos from the local folder and loading the pickled data dispatched from the Jupyter notebook, this time we have some additional requirements - a conversion of timestamps from a seconds-only format to a minutes-and-seconds format. This makes the data visualization on the webpage more intuitive. Here's the code for the file:

from flask import Flask, render_template, send_from_directory
import pickle
import os
from collections import defaultdict

app = Flask(__name__)

# Load data from a pickle file
with open('data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

# Access the data
video_id = loaded_data['video_id']
data_list = loaded_data['data_list']
video_id_name_list = loaded_data['video_id_name_list']
video_search_dict = loaded_data['video_search_dict']

VIDEO_DIRECTORY = os.path.join(os.path.dirname(os.path.realpath(__file__)), "static")

def serve_video(filename):
    print(VIDEO_DIRECTORY, filename)
    return send_from_directory(directory=VIDEO_DIRECTORY, path=filename)

def home():
    for item in data_list:
        if ":" not in str(item['start']):
            item['start'] = int(item['start'])
            item['start'] = f"{item['start'] // 60}:{item['start'] % 60:02}"
        if ":" not in str(item['end']):
            item['end'] = int(item['end'])
            item['end'] = f"{item['end'] // 60}:{item['end'] % 60:02}"

    video_id_name_dict = {video['video_id']: video['video_name'] for video in video_id_name_list}
    # video_name = video_id_name_dict.get(video_id)
    return render_template('index.html', data=data_list[:10], video_id_name_dict=video_id_name_dict, video_id=video_id, video_search_dict = video_search_dict)

if __name__ == '__main__':

HTML Template

Now, it's time to craft the final piece: our Jinja-2 based HTML template code. This utilizes all the data we've transmitted through the Flask file. Our first task is to exhibit the Video OCR results. The video player will encompass the entire duration of the video, and beneath it, a table will display the start, end, and text discovered during that time interval on the screen. For enhanced clarity, the timestamps will be presented in a minutes-and-seconds format, and they will be clickable, enabling us to jump to the specific timestamp and get the video playing from that point. It's important to note that I've converted the timestamps back to seconds when passing them to the JavaScript function playVideo. This is due to the function being configured to accept timestamps in a seconds-only format for video playback.

<!DOCTYPE html>
    <link rel="shortcut icon" href="#" />
    <title>Video OCR</title>
        body {
            text-align: center;
            font-family: Arial, sans-serif;
            color: #333;
            background-color: #f5f5f5;
        h1, h2 {
            color: #444;
        table {
            margin: 0 auto;
            border-collapse: collapse;
            width: 80%;
            margin-top: 20px;
        th, td {
            border: 1px solid #ddd;
            padding: 8px;
            text-align: center;
        th {
            padding-top: 12px;
            padding-bottom: 12px;
            text-decoration: underline;
            color: black;
        video {
            width: 40%;
            height: auto;
            margin-top: 20px;

        /* search style */
        .video-container {
            text-align: center;
            margin-bottom: 2em;
            padding: 1em;
            background-color: #fff;
            border: 1px solid #ddd;
            border-radius: 4px;
            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        table {
            margin: 0 auto;
            margin-bottom: 1em;
        th, td {
            padding: 0.5em;
            border: 1px solid #ddd;
        function playVideo(timeString) {
            var timeParts = timeString.split(":");
            var time = parseInt(timeParts[0]) * 60 + parseInt(timeParts[1]);
            var video = document.querySelector('#mainVideo');
            video.currentTime = time;

    <h1>Video OCR</h1>
    <h3>Video file: <i>{{ video_id_name_dict[video_id]}}</i></h3>
    <video id="mainVideo" controls>
        <source src="{{ url_for('static', filename=video_id_name_dict[video_id]|string) }}" type="video/mp4">
        Your browser does not support the video tag.
    <br /> <br /> <br />
        {% for item in data %}
            <td><a href="javascript:void(0)" onclick="playVideo('{{ item['start'] }}')">{{ item['start'] }}</a></td>
            <td>{{ item['end'] }}</td>
            <td>{{ item['value'] }}</td>
        {% endfor %}
    <br /> <br />
    {% for video_id, results in video_search_dict.items() %}
    <div class="video-container">
        <h1>Text-in-video Search Results</h1>  
        <h2>Video file: <i>{{ video_id_name_dict[video_id] }}</i></h2>
        <h2>Entered query: <i>{{input_query}}</i></h2>
        {% for result in results %}
        <video controls preload="metadata" style="width: 40%;">
            <source src="{{ url_for('static', filename=video_id_name_dict[video_id]) }}#t={{ result['start'] }},{{ result['end'] }}" type="video/mp4">
            Your browser does not support the video tag.
                <td>{{ result['start'] }}</td>
                <td>{{ result['end'] }}</td>
                <td>{{ result['confidence'] }}</td>
                <td>{{ result['text'] }}</td>
        {% endfor %}
    {% endfor %}

Running the Flask app

Awesome! let’s just run the last cell of our Jupyter notebook to launch our Flask app:


You should see an output similar to the one below, confirming that everything went as anticipated 😊:

After clicking on the URL link, you should be greeted with the following web page:

Here's the Jupyter Notebook containing the complete code that we've put together throughout this tutorial -


Anticipate more thrilling content on the horizon! If you haven't already, I warmly invite you to become part of our lively Discord community, teeming with individuals who share a fervor for multimodal AI.

See you next time,


Crafting stellar Developer Experiences @Twelve Labs


Generation Examples
No items found.
No items found.
Comparison against existing models
No items found.

Related articles

How to make AI Startup worth over $30MㅣTwelve Labs Jae Lee

Meet Jae Lee, the founder and CEO of Twelve Labs, who spearheaded a $30M seed funding round and forged partnerships with industry giants NVIDIA, Samsung, and Intel.

Nvidia-backed Twelve Labs is building AI that understands videos like humans

Twelve Labs, a South Korean AI startup, aspires to achieve a 'ChatGPT' moment for video

The Chosun Daily
Effortlessly Craft Social Media Content from Video

Use this app to effortlessly create social media posts of any type from short, fun Instagram updates to in-depth blog posts loaded with details1

Meeran Kim
Multimodal AI and How Video Understanding Will Revolutionize Media

A beginner guide to video understanding for M&E with MASV and Twelve Labs

James Le