This blog post is co-authored with Rob Gonsalves - Engineering Fellow at Avid
‍
Quickly and easily finding the perfect content in vast media libraries is crucial in the world of media production. Traditionally, this meant manually tagging media assets with keywords, but this method has its limitations in accuracy, scalability, and understanding of the context. Through AI-powered semantic search, which understands the context, meaning, and relationships between media assets by analyzing the content, users can find relevant content based on its semantic meaning— not just keywords.
Thanks to rapid advancements in multimodal AI, semantic search is now a reality in media production. Foundation models help our smart machines understand media content. They power semantic search engines that index media assets, making them searchable based on their semantic content.
Semantic search plays a huge role in media production. It helps media professionals find exactly what they need quickly, saving time and sparking new ideas for content repurposing and creative storytelling. Plus, it can uncover hidden gems that may have been overlooked due to inadequate manual tagging.
In this post, we'll explore the awesome applications and benefits of semantic search in post-production, the key technologies powering it, how it integrates with media asset management systems, and where it's headed in the future.
‍
The transition from metadata-based searches to semantic searches represents a significant advancement in media production workflows.Â
Avid’s MediaCentral | Production Management and MediaCentral | Asset Management systems have successfully enabled teams of up to hundreds of users to effectively log and search metadata for many years. This has included utilizing AI services from Cloud providers to enrich metadata with automated tagging, speech to text transcription, optical character recognition and more, to generate more searchable data.
These traditional metadata-based searches rely on manually extracted information or predefined taxonomies, which, while highly effective, can limit the ability to find truly relevant content.
Metadata-based searches traditionally have several limitations:
‍
In contrast, semantic search leverages state-of-the-art foundation models to understand the actual meaning and context behind the content. By analyzing the visual elements, spoken words, and other data within the media assets, semantic search engines can comprehend the underlying concepts and relationships, rather than relying solely on predefined keywords or taxonomies.
The semantic search process is depicted above:
Twelve Labs offers a powerful Semantic Video Search solution that simultaneously integrates all modalities inside video and captures the complex relationships among them to deliver a more nuanced, humanlike interpretation. That results in much faster and far more accurate video search and retrieval from cloud object storage. Instead of time-consuming and ineffective manual tagging, video editors can use natural language to quickly and accurately search vast media archives to unearth video moments and hidden gems that otherwise might go unnoticed.
The accuracy and efficiency of semantic search are particularly valuable in media production environments, where vast libraries of text, audio, video, and image assets need to be searched and retrieved quickly. By understanding the true meaning and context of the content, semantic search engines can deliver highly relevant results, even when the user's query does not match the exact keywords or metadata associated with the media assets.
‍
OpenAI's Contrastive Language-Image Pre-training (CLIP) model is at the heart of modern semantic search capabilities. CLIP is a neural network that learns to encode both images and text into a shared embedding space. By training on a massive dataset of image-text pairs, CLIP develops the ability to associate visual concepts with their linguistic representations.
The CLIP model consists of two main components: a visual encoder and a text encoder. The visual encoder, typically a Vision Transformer (ViT), analyzes the image and generates a visual embedding. Simultaneously, the text encoder, a Transformer-based language model, encodes the text input into a textual embedding. These embeddings are then compared, and the model learns to align the visual and textual representations, enabling cross-modal retrieval and understanding. You can see how this works in the diagram below.
For example, if the user searches for “youth hockey coach,” CLIP encodes this text and compares it to embeddings from the media library to find matches. The system ranks video clips by relevance. The highest-scoring video closely aligns with the search, demonstrating CLIP's ability to understand and retrieve content semantically.
‍
Building upon CLIP's success, researchers have developed advanced models to enhance semantic search capabilities across different media formats and languages. One notable extension is Multilingual CLIP, which extends the original CLIP text encoder to support multiple languages. By leveraging techniques like cross-lingual teacher learning, Multilingual CLIP enables cross-lingual search and retrieval, allowing users to query media content using text in various languages.
Another significant development is LAION's CLAP (Contrastive Language-Audio-Visual Pre-training) model, which incorporates audio encoding capabilities into the multi-modal framework. CLAP learns to encode audio waveforms, text data, and visual information into a shared embedding space, enabling a comprehensive semantic understanding of multimedia content.
‍
Twelve Labs' Marengo-2.6 model provides advanced video encoding and retrieval capabilities for video search applications. As a state-of-the-art video foundation model, Marengo-2.6 extracts semantic features from video content, allowing users to search for and retrieve relevant video clips based on text queries or reference videos.
Astoundingly, Marengo-2.6's expanded capabilities allow for any-to-any (cross-modality) retrieval tasks, making it a versatile tool for a wide range of applications. This includes text-to-video, text-to-image, text-to-audio, audio-to-video, and image-to-video tasks, bridging different media types. Watch the webinar session below for qualitative demonstration of such capabilities:
These multimodal models work together to enhance media search capabilities across different formats and languages. CLIP and its extensions, such as Multilingual CLIP and CLAP, encode images, text, and audio into searchable embeddings. These embeddings are then stored in embedding databases, enabling efficient retrieval and matching based on semantic similarity. For video content, Marengo-2.6 leverages self-supervised learning with contrastive loss to embed and search video clips based on text queries or reference videos.
By combining these technologies, users can perform semantic searches across vast media libraries, finding relevant content based on their intent and the contextual meaning of their queries.
‍
Semantic search introduces transformative benefits and applications in post-production tasks. By using advanced foundation models mentioned above, media professionals can easily locate specific clips and images through descriptive queries. For instance, a producer might search for "intense soccer match under rain at night," and the system will retrieve video clips that visually match this description without relying on precise tags.
AI-based systems can provide enhanced analytics and insights through the utilization of clustering and semantic mapping. Semantic search can analyze video frames and cluster them into meaningful groups, allowing editors to quickly find scenes of interest or discover thematic patterns across large datasets. For example, semantic embeddings can be used to plot a 2-dimensional semantic map of video clips, providing a visual representation of content relationships and thematic consistencies. You can see an example of this in the image below.
The image shows a representation of CLIP video frame embeddings from a sports highlight reel reduced to two dimensions. You can see how similar frames in the reel are grouped together by semantic similarity, like the swimming shots in groups 9, 15, and 12.
Extending semantic search capabilities to include spoken phrases and ambient sounds enriches the scope of search in audiovisual content. The integration of media embedding models like Marengo from Twelve Labs and LAION’s CLAP enhance the ability to search video and audio content by semantic similarity, not just text match, allowing users to find media that contains specific looks and sounds like bustling cityscapes or serene nature scenes.
‍
Semantic search extends beyond simple retrieval to provide comprehensive insights and analytics. This capability is exemplified by the potential to create interactive displays from semantic embeddings, enabling producers and editors to derive in-depth analytics from media content. For instance, using media embedding models, users can visually explore how different themes are represented across a media library, identify trends, and predict future content preferences.
Furthermore, semantic search can drastically enhance the process of metadata management in media libraries. Typically, metadata is manually tagged, which is labor-intensive and prone to inconsistencies. By automatically generating rich, descriptive metadata from the content, semantic search tools can ensure that every asset is uniformly described, making it far easier to retrieve and analyze. This automated metadata enrichment process leverages the deep learning capabilities of media embedding models to interpret complex media content, including the mood, themes, and key visual elements, thus providing a richer dataset for further analysis and utilization.
These insights are valuable in understanding the existing content and guiding the creation of new media that aligns with audience interests and ongoing trends. The ability to analyze semantic relationships and cultural contexts within media libraries opens up possibilities for predictive analytics and targeted content recommendations.
Integrating semantic search technologies into existing media asset management (MAM) systems can significantly enhance the efficiency and effectiveness of media libraries. This integration facilitates more intelligent search capabilities, which can understand the content and context of media files, thereby improving the accessibility and discoverability of assets.
Integration of semantic search into MAM systems also facilitates better archival and retrieval processes, crucial in post-production workflows. For example, when editors need to pull content from archives that span decades, semantic search can quickly filter through various formats and eras to find content that matches current production needs without manual browsing. This capability speeds up the retrieval process and ensures valuable archival footage is more accessible, promoting its reuse and maximizing the value of existing assets. This represents a significant shift from traditional keyword-based systems, which often require extensive manual input and upkeep to remain effective.
Moreover, semantic search can provide context-aware recommendations based on the user's current project or past searches. This feature speeds up the workflow and inspires new creative ideas by exposing editors to potentially relevant content they might not have considered.
Avid has demonstrated research in this area on various proofs of concept at major trade show events such as NAB and IBC. This has included a Recommendation Engine in the web-based application MediaCentral | Cloud UX, where journalists are offered media related to the script which they are writing, or on the audio in a voiceover on a timeline. The system not only offers suggestions based on a literal analysis of the text, it also generates related sentences or phrases based on the context of the script to offer further suggestions.
Avid is continuing to implement AI-enabled technologies within a range of products, under the banner of Avid Ada – an overarching framework for AI across its portfolio.
Twelve Labs has integrated with multiple MAM providers to bring video understanding to their users. A notable example is our partnership with Vidispine - An Arvato Systems Brand. We first worked together for a joint client from the sports industry to improve the video browsing experience for the client. The joint solution enables easier navigation through video content and uncovers previously undetectable elements, such as specific moves or player conversations. It quickly became clear that the integration had the potential to be even more.
Integrating Twelve Labs’ video-language foundation models in the intuitive user interface of Vidispine’s MediaPortal changes the way users can search for material as it eliminates the need to index all static metadata fields in the core service VidiCore. Vidispine users can now find exact moments within their videos using natural language queries and combine them with metadata from Vidispine applications.
‍
While semantic search technologies have made significant strides in recent years, several challenges remain in their implementation and widespread adoption in the media production industry.
‍
One of the primary challenges is the significant computational power and resources required to effectively process and analyze large volumes of multimedia data. Generating high-quality semantic embeddings and performing complex contextual understanding demands substantial computational resources, including powerful hardware accelerators (GPUs) and ample storage capacity. As media libraries continue to grow exponentially, the computational demands will only increase, necessitating the development of more efficient algorithms and hardware acceleration techniques to make semantic search scalable and practical.
Although current language and vision foundation models have made remarkable progress in understanding context, there is still room for improvement in capturing nuanced meaning, handling ambiguity, and accounting for real-world knowledge. Developing more sophisticated multimodal foundation models that can better grasp the intricate context and relationships within multimedia content will be crucial for enhancing the relevance and accuracy of search results.
‍Seamlessly integrating and fusing diverse modalities, such as text, images, video, and audio, into a unified semantic search framework presents technical challenges. Advancing methods to align and combine these heterogeneous data sources will be important for delivering comprehensive, cross-modal search capabilities that can effectively leverage the complementary information present in different modalities.
Despite these challenges, the future of semantic search in media production holds immense potential and promises to revolutionize the way media professionals search, discover, and utilize content.
The ongoing development of multimodal foundation models, which aim to capture and fuse information across various modalities, could pave the way for more sophisticated semantic search engines. These models (such as Marengo and Pegasus from Twelve Labs), trained on massive multimodal datasets, have the potential to uncover intricate relationships and patterns across different data types, enabling more accurate and comprehensive search capabilities.
Moreover, integrating other forms of production data, such as knowledge graphs, scripts, and transcripts, into a semantic search system can significantly enhance its capabilities. Knowledge graphs can provide a structured representation of relationships between various entities, enriching the search process with contextual information. Scripts and transcripts offer a detailed textual account of media content, allowing the search engine to index and retrieve specific dialogues, scenes, and narrative elements. By leveraging these diverse data sources, semantic search systems can deliver more precise and contextually relevant results, ultimately improving the efficiency of content discovery and utilization in media production.
‍Furthermore, the incorporation of personalized semantic search, which tailors search results based on user preferences and past behavior, could enhance the relevance and utility of search results in media production environments. By understanding the specific needs and contexts of individual users, personalized semantic search can surface the most pertinent content, facilitating more efficient and effective content discovery and utilization.
‍
Semantic search is definitely the new cool kid on the block in the world of news, broadcast and, of course, post-production. It's all about harnessing the power of advanced AI techniques to understand the deeper meaning and context of media assets. Forget the old school keyword-based search methods, this is a transformative approach that is revolutionizing how we manage and use media in production workflows.
Think about models like OpenAI’s CLIP or innovations like Multilingual CLIP, LAION’s CLAP, and Twelve Labs’ Marengo and continuous advancements from Avid. These are just a few examples of how fast things are moving in this field. They're making the search process more intuitive, helping media professionals find content that matches their creative vision with unprecedented precision and speed. With the amount of digital media out there, being able to quickly find what you need will become increasingly vital.
The journey of semantic search is still unfolding, with each new development adding a whole new level of sophistication and capability. By embracing semantic search, we're making things more efficient and boosting our creative processes, giving content creators a whole new way to tell their stories.
‍
Semantic search technologies are integral to the future of media production. As a media professional, adopting these innovations is crucial.
Twelve Labs' semantic video search solution stands at the forefront of this revolution. Our video understanding platform seamlessly integrates with existing media asset management systems, empowering their users to navigate vast video libraries with unprecedented ease. Check out our recent integrations with Vidispine, Blackbird, EMAM, Nomad, and Cinesys.
Over the past few years, Avid has conducted research on the use of AI for media production, including semantic media search. They developed Avid Ada, a digital assistant that supports making workflows more efficient. In addition to feeding the results of their research into their product roadmaps, Avid also publishes and shares with the media industry.
Whether you're looking to find the perfect berry-toned lipstick or just curious about spotting specific colors in your videos, this guide will help you leverage cutting-edge AI to do so effortlessly.
Leverage Twelve Labs Embed API and LanceDB to create AI applications that can process and analyze video content with unprecedented accuracy and efficiency.
We had fun interacting with the AI community in Denver!
Harness the power of Twelve Labs' advanced multimodal embeddings and Milvus' efficient vector database to create a robust video search solution.