The Multimodal Evolution of Vector Embeddings

Implementing deep learning models has become an increasingly important machine learning strategy for companies looking to build data-driven products. In order to build and power deep learning models, companies collect and feed hundreds of millions of terabytes of multimodal data into deep learning models. As a result, embeddings — deep learning models’ internal representations of their input data — are quickly becoming a critical component of building machine learning systems.

For example, they make up a significant part of Spotify’s item recommender systems, YouTube video recommendations of what to watch, and Pinterest’s visual search. Even if not explicitly presented to the user through recommendation system UIs, embeddings are used internally at places like Netflix to make content decisions around which shows to develop based on user preference popularity.

The usage of embeddings to generate compressed, context-specific representations of content exploded in popularity after the publication of Google’s Word2Vec paper. Building and expanding on the concepts in Word2Vec, the Transformer architecture, with its self-attention mechanism, a much more specialized case of calculating context around a given word, has become the de-facto way to learn representations of growing multimodal vocabularies, and its rise in popularity both in academia and in the industry has caused embeddings to become a staple of deep learning workflows.

However, the concept of embeddings can be elusive because they are neither data flow inputs nor output results - they are intermediate elements that live within machine learning services to refine models. So it is helpful to define them explicitly from the beginning.

1 - What Are Embeddings?

A dense embedding is a vector that distributes information related to a concept with multiple elements, indicating that elements can be tuned separately to allow more concepts to be encoded efficiently in a relatively low-dimensional space. Such representations can be compared to symbolic representations, such as one-hot encoding, which uses an element with a value of one to indicate the presence of a concept locally and values of zero for other elements.

In deep learning, the term “embedding” often refers to a mapping from a one-hot vector representing a word or image category to a distributed representation of real-valued numbers. More specifically, the process of embedding includes three steps:

Transforms multimodal input into representations that are easier to perform intensive computation on in the form of vectors, tensors, or graphs.
Compress input information for an ML task, such as summarizing a blog post or performing a semantic search on a large video corpus. The process of compression changes variable feature dimensions into fixed inputs, allowing them to be passed efficiently into downstream components of machine learning systems.
Creates an embedding space that is specific to the data the embeddings were trained on but that, in the case of deep learning representations, can also generalize to other tasks and domains through transfer learning.

Creating an embedding for a word, image, or video to represent an artifact in multidimensional space offers us many possibilities. For instance, in tasks that concentrate on content understanding in a video recommendation system, we are often interested in comparing two given items to assess their similarity. We can perform this task with mathematical precision by transforming videos into vectors and comparing video frames in a shared embedding space.

2 - Unimodal Embeddings

2.1 - Language Representations

Text embeddings are a type of representation for text data that map words or phrases to real-valued vectors in a lower-dimensional space. These vectors capture the meaning and context of the text and enable a variety of natural language processing (NLP) tasks. They have many NLP applications, including search engines, product recommendations, social media content moderation, email spam filtering, and customer support chatbots.

Text embeddings can be generated using a neural network that extracts high-level features from the text and converts them into a fixed-size dimension vector (e.g., 1,500 dimensions). These embeddings can be used to compare and analyze text data, allowing for tasks such as text classification, text retrieval, and text summarization.

In the past, recurrent neural networks like long short-term memory (LSTM) or gated recurrent unit (GRU) language models incorporated information from all past words stored in a fixed-length recurrent vector when predicting a current word. Other commonly used methods for generating word embeddings include the continuous bag-of-words model, skip-grams, and global vectors (GloVe).

Since 2017, Transformers have been heavily used to learn text embeddings. The Vanilla Transformer is a model originally proposed for NLP and uses a self-attention mechanism to achieve state-of-the-art results on various NLP tasks. Many derivative models have been proposed following the success of the Vanilla Transformer, such as BERT, BART, GPT, Longformer, Transformer-XL, and XLNet. A pre-trained Transformer model can be a powerful text embedding generator.

2.2 - Visual Representations

Image embeddings are a way to represent images as real-valued vectors in a lower-dimensional space. These vectors capture the visual content of the image, which can be used for various computer vision tasks, such as image search, object detection, facial recognition, and content-based image retrieval.

We can obtain image embeddings using the output values from the final layers in image classification models (such as AlexNet, VGGNet, GoogLeNet, and ResNet). These models won the ImageNet Large Scale Visual Recognition Competition for image classification in 2012, 2014, and 2015. Image embeddings can be used to compare and analyze image data, enabling tasks such as image classification, image retrieval, and image similarity search.

Alternatively, more direct features can be used as visual embeddings, such as convolutional features and associated class labels from selected regions identified by object detection models. Models using this approach include the region-based CNN (R-CNN), Fast R-CNN, and Faster R-CNN.

Transformers are currently the most popular tool for NLP, and researchers are now exploring how they can be used in other areas, such as visual domains. One such area is the use of a Vision Transformer (ViT), which applies the encoder of a Transformer to images. ViT and its variations have been successfully used for various computer vision tasks, including low-level tasks, recognition, detection, and segmentation. They work well for both supervised and self-supervised visual learning. Recent studies have also provided further understanding of ViT, such as its robustness in internal representation and the continuous behavior of its latent representation propagation. A pre-trained ViT model can be a powerful image embedding generator.

2.3 - Audio Representations

Audio embeddings are numerical representations of audio signals that capture the acoustic content of the audio in a compact and meaningful way. These vectors aim to capture the semantic and contextual information of audio signals. They have many applications in audio processing, especially for tasks such as audio classification, audio retrieval, speaker recognition, and music recommendation.

We can use pre-trained models, such as VGGish or SoundNet, which have been trained on large-scale audio datasets like AudioSet or UrbanSound, to generate these audio embeddings. These models can extract high-level features from the audio signals, such as spectrograms or mel-frequency cepstral coefficients (MFCCs), and encode them into embeddings.

Similar to the language and visual domain, we can also leverage the Transformer architecture to generate these audio embeddings. Some examples include PaSST, Audio Transformer, CTAL, SSAST, and Audio Spectrogram Transformer.

3 - Multimodal Embeddings

Although significant advancements have been made in representing vision, language, or speech, it is theoretically insufficient to model a complete set of human concepts using only one modality. For instance, the idea of a "beautiful picture" is grounded in visual representation, making it hard to describe through natural language or other non-visual ways. That's why it's crucial to learn joint embeddings that use multiple modalities to represent such concepts better. Generally speaking, the field of Multimodal AI looks at building AI systems that can extract embeddings from multimodal data.

3.1 - A Quick Note on Multimodal AI

Multimodal AI has been a significant research area in recent decades. The world we live in is a multimodal environment, and both our observations and behaviors are multimodal. For example, an AI navigation robot requires multimodal sensors to perceive the real-world environment. These sensors include a camera, LiDAR, radar, ultrasonic, GNSS, HD Map, and odometer. Additionally, human behaviors, emotions, events, actions, and humor are also multimodal. As a result, various human-centered Multimodal AI tasks are widely studied, including multimodal emotion recognition, multimodal event representation, understanding multimodal humor, face-body-voice-based video person-clustering, and more.

Thanks to the advancements in internet technology and the proliferation of intelligent devices, an ever-growing volume of multimodal data is transmitted over the web. This has given rise to a plethora of multimodal application scenarios. In today's world, we can observe a broad range of such applications, including commercial services (like e-commerce retrieval, vision-language navigation, and audio-visual navigation), communication methods (such as lip-reading and sign language translation), human-computer interaction, healthcare AI, and surveillance AI.

In the era of Deep Learning, Multimodal AI has made significant progress thanks to deep neural networks. Among the most competitive architectures are the Transformers, which offer new challenges and opportunities to Multimodal AI. Recent successes with large language models and their multimodal derivatives, such as Frozen, VL-Adapter, Flamingo, BEiT, and PaLI, show that Transformers have great potential for creating foundation models for Multimodal AI.

3.2 - Multimodal Pre-Training

In 2021, CLIP was proposed as a new milestone. It uses multimodal pre-training to convert classification into a retrieval task, which enables pre-trained models to tackle zero-shot recognition. Thus, CLIP is a successful practice that fully utilizes large-scale multimodal pre-training to enable zero-shot learning. This has become a main breakthrough for many multimodal tasks. Recently, the idea of CLIP has been further studied in other works, such as CLIP pre-trained model-based zero-shot semantic segmentation, ALIGN, MAD, ALBEF, and CoCa.

3.3 - Trends in Multimodal Big Data

In recent years, the internet has developed rapidly with new applications like social media and online retail. This has led to many new multimodal datasets being created. Some of the most well-known ones include Conceptual Captions, COCO, VQA, Visual Genome, SBU Captions, Cooking312K, LAIT, e-SNLI-VE, ARCH, Adversarial VQA, OTT-QA, MultiModalQA, VALUE, Fashion IQ, LRS2-BBC, ActivityNet, and VisDial.

The new trends among these datasets are:

The datasets are becoming larger in scale. Some of the new ones released are million-scale, like Product1M, Conceptual12M, RUC-CAS-WenLan (30M), HowToVQA69M, HowTo100M, ALT200M, LAION-400M, and LAION-5B.
There are more modalities. In addition to vision, text, and audio, new diverse modalities are emerging, like Pano-AVQA (the first large-scale spatial and audio-visual question-answering dataset on 360-degree videos), YouTube-360 (YT-360) (360-degree videos), AIST++ (a new multimodal dataset of 3D dance motion and music), and ArtEmis (affective language for visual arts). Astoundingly, MultiBench is a dataset including ten modalities.
There are more scenarios. In addition to common caption and QA datasets, more applications and scenarios have been studied, like CIRR (real-life images), Bed and Breakfast (BnB) (vision-and-language navigation), M3A (financial dataset), and X-World (autonomous drive).
The tasks are more difficult. Beyond the straightforward tasks, more abstract multimodal tasks are proposed, like MultiMET (a multimodal dataset for metaphor understanding) and Hateful Memes (hate speech in multimodal memes).
Instructional videos are becoming more popular, like cooking videos with YouCook2. Aligning a sequence of instructions to a video of someone carrying out a task is an example of a powerful pre-training pretext task (as shown in What’s Cookin’?). Pretext tasks are pre-designed problems to force the models to learn representation by solving them.

Transformers, like other deep neural network architectures, require a lot of data. Therefore, their high-capacity models and multimodal big data basis co-create the prosperity of the Transformer based multimodal machine learning. For instance, big data bring zero-shot learning capability to VLP Transformer models.

4 - Video Representations

Many people are turning to video as a way to present information. It is an excellent medium because it combines images, motion, and sound, making it engaging and informative. However, working with video data can be challenging because it involves processing images, language, and audio all at once (aka, video is multi-modal). Additionally, there are unique challenges to the video domain, such as dealing with a high number of dimensions and modeling motion dynamics.

Video embeddings are numerical representations of videos that capture their visual, linguistic, and audio content in a compact and meaningful way. Similar to how word embeddings represent words as dense vectors in natural language processing, video embeddings aim to capture the semantic and contextual information of videos.

Video embeddings have various applications in computer vision and multimedia analysis. They can be used for tasks such as video classification, video retrieval, video summarization, and video recommendation. For instance, by representing videos as embeddings, it becomes easier to compare and search for similar videos based on their content, rather than relying solely on metadata or textual descriptions.

The Transformer architecture is pretty versatile, and it can be used to model various data types. Recently, there has been a lot of Video Transformer works that adapt Transformer to model videos. Before being input into the Transformer, a video must go through some processing. This involves tokenization, embedding, and positioning. For example, the figure above is from the Video Vision Transformer paper. The authors consider two simple methods for mapping a video to a sequence of tokens (uniform frame sampling and tubelet embedding). They then add the positional embedding and reshape the tokens to obtain the final input to the Transformer.

It's important to note that the specific strategies for tokenization, embedding, and positioning can vary depending on the specific video modeling task and the available data.

4.1 - Video Tokenization Techniques

Tokenization is the process of breaking down a video into smaller units called tokens, which can be used as input to a Transformer model. Here are some common tokenization techniques for videos:

1 - Frame-level tokenization

In this technique, each frame of the video is treated as a token. The frames are sequentially encoded and fed into the Transformer model. This approach allows the model to capture temporal information at the frame level.

2 - Clip-level tokenization

Instead of treating each frame as a token, this technique groups consecutive frames into clips. Each clip is then tokenized and processed as a single unit. This approach reduces the number of tokens and can be more efficient for longer videos.

3 - Co-tokenization

This technique involves jointly tokenizing both the video and its corresponding text description. The video and text tokens are then fused together to create a joint representation that captures both visual and textual information. This approach has been shown to be effective for video question-answering tasks.

4 - Object-level tokenization

In this technique, objects or regions of interest in the video are detected and segmented, and each object or region is treated as a separate token. This approach can be useful for tasks like object detection or action recognition.

4.2 - Video Embedding Strategies

Different embedding strategies used in video models can have varying effects on their performance. Here are some examples:

1 - Image-based embedding

This strategy involves using pre-trained image models, such as Inception-V3 or ResNet, to extract visual features from each frame of the video. These features are then transformed into embeddings using techniques like pooling or fully connected layers. Image-based embeddings can capture detailed visual information and are effective for tasks like object detection or scene understanding.

2 - Audio-based embedding

In addition to visual features, audio features can also be extracted from the video. Techniques like spectrogram analysis or audio embeddings can be used to convert the audio signals into embeddings. Audio-based embeddings can capture sound-related information and are useful for tasks like audio-visual synchronization or sound event detection.

3 - Joint audio-visual embedding

This strategy combines both visual and audio features to create joint embeddings. The visual and audio features are typically fused at some point in the model architecture, allowing the model to capture the correlations between the two modalities. Joint embeddings can be beneficial for tasks like video captioning or video retrieval.

4 - Temporal embedding

Temporal embedding strategies focus on capturing the temporal dynamics of the video. This can be done by incorporating temporal information, such as optical flow or motion vectors, into the embedding process. Temporal embeddings can help the model understand the motion and temporal relationships between different frames or clips in the video.

5 - Spatial embedding

Spatial embedding strategies aim to capture the spatial relationships between different regions of interest in the video. This can be achieved by using techniques like region-based embeddings or attention mechanisms that attend to specific spatial locations. Spatial embeddings are useful for tasks like object tracking or action recognition.

4.3 - The Role of Positional Embeddings

Self-attention is the essential operation of the Transformer. It enhances each token embedding with information from other embeddings in a sequence. Since self-attention operates on sets, it is crucial to indicate positional information to utilize the spatiotemporal structure of videos. This is achieved using positional embeddings (PE), which encode the temporal position information of video frames or clips within the Transformer.

Some popular positioning methods for video data in Transformer models include:

1 - Learned positional embedding

This approach involves learning the positional encodings during the training of the Transformer model. The model learns to encode the relative positions of the frames or clips in the video sequence. This method allows the model to capture the temporal order of the video data.

2 - Fixed positional encoding

In this method, fixed positional encodings are added to the video embeddings. These encodings provide information about the absolute positions of the frames or clips in the video sequence. Fixed positional encodings can be based on sine and cosine functions or other predefined patterns.

3 - Relative positional encoding

This approach incorporates relative position information between frames or clips in the video sequence. It takes into account the distance or time gap between consecutive frames or clips. Relative positional encodings can be added as additional inputs to the Transformer model or incorporated into the self-attention mechanism.

4 - Hybrid approaches

Some methods combine different types of positional encodings to capture both the absolute and relative positions of the video data. This can involve using a combination of fixed and learned positional encodings or incorporating relative position information into the self-attention mechanism.

5 - Embeddings In Production

Engineering systems based on embeddings (such as building foundation models) can be computationally expensive to build and maintain. The need to create, store, and manage embeddings has also recently resulted in the explosion of an entire ecosystem of related products, notable the rise in the development of vector databases to facilitate production-ready use of nearest neighbors semantic queries in machine learning systems.

5.1 - Embeddings In Practice

In a production environment, it is possible to train your own embedding model to gain insight into the model's internals if you have access to a lot of GPUs. Alternatively, you can use pre-trained embeddings and adapt them to your specific use cases. Many companies today are using embeddings in both of these contexts. Notably, YouTube was one of the first large companies to publicly share their work on video embeddings in the context of a production recommender system with "Deep Neural Networks for YouTube Recommendations.”

As YouTube has over 800 million pieces of content (videos) and 2.6 billion active users, the application needs to recommend existing content to users while also being able to generalize to new content, which is frequently uploaded. These recommendations need to be served quickly at inference time—when the user loads a new page—with low latency.

YouTube shares how they created a two-stage recommender system for videos based on two deep learning models. The machine learning task is to predict the correct next video to show the user at a given time in YouTube recommendations so that they click. The final output is formulated as a classification problem: given a user's input features and the input features of a video, can we predict a class for the user that includes the predicted watch time for the user for a specific video with a specific probability?

To develop the model, they use two sets of embeddings as input data: (1) embeddings that represent the user and their context as features, and (2) embeddings that represent the video items. The model has many features, including ones based on tables and embeddings. The embeddings-based features include:

User watch history - a vector that shows which videos the user has watched, represented as a sparse video ID mapped to a dense vector.
User's search history - shows which videos the user clicked on after a search term. This is also represented as a sparse vector mapped to the same space as the user watch history.
User's geography, age, and gender - shown as table features.
The number of previous impressions a video had, normalized per user over time.

All of these features are combined into one item embedding. For the user, all the embeddings are blended into one user embedding. These embeddings are then fed into the model's softmax layers. The softmax layers compare the output of the layer (i.e., the probability that the user will interact with an item) to a set of ground truth items. The ground truth items are a set of items that the user has already interacted with. The log probability of an item is the dot product of two n-dimensional vectors - the query and item embeddings.

5.2 - Embeddings as an Engineering Problem

A production-level ML system using embeddings has many moving components: generating embeddings, storing embeddings, detecting concept drift, optimizing inference and latency, evaluating the system offline and online, etc. Let’s look at the first two stages since they directly deal with embeddings.

Embeddings Generation

We have observed that embeddings are typically generated as a byproduct of training neural network models, with the penultimate layer that precedes the final output layer used for classification or regression. There are two approaches to producing these embeddings. We can train our own models, as YouTube has done.

However, one of the major advantages of deep learning models is that we can also use pre-trained models. A pre-trained model is any model similar to the one we are considering for our task that has already been trained on vast amounts of training data and can be used for downstream tasks by fine-tuning.

When we fine-tune a model, we follow the same steps as training from scratch. We have training data, a model, and a loss function to minimize. However, some differences exist. We create a new model by duplicating the existing pre-trained model, except for the final output layer, which we initialize from scratch based on our new task. During training, we initialize these parameters randomly and adjust only the parameters of the previous layers so that they focus on this task, instead of starting with a completely new training. In this way, we can refocus the fine-tuned model without having to train with a gigantic amount of data.

The generated embeddings can be further improved by reducing their dimensionality (using techniques like PCA or t-SNE) and indexing them using an appropriate data structure to facilitate fast and efficient retrieval during search or query processing (more below).

Embeddings Retrieval

After we finish training the model, we have to extract the embeddings from it. The trained model gives us a data structure that holds everything about the model's parameters, like weights, biases, layers, and learning rate. The embeddings are one of the layers in this model object that initially live in-memory. We include them as part of the model object when we write the model to disk. Then, we serialize them onto memory and load them when we retrain or do inference.

There are different things we can do with embeddings when building a model. The easiest way to store embeddings is by using an in-memory numpy array. But, for more advanced tasks, we can:

Get them in batches or one by one during inference.
Analyze the quality of the embeddings offline.
Transform the embeddings to create new features.
Update the embeddings with new models.
Keep track of different versions of the embeddings.
Convert new documents into embeddings.

The most complex and customizable software that can handle most of these tasks is called a vector database. However, there are simpler options such as vector search plugins for databases like Postgres and SQLite, and caches like Redis.

When we work with embeddings, the most important thing we do is called vector search. This helps us find embeddings that are similar to the one we have, which is useful for finding similar items. To do vector searches, we need a way to search through our data structures efficiently and compare them to find the most similar ones.

Relational databases use a b-tree structure to sort items in ascending order within a hierarchy of nodes, which makes it faster to read them. But we can't look up our vectors quickly using columns, so we need to create different structures for them. One example is inverted indices, which many vector stores use to do vector searches effectively.

A general-form embedding store has three main components: the embeddings themselves, an index that maps them back to words, pictures, or text, and a way to compare the similarity between different types of embeddings using various nearest neighbor algorithms.

One common method for comparing embeddings is called cosine similarity, but it can be slow and inefficient when comparing millions of sets of vectors. To solve this, approximate nearest neighbor (ANN) algorithms were developed to create neighborhoods out of elements of vectors and find a vector's k-nearest neighbors. The two most widely used algorithms are HNSW (hierarchical navigable small worlds) and FAISS, which are both standalone libraries and are also part of many existing vector stores.

The trade-off between full and nearest neighbor search is that the latter is less precise but much faster. When evaluating precision and recall, it is important to consider the trade-offs and determine the requirements for the accuracy of our embeddings and their inference latency while keeping in mind the speed of computation.

6 - Conclusion

We have now walked through the multimodal evolution of vector embeddings and their role in various applications.

We started with a brief definition of embeddings. We then walked through various unimodal embeddings that can represent text, images, and audio.
Embeddings have become even more important in the modern explosion of multimodal representations of data and the versatility of Transformer models. We dived a bit into video representations, namely strategies around embeddings and tokenization.
Finally, we have understood the engineering context of working with embeddings in production, including generating and storing embeddings.

The evolution of vector embeddings has opened up new possibilities for understanding and analyzing complex data, particularly in the realm of video data. As the field continues to develop and mature, we can expect to see even more exciting applications and innovations in the future.

At Twelve Labs, we are developing foundation models for multimodal video understanding. In other words, we extract video embeddings from raw video data and power various downstream video understanding tasks, such as video search and video classification. Our goal is to help developers build programs that can see, listen, and understand the world as we do with the most advanced video-understanding infrastructure.

If you would like to learn more, please sign up at https://playground.twelvelabs.io/ and join our Multimodal Minds Discord community to chat about all things Multimodal AI!

Implementing deep learning models has become an increasingly important machine learning strategy for companies looking to build data-driven products. In order to build and power deep learning models, companies collect and feed hundreds of millions of terabytes of multimodal data into deep learning models. As a result, embeddings — deep learning models’ internal representations of their input data — are quickly becoming a critical component of building machine learning systems.

For example, they make up a significant part of Spotify’s item recommender systems, YouTube video recommendations of what to watch, and Pinterest’s visual search. Even if not explicitly presented to the user through recommendation system UIs, embeddings are used internally at places like Netflix to make content decisions around which shows to develop based on user preference popularity.

The usage of embeddings to generate compressed, context-specific representations of content exploded in popularity after the publication of Google’s Word2Vec paper. Building and expanding on the concepts in Word2Vec, the Transformer architecture, with its self-attention mechanism, a much more specialized case of calculating context around a given word, has become the de-facto way to learn representations of growing multimodal vocabularies, and its rise in popularity both in academia and in the industry has caused embeddings to become a staple of deep learning workflows.

However, the concept of embeddings can be elusive because they are neither data flow inputs nor output results - they are intermediate elements that live within machine learning services to refine models. So it is helpful to define them explicitly from the beginning.

1 - What Are Embeddings?

A dense embedding is a vector that distributes information related to a concept with multiple elements, indicating that elements can be tuned separately to allow more concepts to be encoded efficiently in a relatively low-dimensional space. Such representations can be compared to symbolic representations, such as one-hot encoding, which uses an element with a value of one to indicate the presence of a concept locally and values of zero for other elements.

In deep learning, the term “embedding” often refers to a mapping from a one-hot vector representing a word or image category to a distributed representation of real-valued numbers. More specifically, the process of embedding includes three steps:

Transforms multimodal input into representations that are easier to perform intensive computation on in the form of vectors, tensors, or graphs.
Compress input information for an ML task, such as summarizing a blog post or performing a semantic search on a large video corpus. The process of compression changes variable feature dimensions into fixed inputs, allowing them to be passed efficiently into downstream components of machine learning systems.
Creates an embedding space that is specific to the data the embeddings were trained on but that, in the case of deep learning representations, can also generalize to other tasks and domains through transfer learning.

Creating an embedding for a word, image, or video to represent an artifact in multidimensional space offers us many possibilities. For instance, in tasks that concentrate on content understanding in a video recommendation system, we are often interested in comparing two given items to assess their similarity. We can perform this task with mathematical precision by transforming videos into vectors and comparing video frames in a shared embedding space.

2 - Unimodal Embeddings

2.1 - Language Representations

Text embeddings are a type of representation for text data that map words or phrases to real-valued vectors in a lower-dimensional space. These vectors capture the meaning and context of the text and enable a variety of natural language processing (NLP) tasks. They have many NLP applications, including search engines, product recommendations, social media content moderation, email spam filtering, and customer support chatbots.

Text embeddings can be generated using a neural network that extracts high-level features from the text and converts them into a fixed-size dimension vector (e.g., 1,500 dimensions). These embeddings can be used to compare and analyze text data, allowing for tasks such as text classification, text retrieval, and text summarization.

In the past, recurrent neural networks like long short-term memory (LSTM) or gated recurrent unit (GRU) language models incorporated information from all past words stored in a fixed-length recurrent vector when predicting a current word. Other commonly used methods for generating word embeddings include the continuous bag-of-words model, skip-grams, and global vectors (GloVe).

Since 2017, Transformers have been heavily used to learn text embeddings. The Vanilla Transformer is a model originally proposed for NLP and uses a self-attention mechanism to achieve state-of-the-art results on various NLP tasks. Many derivative models have been proposed following the success of the Vanilla Transformer, such as BERT, BART, GPT, Longformer, Transformer-XL, and XLNet. A pre-trained Transformer model can be a powerful text embedding generator.

2.2 - Visual Representations

Image embeddings are a way to represent images as real-valued vectors in a lower-dimensional space. These vectors capture the visual content of the image, which can be used for various computer vision tasks, such as image search, object detection, facial recognition, and content-based image retrieval.

We can obtain image embeddings using the output values from the final layers in image classification models (such as AlexNet, VGGNet, GoogLeNet, and ResNet). These models won the ImageNet Large Scale Visual Recognition Competition for image classification in 2012, 2014, and 2015. Image embeddings can be used to compare and analyze image data, enabling tasks such as image classification, image retrieval, and image similarity search.

Alternatively, more direct features can be used as visual embeddings, such as convolutional features and associated class labels from selected regions identified by object detection models. Models using this approach include the region-based CNN (R-CNN), Fast R-CNN, and Faster R-CNN.

Transformers are currently the most popular tool for NLP, and researchers are now exploring how they can be used in other areas, such as visual domains. One such area is the use of a Vision Transformer (ViT), which applies the encoder of a Transformer to images. ViT and its variations have been successfully used for various computer vision tasks, including low-level tasks, recognition, detection, and segmentation. They work well for both supervised and self-supervised visual learning. Recent studies have also provided further understanding of ViT, such as its robustness in internal representation and the continuous behavior of its latent representation propagation. A pre-trained ViT model can be a powerful image embedding generator.

2.3 - Audio Representations

Audio embeddings are numerical representations of audio signals that capture the acoustic content of the audio in a compact and meaningful way. These vectors aim to capture the semantic and contextual information of audio signals. They have many applications in audio processing, especially for tasks such as audio classification, audio retrieval, speaker recognition, and music recommendation.

We can use pre-trained models, such as VGGish or SoundNet, which have been trained on large-scale audio datasets like AudioSet or UrbanSound, to generate these audio embeddings. These models can extract high-level features from the audio signals, such as spectrograms or mel-frequency cepstral coefficients (MFCCs), and encode them into embeddings.

Similar to the language and visual domain, we can also leverage the Transformer architecture to generate these audio embeddings. Some examples include PaSST, Audio Transformer, CTAL, SSAST, and Audio Spectrogram Transformer.

3 - Multimodal Embeddings

Although significant advancements have been made in representing vision, language, or speech, it is theoretically insufficient to model a complete set of human concepts using only one modality. For instance, the idea of a "beautiful picture" is grounded in visual representation, making it hard to describe through natural language or other non-visual ways. That's why it's crucial to learn joint embeddings that use multiple modalities to represent such concepts better. Generally speaking, the field of Multimodal AI looks at building AI systems that can extract embeddings from multimodal data.

3.1 - A Quick Note on Multimodal AI

Multimodal AI has been a significant research area in recent decades. The world we live in is a multimodal environment, and both our observations and behaviors are multimodal. For example, an AI navigation robot requires multimodal sensors to perceive the real-world environment. These sensors include a camera, LiDAR, radar, ultrasonic, GNSS, HD Map, and odometer. Additionally, human behaviors, emotions, events, actions, and humor are also multimodal. As a result, various human-centered Multimodal AI tasks are widely studied, including multimodal emotion recognition, multimodal event representation, understanding multimodal humor, face-body-voice-based video person-clustering, and more.

Thanks to the advancements in internet technology and the proliferation of intelligent devices, an ever-growing volume of multimodal data is transmitted over the web. This has given rise to a plethora of multimodal application scenarios. In today's world, we can observe a broad range of such applications, including commercial services (like e-commerce retrieval, vision-language navigation, and audio-visual navigation), communication methods (such as lip-reading and sign language translation), human-computer interaction, healthcare AI, and surveillance AI.

In the era of Deep Learning, Multimodal AI has made significant progress thanks to deep neural networks. Among the most competitive architectures are the Transformers, which offer new challenges and opportunities to Multimodal AI. Recent successes with large language models and their multimodal derivatives, such as Frozen, VL-Adapter, Flamingo, BEiT, and PaLI, show that Transformers have great potential for creating foundation models for Multimodal AI.

3.2 - Multimodal Pre-Training

In 2021, CLIP was proposed as a new milestone. It uses multimodal pre-training to convert classification into a retrieval task, which enables pre-trained models to tackle zero-shot recognition. Thus, CLIP is a successful practice that fully utilizes large-scale multimodal pre-training to enable zero-shot learning. This has become a main breakthrough for many multimodal tasks. Recently, the idea of CLIP has been further studied in other works, such as CLIP pre-trained model-based zero-shot semantic segmentation, ALIGN, MAD, ALBEF, and CoCa.

3.3 - Trends in Multimodal Big Data

In recent years, the internet has developed rapidly with new applications like social media and online retail. This has led to many new multimodal datasets being created. Some of the most well-known ones include Conceptual Captions, COCO, VQA, Visual Genome, SBU Captions, Cooking312K, LAIT, e-SNLI-VE, ARCH, Adversarial VQA, OTT-QA, MultiModalQA, VALUE, Fashion IQ, LRS2-BBC, ActivityNet, and VisDial.

The new trends among these datasets are:

The datasets are becoming larger in scale. Some of the new ones released are million-scale, like Product1M, Conceptual12M, RUC-CAS-WenLan (30M), HowToVQA69M, HowTo100M, ALT200M, LAION-400M, and LAION-5B.
There are more modalities. In addition to vision, text, and audio, new diverse modalities are emerging, like Pano-AVQA (the first large-scale spatial and audio-visual question-answering dataset on 360-degree videos), YouTube-360 (YT-360) (360-degree videos), AIST++ (a new multimodal dataset of 3D dance motion and music), and ArtEmis (affective language for visual arts). Astoundingly, MultiBench is a dataset including ten modalities.
There are more scenarios. In addition to common caption and QA datasets, more applications and scenarios have been studied, like CIRR (real-life images), Bed and Breakfast (BnB) (vision-and-language navigation), M3A (financial dataset), and X-World (autonomous drive).
The tasks are more difficult. Beyond the straightforward tasks, more abstract multimodal tasks are proposed, like MultiMET (a multimodal dataset for metaphor understanding) and Hateful Memes (hate speech in multimodal memes).
Instructional videos are becoming more popular, like cooking videos with YouCook2. Aligning a sequence of instructions to a video of someone carrying out a task is an example of a powerful pre-training pretext task (as shown in What’s Cookin’?). Pretext tasks are pre-designed problems to force the models to learn representation by solving them.

Transformers, like other deep neural network architectures, require a lot of data. Therefore, their high-capacity models and multimodal big data basis co-create the prosperity of the Transformer based multimodal machine learning. For instance, big data bring zero-shot learning capability to VLP Transformer models.

4 - Video Representations

Many people are turning to video as a way to present information. It is an excellent medium because it combines images, motion, and sound, making it engaging and informative. However, working with video data can be challenging because it involves processing images, language, and audio all at once (aka, video is multi-modal). Additionally, there are unique challenges to the video domain, such as dealing with a high number of dimensions and modeling motion dynamics.

Video embeddings are numerical representations of videos that capture their visual, linguistic, and audio content in a compact and meaningful way. Similar to how word embeddings represent words as dense vectors in natural language processing, video embeddings aim to capture the semantic and contextual information of videos.

Video embeddings have various applications in computer vision and multimedia analysis. They can be used for tasks such as video classification, video retrieval, video summarization, and video recommendation. For instance, by representing videos as embeddings, it becomes easier to compare and search for similar videos based on their content, rather than relying solely on metadata or textual descriptions.

The Transformer architecture is pretty versatile, and it can be used to model various data types. Recently, there has been a lot of Video Transformer works that adapt Transformer to model videos. Before being input into the Transformer, a video must go through some processing. This involves tokenization, embedding, and positioning. For example, the figure above is from the Video Vision Transformer paper. The authors consider two simple methods for mapping a video to a sequence of tokens (uniform frame sampling and tubelet embedding). They then add the positional embedding and reshape the tokens to obtain the final input to the Transformer.

It's important to note that the specific strategies for tokenization, embedding, and positioning can vary depending on the specific video modeling task and the available data.

4.1 - Video Tokenization Techniques

Tokenization is the process of breaking down a video into smaller units called tokens, which can be used as input to a Transformer model. Here are some common tokenization techniques for videos:

1 - Frame-level tokenization

In this technique, each frame of the video is treated as a token. The frames are sequentially encoded and fed into the Transformer model. This approach allows the model to capture temporal information at the frame level.

2 - Clip-level tokenization

Instead of treating each frame as a token, this technique groups consecutive frames into clips. Each clip is then tokenized and processed as a single unit. This approach reduces the number of tokens and can be more efficient for longer videos.

3 - Co-tokenization

This technique involves jointly tokenizing both the video and its corresponding text description. The video and text tokens are then fused together to create a joint representation that captures both visual and textual information. This approach has been shown to be effective for video question-answering tasks.

4 - Object-level tokenization

In this technique, objects or regions of interest in the video are detected and segmented, and each object or region is treated as a separate token. This approach can be useful for tasks like object detection or action recognition.

4.2 - Video Embedding Strategies

Different embedding strategies used in video models can have varying effects on their performance. Here are some examples:

1 - Image-based embedding

This strategy involves using pre-trained image models, such as Inception-V3 or ResNet, to extract visual features from each frame of the video. These features are then transformed into embeddings using techniques like pooling or fully connected layers. Image-based embeddings can capture detailed visual information and are effective for tasks like object detection or scene understanding.

2 - Audio-based embedding

In addition to visual features, audio features can also be extracted from the video. Techniques like spectrogram analysis or audio embeddings can be used to convert the audio signals into embeddings. Audio-based embeddings can capture sound-related information and are useful for tasks like audio-visual synchronization or sound event detection.

3 - Joint audio-visual embedding

This strategy combines both visual and audio features to create joint embeddings. The visual and audio features are typically fused at some point in the model architecture, allowing the model to capture the correlations between the two modalities. Joint embeddings can be beneficial for tasks like video captioning or video retrieval.

4 - Temporal embedding

Temporal embedding strategies focus on capturing the temporal dynamics of the video. This can be done by incorporating temporal information, such as optical flow or motion vectors, into the embedding process. Temporal embeddings can help the model understand the motion and temporal relationships between different frames or clips in the video.

5 - Spatial embedding

Spatial embedding strategies aim to capture the spatial relationships between different regions of interest in the video. This can be achieved by using techniques like region-based embeddings or attention mechanisms that attend to specific spatial locations. Spatial embeddings are useful for tasks like object tracking or action recognition.

4.3 - The Role of Positional Embeddings

Self-attention is the essential operation of the Transformer. It enhances each token embedding with information from other embeddings in a sequence. Since self-attention operates on sets, it is crucial to indicate positional information to utilize the spatiotemporal structure of videos. This is achieved using positional embeddings (PE), which encode the temporal position information of video frames or clips within the Transformer.

Some popular positioning methods for video data in Transformer models include:

1 - Learned positional embedding

This approach involves learning the positional encodings during the training of the Transformer model. The model learns to encode the relative positions of the frames or clips in the video sequence. This method allows the model to capture the temporal order of the video data.

2 - Fixed positional encoding

In this method, fixed positional encodings are added to the video embeddings. These encodings provide information about the absolute positions of the frames or clips in the video sequence. Fixed positional encodings can be based on sine and cosine functions or other predefined patterns.

3 - Relative positional encoding

This approach incorporates relative position information between frames or clips in the video sequence. It takes into account the distance or time gap between consecutive frames or clips. Relative positional encodings can be added as additional inputs to the Transformer model or incorporated into the self-attention mechanism.

4 - Hybrid approaches

Some methods combine different types of positional encodings to capture both the absolute and relative positions of the video data. This can involve using a combination of fixed and learned positional encodings or incorporating relative position information into the self-attention mechanism.

5 - Embeddings In Production

Engineering systems based on embeddings (such as building foundation models) can be computationally expensive to build and maintain. The need to create, store, and manage embeddings has also recently resulted in the explosion of an entire ecosystem of related products, notable the rise in the development of vector databases to facilitate production-ready use of nearest neighbors semantic queries in machine learning systems.

5.1 - Embeddings In Practice

In a production environment, it is possible to train your own embedding model to gain insight into the model's internals if you have access to a lot of GPUs. Alternatively, you can use pre-trained embeddings and adapt them to your specific use cases. Many companies today are using embeddings in both of these contexts. Notably, YouTube was one of the first large companies to publicly share their work on video embeddings in the context of a production recommender system with "Deep Neural Networks for YouTube Recommendations.”

As YouTube has over 800 million pieces of content (videos) and 2.6 billion active users, the application needs to recommend existing content to users while also being able to generalize to new content, which is frequently uploaded. These recommendations need to be served quickly at inference time—when the user loads a new page—with low latency.

YouTube shares how they created a two-stage recommender system for videos based on two deep learning models. The machine learning task is to predict the correct next video to show the user at a given time in YouTube recommendations so that they click. The final output is formulated as a classification problem: given a user's input features and the input features of a video, can we predict a class for the user that includes the predicted watch time for the user for a specific video with a specific probability?

To develop the model, they use two sets of embeddings as input data: (1) embeddings that represent the user and their context as features, and (2) embeddings that represent the video items. The model has many features, including ones based on tables and embeddings. The embeddings-based features include:

User watch history - a vector that shows which videos the user has watched, represented as a sparse video ID mapped to a dense vector.
User's search history - shows which videos the user clicked on after a search term. This is also represented as a sparse vector mapped to the same space as the user watch history.
User's geography, age, and gender - shown as table features.
The number of previous impressions a video had, normalized per user over time.

All of these features are combined into one item embedding. For the user, all the embeddings are blended into one user embedding. These embeddings are then fed into the model's softmax layers. The softmax layers compare the output of the layer (i.e., the probability that the user will interact with an item) to a set of ground truth items. The ground truth items are a set of items that the user has already interacted with. The log probability of an item is the dot product of two n-dimensional vectors - the query and item embeddings.

5.2 - Embeddings as an Engineering Problem

A production-level ML system using embeddings has many moving components: generating embeddings, storing embeddings, detecting concept drift, optimizing inference and latency, evaluating the system offline and online, etc. Let’s look at the first two stages since they directly deal with embeddings.

Embeddings Generation

We have observed that embeddings are typically generated as a byproduct of training neural network models, with the penultimate layer that precedes the final output layer used for classification or regression. There are two approaches to producing these embeddings. We can train our own models, as YouTube has done.

However, one of the major advantages of deep learning models is that we can also use pre-trained models. A pre-trained model is any model similar to the one we are considering for our task that has already been trained on vast amounts of training data and can be used for downstream tasks by fine-tuning.

When we fine-tune a model, we follow the same steps as training from scratch. We have training data, a model, and a loss function to minimize. However, some differences exist. We create a new model by duplicating the existing pre-trained model, except for the final output layer, which we initialize from scratch based on our new task. During training, we initialize these parameters randomly and adjust only the parameters of the previous layers so that they focus on this task, instead of starting with a completely new training. In this way, we can refocus the fine-tuned model without having to train with a gigantic amount of data.

The generated embeddings can be further improved by reducing their dimensionality (using techniques like PCA or t-SNE) and indexing them using an appropriate data structure to facilitate fast and efficient retrieval during search or query processing (more below).

Embeddings Retrieval

After we finish training the model, we have to extract the embeddings from it. The trained model gives us a data structure that holds everything about the model's parameters, like weights, biases, layers, and learning rate. The embeddings are one of the layers in this model object that initially live in-memory. We include them as part of the model object when we write the model to disk. Then, we serialize them onto memory and load them when we retrain or do inference.

There are different things we can do with embeddings when building a model. The easiest way to store embeddings is by using an in-memory numpy array. But, for more advanced tasks, we can:

Get them in batches or one by one during inference.
Analyze the quality of the embeddings offline.
Transform the embeddings to create new features.
Update the embeddings with new models.
Keep track of different versions of the embeddings.
Convert new documents into embeddings.

The most complex and customizable software that can handle most of these tasks is called a vector database. However, there are simpler options such as vector search plugins for databases like Postgres and SQLite, and caches like Redis.

When we work with embeddings, the most important thing we do is called vector search. This helps us find embeddings that are similar to the one we have, which is useful for finding similar items. To do vector searches, we need a way to search through our data structures efficiently and compare them to find the most similar ones.

Relational databases use a b-tree structure to sort items in ascending order within a hierarchy of nodes, which makes it faster to read them. But we can't look up our vectors quickly using columns, so we need to create different structures for them. One example is inverted indices, which many vector stores use to do vector searches effectively.

A general-form embedding store has three main components: the embeddings themselves, an index that maps them back to words, pictures, or text, and a way to compare the similarity between different types of embeddings using various nearest neighbor algorithms.

One common method for comparing embeddings is called cosine similarity, but it can be slow and inefficient when comparing millions of sets of vectors. To solve this, approximate nearest neighbor (ANN) algorithms were developed to create neighborhoods out of elements of vectors and find a vector's k-nearest neighbors. The two most widely used algorithms are HNSW (hierarchical navigable small worlds) and FAISS, which are both standalone libraries and are also part of many existing vector stores.

The trade-off between full and nearest neighbor search is that the latter is less precise but much faster. When evaluating precision and recall, it is important to consider the trade-offs and determine the requirements for the accuracy of our embeddings and their inference latency while keeping in mind the speed of computation.

6 - Conclusion

We have now walked through the multimodal evolution of vector embeddings and their role in various applications.

We started with a brief definition of embeddings. We then walked through various unimodal embeddings that can represent text, images, and audio.
Embeddings have become even more important in the modern explosion of multimodal representations of data and the versatility of Transformer models. We dived a bit into video representations, namely strategies around embeddings and tokenization.
Finally, we have understood the engineering context of working with embeddings in production, including generating and storing embeddings.

The evolution of vector embeddings has opened up new possibilities for understanding and analyzing complex data, particularly in the realm of video data. As the field continues to develop and mature, we can expect to see even more exciting applications and innovations in the future.

At Twelve Labs, we are developing foundation models for multimodal video understanding. In other words, we extract video embeddings from raw video data and power various downstream video understanding tasks, such as video search and video classification. Our goal is to help developers build programs that can see, listen, and understand the world as we do with the most advanced video-understanding infrastructure.

If you would like to learn more, please sign up at https://playground.twelvelabs.io/ and join our Multimodal Minds Discord community to chat about all things Multimodal AI!

Implementing deep learning models has become an increasingly important machine learning strategy for companies looking to build data-driven products. In order to build and power deep learning models, companies collect and feed hundreds of millions of terabytes of multimodal data into deep learning models. As a result, embeddings — deep learning models’ internal representations of their input data — are quickly becoming a critical component of building machine learning systems.

For example, they make up a significant part of Spotify’s item recommender systems, YouTube video recommendations of what to watch, and Pinterest’s visual search. Even if not explicitly presented to the user through recommendation system UIs, embeddings are used internally at places like Netflix to make content decisions around which shows to develop based on user preference popularity.

The usage of embeddings to generate compressed, context-specific representations of content exploded in popularity after the publication of Google’s Word2Vec paper. Building and expanding on the concepts in Word2Vec, the Transformer architecture, with its self-attention mechanism, a much more specialized case of calculating context around a given word, has become the de-facto way to learn representations of growing multimodal vocabularies, and its rise in popularity both in academia and in the industry has caused embeddings to become a staple of deep learning workflows.

However, the concept of embeddings can be elusive because they are neither data flow inputs nor output results - they are intermediate elements that live within machine learning services to refine models. So it is helpful to define them explicitly from the beginning.

1 - What Are Embeddings?

A dense embedding is a vector that distributes information related to a concept with multiple elements, indicating that elements can be tuned separately to allow more concepts to be encoded efficiently in a relatively low-dimensional space. Such representations can be compared to symbolic representations, such as one-hot encoding, which uses an element with a value of one to indicate the presence of a concept locally and values of zero for other elements.

In deep learning, the term “embedding” often refers to a mapping from a one-hot vector representing a word or image category to a distributed representation of real-valued numbers. More specifically, the process of embedding includes three steps:

Transforms multimodal input into representations that are easier to perform intensive computation on in the form of vectors, tensors, or graphs.
Compress input information for an ML task, such as summarizing a blog post or performing a semantic search on a large video corpus. The process of compression changes variable feature dimensions into fixed inputs, allowing them to be passed efficiently into downstream components of machine learning systems.
Creates an embedding space that is specific to the data the embeddings were trained on but that, in the case of deep learning representations, can also generalize to other tasks and domains through transfer learning.

Creating an embedding for a word, image, or video to represent an artifact in multidimensional space offers us many possibilities. For instance, in tasks that concentrate on content understanding in a video recommendation system, we are often interested in comparing two given items to assess their similarity. We can perform this task with mathematical precision by transforming videos into vectors and comparing video frames in a shared embedding space.

2 - Unimodal Embeddings

2.1 - Language Representations

Text embeddings are a type of representation for text data that map words or phrases to real-valued vectors in a lower-dimensional space. These vectors capture the meaning and context of the text and enable a variety of natural language processing (NLP) tasks. They have many NLP applications, including search engines, product recommendations, social media content moderation, email spam filtering, and customer support chatbots.

Text embeddings can be generated using a neural network that extracts high-level features from the text and converts them into a fixed-size dimension vector (e.g., 1,500 dimensions). These embeddings can be used to compare and analyze text data, allowing for tasks such as text classification, text retrieval, and text summarization.

In the past, recurrent neural networks like long short-term memory (LSTM) or gated recurrent unit (GRU) language models incorporated information from all past words stored in a fixed-length recurrent vector when predicting a current word. Other commonly used methods for generating word embeddings include the continuous bag-of-words model, skip-grams, and global vectors (GloVe).

Since 2017, Transformers have been heavily used to learn text embeddings. The Vanilla Transformer is a model originally proposed for NLP and uses a self-attention mechanism to achieve state-of-the-art results on various NLP tasks. Many derivative models have been proposed following the success of the Vanilla Transformer, such as BERT, BART, GPT, Longformer, Transformer-XL, and XLNet. A pre-trained Transformer model can be a powerful text embedding generator.

2.2 - Visual Representations

Image embeddings are a way to represent images as real-valued vectors in a lower-dimensional space. These vectors capture the visual content of the image, which can be used for various computer vision tasks, such as image search, object detection, facial recognition, and content-based image retrieval.

We can obtain image embeddings using the output values from the final layers in image classification models (such as AlexNet, VGGNet, GoogLeNet, and ResNet). These models won the ImageNet Large Scale Visual Recognition Competition for image classification in 2012, 2014, and 2015. Image embeddings can be used to compare and analyze image data, enabling tasks such as image classification, image retrieval, and image similarity search.

Alternatively, more direct features can be used as visual embeddings, such as convolutional features and associated class labels from selected regions identified by object detection models. Models using this approach include the region-based CNN (R-CNN), Fast R-CNN, and Faster R-CNN.

Transformers are currently the most popular tool for NLP, and researchers are now exploring how they can be used in other areas, such as visual domains. One such area is the use of a Vision Transformer (ViT), which applies the encoder of a Transformer to images. ViT and its variations have been successfully used for various computer vision tasks, including low-level tasks, recognition, detection, and segmentation. They work well for both supervised and self-supervised visual learning. Recent studies have also provided further understanding of ViT, such as its robustness in internal representation and the continuous behavior of its latent representation propagation. A pre-trained ViT model can be a powerful image embedding generator.

2.3 - Audio Representations

Audio embeddings are numerical representations of audio signals that capture the acoustic content of the audio in a compact and meaningful way. These vectors aim to capture the semantic and contextual information of audio signals. They have many applications in audio processing, especially for tasks such as audio classification, audio retrieval, speaker recognition, and music recommendation.

We can use pre-trained models, such as VGGish or SoundNet, which have been trained on large-scale audio datasets like AudioSet or UrbanSound, to generate these audio embeddings. These models can extract high-level features from the audio signals, such as spectrograms or mel-frequency cepstral coefficients (MFCCs), and encode them into embeddings.

Similar to the language and visual domain, we can also leverage the Transformer architecture to generate these audio embeddings. Some examples include PaSST, Audio Transformer, CTAL, SSAST, and Audio Spectrogram Transformer.

3 - Multimodal Embeddings

Although significant advancements have been made in representing vision, language, or speech, it is theoretically insufficient to model a complete set of human concepts using only one modality. For instance, the idea of a "beautiful picture" is grounded in visual representation, making it hard to describe through natural language or other non-visual ways. That's why it's crucial to learn joint embeddings that use multiple modalities to represent such concepts better. Generally speaking, the field of Multimodal AI looks at building AI systems that can extract embeddings from multimodal data.

3.1 - A Quick Note on Multimodal AI

Multimodal AI has been a significant research area in recent decades. The world we live in is a multimodal environment, and both our observations and behaviors are multimodal. For example, an AI navigation robot requires multimodal sensors to perceive the real-world environment. These sensors include a camera, LiDAR, radar, ultrasonic, GNSS, HD Map, and odometer. Additionally, human behaviors, emotions, events, actions, and humor are also multimodal. As a result, various human-centered Multimodal AI tasks are widely studied, including multimodal emotion recognition, multimodal event representation, understanding multimodal humor, face-body-voice-based video person-clustering, and more.

Thanks to the advancements in internet technology and the proliferation of intelligent devices, an ever-growing volume of multimodal data is transmitted over the web. This has given rise to a plethora of multimodal application scenarios. In today's world, we can observe a broad range of such applications, including commercial services (like e-commerce retrieval, vision-language navigation, and audio-visual navigation), communication methods (such as lip-reading and sign language translation), human-computer interaction, healthcare AI, and surveillance AI.

In the era of Deep Learning, Multimodal AI has made significant progress thanks to deep neural networks. Among the most competitive architectures are the Transformers, which offer new challenges and opportunities to Multimodal AI. Recent successes with large language models and their multimodal derivatives, such as Frozen, VL-Adapter, Flamingo, BEiT, and PaLI, show that Transformers have great potential for creating foundation models for Multimodal AI.

3.2 - Multimodal Pre-Training

In 2021, CLIP was proposed as a new milestone. It uses multimodal pre-training to convert classification into a retrieval task, which enables pre-trained models to tackle zero-shot recognition. Thus, CLIP is a successful practice that fully utilizes large-scale multimodal pre-training to enable zero-shot learning. This has become a main breakthrough for many multimodal tasks. Recently, the idea of CLIP has been further studied in other works, such as CLIP pre-trained model-based zero-shot semantic segmentation, ALIGN, MAD, ALBEF, and CoCa.

3.3 - Trends in Multimodal Big Data

In recent years, the internet has developed rapidly with new applications like social media and online retail. This has led to many new multimodal datasets being created. Some of the most well-known ones include Conceptual Captions, COCO, VQA, Visual Genome, SBU Captions, Cooking312K, LAIT, e-SNLI-VE, ARCH, Adversarial VQA, OTT-QA, MultiModalQA, VALUE, Fashion IQ, LRS2-BBC, ActivityNet, and VisDial.

The new trends among these datasets are:

The datasets are becoming larger in scale. Some of the new ones released are million-scale, like Product1M, Conceptual12M, RUC-CAS-WenLan (30M), HowToVQA69M, HowTo100M, ALT200M, LAION-400M, and LAION-5B.
There are more modalities. In addition to vision, text, and audio, new diverse modalities are emerging, like Pano-AVQA (the first large-scale spatial and audio-visual question-answering dataset on 360-degree videos), YouTube-360 (YT-360) (360-degree videos), AIST++ (a new multimodal dataset of 3D dance motion and music), and ArtEmis (affective language for visual arts). Astoundingly, MultiBench is a dataset including ten modalities.
There are more scenarios. In addition to common caption and QA datasets, more applications and scenarios have been studied, like CIRR (real-life images), Bed and Breakfast (BnB) (vision-and-language navigation), M3A (financial dataset), and X-World (autonomous drive).
The tasks are more difficult. Beyond the straightforward tasks, more abstract multimodal tasks are proposed, like MultiMET (a multimodal dataset for metaphor understanding) and Hateful Memes (hate speech in multimodal memes).
Instructional videos are becoming more popular, like cooking videos with YouCook2. Aligning a sequence of instructions to a video of someone carrying out a task is an example of a powerful pre-training pretext task (as shown in What’s Cookin’?). Pretext tasks are pre-designed problems to force the models to learn representation by solving them.

Transformers, like other deep neural network architectures, require a lot of data. Therefore, their high-capacity models and multimodal big data basis co-create the prosperity of the Transformer based multimodal machine learning. For instance, big data bring zero-shot learning capability to VLP Transformer models.

4 - Video Representations

Many people are turning to video as a way to present information. It is an excellent medium because it combines images, motion, and sound, making it engaging and informative. However, working with video data can be challenging because it involves processing images, language, and audio all at once (aka, video is multi-modal). Additionally, there are unique challenges to the video domain, such as dealing with a high number of dimensions and modeling motion dynamics.

Video embeddings are numerical representations of videos that capture their visual, linguistic, and audio content in a compact and meaningful way. Similar to how word embeddings represent words as dense vectors in natural language processing, video embeddings aim to capture the semantic and contextual information of videos.

Video embeddings have various applications in computer vision and multimedia analysis. They can be used for tasks such as video classification, video retrieval, video summarization, and video recommendation. For instance, by representing videos as embeddings, it becomes easier to compare and search for similar videos based on their content, rather than relying solely on metadata or textual descriptions.

The Transformer architecture is pretty versatile, and it can be used to model various data types. Recently, there has been a lot of Video Transformer works that adapt Transformer to model videos. Before being input into the Transformer, a video must go through some processing. This involves tokenization, embedding, and positioning. For example, the figure above is from the Video Vision Transformer paper. The authors consider two simple methods for mapping a video to a sequence of tokens (uniform frame sampling and tubelet embedding). They then add the positional embedding and reshape the tokens to obtain the final input to the Transformer.

It's important to note that the specific strategies for tokenization, embedding, and positioning can vary depending on the specific video modeling task and the available data.

4.1 - Video Tokenization Techniques

Tokenization is the process of breaking down a video into smaller units called tokens, which can be used as input to a Transformer model. Here are some common tokenization techniques for videos:

1 - Frame-level tokenization

In this technique, each frame of the video is treated as a token. The frames are sequentially encoded and fed into the Transformer model. This approach allows the model to capture temporal information at the frame level.

2 - Clip-level tokenization

Instead of treating each frame as a token, this technique groups consecutive frames into clips. Each clip is then tokenized and processed as a single unit. This approach reduces the number of tokens and can be more efficient for longer videos.

3 - Co-tokenization

This technique involves jointly tokenizing both the video and its corresponding text description. The video and text tokens are then fused together to create a joint representation that captures both visual and textual information. This approach has been shown to be effective for video question-answering tasks.

4 - Object-level tokenization

In this technique, objects or regions of interest in the video are detected and segmented, and each object or region is treated as a separate token. This approach can be useful for tasks like object detection or action recognition.

4.2 - Video Embedding Strategies

Different embedding strategies used in video models can have varying effects on their performance. Here are some examples:

1 - Image-based embedding

This strategy involves using pre-trained image models, such as Inception-V3 or ResNet, to extract visual features from each frame of the video. These features are then transformed into embeddings using techniques like pooling or fully connected layers. Image-based embeddings can capture detailed visual information and are effective for tasks like object detection or scene understanding.

2 - Audio-based embedding

In addition to visual features, audio features can also be extracted from the video. Techniques like spectrogram analysis or audio embeddings can be used to convert the audio signals into embeddings. Audio-based embeddings can capture sound-related information and are useful for tasks like audio-visual synchronization or sound event detection.

3 - Joint audio-visual embedding

This strategy combines both visual and audio features to create joint embeddings. The visual and audio features are typically fused at some point in the model architecture, allowing the model to capture the correlations between the two modalities. Joint embeddings can be beneficial for tasks like video captioning or video retrieval.

4 - Temporal embedding

Temporal embedding strategies focus on capturing the temporal dynamics of the video. This can be done by incorporating temporal information, such as optical flow or motion vectors, into the embedding process. Temporal embeddings can help the model understand the motion and temporal relationships between different frames or clips in the video.

5 - Spatial embedding

Spatial embedding strategies aim to capture the spatial relationships between different regions of interest in the video. This can be achieved by using techniques like region-based embeddings or attention mechanisms that attend to specific spatial locations. Spatial embeddings are useful for tasks like object tracking or action recognition.

4.3 - The Role of Positional Embeddings

Self-attention is the essential operation of the Transformer. It enhances each token embedding with information from other embeddings in a sequence. Since self-attention operates on sets, it is crucial to indicate positional information to utilize the spatiotemporal structure of videos. This is achieved using positional embeddings (PE), which encode the temporal position information of video frames or clips within the Transformer.

Some popular positioning methods for video data in Transformer models include:

1 - Learned positional embedding

This approach involves learning the positional encodings during the training of the Transformer model. The model learns to encode the relative positions of the frames or clips in the video sequence. This method allows the model to capture the temporal order of the video data.

2 - Fixed positional encoding

In this method, fixed positional encodings are added to the video embeddings. These encodings provide information about the absolute positions of the frames or clips in the video sequence. Fixed positional encodings can be based on sine and cosine functions or other predefined patterns.

3 - Relative positional encoding

This approach incorporates relative position information between frames or clips in the video sequence. It takes into account the distance or time gap between consecutive frames or clips. Relative positional encodings can be added as additional inputs to the Transformer model or incorporated into the self-attention mechanism.

4 - Hybrid approaches

Some methods combine different types of positional encodings to capture both the absolute and relative positions of the video data. This can involve using a combination of fixed and learned positional encodings or incorporating relative position information into the self-attention mechanism.

5 - Embeddings In Production

Engineering systems based on embeddings (such as building foundation models) can be computationally expensive to build and maintain. The need to create, store, and manage embeddings has also recently resulted in the explosion of an entire ecosystem of related products, notable the rise in the development of vector databases to facilitate production-ready use of nearest neighbors semantic queries in machine learning systems.

5.1 - Embeddings In Practice

In a production environment, it is possible to train your own embedding model to gain insight into the model's internals if you have access to a lot of GPUs. Alternatively, you can use pre-trained embeddings and adapt them to your specific use cases. Many companies today are using embeddings in both of these contexts. Notably, YouTube was one of the first large companies to publicly share their work on video embeddings in the context of a production recommender system with "Deep Neural Networks for YouTube Recommendations.”

As YouTube has over 800 million pieces of content (videos) and 2.6 billion active users, the application needs to recommend existing content to users while also being able to generalize to new content, which is frequently uploaded. These recommendations need to be served quickly at inference time—when the user loads a new page—with low latency.

YouTube shares how they created a two-stage recommender system for videos based on two deep learning models. The machine learning task is to predict the correct next video to show the user at a given time in YouTube recommendations so that they click. The final output is formulated as a classification problem: given a user's input features and the input features of a video, can we predict a class for the user that includes the predicted watch time for the user for a specific video with a specific probability?

To develop the model, they use two sets of embeddings as input data: (1) embeddings that represent the user and their context as features, and (2) embeddings that represent the video items. The model has many features, including ones based on tables and embeddings. The embeddings-based features include:

User watch history - a vector that shows which videos the user has watched, represented as a sparse video ID mapped to a dense vector.
User's search history - shows which videos the user clicked on after a search term. This is also represented as a sparse vector mapped to the same space as the user watch history.
User's geography, age, and gender - shown as table features.
The number of previous impressions a video had, normalized per user over time.

All of these features are combined into one item embedding. For the user, all the embeddings are blended into one user embedding. These embeddings are then fed into the model's softmax layers. The softmax layers compare the output of the layer (i.e., the probability that the user will interact with an item) to a set of ground truth items. The ground truth items are a set of items that the user has already interacted with. The log probability of an item is the dot product of two n-dimensional vectors - the query and item embeddings.

5.2 - Embeddings as an Engineering Problem

A production-level ML system using embeddings has many moving components: generating embeddings, storing embeddings, detecting concept drift, optimizing inference and latency, evaluating the system offline and online, etc. Let’s look at the first two stages since they directly deal with embeddings.

Embeddings Generation

We have observed that embeddings are typically generated as a byproduct of training neural network models, with the penultimate layer that precedes the final output layer used for classification or regression. There are two approaches to producing these embeddings. We can train our own models, as YouTube has done.

However, one of the major advantages of deep learning models is that we can also use pre-trained models. A pre-trained model is any model similar to the one we are considering for our task that has already been trained on vast amounts of training data and can be used for downstream tasks by fine-tuning.

When we fine-tune a model, we follow the same steps as training from scratch. We have training data, a model, and a loss function to minimize. However, some differences exist. We create a new model by duplicating the existing pre-trained model, except for the final output layer, which we initialize from scratch based on our new task. During training, we initialize these parameters randomly and adjust only the parameters of the previous layers so that they focus on this task, instead of starting with a completely new training. In this way, we can refocus the fine-tuned model without having to train with a gigantic amount of data.

The generated embeddings can be further improved by reducing their dimensionality (using techniques like PCA or t-SNE) and indexing them using an appropriate data structure to facilitate fast and efficient retrieval during search or query processing (more below).

Embeddings Retrieval

After we finish training the model, we have to extract the embeddings from it. The trained model gives us a data structure that holds everything about the model's parameters, like weights, biases, layers, and learning rate. The embeddings are one of the layers in this model object that initially live in-memory. We include them as part of the model object when we write the model to disk. Then, we serialize them onto memory and load them when we retrain or do inference.

There are different things we can do with embeddings when building a model. The easiest way to store embeddings is by using an in-memory numpy array. But, for more advanced tasks, we can:

Get them in batches or one by one during inference.
Analyze the quality of the embeddings offline.
Transform the embeddings to create new features.
Update the embeddings with new models.
Keep track of different versions of the embeddings.
Convert new documents into embeddings.

The most complex and customizable software that can handle most of these tasks is called a vector database. However, there are simpler options such as vector search plugins for databases like Postgres and SQLite, and caches like Redis.

When we work with embeddings, the most important thing we do is called vector search. This helps us find embeddings that are similar to the one we have, which is useful for finding similar items. To do vector searches, we need a way to search through our data structures efficiently and compare them to find the most similar ones.

Relational databases use a b-tree structure to sort items in ascending order within a hierarchy of nodes, which makes it faster to read them. But we can't look up our vectors quickly using columns, so we need to create different structures for them. One example is inverted indices, which many vector stores use to do vector searches effectively.

A general-form embedding store has three main components: the embeddings themselves, an index that maps them back to words, pictures, or text, and a way to compare the similarity between different types of embeddings using various nearest neighbor algorithms.

One common method for comparing embeddings is called cosine similarity, but it can be slow and inefficient when comparing millions of sets of vectors. To solve this, approximate nearest neighbor (ANN) algorithms were developed to create neighborhoods out of elements of vectors and find a vector's k-nearest neighbors. The two most widely used algorithms are HNSW (hierarchical navigable small worlds) and FAISS, which are both standalone libraries and are also part of many existing vector stores.

The trade-off between full and nearest neighbor search is that the latter is less precise but much faster. When evaluating precision and recall, it is important to consider the trade-offs and determine the requirements for the accuracy of our embeddings and their inference latency while keeping in mind the speed of computation.

6 - Conclusion

We have now walked through the multimodal evolution of vector embeddings and their role in various applications.

We started with a brief definition of embeddings. We then walked through various unimodal embeddings that can represent text, images, and audio.
Embeddings have become even more important in the modern explosion of multimodal representations of data and the versatility of Transformer models. We dived a bit into video representations, namely strategies around embeddings and tokenization.
Finally, we have understood the engineering context of working with embeddings in production, including generating and storing embeddings.

The evolution of vector embeddings has opened up new possibilities for understanding and analyzing complex data, particularly in the realm of video data. As the field continues to develop and mature, we can expect to see even more exciting applications and innovations in the future.

At Twelve Labs, we are developing foundation models for multimodal video understanding. In other words, we extract video embeddings from raw video data and power various downstream video understanding tasks, such as video search and video classification. Our goal is to help developers build programs that can see, listen, and understand the world as we do with the most advanced video-understanding infrastructure.

If you would like to learn more, please sign up at https://playground.twelvelabs.io/ and join our Multimodal Minds Discord community to chat about all things Multimodal AI!