What makes Foundation Models special?
James Le
James Le
Date Published
April 11, 2023
Foundation models
Transfer Learning
Multimodal AI
Join our newsletter
You’re now subscribed to the Twelve Labs Newsletter! You'll be getting the latest news and updates in video understanding.
Oh no, something went wrong.
Please try again.

Since Stanford's introduction of the Foundation Models concept in 2022, it has been one of the hottest trends in almost every business sector. These models are the first to have penetrated the general public. Put simply, a foundation model is a large pre-trained machine learning model that can be further fine-tuned on a specific task and has achieved state-of-the-art performance on a wide range of tasks.

However, understanding the capabilities of foundation models and their potential applications can be challenging for audiences unfamiliar with the technical terminology. By doing so, we can help make this exciting technology more accessible and understandable to a broader audience. While our previous post looks at them from a technical lens, this article breaks down the capabilities of foundation models into simple terms and provides examples of how they can be used to solve real-world problems.

1 - An Overview of Foundation Models

1.1 - Why Now?

Foundation models are large, complex models that are designed to be flexible and reusable across a wide range of domains and industries. These models work because of two things: transfer learning and scale.

Transfer Learning

Transfer learning is a technique used in machine learning (ML) that involves using knowledge gained from one task to help with another task. This can improve the performance of ML models by reducing the need for large amounts of data for each new task and allowing models to learn more quickly. Transfer learning helps solve many real-world applications - such as cold start problems in recommendation systems, simulation in robotics, and translation across different languages. If you want to learn more, this blog post from Georgian concisely explains transfer learning with further examples.

Source: https://crfm.stanford.edu/report.html

Scale makes foundation models powerful. It required three ingredients:

  1. Improvements in computer hardware: The use of GPUs has dramatically improved the performance of deep learning models. GPUs can perform calculations in parallel, making them perfect for the math involved in training deep neural networks. Over the last 5 years, GPU throughput and memory have increased 10x to support larger foundation models.
  2. Highly parallel architectures: Transformer architectures allow larger deep learning networks to take advantage of the parallelism of hardware. This means Transformer networks can capture long-range dependencies and high-order interactions between elements far across the input data. We talked about this more in our previous post on multimodal foundation models.
  3. Huge amounts of training data: Large-scale datasets and improved data collection and annotation have enabled the development of larger and more powerful foundation models. GPT-2 was trained with 40 gigabytes of data, while GPT-3 was trained with 570 gigabytes of data, including a big chunk of the internet. Many companies and organizations also have access to proprietary datasets that aren't shared publicly, which they can use to train large models. Another approach is data augmentation, where existing datasets are enhanced by techniques like randomly substituting words or phrases in a sentence while preserving its overall meaning.

1.2 - Technical Gains


While the engineering behind these models is impressive, AI researchers and engineers are most excited about their generalizability. Generalizability means that a well-trained foundation model can make accurate predictions or generate coherent text/images based on data it has never seen before (without the need for additional training or fine-tuning). This is because they have already been trained on a large dataset and have learned basic features that can be useful for many different tasks.

In contrast, traditional ML models require a large amount of labeled data to perform well on a given task. This labeling process can be time-consuming and expensive. Furthermore, you need to design a suitable architecture and iterate through multiple training cycles, thereby limiting the scalability and generalizability of the model.

Source: https://crfm.stanford.edu/report.html

Foundation models have achieved state-of-the-art performance in various research benchmarks across a wide range of modalities, including text, images, speech, tabular data, protein sequences, organic molecules, and reinforcement learning. Additionally, since data is naturally multimodal in some domains (such as videos), multimodal foundation models effectively combine relevant information about a domain and adapt to tasks involving multiple modes.


Out-of-the-box foundation models trained on general knowledge will struggle with domain-specific tasks. To improve the model’s performance to the point where business leaders feel comfortable using it, you must gather and prepare data for fine-tuning. For example, Bloomberg-GPT is a foundation model developed by Bloomberg. It has been pre-trained on a vast amount of financial text data and financial-specific knowledge sources and can perform financial natural language processing and fill in incomplete domain-specific phrases by reasoning over pre-existing knowledge or references.

Source: https://openai.com/research/instruction-following

To fine-tune a foundation model, the obvious approach is to create a large, domain-specific training dataset for your use cases and then adapt the model to that dataset. Another approach that has become popular lately is RLHF, which stands for reinforcement learning from human feedback. At a high level, the concept is to fine-tune a pre-trained LLM with information about human preferences, prodding it to produce desirable outputs. If you want to learn more, this blog post from Molly Welch provides an excellent overview of RLHF and its applications.

1.3 - Economic Benefits

Foundation models can benefit enterprise businesses by reducing time-to-market, improving productivity, and increasing revenue. By democratizing these powerful models with unprecedented generalizability, the community can enable individuals, developers, and businesses to take advantage of these capabilities without building them from scratch. This is similar to how we draw power from nearby power plants instead of building our own generators.

These models can help businesses automate both manual and creative tasks, such as content creation and editing. This speeds up the development and iteration of products and services. They can also be used to develop chatbots and virtual assistants that provide customer support and answer common questions. This helps businesses save resources and still achieve their goals.

Although calculating cost-of-goods-sold may be more complicated when switching from traditional SaaS to foundation models, foundation-models-powered applications are expected to be much more intelligent. This allows for hyper-personalized experiences for customers, resulting in increased revenue.

2 - Types of Foundation Models

2.1 - Language

Foundation models have had a significant impact on the field of natural language processing (NLP). For instance, OpenAI's GPT-3 (Generative Pre-trained Transformer 3) is a well-known foundation model that can produce human-like language. This model has been trained on an extensive amount of text data, making it adaptable for various language-related tasks such as text generation, question answering, and text summarization. It served as the foundation for OpenAI’s ChatGPT.

GPT-3 can be used in various ways, including generating high-quality news articles and blogs, scheduling appointments, generating poems, and translating one language to another.

2.2 - Vision

Computer Vision saw its first foundational model with OpenAI's release of CLIP (Contrastive Language–Image Pre-training). The CLIP model was trained on 400MM image-caption pairs and can understand the relationship between language and images. This means that the model can understand the contents of an image and generate a description of it in human language. For example, if you show the model a picture of a cat, it can tell you that it is a cat and describe its color or other characteristics. CLIP's ability to understand both language and images has many potential applications, such as image recognition, captioning, and even creative tasks like generating art or designing products.

Source: https://www.theverge.com/2022/9/15/23340673/ai-image-generation-stable-diffusion-explained-ethics-copyright-data

Additionally, Stable Diffusion is an open-source project that can generate images from text. It uses a unique algorithm (called a latent diffusion model) to create realistic and high-quality images based on simple textual descriptions. The project has gained much attention because of its ability to produce stunning images that look like human artists created them. The best part is that it is free to use and can be downloaded by anyone who wants to experiment with image generation. Whether you're an artist, a developer, or just someone interested in AI, you can explore Stable Diffusion and see what kinds of images you can create from your textual descriptions.

2.3 - Speech

In addition to language and vision, foundation models can also process speech. OpenAI's Whisper is a speech foundation model that can understand spoken language in many different accents and languages. It can transcribe spoken words accurately and quickly, even in noisy environments. Whisper has been trained on a large dataset of over 680,000 hours of speech, which has helped it achieve human-level accuracy in recognizing spoken words.

Whisper has the potential to be used in a wide range of applications, such as digital assistants, transcription software, and even in cars and other noisy environments.

2.4 - Video

Finally, foundation models have also been developed for processing video data. These models are designed to understand the content of videos, including visual and audio elements. They can be used for various applications, such as video annotation, summarization, and search. For example, you could train a foundation model for video data to recognize specific objects, actions, or scenes within a video. This could automatically generate tags or captions for the video, making it easier to search for or share.

At Twelve Labs, we are building foundation models for long-tail multimodal video understanding. Our Video Understanding API provides developers with powerful tools for processing and analyzing video data, such as video search and classification. Our foundation models can be further fine-tuned for specific applications, making them highly adaptable to various industries.

3 - Potential Applications

Foundation models can be applied to a wide range of industries, from media entertainment to sports analytics to consumer healthcare. Below are interesting AI applications from these industries already in production, which can be easily replicated (and made better) using foundation models.

3.1 - Media Entertainment
Source: https://netflixtechblog.com/new-series-creating-media-with-machine-learning-5067ac110bcd

Companies in the entertainment industry are using AI to create more personalized and engaging consumer experiences. One example is Netflix, which uses AI to help creators make better media, such as TV shows, trailers, movies, and promotional art. They have video understanding models that categorize characters, storylines, emotions, and cinematography, making it easier to find specific footage. This frees creators from the task of categorizing footage for hours so that they can focus on making creative decisions.

Foundation models could solve similar use cases, such as categorizing characters or emotions. This improves the accuracy and efficiency of these models, allowing creators to quickly find specific footage without spending hours categorizing it.

Foundation models can also be used to create more engaging and realistic content. For example, in the gaming industry, they can help make non-player characters (NPCs) more intelligent and realistic. Ubisoft has created a new AI tool called Ghostwriter that automatically generates dialogue for NPCs, while Roblox offers creators a platform called Roblox Studio to build immersive 3D experiences.

3.2 - Sports Analytics

AI can help track the ball and players' movements in sports analytics. At the FIFA World Cup 2022 in Qatar, AI was used to do different jobs, including a semi-automated offside technology to detect offside offenses. This technology gathers data from video feeds and sensors to help referees make better calls on offside positions. 12 tracking cameras were under the stadium's roof to track the ball and 29 data points per second, like player positions and ball location.

Source: https://www.youtube.com/watch?v=FEPKAeGzwPU

Using foundation models, FIFA can improve the accuracy and efficiency of its semi-automated offside technology. By fine-tuning a foundation model with bits of soccer footage collected over the years, it can recognize different offside situations, like when a player is offside due to their position or movement. Additionally, it can predict the likelihood of offside situations, helping referees be more prepared and potentially avoid wrong calls.

3.3 - Healthcare

In healthcare for consumers, AI has made it easier for healthcare workers to manage their time by handling specific tasks for them, like filing insurance claims, dealing with paperwork, and writing notes from a doctor's visit. Additionally, AI can help predict how patients will do based on different things like age, medical history, and genetic information. Doctors can use this information to create personalized treatment plans. If you want to learn more about this topic, read Eric Topol's book called "Deep Medicine."

Foundation models can be used to analyze medical images, such as X-rays and MRIs. For example, a foundation model can be trained to spot unusual things in medical images, like tumors or fractures, which can help doctors diagnose and treat diseases faster and more accurately.

Source: https://arxiv.org/abs/2210.04133

In addition to assisting with care, foundation models will significantly accelerate the rate of medical breakthroughs. The amount of data in biology is vast, and it is difficult for humans to keep track of all the ways complex biological systems work. However, there is already software that can analyze this data, infer pathways, search for targets on pathogens, and design drugs accordingly. Isomorphic Labs, whose technology is based on the AlphaFold breakthrough, is applying AI to drug discovery and revolutionizing the way medicine helps and heals people.


I find the above applications personally interesting, and these are only a few examples out of thousands that can be disrupted. The potential for foundation models is truly exciting, as they offer countless unseen and new opportunities that have yet to be imagined. These opportunities will surely emerge, allowing creative minds to transform their ideas into tangible, compelling products in record-breaking time.

Foundation models have limitations that should be considered before being used in production. One risk is that they may not perform as well on niche or specific tasks that were not included in the pre-training data. Therefore, it may be necessary to fine-tune the models on data that is more relevant to the specific task to improve their performance. Additionally, foundation models may inadvertently perpetuate biases present in the data used to train them. Preventing and mitigating these biases is an active area of research, and there are best practices that can be used to reduce the impact of biases in pre-trained models (which we will cover in future articles).

If you are interested in chatting about foundation models in general, join our Discord community to discuss all things Multimodal AI!

Generation Examples
No items found.
No items found.
Comparison against existing models
No items found.

Related articles

Semantic Content Discovery for a Post-Production World

Explores the benefits of semantic search in post-production, the key technologies powering it, how it integrates with media asset management systems, and where it's headed in the future.

James Le
Twelve Labs Named to the CB Insights AI 100 List for the Third Consecutive Year

Company recognized for achievements in multimodal video understanding

Nvidia-backed Twelve Labs is building AI that understands videos like humans

Twelve Labs, a South Korean AI startup, aspires to achieve a 'ChatGPT' moment for video

The Chosun Daily
Multimodal AI and How Video Understanding Will Revolutionize Media

A beginner guide to video understanding for M&E with MASV and Twelve Labs

James Le