The World of Multimodal Foundation Models

William McNamara • April 2, 2023

A primer on foundation models: what they are, how they've evolved, and where they're going.

The success of foundation models such as BERT, GPT-3, CLIP, and Codex has generated increased interest in models that combine vision and language modalities. These hybrid vision-language models have demonstrated impressive capabilities in challenging tasks, including image captioning, image generation, and visual question answering. And now a new paradigm of video foundation models that learn from video data using the principles of foundation models has recently emerged.

This blog post provides an overview of foundation models, large language and vision-language models, and video foundation models. I'll review the architecture of foundation models as well as their training, fine-tuning paradigm, and scaling laws. Additionally, I'll review how vision-language models combine the power of computer vision and natural language processing and how they are being used to solve complex problems. Finally, I'll introduce video foundation models and how they are revolutionizing the understanding and analysis of video data.

Intro to Foundation Models‍


A foundation model is a type of machine learning model that learns from a wide range of data using self-supervision at scale. The idea is to create a model that can be used for many different tasks. By training on lots of data, the model can learn the general patterns in the data. When the model is used for a specific task, it can use this knowledge to quickly adapt.

Foundation models use deep neural networks, which have been popular since 2012, and self-supervised learning, which has been around for almost as long. Recent improvements in both areas have allowed for the creation of larger and more complex models. These models are trained on massive amounts of data, often without explicit labels.

The result is a model that can learn a wide range of patterns and relationships, which can be used for many tasks. This has led to significant improvements in natural language processing, computer vision, and multimodal AI. With foundation models, we can create one model that can be used for many tasks, rather than creating different models for each task. This will save time, resources and speed up progress in many fields.

Transfer Learning


Traditional machine learning (ML) models are trained from scratch (if not almost) and require lots of domain-specific datasets to perform well. However, if you only have a small amount of data, you can leverage the benefit of transfer learning. The idea of transfer learning is to take the "knowledge" learned from one task and apply it to another task so that you don’t require as much data as you would if you were to train from scratch. For deep neural networks, pre-training is the dominant approach to transfer learning: you train the model on an original task (i.e, detecting cars on the street) and fine-tune it to another downstream task of interest (i.e, detecting a black Tesla Model 3).

As you can imagine, this is a very useful mechanism for computer vision. Most commonly a computer vision algorithm will train a model on ImageNet, keep most of the layers, and replace the top three or so layers with newly learned weights. Alternatively one could fine-tune the model end-to-end. Some of the most popular pre-trained models for computer vision tasks include AlexNet, ResNet, MobileNet, Inception, EfficientNet, and YOLO.

In natural language processing (NLP), pre-training was initially limited only to the first step: word embeddings. The input to a language model is words. One way to encode them as a vector (instead of a word) is through one-hot encoding. Given a large matrix of words, you can create an embedding matrix and embed each word into a real-valued vector space. This new matrix is reduced to the dimension on the order of a thousand magnitude. Theoretically those dimensions correspond to some semantic notion.


As an example, looking at a model built by Word2Vec, It looked at which words frequently co-occur together. The learning objective was to maximize cosine similarity between their embeddings. When you embed the words "king," "man," and "woman," you can do vector math to get a vector that is close to the word "queen" in this embedding space.

It's useful to see more context to correctly embed words because words can play different roles in a sentence depending on their context. If you do this, you'll improve accuracy on all downstream tasks. In the last few years several models, including ELMo, ULMFiT, and GPT, have empirically demonstrated how language modeling can be used for pre-training. All three methods employed pre-trained language models to achieve state-of-the-art results on a diverse range of tasks in NLP, including text classification, question answering, natural language inference, coreference resolution, sequence labeling, and many others.

Transformers: The Underlying Architecture For Foundation Models


Prior to Transformers, the state of the art in NLP was based on recurrent neural networks (RNNs), such as LSTMs and the widely-used Seq2Seq architecture, which processed data sequentially – one word at a time, in the order that the words appeared.

The innovation delivered by transformers is to parallelize language processing. This allows all the tokens in a given body of text to be analyzed simultaneously, rather than in sequence. Transformers rely on an AI mechanism known as attention to support this parallelization. Attention enables a model to consider the relationships between words, even if they are far apart in a text, and to determine which words and phrases in a passage are most important to pay attention to.

Parallelization also makes transformers much more computationally efficient than RNNs, allowing them to be trained on larger datasets and built with more parameters. Today's transformer models are characterized by their massive size.

Vision Transformers


Convolutional Neural Networks have been the dominant architecture in the field of computer vision. However, given the success of Transformers in NLP, researchers started adapting this architecture to image data. Enter the Vision Transformer (ViT) architecture, which applies the encoder block of the Transformer architecture to the image classification problem.


The idea is to split an image into patches and provided the sequence of linear embeddings of these patches as input to a Transformer. Similar to tokens in the NLP setting, these image patches are treated as inputs. The architecture includes a stem that patches images, a body based on the Multi-Layer Transformer encoder, and a Multi-Layer Perceptron (MLP) head that transforms the global representation into the output label. The end result being that the ViT sets or exceeds state-of-the-art results on many image classification datasets while being relatively inexpensive to pre-train.

But they aren't without their problems. One significant issue is that they have difficulty with high-resolution images because they require a lot of computing power, which increases rapidly with image size. Additionally, the fixed-scale tokens in ViTs are not useful for tasks that involve visual elements of varying sizes.

Transformer Variants


A flurry of research work followed the original Transformer architecture, and most of them made enhancements to the standard Transformer architecture in order to address the above-mentioned shortcomings.


Swin Transformers have become very useful in this regard, which are generic Transformers that can be applied to any modality. The Swin Transformer introduced two concepts: hierarchical feature maps and shifted window attention.

  1. The model uses hierarchical feature maps to enable advanced techniques for dense prediction. It achieves linear computational complexity by computing self-attention locally within non-overlapping windows that partition an image. This makes Swin Transformers a good backbone for various vision tasks.
  2. The use of shifted windows enhances modeling power by bridging windows of the preceding layer. The strategy is also efficient in terms of real-world latency: all query patches within a window share the same key set, making memory access in hardware easier.

Perceiver is another Transformer variant recently created by DeepMind that takes inspiration from biological systems. It uses attention-based principles to process various types of input, including images, videos, audio, and point clouds. It can also handle combinations of multiple types of input without relying on specific assumptions about the domain.


The Perceiver architecture introduces a small set of latent units that forms an attention bottleneck. This eliminates the problem of all-to-all attention and allows for very deep models. It attends to the most relevant inputs, informed by previous steps. However, in multimodal contexts, it is important to distinguish input from one modality or another. To compensate for the lack of explicit structures, the model associates position and modality-specific features with every input element, similar to the labeled line strategy used in biological neural networks.

Large Language Models


Following the original Transformer paper, a flurry of innovation occurred as leading AI researchers built upon this foundational breakthrough - starting with the NLP domain.

GPT and GPT-2 came out a few years ago. The name means “generative pre-trained Transformers.” They are decoder-only models and use masked self-attention. This means that at a point in the output sequence, you can only attend to two input sequence vectors that came before that point in the sequence. While GPT embeddings can also be used for classification, the GPT approach is at the core of today’s most well-known large LLMs, such as chatGPT.

These models were trained on 8 million web pages. The largest model has 1.5 billion parameters. The task that GPT-2 was trained on is predicting the next word in all of this text on the web. They found that it works increasingly well with an increasing number of parameters.



BERT came out around the same time as Bidirectional Encoder Representations for Transformers. With 110 million parameters, it is an encoder-only Transformer designed for predictive modeling tasks and introduces the original concept of masked-language modeling. During training, BERT masks out random words in a sequence and has to predict whatever the masked word is.

T5 (Text-to-Text Transformer) came out in 2020. The input and output are both text strings, so you can specify the task that the model supposes to be doing. T5 has an encoder-decoder architecture. It was trained on the C4 dataset (Colossal Clean Crawled Corpus), which is 100x larger than Wikipedia. It has around 10 billion parameters.


‍The Rise of Large Vision-Language Models


Thanks to the Vision Transformer architecture, there has been increased interest in models that combine vision and language modalities. These hybrid vision-language models have demonstrated impressive capabilities in challenging tasks such as image captioning, image generation, and visual question answering. Typically, they consist of three key elements: an image encoder, a text encoder, and a strategy to fuse information from the two encoders. I'll review here some of the most well-known models in vision-language model research over the past two years.

In 2021, OpenAI introduced CLIP (Contrastive Language–Image Pre-training). The input to CLIP is 400 million image-text pairs that were crawled from the internet. It encodes text using Transforms, encodes images using Vision Transformers, and applies contrastive learning to train the model. Contrastive training matches correct image and text pairs using cosine similarity.


With this powerful trained model, you can map images and text using embeddings, even on unseen data. There are two ways to do this. One way is to use a "linear probe" by training a simple logistic regression model on top of the features that CLIP outputs after performing inference. Alternatively, you can use a "zero-shot" technique that encodes all the text labels and compares them to the encoded image. The linear probe approach is slightly better.


To clarify, CLIP does not directly go from image to text or vice versa. It uses embeddings. However, this embedding space is extremely useful for performing searches across modalities.

CoCa, or Contrastive Captioner, is another foundation model by Google that combines contrastive learning (CLIP) and generative learning (SimVLM). It uses an encoder-decoder architecture that has been modified and trained with both contrastive loss and captioning loss. This allows it to learn global representations from unimodal image and text embeddings, as well as fine-grained region-level features from the multimodal decoder.

In late 2022, DeepMind created a group of Visual Language Models called Flamingo. These models can do many different things, even with just a few examples of input and output. They have two parts: a vision model that can understand visual scenes, and a language model that helps with reasoning. The models use their pre-training knowledge to work together. Flamingo models can also take high-quality images or videos thanks to a Perceiver architecture (discussed in the section on Transformers variants) that can analyze a large number of visual input features and produce a small number of visual tokens.


Thanks to these new architectural innovations, the Flamingo models can connect strong pre-trained models for vision and for language, handle sequences of mixed visual and text data, and easily use images and videos as input. The Flamingo-80B, the biggest version with 80 billion parameters, set a new record in few-shot learning for many tasks that involve understanding language, images, and videos.

Microsoft, Google, and Open AI released their own versions of large vision-language models over the past few weeks, thereby propelling the trends toward multimodal AI further.

  • Microsoft released Kosmos-1, a multimodal language model that can perceive different modalities, learn context and follow instructions. The model generates text based on the previous context and handles text and other modalities using a Transformer-based causal language model. It was trained using various types of data and has performed well in different scenarios, including understanding and creating language, recognizing images, and answering questions based on images.
  • Google's PaLM-E is an embodied multimodal language model that can handle various reasoning tasks based on observations from different sources and using different embodiments, including internet-scale language, vision, and visual-language domains. The biggest PaLM-E model, PaLM-E-562B, has 562 billion parameters and can reason about different things without being trained beforehand, like telling jokes based on an image or doing robot tasks such as perceiving, talking, and planning.
  • Lastly, OpenAI’s GPT-4 is a large multimodal model capable of processing image and text inputs and producing text outputs. It scored 90th percentile on a simulated bar exam and 99th percentile (with vision) on Biology Olympiad.

Conclusion


Foundation models are becoming multi-modal. As foundation models will eventually serve as the basis of all AI-powered software, developers will increasingly start with pre-trained foundation models and then fine-tune them on narrow tasks. However, the most difficult situations for these models are the "long-tail" events they have not seen before. These long-tail events will continue to be even more complex to solve under multi-modal settings.





By William McNamara September 19, 2025
How I made a critical segmentation algorithm 9x faster with a new approach to clustering
By William McNamara December 11, 2024
Not a whole lot for Trump to undo
By William McNamara October 15, 2024
ColabFold has changed the game for amateur protein folding analysis
By William McNamara February 3, 2024
Getting started with Deepmind's revolutionary model for protein folding
By William McNamara December 22, 2023
A continuation of genome sequencing analysis
By William McNamara November 7, 2023
A good step but not nearly enough
By William McNamara August 1, 2023
Introductory methods for genome sequencing
By William McNamara March 19, 2023
Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy. But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it. In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills. Introduction to Spotify’s API Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists. You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff. In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API. NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.
By William McNamara January 17, 2023
Google's DeepMind Researchers have proposed a new framework for scaling AI Models that could be a game changer for the space.
By William McNamara December 8, 2022
Evolutionary strategies for feature engineering