A New Scaling Paradigm from Google

William McNamara • January 17, 2023

Google's DeepMind Researchers have proposed a new framework for scaling AI Models that could be a game changer for the space.

‍

Moore’s Law

Generally, scaling laws predict a continued improvement in model quality as we continue to scale up the computational budget (e.g., bigger models or more data). Open AI investigated the scaling laws of Transformer language models a few years ago and showed that scaling laws are predictive of future performance. Their findings showed that performance is a function of data size, number of parameters, and compute size.

More specifically, the experiments revealed that the test loss follows a power law with respect to the model size, dataset size, and compute used for training, spanning trends over seven orders of magnitude. This suggests that the relationships between these variables can be described by simple equations, which can be used to optimize training configurations for large language models. Additionally, the experiments indicate that other architectural details, such as network width or depth, have minimal effects within a wide range.
‍

Based on the experiments and derived equations, larger models are significantly more sample efficient. In other words, optimal compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Since the publication of that Scaling Laws paper, there has been significant interest in scaling up language models. GPT-3 was one of the state-of-the-art models in 2020. It was 100 times larger than GPT/GPT-2, with 175 billion parameters. Due to its size, GPT-3 exhibits unprecedented capabilities in few-shot and zero-shot learning. The more examples you give the model, the better its performance will be. And the larger the model, the better its performance gets.

DeepMind's Discovery

Last year, DeepMind proposed the "Chinchilla" scaling laws to create compute-optimal models. This is a more accurate scaling law formula than the original one proposed by OpenAI.
‍

They trained over 400 language models with 70 million to 16 billion parameters on 5 billion to 500 billion tokens. By predicting the optimal amount of data given the number of model parameters, they derived formulas for the model and training set size. Most large language models are "undertrained," meaning they haven't seen enough data.
To verify this, they trained another large model, Gopher, with 280 billion parameters and 300 billion tokens. With Chinchilla, they reduced the number of parameters to 70 billion while increasing data fourfold to 1.4 trillion tokens. Despite fewer parameters, Chinchilla exceeded Gopher's performance, suggesting that model size and training tokens are equally important.

Scaling Vision Transformers from Google shows that the scaling law also applies to not only the NLP task but also the CV task. The authors conducted experiments with Vision Transformer models ranging from 5 million to 2 billion parameters, datasets ranging from 1 million to 3 billion training images, and compute budgets ranging from less than 1 TPUv3 core day to more than 10,000 core days. Their findings show that simultaneously scaling total compute and model size is effective. Increasing a model's size when additional compute is available is optimal.

Emergent Abilities of Large Language Models

‍Google recently published an important paper titled "Emergent Abilities of Large Language Models," which explores the emergent abilities that are present in larger models but not in smaller ones. The paper examines research that analyzes the influence of scale, comparing models of different sizes trained with varying computational resources. For many tasks, the behavior of the model either predictably grows with scale or surges unpredictably from random performance to above random at a specific scale threshold (for instance, more than 70 billion parameters).

‍

Since the formal and empirical analysis of scaling laws, many more language models (LLMs) have been released. These models have achieved state-of-the-art few-shot results on many tasks by scaling model size, using sparsely activated modules, and training on larger datasets from more diverse sources. Notable examples include Megatron-LM (8.3B params), GLaM (64B params), LaMDA (137B params), Megatron-Turing NLG (530B params), and PaLM (540B params).

These exciting discoveries will continue to grow the space and achieve more and more incredible results. I'm especially excited to see how this scaling will better equip large models for scientific tasks like image labeling, genome sequencing, and protein folding.

< Older Post

Newer Post >

Mail

Hours to Minutes: Scaling User Clustering with Topological Manifold Learning

By William McNamara • September 19, 2025

How I made a critical segmentation algorithm 9x faster with a new approach to clustering

Joe Biden's AI Legacy

By William McNamara • December 11, 2024

Not a whole lot for Trump to undo

Exploring Alphafold Model (Part 2)

By William McNamara • October 15, 2024

ColabFold has changed the game for amateur protein folding analysis

Exploring Alphafold Model (Part 1)

By William McNamara • February 3, 2024

Getting started with Deepmind's revolutionary model for protein folding

Extracting Patterns from Genomic Data

By William McNamara • December 22, 2023

A continuation of genome sequencing analysis

Biden's Executive Order on AI: It's a start

By William McNamara • November 7, 2023

A good step but not nearly enough

Dimensionality Expansion for Genome Sequencing

By William McNamara • August 1, 2023

Introductory methods for genome sequencing

The World of Multimodal Foundation Models

By William McNamara • April 2, 2023

A primer on foundation models: what they are, how they've evolved, and where they're going.

Fun with Spotify's API

By William McNamara • March 19, 2023

Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy. But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it. In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills. Introduction to Spotify’s API Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists. You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff. In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API. NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.

Evolutionary feature engineering

By William McNamara • December 8, 2022

Evolutionary strategies for feature engineering