A New Scaling Paradigm from Google

William McNamara • January 17, 2023

Google's DeepMind Researchers have proposed a new framework for scaling AI Models that could be a game changer for the space.


Moore’s Law


Generally, scaling laws predict a continued improvement in model quality as we continue to scale up the computational budget (e.g., bigger models or more data). Open AI investigated the scaling laws of Transformer language models a few years ago and showed that scaling laws are predictive of future performance. Their findings showed that performance is a function of data size, number of parameters, and compute size.


More specifically, the experiments revealed that the test loss follows a power law with respect to the model size, dataset size, and compute used for training, spanning trends over seven orders of magnitude. This suggests that the relationships between these variables can be described by simple equations, which can be used to optimize training configurations for large language models. Additionally, the experiments indicate that other architectural details, such as network width or depth, have minimal effects within a wide range.

Based on the experiments and derived equations, larger models are significantly more sample efficient. In other words, optimal compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.




Since the publication of that Scaling Laws paper, there has been significant interest in scaling up language models. GPT-3 was one of the state-of-the-art models in 2020. It was 100 times larger than GPT/GPT-2, with 175 billion parameters. Due to its size, GPT-3 exhibits unprecedented capabilities in few-shot and zero-shot learning. The more examples you give the model, the better its performance will be. And the larger the model, the better its performance gets.


DeepMind's Discovery


Last year, DeepMind proposed the "Chinchilla" scaling laws to create compute-optimal models. This is a more accurate scaling law formula than the original one proposed by OpenAI.

  • They trained over 400 language models with 70 million to 16 billion parameters on 5 billion to 500 billion tokens. By predicting the optimal amount of data given the number of model parameters, they derived formulas for the model and training set size. Most large language models are "undertrained," meaning they haven't seen enough data.
  • To verify this, they trained another large model, Gopher, with 280 billion parameters and 300 billion tokens. With Chinchilla, they reduced the number of parameters to 70 billion while increasing data fourfold to 1.4 trillion tokens. Despite fewer parameters, Chinchilla exceeded Gopher's performance, suggesting that model size and training tokens are equally important.


Scaling Vision Transformers from Google shows that the scaling law also applies to not only the NLP task but also the CV task. The authors conducted experiments with Vision Transformer models ranging from 5 million to 2 billion parameters, datasets ranging from 1 million to 3 billion training images, and compute budgets ranging from less than 1 TPUv3 core day to more than 10,000 core days. Their findings show that simultaneously scaling total compute and model size is effective. Increasing a model's size when additional compute is available is optimal.



Emergent Abilities of Large Language Models


‍Google recently published an important paper titled "Emergent Abilities of Large Language Models," which explores the emergent abilities that are present in larger models but not in smaller ones. The paper examines research that analyzes the influence of scale, comparing models of different sizes trained with varying computational resources. For many tasks, the behavior of the model either predictably grows with scale or surges unpredictably from random performance to above random at a specific scale threshold (for instance, more than 70 billion parameters).


Since the formal and empirical analysis of scaling laws, many more language models (LLMs) have been released. These models have achieved state-of-the-art few-shot results on many tasks by scaling model size, using sparsely activated modules, and training on larger datasets from more diverse sources. Notable examples include Megatron-LM (8.3B params), GLaM (64B params), LaMDA (137B params), Megatron-Turing NLG (530B params), and PaLM (540B params).


These exciting discoveries will continue to grow the space and achieve more and more incredible results. I'm especially excited to see how this scaling will better equip large models for scientific tasks like image labeling, genome sequencing, and protein folding.


By William McNamara September 19, 2025
How I made a critical segmentation algorithm 9x faster with a new approach to clustering
By William McNamara December 11, 2024
Not a whole lot for Trump to undo
By William McNamara October 15, 2024
ColabFold has changed the game for amateur protein folding analysis
By William McNamara February 3, 2024
Getting started with Deepmind's revolutionary model for protein folding
By William McNamara December 22, 2023
A continuation of genome sequencing analysis
By William McNamara November 7, 2023
A good step but not nearly enough
By William McNamara August 1, 2023
Introductory methods for genome sequencing
By William McNamara April 2, 2023
A primer on foundation models: what they are, how they've evolved, and where they're going.
By William McNamara March 19, 2023
Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy. But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it. In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills. Introduction to Spotify’s API Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists. You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff. In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API. NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.
By William McNamara December 8, 2022
Evolutionary strategies for feature engineering