The Core Primitive of AI: Experimental Learning
How LLMs can effectively extrapolate beyond their training data
There is this idea that LLMs can’t generalize beyond their training data. This can be interpreted in a lot of ways, and in some sense it’s trivially true, but it’s often incorrectly used as an argument against treating LLMs as an intelligence and instead as an incredibly advanced data retrieval tool. This misunderstands the depth of the training data.
The training data we have been using in core LLMs has access to a massive spectrum of human scientific and reasoning processes, and has the ability to speed-run through all of our discoveries; forming latent connections and learning higher-order properties of how humans reason about the world. So while it may be true we can’t ‘generalize’ beyond the training data, the training data itself embeds within it estimates of the base scientific primitives we’ve developed.
Given that aa LLM is learning some function, what do we think that function looks like? We know it’s an intractably complex mess of matrices, but what class of statistical functions does it ascend from?
Our Corpus
I made a previous argument where I wrote that the corpus of human thought embeds within it sufficient data to train an artificial general intelligence. My argument then was that the apparent magic that happens during training and inference, which is fundamentally “just” predicting the next token, is due to a model learning the data generating process of human reasoning.
We know that in theory any non-parametric algorithm will learn the true distribution of the underlying data with the right structure and compute. There is a fundamental principle from statistical theory, the Glivenko-Cantelli Theorem. In simple terms, this theorem states that, given a large enough sample size and compute, the empirical distribution function of the sample will converge uniformly to the true underlying distribution function.
Applying this to LLMs, think of the entire corpus of human language as a 'universe' of knowledge, with its own true, but unknown, distribution. Each piece of text that the LLM is trained on can be seen as a sample from this universe. The training process, then, involves the LLM constructing an empirical distribution based on these samples.
Experimental Data
Over the last year, I’ve come to think about what this means in a more refined way: Aside from simply saying we are learning the data generating process of reasoning, we can think of our training data as being a composition of every experiment ever recorded. By learning this distribution, we learn estimates across the vastness of human experimentation.
Suppose you are in ancient Crete, and somehow get access to a GPT4 style model trained on data from the time (let’s assume somehow there are lots of documents available). It would still be pretty useful, you could ask it questions about facts embedded within its training data, or folk wisdom on issues related to farming or known remedies or recipes. But the more advanced questions you ask may have your LLM advising you to make sacrifices to the gods.
A modern LLM won’t advise us to sacrifice a cow to defeat our enemies, and the reason it won’t, obviously, is because that’s not a common suggestion in our corpus. The reason it isn’t a common suggestion in our corpus, though, is due to our scientific development as a species.
The statistical formalizations humans have built over the centuries have ultimately been in service of how we communicate with one another to run experiments. The human corpus is embedded with billions of observational experiments, not only are there the literal experiments, like John Snow’s cholera experiment. There are also meta-experiments, where we all observed Snow’s experiment, and reacted to its occurrence.
The corpus of our textual data is a spiraled tower of portals ascending throughout time, with every artifact representing an incremental piece of information for a model to learn cause-and-effect, and the counterfactual of an event taking place vs. not taking place. Forming complex estimates on the joint-distribution of any question that can be posed within text.
There are implicit experiments, hidden in every recorded interaction we take with reality. Even the boring, trivial documents in the training corpus are experiments, any cases where we interact with the world in some capacity and report the outcome in text. The fundamental structure of an experiment only requires that we act on the world in some way, then the variation from that act is recorded in some way.
The statistical motivation of this involves viewing contexts as stemming from a hierarchy of joint random variables, each representing different levels of abstraction within the data. In this framework, contexts that are thematically similar in some either explicit or more abstract way, are clusters or mixtures within a hierarchical model.
For example, if we consider two contexts, say Ci and Cj which belong to the same cluster in a hierarchy, then the models responses (or outcomes) Oi and Oj are expected to be similar due to the shared statistical characteristics of their respective outcomes.
While it’s true that our training data is text, this text is merely the interface overlaying the actual information of our world. Unlike mathematical notation, or matrices in your computer, these don’t immediately look and act like the random variables we are used to dealing with, but there is no reason why contexts of text shouldn’t be treated as random variables.
You can loosely imagine similar documents, or contexts, as being drawn from the same joint-distribution. In practice the taxonomy of these distributions won’t be discrete, each one is drawn from a sort of hierarchical distribution, with common latent factors giving way to child nodes that encapsulate more object level mixtures.
It’s true that the continuous nature of text makes this far more awkward to deal with contexts as random variables. The fact that we can learn these functions though shows that these distributions aren’t intractable, they’re just too messy and awkward to write down as functions to estimate. Once we realize that contexts can be viewed as random variables, we can take it another step, and treat each segment of the training data as an experiment where the provided context acts as a 'treatment' and the subsequent text as the 'outcome'.
Imagine every piece of text as a micro-experiment. The context, or Ci in our notation, is the treatment, and the following text, Oi, is the outcome. The model learns to map these contexts to outcomes f(Ci) →Oi essentially learning from a multitude of 'experiments'.
Think of a simple Reddit post detailing a social conflict. The post itself is a treatment. The varied responses it garners are outcomes. Different posts on similar conflicts map to similar responses, showcasing the model's understanding of human interaction dynamics.
Or another example would be an academic paper and its subsequent rebuttal can be seen as another experiment. The original paper is the treatment; the rebuttal and ensuing discussion are outcomes. Papers on similar topics tend to generate similar kinds of critiques and discussions, revealing how the model learns from patterns in scientific discourse. We see this obviously in contemporary LLMs, where a model can form critiques of papers it has not seen before.
It’s true that in these cases text is the interface between human and computer, but there is no loss of generality or experimental rigor when we reframe our science from matrices of statistical data to continuous textual documents. When you switch from matrices in memory to text you lose some observable rigor from being able to run mathematical statistical tests on it, but text and matrices are both just abstractions over information. Plus, what you may lose from the matrices, you gain from the ability to jointly model every implicit experiment in the whole human corpus.
While it would be intractable for a human to collect every case in our corpus where a specific class of event occurred, and track all forms of responses to it, this happens naturally within the sweet hum of a GPU cluster. If we conceive of our corpus as a vast numbers of observational experiments, then the training process is like any other experimental statistical model trying to jointly fit an outcome variable conditional on a training input.
Of course, like any statistics problem, many of those estimates will struggle due to sparsity – meaning that the effective sample size for that context is low. We could even have sparsity at the level of scientific primitives, due to us posing a question that requires a more refined philosophy of science to answer. Would an LLM be aware of the pitfalls of trusting scientific research before the replication crisis? Not as much as it is now.
What I want to impress here, is in this framework the human corpus is a hierarchical network of nodes, where each node is itself some type of structural observational experiment. The higher nodes of the tree are representative of the latent space of experimental reasoning. Even in cases where we don’t have estimates for a context, since it’s not in the training data, we do have a hierarchical structure that lets the model reason about what information it needs, and how to get it, to construct such an estimate.
Of course, we can’t see this, we can’t write it down, and we can’t estimate it. We don’t have the ability or tooling to write down the full data generating process in the traditional statistical sense. We instead are left trying to speculate as to the actual function that we are learning.
What’s Missing?
I’m not claiming any of this will be enough to act upon the world until it has a motivation to ask questions, deploy scientific experiments broadly defined as manipulating reality – manipulating humans – and observing the outcome. I believe it’s more true that it’s the prior of AGI. An LLM trained on the textual corpus represents a high-dimensional prior over all questions and spaces that are covered by the generality of human reasoning.
But the fundamental building blocks seem to exist, lacking only a handful more tricks to glue them together. As clever as attention or transformers are, they don’t hit the same as an original Turing computer or Von Neumann architecture, which required a generational talent at the edge of human mathematical ability to conjure into existence.
I don’t view this as a fundamental limitation with the approach. It’s just a bunch of edge-case bullshit, maybe missing a trick or two, or it needs to be carved slightly more into an agent. AI engineers will work these details out incrementally. The days are long but the years are fast.
I think the meta-learning approach of TabPFN (https://arxiv.org/abs/2207.01848) is a step in this direction. It's been trained on synthetic data, sampling over structural causal models and then sampling over data drawn from those models. Then it's trained to perform in-context learning to solve tabular classification problems. So you end up with a Transformer that's basically been trained to do science (make predictions by marginalizing over the space of causal models, weighted by how well they explain your in-context data). You can imagine training such a model, not merely to perform classification, but to tell you whether you should aggregate or segment data, or whether this would be invalid due to Simpson's Paradox or Berkson's Paradox. In other words, you can imagine training a model to make all these judgment calls that are a "necessary evil" in science, statistics, and data science, but in a fully-standardized fashion that avoids the human tendency (and social incentives) to exploit these judgment calls to put our fingers on the scales in favor of our desired conclusions.
And the above approach doesn't even begin to exploit knowledge about the meaning of each feature in a dataset. One can imagine a model that takes in the names of features, and uses an LLM to weight possible causal models by how much they accord with all of humanity's prior knowledge.
We're bootstrapped. They can click around. You best believe the labs aren't putting out their internal tooling to give these models the ability to experiment. But, they can. It's bootstrapped