Does the Textual Corpus for Large Language Models Have Enough Information to Train an AGI?
Given the size and contents of our textual data, probably.
Is the algorithm for general intelligence embedded within the collective textual corpus of humanity? Probably.
Intro: LLM Consensus
I see there being two hypotheses emerging for whether large language models (LLMs) are sufficient to progress to general AI. The first is the curve-fitting hypothesis. This hypothesis says LLMs are glorified curve-fitting machines, and while they may appear smart, they’re only replaying our wisdom back to us and interpolating information in high-dimensional space and can never approximate human intelligence.
The second hypothesis is the one I believe, which I call the latent space hypothesis, claims that the textual corpus of humanity has enough information to train a general intelligence. Our textual corpus, which includes humanity’s progressive improvement of our scientific heuristics for reasoning about the world, allows a LLM to reconstruct the core primitives of how to learn and infer in our world.
There isn’t a consensus on which hypothesis is correct. AI scientists are trying out techniques that seem to be working, but excluding occasional architecture breakthroughs, progress comes from empirical and engineering driven experiments. GPT3 basically added more compute and parameters to existing frameworks, and the results have been revolutionary in their power.
Meanwhile, academics who should be paying attention seem stuck in the previous century (or only just waking up), and everyone else is trying to figure out what’s going on in real-time on blogs and Twitter. Despite my own belief that LLMs are sufficient to approach general intelligence, there’s no clear consensus on how to forecast the potential of the mechanisms of the current generation of models.
The lack of agreement on what LLMs should be capable of achieving among experts gives us a signal that we’re dealing with a lot of uncertainty, and that we collectively have a lack of understanding of the mechanisms of these models. The majority of the arguments I see for LLM’s success is empirical and comes from extrapolating their observed success. Roon makes the convincing technical argument for why text is an appropriate primitive for information transmission in his piece text as the universal interface.
Still a lot of serious thinkers don’t seem to think LLMs have what it takes to progress to any form of general intelligence. As a sample of a few of them, Nassim Taleb and François Chollet hold the view that these models don’t ‘know’ anything in a meaningful sense and are simply only curve fitting exercises. From their perspective, functions learned on top of human text don’t count, and can’t progress, to general intelligence.
II. Two Hypothesis
LLMs have been built up over decades from humble foundations in mathematical statistics and probability theory, however they have clearly evolved past our ability to simply open up the model and introspect what is happening. Within a classic statistical framework, it’s typically possible to introspect the model and your parameters.
Our problem is that there are now things going on within the weights of an LLM that we don’t fully understand. We’re forced to reason about their behavior through analogy to historical models, or through empirically testing their capabilities.
Hypothesis 1: Curve-Fitting Hypothesis
This is also the most common hypothesis I see from academics and influencers who extrapolate from their statistical expertise in the mechanics of what is happening. Technically the model doesn’t know anything. It’s just returning the next value that is statistically most likely. This is what happens when we take our mechanical understanding of older generation NLP models, and statistics more generally, and apply them to the world of billion parameter models.
(Of course, it’s true that the model is fitting a function to textual data to maximize some likelihood – that's what’s happening. The stronger claim here is that this precludes the ability for it to have some higher order intelligence or reasoning)
In this world LLMs are curves that predict based on their in-sample data. This is a fundamental limitation of this algorithmic approach that is not capable of what we would consider true human level intelligence. A passage from a relatively recent argument by Jacob Browning and Yann Lecun, who is the chief AI scientist at Meta, summarizes it:
LLMs have acquired this kind of shallow understanding about everything. A system like GPT-3 is trained by masking the future words in a sentence or passage and forcing the machine to guess what word is most likely, then being corrected for bad guesses. The system eventually gets proficient at guessing the most likely words, making them an effective predictive system.
The issue they raise is that since an LLM is only memorizing facts and at most synthesizing them together, it’s more of an advanced query engine that simply returns a curated list of facts and answers extracted from our collective writing and problem-solving. If the query engine ends up sounding like a human, it’s only because it’s read so many of our blog posts and tweets and the statistical machinery is only talking our words back to us. It’s not reasoning on its own.
If we imagine a space of training data, the model will forever live in the confines of this world, where it can’t ever truly ask and answer new questions that aren’t combinations of our existing knowledge base.
This hypothesis doesn’t claim LLMs can’t be extremely useful and powerful, but that these restrictions preclude the ability to progress to a general intelligence, since simply learning labels and generating text is fundamentally missing some key ingredient of true reasoning. From the same article above, they claim: “A system trained on language alone will never approximate human intelligence, even if trained from now until the heat death of the universe.”
Hypothesis 2: The Latent Space Hypothesis.
The latent space hypothesis focuses on asking the question of what is embedded within the textual corpus of humanity, and claims that the complexity in this latent space is sufficient for AGI. This doesn’t refute or argue with the idea that an LLM is “only fitting a curve,” but instead claims that this curve can be sufficiently complex to constitute general intelligence.
My intuition here originates from my background doing academic research into bond markets. The prices for the yield curve of bond markets are only a few megabytes in size. But they embed within them the marginal knowledge and belief in the growth of the US economy across every investor and the future of economic and global growth.
Those series are the output of the beliefs and actions of millions of people. While the floating point representation probably won’t capture quite this level of detail, in theory even your personal consumption and retirement choices are included in those time-series. Decades of academic research, statistical modeling, and researcher careers have been spent on incrementally uncovering new insights from this data.
This is a few megabytes of financial data, but it contains information on our collective utility functions and beliefs in the economy. So what does the textual corpus of humanity hold?
When we think about text, what we’re really observing is the data generating process of humanity. We are a collection of individual computing machines and in a very literal sense, we are each a function. We take inputs and produce outputs. As a species that has evolved for language, text has always been our universal written interface. Our corpus of text represents the outputs of millions of humans. Embedded within that mess of information is all of our social graphs, the information we observe, and how we have reacted, processed, and written down that information.
It’s not even clear to me that we can reason correctly around the sheer information content within all of this text. It’s so far beyond what any single human could hope to read, or has read. The depth of connections an LLM can make within this space may quickly reach a level we’ve never seen or believed possible.
Scientific Generating Functions
The latent space hypothesis claims that the functions that have produced all of our textual output are themselves embedded within this text and could be learned by a properly specified LLM with sufficient compute. The training data we have embeds within it the latent space of human reasoning. This means it has access to the same self-improving heuristics of the scientific method that we have discovered and written down in our attempt to understand the world.
When we talk about text, yes, we’re talking about Stack Overflow answers, and code-snippets on blogs. However, we’re also talking about all the scientific discovery, experimentation, and records of our learning, as they have played themselves out over centuries. Terabytes of textual data. These are heuristics we’ve never formally modeled before, as they have always lived within human text. They are usually intractable to mathematically formalize, having originated from our own evolutionary algorithmic processes.
Our motivation and heuristics for choosing to test a specific hypothesis always begin with some theory we’ve simulated within our brain on what causal aspects we ought to test. As a single famous example, John Snow is famous for constructing one of the first observational experiments on cholera, by examining two water sources, and noticing that one was the cause of the cases of cholera.
What’s just as interesting as his experiment, was his intuition – or in statistical terms – high-dimensional prior beliefs, that the current theory of miasma was incorrect and that an alternative theory should be proposed and tested. Due to the limitation of our statistical tools, we’ve never considered his book where he talks through his theorizing and intuition as part of the model; we always skip to his numerical tables. Yet it’s clearly the case that his book was a fundamental input into his modeling approach.
The amount of artifacts where humans walk through their reasoning for solving problems is astounding, and embeds in it all of our most primitive reasoning heuristics for unraveling the world. This exists within all of our historical writing as well. Which events were the important ones? Why did we focus on them? The amount of our base reasoning we impart into text is massive.
Novels and Simulations
We can even see the relationship to other forms of narrative writing. In a 2004 piece, Tyler Cowen wrote about the similarity between models and novels, and in what ways a novel can be considered a model:
Clearly novels are not data, as a social scientist would use that term. By definition novels do not narrate true events. A novel may be “true data” about the mind of its creator, and the proclivity of that mind to draw connections and tell stories. But a novel remains a constructed tale. If novels cannot be data, I therefore consider two other major categories in economic theorizing -- models and simulations -- to see how novels might fit in.
The bottom line can be explained as follows. Economic theory is rigorous, or at least attempts to be. Yet how we evaluate economic theory, and how we choose economic theories, is often highly intuitive. A knowledge of novels can refine our intuitions in these tasks.
A lot of what we consider this type of intuition or imagination comes from running a simulation in our heads of some possible phenomena and modifying parts of it to see how it would resolve. I think the clearest examples of this come from Tolstoy’s work, where he wrapped a novel around a historical event, with the goal of fleshing out a social scientific theory.
My favorite example is Tolstoy’s use of War and Peace as an argument against the great man theory of Napoleon, but Tyler highlights “Anna Karenina, in part, as a story of the prevalence of self-deception, set among the Russian nobility.” In both cases, the novel was a model that generated a simulation that seems to fit the data. Up until recently, this is a type of human generated data and reasoning that has been off limits to any type of computational model.
When Tyler wrote his paper, the idea of considering a novel as a model was a theoretically interesting but intractable concept. We need to start seriously thinking about the function space of novels, and what it would mean to learn the functions that have generated historical novels, or narrative structures. A novel space is a large space of human generated simulation training data.
When I think of a more general intelligence, I’m not only thinking of a tool that can synthesize and extend software or mathematical formulations, but something that also has the ability to reason and form an intuition about what might be interesting. Our books and documents aren’t only encoding surface level information, they are encoding a more primitive form of simulation based reasoning and scientific discovery that exists beneath the surface. I have no doubt the data exists within our text; the primary question is whether LLMs can learn these functions.
Conclusion
If our textual space is complete enough to represent the generating functions of human reasoning, I don’t see any reason why fitting the correct function to that data would be insufficient to create a general intelligence. The textual space of humanity seems to encompass a latent space that embeds the functions we all use to ask and test scientific theories on the world.
Whether this is a function our current generation of LLMs can learn with more compute I’m not sure. Empirically, GPT3 seems to be the latest on a shockingly steep curve of increased capabilities.
I can think of a few reasons it doesn’t work out: The first is that the specific ways in which we condense information into text could be too lossy. When we store information, we compress it from our full sensory data. A machine that can only read text will be missing the full context and scope of our interactions with the world, and the overwhelming amount of visual input we receive and process. It’s possible this is too much lost information, and we embed our textual knowledge in some way that interacts with our physical experience of the world in a way that a pure language model will be unable to reconstruct.
The second is that the true scientific generating functions embedded in the text might also be too hard to learn. Even if we do acknowledge this exists in our data, it's possible the LLMs, even with massive compute, can't learn these functions. For example, we know the information content for how addition works, and other mathematical operations are in the text. If the model can’t learn those functions, why should we be confident it can learn other latent functions? Perhaps some additional undiscovered modeling approach that is different than LLMs would need to be necessary to learn from this data in a more computationally efficient way.
I believe that these problems either won’t be an issue or will be solvable within the general LLM framework. Even if they ultimately aren’t, these problems seem empirical. I don’t see a credible argument for a priori claiming that LLMs will not be sufficient to learn an AGI.
Thanks roon for helpful comments and discussion.
This type of thinking is a step in the right direction, but both the ML research and the AI alignment communities already have some quite detailed predictions about the actual mechanisms of language modeling -> intelligence:
https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators
https://twitter.com/jacobandreas/status/1600118539263741952 (paper submitted in May 2022)
Essentially, in this Twitter language, the LM learns Guy Typology to next-token-predict text on the internet, including Guy Who Does Maths, Guy Who Knows Facts About The Roman Empire, and a zillion others. If we had an infinite amount of text produced from any single Guy, the low-loss limit of training would model what that Guy says perfectly.
And each Guy is only a view on the world model; by learning views from many different viewpoints, the LM develops a world model consistent with those views. Sort of like neural radiance fields intuition.
There are of course architectural constraints: functionalities require more parameters as they get more complex, and a priori the optimization landscape might be difficult.. But adaptive gradient methods on transformers seem to work well so far.
Great article thank you.