My prediction of AGI in 2021-2023

In 2020 I made a prediction on Facebook that AGI was only 1-3 years away. Probably time to follow up on that.

At the time, Gwern’s “You can only prove the presence of an ability, not the absence of it” seemed solid. Someone would say “GPT-3 can’t do X”, and then later someone would find a prompt where it could reliably do X! There was a lot of talk about becoming a good “prompt engineer”: being good at wording things in a way that cause language models to reliably do what you want.


Relatedly, these days my favorite way to think about language models is as a Simulator (Janus popularized this framing here https://generative.ink/posts/simulators/). Inside them is all these different persons/personalities, each with different abilities (try character.ai to experience this). Prompt engineering pulls out a specific simulated person and inhibits many others. If you get the wrong person, it’ll fail at things other simulated people are very capable of!


But honestly, I think ChatGPT proved Gwern wrong. There’s a variation of that quote, something like


“if it’s not easy to get a model to do some ability, most people will assume it can’t do it, because they’ll have plenty of examples of it failing”.


In simulator terms, “if the default simulated personality can’t do some ability, to the public, the model might as well not have that ability”


GPT-3 could do most the things that ChatGPT could do, but you didn’t need to prompt engineer ChatGPT! (At least, not nearly as much)


I try to keep this new quote in mind when thinking about language model capabilities.


In the comments of my prediction post, I said “[GPT-3] seems to be getting close to human performance in the category of using and comprehending language. Because [the task of predicting the next token] is so general and includes things like doing academic research, the Turing test, reasoning, etc. it seems it may be sufficient to capture general intelligence once it surpasses human performance, which may happen soon”


I think GPT-4 is very close! And obviously closer than ChatGPT, which was closer than GPT-2. I also think prompt engineering is still important, stuff like asking the model to show its work or “simulate a panel of experts and write their discussion and conclusion” make a big difference!


But that probably seems like a cop-out. Most of you have worked a bit with ChatGPT, and heard the hype. It’s good at some stuff, a fun and sometimes useful tool, but it still seems like it’s not quite there. Writing off its failures as “bad prompt engineering” seems unfair, even with very good prompt engineering, it really feels like there’s something missing. If you are a domain expert this is especially apparent, not only is it wrong often, but you can’t really teach it to be right the same way humans can be taught. (For me, I see this in its poor math abilities and its buggy code)


So where was I wrong?


First off, let me address memory: while modern chatbots have a small memory and forget stuff, I don’t think that’s a fundamental issue and it seems like as a research community we are steadily finding ways to patch that issue.


The next issue is hallucinations: language models make stuff up! Janus says “you cannot stop the dreaming”, that language models are fundamentally fiction writers that just see the real world as a particular genre of fiction. While a useful frame to understand model hallucinations, I don’t think it’s strictly true, retrieval augmented models help a lot (try using Bing Chat mode or perplexity.ai to see an ai that’s a little better at citing its sources). Also see influence functions https://www.anthropic.com/index/influence-functions


These two issues are part of a broader issue, that language models aren’t simple tools that follow instructions, nor are they “agents”: a singular person with cumulative knowledge and goals. They are simulators, that gain and throw away goals, knowledge, ideas, and personalities freely, and it takes a lot of effort to keep them reined in. This makes them difficult to rely on, which prevents them from automating jobs they can in theory do 80-90% of the time.


I do think as we develop more advanced ways of reining in their tendency to write fiction instead of do work (though honestly, mood), we’ll see more useful applications with existing models. But still, it feels like there’s something missing.


The simplest answer then would be “timing”. Due to its size and increased amount of data (alongside many other smaller things) GPT-4 is better than ChatGPT. I was naïve in thinking GPT-3 was most of the way there (after seeing its outstanding ability at things NLP considered “AGI-Complete” just shortly before), but it turns out there’s still a ways to go! Reaching human level abilities via next word prediction might just require tons more compute.


Something to note here, language models are in an odd place. Part of my prediction was banking on major architecture improvements. For image gen models, we kept seeing advancements that let you need less data, which is a big part of why things like Stable Diffusion or Midjourney became possible. But for language models, beyond transformers (2017) there’s only been a few very minor things. This isn’t because people weren’t trying! RWKV (RNN), Hungry Hungry Hippos (state space models), MegaByte, countless __-former architectures, etc. Just for some reason, none of them have surpassed small modifications on GPT-2’s architecture once you get to billions of parameters. This is weird!


Without any architecture improvements, the only way to go is more data and bigger models (following the empirical “scaling laws” we’ve measured to determine the right amount of each). Or is it?


If you’ve been following the open source language model community, you may have heard of LLAMA (or more recently LLAMA-2). This is currently the best open source language model, and many people have been seeing how much better they can make it by fine-tuning on a relatively small amount of high quality data. This can make a big difference! Nous-Hermes-Llama2-13b is my personal favorite, which is a model trained from a few sources, though I think the biggest impact is the Wizard dataset.


Getting high quality data made by humans is expensive, so many of these approaches use data made by a much better language model (GPT-4) instead. But there’s also this notion of “Impossible distillation” where you generate data from a language model in a careful way (and filter it), then use it to train that language model. https://arxiv.org/abs/2308.08998 is a recent version of this. It’s weird to think that this is a version of recursive self improvement (RSI) that actually kinda works. But as Paul Christiano said, we will have bad, slow RSI before we have good RSI. And maybe RSI is such diminishing returns that it’s always pretty meh.


If you are willing to train a model from scratch, there’s also a question of how good can you make a small model with very high quality data. I really like "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?" https://arxiv.org/abs/2305.07759 and "Textbooks Are All You Need" https://arxiv.org/abs/2306.11644 There’s still a lot of open questions here!


Anyway, right now it’s basically just an empirical question. Personally I still think my prediction about “predict next word is enough” was correct, just the timing is wrong. If we use the rule of "multiply any estimate a software engineer gives by 6" that gives us a window of 2026-2038, and I’ll make another follow up in 2-3 years.


Of course, there’s a big glaring issue here. What about robotics?


Yea so… that’s hard. Turns out it’s really hard. Obviously “predict next word” isn’t all you need for that (though it may help as a language understanding component, which we’ve seen in some recent papers). We may end up in a place where all computer based jobs can be automated, but jobs like hairdresser, plumber, electrician, etc. are still quite a few years away. Even if scaling is the right approach here, there is so much less data that you need things like impossible distillation to work really well! And right now it’s still very early days. Sim-to-real is also promising but there’s a pesky simulator gap that is hard to fully close.


Even for settings like self-driving where there is lots of data, many robotics applications have the “long tail” where getting to 98% functionality is relatively easy, but the last 1-2% is so many edge cases that it takes 10-15 years to cover enough of them.


A similar lesson applies for many applications of language models. Of course, our criteria should probably not be “perfect” and instead just “safer/more reliable than humans”, but still, I think this is all a good reminder to me that humans are really impressive. We are all geniuses, especially in the little things we do every day that we don’t think very hard about, like walking and talking.

Comments

Popular posts from this blog

Inductive Linguistic Biases

OpenAI Scholars Project