Biweekly update

I'm working on a fairly detailed post about linguistic biases and how to formalize them as open-ended tasks, but it isn't quite ready yet. In the meantime, using this blog more as a journal entry, I'll try and give a summary of my thoughts and research directions.

Ideally, what I want to do is create a series of curriculum that are open-ended and teach an agent "intelligence". Because that's pretty ambitious, I'm okay with simply some open ended curriculum that teach some important aspects of intelligence and help augment the curriculum of language models in various ways. There's a few reasons this is still my research direction.

My main point is safety:

(I'm still doing a lot of reading about AI safety and so my thoughts on these topics is constantly changing and I'm not on expert on them. This is just my current impressions)

One way to create AI would be to feed a model very high resolution EEG signals and have it predict what the underlying model is. Because we don't have a great dataset of EEG, one step removed is to predict the actual outputs made by the brain (what a person is doing). Two steps removed is to predict some specific kind of output, and the internet provides a very large amount of data produced by human brains. Thus, if we try and model the distribution of data produced on the internet, if our model is good enough, it'll have to recreate many of the mechanisms that produced that data, leading to an intelligent model that is capable of modeling most of what makes up human brains.

This is a rough description of what it looks like OpenAI's approach to building an intelligent agent is. While I think the emphasis on "AGI" is a little odd (any agent created will eventually be better than humans in some ways and worse in others for a long time before it is better than humans in all ways, so it makes sense to try and decouple the aspects of intelligence to keep expectations moderated), the intuition is that they are trying to make an agent that is good at many of the economically valuable things humans currently do, and this distribution fitting approach should get them there.

That's really exciting that there's a plausible path to getting an agent that matches human intelligence on many levels, but it comes with risks. There's all the standard automation concerns, but after that there are a couple different safety concerns.

One safety concern is that we will have systems great at optimizing measurable goals, but measurable goals will continually diverge from human happiness due to Goodhart's law, and so we'll end up with a society where the measurements all say "things are going great" and the general systems are getting more and more complicated but individuals feel miserable. I think that affective computing is a great approach to this problem in the short term. In the longer term, as systems get arbitrarily good at optimization this approach leads towards wireheading (directly stimulating pleasure centers), and before then society can get stuck in feedback loops that seem locally desired but not globally desired, but generally as a principle this seems like a nice approach as long as our optimizers aren't too powerful.

The other safety concern has to do with not understanding what our models are doing. There's this weird effect where from the outside these models seem very powerful, but from the inside you start feeling like they are really dumb because you change some small thing that shouldn't matter and then suddenly the model just says "ttt t tt tt t t" and never gets past that point.This means that the idea of some model "taking over society" or "killing all humans" or "making a paperclip optimizer" seems pretty silly to anyone currently working on AI research: most the time we can't even get them to do the tasks that are extremely well defined and we want them to do.

However, I do think OpenAI is particularly good at trying to counter this fallacy with developing a good understanding of the kinds of scaling timings that might happen. This lets us recognize that our intuition for these things is biased because humans aren't good at thinking about exponentials, and account for that.

Anyway, there seem to be two main areas of addressing this question of "how to prevent AIs from getting spooky". The first direction (Paul Christiano) is saying "okay, practically, it's really hard to prove stuff about these systems. Due to the non-linearities, it ends up being tied to intractable things or large open conjectures in mathematics. However, we're still going to be building these systems anyway, and if we don't, other people (like China) will. Thus, we should figure out practical engineering type approaches to making these systems do what we want and not get spooky."  In particular, there are three things: making sure the agents are good at optimizing the reward you gave them, making sure they are only optimizing your reward and nothing else (this is called "inner alignment"), and making sure their reward is aligned with the designer's intentions (this is called "outer alignment"). Both inner and outer alignment are difficult, and that difficulty imposes an "alignment tax" where it's additional work to make sure AI systems are aligned. Thus, it is very important to develop methods that help us do outer and inner alignment better while simultaneously improving the model's ability to optimize, so engineers will be incentivised to use them because it gets them what they want. Ideally, the better we get at inner and outer alignment, the more safe our systems will be.

The distinctions are actually slightly more nuanced and messy than this, and there's a lot of pre-req work I'm still trying to understand. Paul and the others at OpenAI are targeting a specific subset of this problem that seems most promising, but that's the jist of it.

Kudzo Ahegbebu (one of the other Scholars) made a really interesting analogy to airplanes. We made flying machines far before we had theory to understand why they work, using trial and error based engineering. AI seems to be in a similar place. Because AI has serious risks, it makes sense to use all tools at our disposal to tackle those risks, so it's worth trying to use trial and error based engineering to help with AI alignment, even if it does feel a bit like alchemy at times.

The other research direction is taking the assumption that "guys, we are software engineers, and we know that we mess up all the time. We don't even know why our current models work, and they are relatively dumb. So why do we think we'd have any hope of understanding how extremely intelligent models work?"

There are two takes from here. The first take is that we should focus on interpretability and understanding.  Maybe we can just use scientific techniques discovered in other fields to understand these complex systems, and that'll allow us to make sure they do what we want.

The other take is that mathematical proofs are the only way to have guarantees. So we should take a step back and make sure we can build systems where we can prove safety guarantees about their behavior.

Personally, I've been waffling between these camps. As a CS theory person and software dev I understand I make mistakes all the time, and so the second camp is fairly convincing to me. On the other hand, maybe the first camp is the necessary approach.

My research direction is trying to approach the clarity and mathematical questions from a task perspective. I'm hoping that if we can make formal what are the essential components of human tasks (by making synthetic tasks that are a superset of them), we can use that to prove things about what the behaviour of our models will be. Or, in the worst case, we can use that knowledge to make empirical rules and guesses about what kinds of alignment failures we might expect on very powerful models, and find ways to mitigate that.

The point is, I feel like not having a theoretical way to poke at human data is really limiting. There are a lot of representation and statistical techniques that approach the question in other ways, and I find many of them very promising, but I'm hoping this direction can complement them.

Practically, I've been spending most of my time trying to piece together what the "essential components" of human data are. It's a really tricky question that pokes into psychology, sociology, linguistics, economics, mathematics, and lots of prior work in AI, but I think it's possible to make some progress. I've been going back and forth between designing small scale experiments, and trying to work from the top down (reading literature and prior work) to guide where my theories do and don't work. Most of my time I've spent reading because it seems like there's a lot of conceptual ground I need to understand first. Things are coming together, but there is still a lot of work to do. My next blog post should be a much more detailed take about these thoughts.

One fun unrelated point: apparently transformers can factor numbers? As part of understanding what problems these models can and can't do I tried a fairly simple GPT type model on few digit factoring problems (factoring products of two primes), and it successfully learned an algorithm that generalized to a holdout test set. You have to make the base of your numbers fairly high before the amount of data is enough to prevent it from memorizing single or double digit problems, and it's important to make sure you ask the model to output the factors in a predictable order (such as sorted) or the model is forced to memorize. Before completely fitting, it also learned to "approximately factor" where it would output two numbers that are prime about 80% of the time and multiply together to be within 1-2% of the right value. My initial impression of the scaling laws for factoring are that the required model size grows very quickly (maybe exponentially with respect to problem size?) so it's similar to brute force and not actually helpful for practical real world problems. Still, I think it's interesting.

Comments

Popular posts from this blog

My prediction of AGI in 2021-2023

OpenAI Scholars Project

Inductive Linguistic Biases