On AI Alignment

(this isn't my bi-weekly progress blog post, this is a less polished series of ramblings about AI safety I've been musing over for the past few weeks. Feel free to skip, my bi-weekly blog post is here)

I use writing to work out my thoughts. This post is in two pieces: a summary of my thoughts, and what I think the takeaways should be, and a longer "working out why things are the way they are". I don't expect anyone to read the "working out stuff" sections, the summary should be the most useful and interesting. This post is still in progress, I keep tweaking it as I think of new things.

There are a few assumptions within alignment that can lead to vastly different predictions about what the right research direction is. Here's the rough process I've been through in understanding these opinions:

Assumption 1: There will be a slow takeoff, so we need to do "black box alignment" where multiple people build systems simultaneously.

It's been pretty clear to me for a while that an intelligence explosion could happen. I used to think that a fast takeoff (over the order of a few minutes/days) was most likely. However I've spent some more time reading and thinking this week, and I think there are quite a few good arguments that suggest that we will have a "slow takeoff" over the course of about a year, and it'll be clear in advance that we are working up to it.

A "slow takeoff" has a higher bar for alignment. Instead of just needing to make a single agent and "locking it in", there will be multiple people making AI at the same time. Thus, techniques need to ensure that everyone is using good safety techniques. Paul Christiano suggests the following criteria for the technology everyone is using:

Secure: Robust to adversarial inputs, since the real world can be adversarial

Competitive: Ideally, alignment techniques should let you get better systems than other techniques, which encourages everyone to use those techniques. The more there is a penalty for using alignment techniques, the more likely someone will decide to not use them to get a competitive edge.

Scalable: Techniques continue working as the systems get larger and larger

In practice, this assumption only really matters for the Competitive point: Scalable and Secure seem like necessary criteria for any AI alignment system.

This competitive point is relevant because it rules out alignment techniques that would work, but wouldn't be used by most of the major parties. This means that many kinds of "provably aligned AIs" might end up requiring many more resources than the hand engineered "mostly aligned AIs", and so just developing a provably aligned AI is not sufficient, and we should try and constrain our search to techniques that apply to the most likely AI techniques to initially cause an intelligence explosion.

Personally, I'm still not entirely sold on this point. I think it's very dependent on how difficult certain manufacturing technologies are to build for an AI, which is dependent on the physics of our world. Even Paul Christiano gives a 30% probability of fast takeoff. So I'm still in favor of people searching for provably aligned AIs, even if they are less efficient: it might be possible to convince enough people to devote resources to these less efficient AIs that are provably safe that we end up in a good place, and maybe they'll end up being more efficient. However, this leads me to my second point:

Assumption 2: We should accept that we cannot build alignment techniques in time that provably work

For a while, I had this hope that we could:

1. Formally specify how the brain works (or at least, formally specify that we want the system to spend it's initial efforts formally specifying how the brain works)

2. Define preference satisfaction in terms of how the brain works, in a way that prevents wireheading (loosely defined as any setting that leads to pleasure maximization at the cost of human cultural/technological stagnation)

3. Make an AI that tries to do those things to the best of it's abilities

Once we had such an AI, we could "lock it in" in an intelligence explosion, and then essentially we've created utopia.

This might be possible, and it's good that people are still working on creating such things, but I can imagine a few cases where there are fundamental problems:

1. Prediction difficulties: Maybe the brain is capable of universal computation, so making long term predictions about our preferences is as difficult as solving the halting problem on a few million state turing machines. Many other similar questions about our long term horizons might just reduce to Rice's theorem.

2. Computational difficulties: Maybe the preference aggregation encounters impossibility results, like those seen in Social Choice theory. The hope is that we can make enough assumptions to make things feasible, but it's possible that's too optimistic and there really are just difficult computational/impossibility problems in preference aggregation that we can't avoid.

3. Emergent difficulties: Maybe the nature of human psychology (prediction errors giving pleasure) would keep pushing a preference satisfaction AI into doing more "interesting" and "unpredictable" things, and this would naturally settle into basins of computationally asymmetrical problems: we can verify it's what we want quickly, but it's very difficult for the agent to produce the desired thing (assuming P!=NP), even as it's computational abilities far surpass ours.

4. It's possible to make provably aligned AI, but the intelligence explosion/lock in will happen before we find it

5. Specification difficulties: Human behavior isn't only dependent on the structure of our brains: the physics of the real world, the culture and experiences we had growing up, etc. all have an impact on our decisions and actions. We can have formal causal models of how systems in the world work, yet disambiguating them requires experimentation (and that disambiguation process might require some unavoidable cost-benefit explore/exploit tradeoff), and also incorporating those models in a way that affects the world leads to self-referential issues that are really non trivial to address (see some of the MIRI work)

6. Practical difficulties: In computer security, we've never been able to make a "provably" secure system because that's not how computer security works. Instead, you have to make assumptions with respect to some "threat model". Provable AI Alignment seems like it requires some kind of universal threat model, so I don't see why we expect we can solve it when we can't even make a universal threat model in computer security, which seems like a simpler problem (but still pretty clearly unsolvable). Perhaps there's some assumption that an aligned AI will be the most powerful thing and there will be no one powerful enough to attack it, but again that's a pretty strong assumption and in an open world setting is likely to have cases where it isn't true.

If we accept that we cannot make a provably aligned AI, everything initially seems hopeless. However, there is an out: make sure that we don't ever get "locked in". This will allow us to do engineering work on Alignment techniques. When they don't do quite what we want, we just tweak them. Ideally, AI can even help us with this improvement process, but we still have the final say. This requires an AI that lets us tweak its value functions and optimization procedures, also known as a Corrigibile AI.

The hope is that Corrigibility is something that you can have a provable guarantee for. As long as you have that, you are probably okay, because we can just keep tweaking systems and continually improve them for a long time.

If we don't have a guarantee of Corrigibility, there might still be some hope if there is a "basin of attraction" that a corrigible agent tends to fall into. Because corrigible agents tend to satisfy our preferences better than ones that aren't corrigible, it seems like preference satisfaction might naturally lead to this.

Fortunately, there also seem to be some ways of getting around "not being able to formally define things". Providing examples of what we want, and asking the systems to "learn what we mean" can go quite far, see the recent reward modeling work at OpenAI. This works less well for settings where you have data, but when you don't have good examples, it's somewhat non-trivial. It seems like some of the insights of On centering, solutionism, justice and (un)fairness become relevant here, but I'd still like to write a more detailed analysis of this.

Note: Since I wrote this, I read The Rocket Alignment Problem, which I highly recommend. It points out that the point of "provably aligned AI" is not to actually make provably aligned AI. Instead it's to continue to simplify the theoretical setting until we can make a provably aligned AI. Once we do that, we continue to add back complexities, and design "steering" mechanisms that can work in reality. By analogy, in the case of a Rocket, it makes sense to plot a trajectory that would work in getting to the moon, assuming perfect conditions. Once we have a theoretical trajectory that seems plausible, we can add corrective devices to add for the real world pieces, and then we should be able to land on the moon with high probability.

I think this is a good argument, and I'd in favor of that kind of research being done. However I still worry that in practice there will be too many computationally intractable pieces to make doing such a thing feasible.

Takeoff Speed

I've seen people in a few different camps, roughly ranked by "takeoff speed of AI"

1. AI won't be that dangerous, and it'll take off really slowly and be capped at roughly human performance due to theoretical limitations/diminishing returns

In this case, AI safety is important, but it's pretty comparable to safety of other technologies. Efforts focusing on preventing systemic discrimination, bias, perpetuating inequalities, and general "AI Fairness" techniques should be the main focus.

This also includes the

Misalignment danger deniers:

- "I think AI might get very intelligent, but our society also has very intelligent people already, and plenty of mechanisms for ensuring that humans don't do too awful things. I'm confident the problem is essentially already addressed/we can figure it out as we go"

- "Nature seems to converge to social intelligences, therefore helpful, friendly intelligences are pretty common among the space of all intelligences, so we should expect to see one emerge in the creation of intelligence. This is especially true because we measure our systems by how well they are at tasks humans care about, therefore those systems should care about the things we care about"

AI deniers: "We aren't going to make AGI any time soon". This argument can happen in increasingly nuanced ways as we get closer to AI.

I'm not going to devote this post to countering these arguments, I may make another blog post on that. Here I'll just say that I don't find the arguments convincing.

2. AI will be dangerous, and there will be a "fast take off".

By "take off", I'm referring to AI capabilities rapidly expanding over a short period of time (days or weeks), singularity type stuff. The argument goes as follows:

- Building AI requires intelligence

- Once we make an AI that is more intelligent than us, it will be better at us at building smarter AI

- This will quickly become a cycle of self-improving that converges to a being of God-like intelligence.

If you believe this train of logic, it makes sense to tackle the "white-box AI alignment problem": we need to make a single aligned AI, formally prove it works, then "let it loose". At that point, we will be "locked in". The AI will quickly become more powerful than anyone else, and prevent anyone else from making any other AIs, including unaligned ones.

It also makes the development of AI seem very risky. If you spend your time working on capabilities and end up speeding up the development of AI to the point where this "lock in" happens before we have a good alignment solution, we could end up in a very scary place. Some people aren't worried about this because they figure taking the gamble is fine (worst case they are killed, best case they end up in a utopia, and global warming is an existential threat anyway), but it's important to also consider suffering-risks. These are unaligned AI that result in spreading suffering over the entire universe, and put us each in a sort of personalized hell for the rest of the lifetime of the universe. There are also plenty of intermediate misalignment states that just result in the world getting progressively worse for humans, but don't immediately kill us.

However, there are people that disagree with the "fast takeoff" argument:

3. AI is dangerous, but there will be a "slow takeoff"

Slow Takeoff

The fast takeoff argument hinges on discontinuity: we eventually pass a threshold, where self-improvement happens at a rate faster than humans improving AI. Paul Christiano has a rebuttal to many of these arguments.

This is important because if we slowly approach an intelligence singularity (say, over the course of a year), then multiple people will have systems that are comparable in power. In this case, we need to solve the "black box version of AI alignment": we don't just need one aligned system, we need to solve the coordination problems to ensure that any AI system created by people doesn't result in harm. However, the "lock in" situation might be less likely because we will have multiple competing systems.

I tended to lean towards believing that "slow takeoff" was unlikely. In computer science I'm very familiar with these weird "boundaries" where some small, seemingly unimportant change leads to categorically different behavior. A few examples:

- When looking at models of computation, it's very easy to add a few features to your theoretical computer and make it Turing Complete.

- When looking at computational problems, it's very easy to either run into uncomputable things, or NP-Complete problems.

- When looking at physical systems, it's very easy to accidentally run into chaotic behavior, where some questions about the long term behavior are uncomputable without directly simulating them (this is really just a subset of the models of computation point - because this threshold can often happen once the system is capable of simulating powerful models of computation - but I think it stands well enough on its own)

- A set of axioms can very quickly end up with having uncomputable questions about provability of statements

- There are many finitely presented groups (one of the most simple mathematical objects you could think of) that have uncomputable questions about themselves

etc.

This is somewhat related to Paul's "secret sauce" point: maybe there's some way of implementing intelligence that leads to an explosion that no one has thought of yet, because the algorithm ends up in some new "category of intelligent systems". Once we do it, our current hardware and compute levels is enough to get an intelligence explosion.

Finding a universal intelligence is easy

However, these examples I provided can be used to argue against an intelligence explosion.

Consider an NP-Complete problem, or an uncomputable problem. There are tons of them, therefore it's very likely that someone trying to just work on some unrelated thing will encounter one.

Consider a Turing-Complete machine. There are tons of them, therefore it's very likely that someone trying to create a computing machine will quickly converge on some Turing Complete computing model.

The point here is that "threshold effects" in mathematics often happen when the object is so common that it is easily encountered in practice, and many different approaches converge to them. Therefore, if threshold effects exist in intelligence, the "universal intelligent model" should be relatively easy to find, and fairly dense among the set of all things people could come up with.

There are some objects that are theoretically very common (such as transcendental numbers, which are dense among the reals) and yet we have relatively few examples of them. However, for the few examples we do have, those examples were found quite a long time ago, such as the number pi.

If it's true that finding a "universal intelligent system" is pretty easy, then it's likely we already have it. Maybe LSTMs or transformers are already good enough.

Relative Efficiency

If it is easy to discover "universal intelligent systems", I think that this becomes very comparable to Turing Machines. The Church-Turing Hypothesis posits that "Turing Machines are all you need", and any function that can be computed can be computed by a Turing Machine.

This is an unproven claim about the physics of our world. There are plenty of hypothetical models of computation, such as computers with access to time machines, that are capable of computing functions that cannot be computed by Turing Machines. However, so far it doesn't seem possible to build such computers in our world.

There's a stronger claim that seems more insightful for AI: "any other machine that can be built in our world can only be polynomially faster than a Turing Machine". This is again an assumption about physics of our world. In fact, quantum computing might be provably better in some ways, here's some discussion with a nuanced explanation of the differences. This point also is only true for certain problems, which ties nicely into the no-free-lunch points.

Using this as an analogy to AI, we get the following hypothesis:

"Universal Learners (search processes over programs) are relatively easy to find, and many ideas work once they have enough compute. The sample complexity difference between ideas is at most polynomial, and empirically it seems to be a constant factor difference, so it's possible there are even stronger guarantees. Breakthroughs in physics could act as a 'jump', and the details of physics in our universe affect how large that jump is".

So far, quantum mechanics provides the only known physics 'jump' in computing. (maybe black hole computing is better? I'm not sure if there's a conclusive answer to that question yet, and an AI using black holes would probably be noticeable). If quantum mechanics is useful for intelligence, it's very possible our brains are using it, since quantum mechanical properties are used in biology elsewhere. Thus, if we are able to get comparable performance to the brain with comparable compute, this is good evidence against quantum being much more helpful. (counterargument: maybe doing quantum computing at a large scale is infeasible in physical systems without extensive engineering efforts that no longer are worth it, so the brain has to settle with classical computing)

One counter-argument to this point could be something along the lines of: "There are certain computational problems we eventually proved were in P, but it took us a very long time to do so. Maybe our current techniques are inefficient approaches, and a sufficiently powerful AI will do something like solve P=NP, giving it very powerful computational tools to work with"

The best response to this point seems like an appeal to nature: I feel like if nature never found a solution to computational problems that greatly decreased the power usage of intelligence, it's unlikely we will either. But this response isn't great, maybe there's not a clear path of mutations that would ever lead to such an implementation.

In summary: "Analogies from building models of computation suggest that there is no secret sauce, finding universal learners is relatively easy, and improvements involve small polynomial improvements in sample complexity for relevant real world tasks. Also, the physics of our world matters a lot".

Caveats

These aren't formal arguments, and it's possible the structure of learning algorithms is much more like algorithm design (improvements matter a lot for capabilities, and learning algorithms are somehow this never ending source of problems that can lead to more algorithmic improvements) and much less like architecture design (pretty minimal ideas give you most of what you want). Personally, I read the classification results in learning theory as being somewhat ambiguous on this point, and read scaling law literature as being suggestive that we are just finding constant factor improvements.

There's a good question about how much asymmetry is there. If many "natural tasks" have "optimal learners" that require many more bits to describe than the task requires to create, this implies that the barrier for an intelligence explosion is higher, and potentially self-defeating if in solving tasks it ends up creating more of these natural tasks.

Limits of computation

If we assume the above argument about there being no "secret sauce", this implies that you may get a few orders of magnitude improvement from techniques, but that you'll eventually get diminishing returns with "universal" architectures. This implies that scaling up computing is the main constraint towards "fast takeoff", and "fast takeoff" would only happen with algorithmic improvements if we are seriously slacking in our algorithmic efforts compared to what's possible. (maybe the notion of an "algorithmic overhang" would make sense here?)

If we look at computing, supercomputers follow a pretty steady moore's law. If we read Moore's law as "coordinated human level intelligence effort causes a growth at roughly this rate", then it's a question of how much "more intelligence" would speed it up.

Without secret sauce that causes discontinuous jumps, we'd need a discontinuous jump in hardware improvements. This becomes a physics question again. I'm not particularly familiar with manufacturing, but it seems likely the limiting constraints would be acquiring the raw resources, which just takes time and is often slowed down by local regulations. Nano-machines seem like the biggest "tech overhang" that could result in a discontinuous jump in computing power. I'd need to do more reading about the viability of nano-manufacturing for making more flops at a rate faster than Moore's law. It seems plausible as a point of discontinuous improvement, some people even suggest we should build AI as fast as possible to make sure we get AI before nano-computing.

Anyway, all of these arguments suggest to me that fast takeoff is somewhat likely (I'd put it at about 40%), but that slow takeoff seems more likely.

Search This Blog

AI Research Blog

On AI Alignment

Takeoff Speed

Comments

Post a Comment

Popular posts from this blog

Starting up this blog again (MATS)

My prediction of AGI in 2021-2023

Inductive Linguistic Biases