New Research Direction

AI Safety

I spent the holidays reading AI Safety literature. I was somewhat familiar with the general concepts before, but this helped me get a deeper understanding of what some of the particular directions are.

In doing this reading, I've become more convinced that scaling up AI is a bad idea. I think that scaling up AI+techniques that improve sample complexity has a very practical chance of making some kind of AI that is superhuman in intelligence, and that seems to be OpenAI's main research direction. I have the impression that they are fairly considerate about safety in general, but I still am not sure that scaling up is actually a good idea.

Basically I feel this way because there is not a good answer to the question of "how to do AI safety" yet. Stuart Russel's preference learning direction seems like it may eventually get to a decent place, but Paul Christiano's approach for informed oversight still seems like it has some fundamental issues. There's been some decent progress, and some new directions, but it's still unclear to me if they'll be able to patch all of the problems they run into. Personally I think that Deep Mind's formal verification will be an important piece of AI safety, assuming that the SDP techniques can scale up to large models with algorithmic improvements.

There's this fair point of "white box AI safety is a hard problem" (making an algorithm that is aligned and doesn't do catastrophic harm), but that even if we get white box AI safety, that also needs to somehow prevent anyone from making an unaligned AI.  I was trying to think about things in terms of sample complexity to make some impossibility claims (maybe building some world destroying technology is just too hard for AI due to complexity theory reasons), but unfortunately it seems to be mostly dependent on our world's physics for what things are hard or easy to build. And unfortunately, humans are an existence proof that human level AI is possible, and nuclear weapons are an existence proof that human level AI is sufficient to destroy the world in our kind of physics. Thus, there's a serious risk here. I do think that depending on the physics (for example, how difficult nano machines are to build, or asymmetries in the difficulty of defending vs attacking) complexity theory arguments might be sufficient to argue a singularity/intelligence explosion is impossible, and for any task there is usually diminishing returns anyway. But the point is you don't need an intelligence singularity to have a dangerous system, just human level intelligence is enough, and scaling up seems like a very plausible approach to getting that. 

There's a more general "black box AI safety" problem, which says "if you have no guarantees of what AI is doing, can you still prevent things from going catastrophically bad?" If you think about it, this is literally just the question of "how can we setup society so we don't all kill each other", because each human is it's own "black box AI". We have no formal guarantees on the goals and preferences of humans, and yet, our society is decent at not killing each other. Nuclear weapons led to a close call, and future tech might as well, but many of the techniques we have (democracy, trade, specialization, etc.) have helped prevent any serious failure cases. How applicable our "how to run a society" techniques are to AI alignment seem to just be dependent on how different AI might be (maybe our techniques make implicit assumptions about human psychology), and how much past human intelligence AI can get.

Of course, in practice these systems developed are not usually just put into the real world and told "go do your thing", we have a little more oversight than that, and the kinds of actual dangers are more nuanced and complicated. But still, I am now fairly nervous about contributing to research that improves the sample efficiency of AI.

Old Research Direction

My research direction was essentially about trying to make a formal model of human data/inductive biases. I ran a few experiments and did more thinking, and came to the following conclusions (also see this recent paper):

1. It seems like the approach of "making synthetic data to encode inductive biases" will be useful for training language models faster (i.e., improving our ability to scale up). I think this is particularly true for things like theorem proving where the task is formally specificable in not too many bits, open-ended, and continues being useful as complexity increases.

2. It seems like it will not be very useful for giving us a deeper understanding of "what is being taught to language models", aside from vague intuitions/Microscope AI techniques that aren't particularly useful for safety when scaling up. This is partially because many tasks end up having a bell curve type shape when it comes to transfer. They aren't useful for small levels of complexity, helpful for medium levels of complexity, and less helpful for high levels of complexity (see spelling of words - eventually the words are so long that it's unhelpful). There's also a lot of nuance about what kind of assumptions you make about the data/what kind of data you have, and this gets into formal learning theory. There are lots of impossibility results one could prove around edge of chaos making long term prediction of behaviour hard, and the only way to get around that is by making those assumptions. This brings me to this point:

I do think there is a plausible way to make such a formal theory, it basically relies on Topological Data Analysis (and other similar theories about assumptions of data), Learning Theory, and DeepMind's formal verification techniques. People say that learning theory doesn't apply to neural networks, but that's not really true. I think that there's a promising approach to "clarity of predicting what AI will learn, and how long it takes to learn it" that essentially takes data and uses automated theorem proving to generate a sufficient set of assumptions to produce a sample complexity bound, which can then be verified to be correct or refined further. However, the level of math chops this direction requires is a little past what I have time to develop in the Scholars program, so I'm going to pivot. After the Scholars program I'll have some time to do independent research, and may focus on building my skills to pursue this direction more.

It's still unclear to me if such an approach could prevent mesa-optimizers: algorithms that optimize a different more general thing that also happens to optimize for your task, but might fail catastrophically on out of distribution settings. But I think it's very possible formal verification and out of distribution robustness techniques could eventually address that concern as well. Once you make your assumptions about your data explicit, checking if they are violated (and thus you are out of distribution) is doable. There are also plenty of easier ways of checking for out of distribution, like looking at the variance of an ensemble of predictions, or treating different instantiations of dropout as different predictions and using the variance of those estimates.

Of course, I realize that I'm not the only person working on this direction. Choosing not to work on the "experiments to help guide dataset design to improve sample complexity" direction may not have any actual impact in the rate in which that direction is pursued, and there's also a fairly high probability that - if I were to pursue this direction - the results would be less helpful than something like active learning or just more scaling. Or that I push in some corner that isn't quite the right direction, and there's a more promising approach elsewhere someone else will find, because that can often be how research goes. The other reason I'm switching directions is that the reason I found this direction interesting and exciting was because of the potential to develop those formal theories. Now that I understand that the kinds of experiments I was going to run aren't particularly helpful to develop those formal theories due to edge of chaos/dependence on particular choices of model reasons, and instead the right way to go is this more mathy direction, I find the experiments much less interesting.

New Research Direction

Essentially I'm going to be studying "feedback loops in AI systems". There are a few questions here that I want to study. I'll be typing up a new blog post in a bit with a more detailed description of this approach, but here's the high level picture:

1. Looking at models of opinion formation to try and model how giving an individual a larger slice of opinion influencing ability (through AI models) will shape the overall information landscape. This includes disinfo, but also other settings. For example, if we have an (unintentionally) biased model compared to the initial opinion distribution, as the amount of text generated by that model becomes larger, it has a larger sway on the set of opinions held by others, once you make assumptions about opinion spread.

2. Looking at what happens when you train a language model on it's own outputs. Will it converge to boring? Or diverge to entropy? Or do some of one and some of the other for different domains?

3. Seeing if there are ways to manage both of those feedback problems to make them less of an issue.

4. Examining the deficiencies of different framings of this problem.

Some related reading:

A systems approach to cultural evolution

Fairness and Abstraction in Sociotechnical Systems

Aligning Popularity and Quality in Online Cultural Markets

Popularity Signals in Trial-Offer Markets

Finally, here's a preview of a few cool pictures from my initial experiments on models of opinion formation, I'll be explaining these in a later blog post.






Comments

Popular posts from this blog

My prediction of AGI in 2021-2023

OpenAI Scholars Project

Inductive Linguistic Biases