Posts

Starting up this blog again (MATS)

I’ll be attending MATS , studying mechanistic interpretability with  AdriĆ  Garriga-Alonso  as my mentor. I’ve been spending the last few months skilling up on this, and the program starts on Monday. I think doing a weekly/biweekly blog post would be useful for documenting what I learn. Broadly, it’s been interesting learning about mech interp. It’s a field where you rarely have concrete proofs, instead you just find lines of evidence for/against a claim and try to build arguments based on those. Those arguments may be later overturned, that’s okay. This means that your “results” are less what theories you have, and more about what evidence you found. The main areas I’m interested in (subject to change) are: (I’ll talk about and explain these more in later posts if I end up working on them) - Mechanistic Interpretability of the steering vectors learned by Contrastive Activation Adding - Implementing Activation Adding in some of the popular chatbot uis (for example, emotion control) or m

My prediction of AGI in 2021-2023

In 2020 I made a prediction on Facebook that AGI was only 1-3 years away. Probably time to follow up on that. At the time, Gwern’s “You can only prove the presence of an ability, not the absence of it” seemed solid. Someone would say “GPT-3 can’t do X”, and then later someone would find a prompt where it could reliably do X! There was a lot of talk about becoming a good “prompt engineer”: being good at wording things in a way that cause language models to reliably do what you want. Relatedly, these days my favorite way to think about language models is as a Simulator (Janus popularized this framing here  https://generative.ink/posts/simulators/ ). Inside them is all these different persons/personalities, each with different abilities (try character.ai to experience this). Prompt engineering pulls out a specific simulated person and inhibits many others. If you get the wrong person, it’ll fail at things other simulated people are very capable of! But honestly, I think ChatGPT proved Gwe

Inductive Linguistic Biases

Image
(unfinished blog post, posting for reference) Introduction Humans have "inductive biases". When humans see data that they could draw multiple conclusions from, they often pick one of those conclusions. There are certain kinds of conclusions that humans are more likely to come up with, and those conclusions are our inductive biases. For example, humans tend to assume the sequence (1,2,3) will be followed by (4,5,6,7,...) even though there are many other conclusions you could come to. These inductive biases are very important for understanding how humans can learn from so little data: many mechanisms in our society involve learning to coordinate , instead of learning the "correct" behaviour. Often these norms determined based on the previous generation's "best guesses", and since you share the same inductive biases, you are likely to make those same guesses, and so can learn to coordinate easier. This means that if you can create AI models that have ve

Progressing on Open-Ended Interpretability through Programming Language and User Interface Design

Part of the reason that open-endedness research is so difficult is that many of the things AI agents create are weird . They aren't constrained to our ways of thinking, and so they'll find solutions that are really hard for us to understand. This makes it very difficult to diagnose where things have gone wrong, because you can hardly even tell what is going on. Maybe a system developed an entire culture and simulated an entire universe inside that uninterpretable mess of symbols. Maybe it's just utilizing noise and random combinations because that's easier than designing small things and the simulation favors that. When we can't distinguish between these two cases, we have a bad problem and our research cannot go to space today. So how do we fix this? One answer is to just try and figure out what's going on anyway. I think that's a valid approach, and some of that will be needed. See the fantastic Circuits/Reverse Engineering Neural Networks work for some e

Current projects

I'm generally working in the area of trying to merge open-endedness, artificial creativity, and complex systems. I may update this post occasionally as I learn things, and will probably shift projects occasionally if things don't pan out. I'll try to have a policy of openly sharing my ideas because I don't care if I'm scooped. I just want cool stuff to be built, I don't care who does it (though don't worry, when I collaborate with others I don't use this policy because I understand others feel differently). My goal is to figure out if there's a way to make systems generate open-ended content that seems "human culture-like" in it's form. While I like Stanley's approach and will use many of the insights of POET and it's follow ups, just body locomotion seems like it'll cap out in a way that's distinct from culture. Maybe there's a way to transfer via some reduction, but I'd like to see if it's possible to push

Open-ended concept creation through scaffolding

(The scholars program is over, but I'm going to continue doing independent AI research for at least a year, as I continue to build my research chops. My posts might be less frequent, as I also spend some of my time now doing game design, see my other blog ) Most of my research direction is trying to determine if it's possible to create a language model with very little (or no) training data. Ideally, it should still have scaling laws. If we could do this, we could arbitrarily scale our models to continually increase capabilities, and it wouldn't be limited by the presence of data. Yet, in some ways this is non-sensical. The set of all distributions is very large, so a single model cannot "learn" them. Thus, it must focus on learning a subset of all distributions. Somehow we'd need a way of specifying a distribution that isn't too large to be intractable, while still containing human language and most of the things we care about. And even if we did that, th

OpenAI Scholars Project

My blog post is split up into three parts: Literature Review (incomplete, this is a very large field of study): Lenses of Analysis on Opinion Spread . This post is an older version that contains more detailed discussion and why I didn't include some things (like advertising), for those that are interested. Experiments and Analysis Section: Training models on their own outputs . Speculative Section:  What we'll (probably) see before we build artificial cultures . This is just some research ideas and thoughts I had after doing lots of reading, take them with a grain of salt.