Starting up this blog again (MATS)

I’ll be attending MATS, studying mechanistic interpretability with Adrià Garriga-Alonso as my mentor.

I’ve been spending the last few months skilling up on this, and the program starts on Monday. I think doing a weekly/biweekly blog post would be useful for documenting what I learn.

Broadly, it’s been interesting learning about mech interp. It’s a field where you rarely have concrete proofs, instead you just find lines of evidence for/against a claim and try to build arguments based on those. Those arguments may be later overturned, that’s okay. This means that your “results” are less what theories you have, and more about what evidence you found.

The main areas I’m interested in (subject to change) are:

(I’ll talk about and explain these more in later posts if I end up working on them)

- Mechanistic Interpretability of the steering vectors learned by Contrastive Activation Adding

- Implementing Activation Adding in some of the popular chatbot uis (for example, emotion control) or making an instruction following vector+Hermes (to get more people interested in improving activation adding)

- Making automated circuit discovery do more of typical analysis steps/making the output more nice (Would be cool to have ACDC output a Jupyter notebook full of all the evidence it found based on the data you gave it)

- Automatic discovery of multi-word tokens entities based on seeing when first few layers shove meaning into the last token (and when ablated away it changes meaning on later token)

- Each head in each layer “enriches” tokens with additional meaning. Look at the statistics of this over a large dataset. Given any token, visualize its enrichments: What do the top PCA components unembed into? What are the statistics of unembedding each enrichment (scaled by their magnitude)

(LLMs have garbage enrichment that they’ll do, like confidently adding a sport to someone that doesn’t do sports. This isn’t an issue for LLMs because they won’t look up those wrong enrichments since their lookups are ANDs over name and sport. But this will mean I get many irrelevant enrichments. Can I filter based on lookups?)

I’ll also probably make a few blog posts just documenting some of the cool stuff I’ve learned in my readings.

So far, I really enjoy mech interp research! It has short feedback loops which fits with me better, and there’s some major progress that’s been made in a relatively short period of time! There seem to be many things to do and I’m looking forward to my time at MATS :)

Otherwise, I’ve been working on reproducing the Stanford LLM AI Sims thing, but a little more general and with open source LLMs. I may make a post sometime on all the stuff I’ve learned in the process of doing that

Comments

Popular posts from this blog

My prediction of AGI in 2021-2023

OpenAI Scholars Project

Inductive Linguistic Biases