Posts

Showing posts from January, 2024

Starting up this blog again (MATS)

I’ll be attending MATS , studying mechanistic interpretability with  AdriĆ  Garriga-Alonso  as my mentor. I’ve been spending the last few months skilling up on this, and the program starts on Monday. I think doing a weekly/biweekly blog post would be useful for documenting what I learn. Broadly, it’s been interesting learning about mech interp. It’s a field where you rarely have concrete proofs, instead you just find lines of evidence for/against a claim and try to build arguments based on those. Those arguments may be later overturned, that’s okay. This means that your “results” are less what theories you have, and more about what evidence you found. The main areas I’m interested in (subject to change) are: (I’ll talk about and explain these more in later posts if I end up working on them) - Mechanistic Interpretability of the steering vectors learned by Contrastive Activation Adding - Implementing Activation Adding in some of the popular chatbot uis (for example, emotion control) or m