Research Direction

My research direction that I've chosen is what I've been discussing previously: I want to try and make synthetic language tasks that can help transfer performance on real world language tasks.

In this post I'm going to give a motivation of this problem from a few different perspectives: Transfer Learning, and Understanding Inductive Biases.

Transfer Learning

When you want to teach a model a skill, one way to do this is by feeding the model labeled (input, output) pairs. If you don't give it enough data, it'll be confused about what you wanted and won't properly learn the task. Eventually you can give it enough data and it'll figure out what you are trying to get it to do, allowing it to generalize to data it hasn't seen. One way to think of this is that the model starts out with a hypothesis space of "here are all possible things they might be trying to teach me", and as you feed it data, it can eliminate hypotheses. Eventually it's thrown out so many hypotheses that only your task remains.

In formal learning theory, "how much data do you need before this elimination process converges" is known as the sample complexity of your task.

For machine learning models, a more helpful mental picture is that it has a "preference" value for every hypothesis, and it slowly adjusts those values until the correct hypothesis has the highest preference.

In the real world, this "hypothesis filtering process" isn't the entire picture. Humans are able to use their past experience to help them solve new problems. In machine learning, this is known as transfer learning (or multi-task learning, if you care about all tasks equally). From the perspective of sample complexity, you could decide to make your model assume that future tasks will be similar to previous tasks. Due to the way this "narrowing down process" actually happens with machine learning models, this is an assumption they have, which is more clear when you are using the "preference value" perspective.

When this assumption is true, by first training on some task A, training on some task B will require less data because your model will be able to much more quickly narrow down what you are trying to teach it. Thus, if two tasks are related, training on one will decrease the sample complexity of the other. That's the whole motivation behind transfer learning and multi-task learning. Related also isn't quite right, some tasks might be supersets or subsets of each other, but hopefully the idea is clear. One way to think about task similarity is to imagine that they are points in space, and distance measures this similarity. Using euclidean distance introduces distortions when you have a tree structure, so this probably isn't ideal, but even then, this idea of embedding tasks in space to predict transfer has been verified experimentally. It should be noted that we still don't have a great formal theory of generalization, testing theories against what happens in practice has shown that all of them fail in some way. However, some are pretty good, at will be correct about 90% of the time.

The standard way of doing this is called generative pre-training (that's what the GPT stands for in GPT-3), and it involves gathering massive amounts of data from the internet, and trying to produce new data from a similar distribution. For text, just predicting the next word captures this alright, but there are other alternative ways that work better for other kinds of data. The point is that this understanding of real world data is super useful because it eliminates a lot of the weird hypotheses like "the task is to count the number of a's and subtract the number of b's". Sure, that is a task you could teach your networks, and pre-training probably won't help you very much there. But most of the tasks we care about involve some language understanding, and pre-training lets the model start with that understanding already available.

However, pre-training is not without faults. In particular:

1. Pre-trained models can learn to be biased in the same way some humans are, such as being racist or xenophobic.

2. Only 10-100 million words is enough to teach models most of the language understanding we study and test for, yet it is clear that more words than that are needed for many of the impressive features we saw in GPT-3, like few shot learning, more cohesive storytelling, and common sense reasoning. It would be nice to know what those additional skills are, for making model actions more predictable. 

3. It's possible some skills we care about are underrepresented in the training data. GPT-3 still isn't great at mathematical reasoning, yet transformer experiments have shown that much smaller models are already capable of doing much more advanced mathematical reasoning. Thus, it's very possible mathematical reasoning just wasn't very important to do well on real world data. This can be addressed by improving the training data, but it would be nice to have more fine grained control of how much expertise we want models to have in different areas.

4. If you don't know what the distribution you are teaching your model is, it's difficult to get at questions about how well it'll generalize to problems outside that distribution. This is an AI safety problem that is relevant today, as it could lead to unpredictable behaviour that isn't what the designers intended.

5. There are real world problems that us humans are bad at, and other problems we just don't know how to solve. In order to solve these, we need super human performance at these tasks. An AI that learns from all of human knowledge is likely to be better than most humans at many things, but it still is limited, as superhuman performance is out of distribution. On the other hand, synthetic tasks might be able to be scaled past human performance, allowing us to decrease the sample complexity of the real world problems we care about. Thus, they can provide a form of pre-training that is relevant in more domains than just "humany tasks".

There's people working on interpreting the inner workings of the model after training, and that seems like a promising way to approach these issues. But a complementary approach is examining the training data itself. While this is happening, creation of synthetic data seems like a relatively underexplored area. In theory, it would allow formalization of the skills we care about. If we notice a model is lacking in a particular area, the pre-training data could be augmented with new data capturing that skill.

Creation of synthetic training data is a common approach in robotics (using simulators), and is also studied in image recognition, see for example synbols or learning shapes. But for language it seems relatively underexplored. One comparable area is the reinforcement learning field of emergent language learning, which studies various games agents play with each other to learn language-like skills. This is promising for use as a way to augment your training curricula, but isn't as helpful for approaching some of the "interpretability of training distribution" questions. The closest things I'm aware of to what I'm trying to do are probably recent work on regular expressionsLIME, and this 2006 PhD Thesis by Hal Daumé III.

One reasonable objection is that active learning and automated curriculum design might be better at addressing some of these issues, while training models quicker. These techniques automatically find samples the model is bad at, and have it focus on those specific samples. I agree with this objection, but synthetic task design is complementary to automated curriculum design. In particular, I think the strength of synthetic task design is in cases where there may not yet be enough real world training data. I also think there are more precise questions you can ask about synthetic data that you can't do with "clustered real world data" in quite the same way, and some of those questions may help lead to us understanding deficiencies of our current models. This leads me to my second point:

Inductive Biases

I'll be sharing a more detailed post later about what some of the major inductive biases seem to be, in the meantime I highly recommend Inductive Biases for Deep Learning of Higher-Level Cognition. The idea is that the "preference values" for different skills don't start out at equal for every skill, humans have biases for how they generalize from a given set of data. Those biases help us learn without needing massive amounts of data, especially because others around us share the same biases. This means that for things like language or culture, they were shaped by innate human biases, so "our best guess is right, because the right answer is the best guess of the last generation".

Some of the most successful machine learning models (LSTM, Transformer/Attention, CNNs) were motivated - or could be rediscovered - by trying to encode some human inductive bias we hadn't captured before. Aside from encoding them in models, it's also possible you could also teach these biases with data, and maybe at some point the penalty is just needing more parameters compared to the "native" architecture with built-in inductive biases.

To find an inductive bias, it seems like the two ways are either:

1. Take the literature about some field that studies humans (economics, psychology, neuroscience, sociology, linguistics, etc.) and try to formalize an insight of it into an inductive bias.

2. Look closely at some problem current systems have, try to formalize what that problem is, and try to come up with solutions.

This is not an exhaustive list, but my impression so far is that those are the two most common ones. Once you have a proposed inductive bias, you need claim it is a useful inductive bias. To do this, people usually use benchmarks.

The idea is that model tweaks that need less data to get comparable performance on some human task have probably captured new inductive biases for that task. If no model has managed to get 99-100% accuracy (also you should consider precision and recall separately for classification tasks), you probably just don't have enough data for your current inductive biases to completely learn the task. Improvements are thus either improving those biases, or increasing the data. Tweaks that help with a large range of human tasks probably capture a more fundamental human inductive bias.

Also, the number of parameters matter, as we can think of additional parameters as expanding the range of possible tasks the model can do. It's okay to do this because regularization constrains models to start with simple tasks, and only learn more complicated tasks once it has ruled out the simple ones. Because of random initialization of weights, it's also possible more parameters increases the chance that the model has a subnetwork that already almost solves the task - this is called a lottery ticket - and all it has to do is some fine tuning to get exactly what you want. Anyway, the point here is that you can benchmark optimize, optimize for needing as little data as possible, or optimize for as few parameters as possible. Those three approaches all give useful insights about sample complexity.

In designing synthetic tasks that try to decrease the sample complexity of pre-training, the hope is that I can try to help make a little more progress towards a more formal theory of human inductive biases. Unlike models which are difficult to formally analyze, formal tasks are more amenable to theoretical analysis. Even in the cases where they are too complex to be formally analyzed, formal tasks are still easier to ask concrete experimental questions about, as you have full control over the different aspects of the data.

Experiments

This gives some good background as for why I'm working on this task, but I haven't really explained precisely what I'm going to do. This is somewhat intentional. The goal of formalizing inductive biases as synthetic tasks is very exploratory, and my initial experiments will be about validating that it's a workable approach using really basic synthetic tasks (if not, my fallback is trying to experimentally relate scaling laws to VC Dimension). My loose research goal is currently to train a model that is much smaller than GPT-3, but better than GPT-3 at some concrete language task, like dealing with negations or simple reasoning. I would not use any architectural innovations, this would be purely through using synthetic data as part of the training curriculum.

However, there are a few important points that are worth covering when talking about synthetic task design.

Extraction

For each task, there seem to be two distinct pieces:

- Formally defining the task

- Extracting instances of that task from real world data

For example, consider the task of learning spellings of words. We can define this task in a few different ways (using automata, an n-gram model, etc.). Once we do that, we can construct a curriculum of progressively more difficult tasks, and if desired, understand the scaling laws of these tasks. Once a model is good at tasks that seem comparable in difficulty to english spellings, we can compare transfer performance to learning on real world text (and vice versa: see how a model does on these synthetic tasks at different levels of complexity after being trained on real world text).

However, it would be better to control for the other tasks that exist in real world text that are not part of our skill, such as grammar, word meaning, or reasoning. To properly do so, this requires extracting real world english spellings into instances of our synthetic task. This involves scraping real world data and performing analysis on it. For more complex tasks like reasoning, I expect this extraction process to be particularly difficult, and it might be necessary to just rely on other forms of analysis.

It’s also very possible that a large portion of the loss can be gained by just learning better information about the world, which is only interesting insofar as that represents novel causal reasoning being developed. To counter this effect, one thing I can do is generate a set of data based on the data points in pre-training that learning my task causes to go up. Unfortunately this is a self-fulfilling prophecy (of course we expect transfer, we chose points that were exhibiting transfer), however, by examining those points we could create a more “real world” ish dataset. This is an alternate way of doing data extraction. For example, creating a synthetic reasoning task, then looking at real world tasks it’s getting better at and hypothesizing the overall pattern and creating more language examples that fit that pattern, and testing on those. This is still a bit methodologically sketchy, and I'm still thinking about good ways of doing this process without being self-fulfilling.

Real data transformation

An alternate approach to extraction is to try and transform real world language data to “remove” the more difficult tasks. For example, if you are trying to teach a model word spellings, you can shuffle the order of the words. This would partially remove some effects like grammar and reasoning. While this is not as controlled as task extraction, it’s still a promising alternative, and also allows you to construct “synthetic tasks” (I’ll refer to these as “transformed tasks” to make the distinction clear) in a different way.

These transformed tasks can be useful for seeing how important a particular skill is in language. For example, if a model can transfer quickly to real world text after being trained on a transformed task, the stuff you transformed away probably didn’t make up the majority of what makes language learning difficult. A few examples of transformed tasks:

- Replacing all words with their parts of speech

- Shuffling (or sorting) the letters inside words

- Replacing all nouns with the same noun

- Removing all words except nouns

Tasks

(I try to write this blog to make it somewhat approachable to non-technical people, but in this section that's really difficult without making it more verbose than necessary, sorry about that)

Here are the set of tasks I am currently considering, in order of least speculative to most speculative. These are intentionally slightly vague. I find that clarifying them ends up needing to justify lots of particular decisions that are fairly ad-hoc, and so it makes more sense to guide those decisions through experimental verification.

I'm aware that many of these are subsets of other more general tasks, which are themselves subsets of more general tasks. The reality is that much of synthetic task design is actually about picking the right level. If you go too high level, everything becomes intractable, and your biases aren't particularly transferrable to real world problems. I don't think there's really an issue with going too low level.

Anyway, here's the list. I'll probably be updating this over time, and I plan on making a more detailed post eventually, once I have some experimental results to justify the particular decisions.

- Memorization: An important sort-of baseline skill, but also involves questions like how many times does it need to see a sample to memorize it, what factors make a piece of data difficult or hard to memorize, and how do these answers change over time)
- Spelling (DFAs), with data biased to be compositional
- Grammar (Grammar Induction)/Sequence Learning, also with data biased to be compositional
- Word meaning: learning synthetic word vectors for progressively more complicated context
- Mathematical reasoning (LIME and related setups), more generally considering subsets of substitution rules since Metamath can phrase all of mathematics as substitution rules. Also consider a simple variable binding type task (Legg, another scholar, is investigating variable binding in more detail), and variable negation.
- Partial Information settings where discourse is based on hidden information, and learning the rules requires inferring that hidden information
- Few shot learning variants of the above tasks
- Generalization through inferring relations between two pieces of data
- Hypotheses creation via translation from one grammar to another, where the translation is underspecified so the model needs to invent rules
- Mutual Exclusivity assumptions
- Variants of structure mapping theory that account for recent insights about honing
- Associative/Causal Reasoning, based on models of cognition and world models constructed from models of synthetic culture

Because this is a fairly obvious thing to do in linguistics, I don't expect to find simple computational problems that teach most of needed skills. Instead, my guess is that the problems are either going to be computationally difficult (and thus have pretty spicy scaling laws), or the search space is so large that using learning algorithms to help me navigate through it will be necessary. Lots of structured prediction theory becomes relevant. As mentioned above, there's also the possibility that most of what is learned is just world knowledge. In that case, extraction/transformed tasks will be needed to capture most of the improvement in loss, and comparing to pre-training directly becomes less useful. Whatever the case ends up being, much of my research is just about trying to explore and understand if this research direction is feasible - and if so - where the difficult problems actually lie. I expect many of the insights in developmental robotics to be useful here, as one way to phrase this research direction is "developmental natural language processing".

In practice, I don't have time to approach everything here, so after a month of preliminary exploratory experiments, I'm going to narrow down to a particular question that seems promising and try to investigate it in detail for the last 2-3 months.

Comments

Popular posts from this blog

My prediction of AGI in 2021-2023

OpenAI Scholars Project

Inductive Linguistic Biases