Research Processes and Inductive Linguistic Biases

 In general, I’ve been thinking about two main questions:

  1. Good ways of conducting research

  2. Formalizing the notion of “breaking language learning into synthetic tasks that capture most of the problem”

1

I’ve received a few pieces of advice for research processes that I’ve been reflecting on.

There are two ways to go about “getting something to work”. One is to fix the task, then throw every tool you can think of until your model does what you want. The other way is to use a standard algorithm on fairly default settings, and keep adjusting the task, problem setup, etc. until things work.

Either approach is warranted in different cases. The first makes more sense on clearly defined tasks we know we want to do well on but haven’t managed to do well yet. The second is much better for exploratory work (“trying to take a vague idea of behavior you want to see and condensing that into a formal setup”), as it helps you keep a clearer perspective and doesn’t muddy the waters of potential things you could do.

In practice you can also do a combination of both, but it’s good to keep in mind which one you are doing (and why) or you can end up losing a lot of time for very little return.

It’s also very important to keep in mind what questions you are trying to answer. Because there are so many experiments you could do, it’s important to try and prioritize experiments that have a high chance of telling you something useful. Exactly what “useful” means depends on your research direction and perspective, but as a guiding principle this causes you do think carefully about the questions you actually care about. Asking yourself “if this works, what would that tell me? How much do I care?” is very helpful for doing this filtering. Sam Altman recommended we spend 80% of our time on problem selection and 20% of our time on problem solving, and this seems like a decent way of implementing that advice.

In the last two weeks, I’ve gained a much better understanding of the importance of this advice first hand. I’ve spent a lot of time in “experiment rabbit holes” (like a variant of a transformer that uses relative attention so you don’t need positional embeddings, or investigating the impacts of various model parameters on the loss curve) that were somewhat interesting and insightful, and helpful educationally, but not super productive for actually answering the questions I want to answer.

2

On that note, I've been spending a lot of time thinking about my research direction of "breaking language learning into synthetic tasks that capture most of the problem". This still seems like a decent research direction, but I've encountered a lot of nuance for how this will actually work in practice.

When trying to formalize a vague idea into a concrete theory, I've found it is helpful to approach it from two directions.

The top down perspective involves two things: reading what other people have done, and trying to clarify/detail out pieces of what you are trying to do. Other ideas are helpful for giving you additional perspectives you may not have thought of, and detailing is useful for clarifying ambiguities in your initial idea. This is mostly about giving you guidance for generating theories, and helping you clarify how you'll know when you find what you're looking for.

The bottom up perspective is all about making proposed formal theories, trying them out, and seeing where they break. Sometimes they will be too general, sometimes they will be too specific. That's okay. You commit to a theory, see where it doesn't meet what you want, and then propose another theory.

It's really important to do both of these things. If you do too much of the top down, you can be stuck in idea land, and never end up making something concrete. When you actually produce a formal theory, you realize which things are actually difficult and easy to solve, and can make significant progress. On the other hand, if you do too much of the bottom up, you can end up not spending your time effectively, getting distracted but interesting but unimportant side details, and focusing on things that are too far away from the idea you were trying to capture to be useful.

This is the perspective I've learned for software and game design, but in my very limited experience it seems relevant to research as well. So, applying it to my research question:

Top down:

One way of framing my problem was stated very nicely in LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning. The idea here is that humans have "inductive biases". When humans see data that they could draw multiple conclusions from, they often pick one of those conclusions. There are certain kinds of conclusions that humans are more likely to come up with, and those conclusions are our inductive biases. For example, humans tend to prefer the simplest possible representations and explanations for real world phenomenon that's fairly consistent with the data.

In machine learning, the hope is that in encoding human inductive biases into models will improve their performance on human tasks. For example, convolutional neural networks seem to be fairly effective at vision, and they encode the bias of "if you shift something a little bit the result should be the same" (known as translational invariance). I care about them as a way of giving us a mathematical approach to understanding of human cognition and human generated data, but they are also practically useful.

So I decided to study in detail linguistic inductive biases. I started with reading a few introductory pieces about phonology (rules for sound orderings), rules for turning text into sounds, morphology (see chapter 2 of the referenced book), and various aspects of syntax. I generally am still fairly new to much of this and so am not qualified to comment on the many very heated discussions that seem to be going around in linguistics right now, but my impression is that aside from some modern work, many of these fields focus on categorizing and understanding the underlying rules and systems, and not necessarily why they come about. This makes sense as a practical study of the mechanisms, but isn't quite as helpful for my purpose of learning what the human linguistic inductive biases are. However, there were still some interesting tidbits: following the Chromsky hierarchy, phonology seems to be a strict subset of regular languages, morphology has a few rules like reduplication that are common and not regular, and syntax and grammar end up being much more complex (at least context free). I also found a this quote really insightful:

"Scientific explanation of any complex biological information-processing system occurs at three levels: (1) a computational theory, which explains what is computed and why; (2) a representation for the input and output of the process and the algorithm for the transformation; and (3) the hardware implementation, or the device in which the representation and algorithm are physically realized [...] a [biological] algorithm is likely to be understood more readily by understanding the nature of the problem being solved than by examining the mechanism (and hardware) in which it is embodied." (Marr, David. 1980. Vision. W.H. Freeman and Company.)

(from Computational Phonology – Part I: Foundations).

Here's also a really interesting picture of the two pathways our brain uses for turning text into sound, from A Computational Theory of Writing Systems. This theory seems very plausible, as they've found people with brain injuries where one of these pathways were injured and they had the expected mistakes due to having to rely almost completely on the other pathway.



Anyway, to answer the question of how humans acquire language and what that says about our inductive linguistic biases, the most relevant field of work seemed to be the field of language acquisition and development. I started with A Bayesian view of language evolution by iterated learning, which describes a computational model for languages being passed from one generation to the other. They found that without any other constraints, a language passed down from one generation to the next will slowly change and converge to whatever kind of language the language learners assume is most common (their linguistic priors/inductive linguistic biases). This notion of using learning over multiple generations to understand the inductive priors of a system seems really interesting more generally.

Language Acquisition Meets Language Evolution has a much more nuanced take, and has many fascinating points that seem to contradict some of the notions I see commonly discussed in AI research. Really, I just recommend reading it in full, but here's a rough summary.

Problems of learning can be broken into four categories on two axes: Innate Constraints Dominant vs Learning Dominant, and Understanding and Manipulating the world ("N-Induction") vs Coordinating with Others ("C-Induction"). Here's some examples from animal learning and human development:



C-induction is significantly easier than N-induction. The reason for this is also an answer to how humans can learn from so little stimulus:
When the child aims to learn an aspect of human culture (rather than an aspect of the natural world), the learning problem is dramatically simplified—because culture (including language) is the product of past learning from previous generations. Thus, in learning about the cultural world, we are learning to ‘‘follow in each other’s footsteps’’—so that our wild guesses are likely to be right—because the right guess is the most popular guess by previous generations of learners.

In general, the idea the paper presents is that language is not the result of pre-built evolved structures that exist in the brain for encoding some "universal grammar". The main argument against this I found convincing is that some aspects of what seem to be a "universal grammar" shared between languages are fairly arbitrary. It is:

1. Unlikely that through a few mutations we would suddenly develop much more complex linguistic skills, and

2. Unlikely that all humans would have kept the same linguistic toolbox. We would have expected our innate language functionality to diverge a bit and adapt to the environment, but instead it seems that any human has the tools to learn any language.

For these reasons, they posit that instead, language itself develops through evolution over time. "Language is shaped by our minds". We see this in the case of simple Pidgin languages that develop into complex Creole languages, and we also see this in cases where a language has emerged from scratch and develops complexity over time, like with the Nicaraguan sign language. The idea is that languages initially form as these simple things, and then the process of teaching them to successive generations refines them. Properties of languages that make them more teachable, or that help people communicate better, are more likely to be passed on. There are four attributes that contribute to this refinement process. To quote the authors:

Perceptuo-motor factors: The motor and perceptual machinery underpinning language seems inevitably to influence language structure. The seriality of vocal output, most obviously, forces a sequential construction of messages. A perceptual system with a limited capacity for storing sensory input forces a code that can be interpreted incrementally (rather than the many practical codes in communication engineering, in which information is stored in large blocks). The noisiness and variability (across contexts and speakers) of vocal or signed signals may, moreover, provide a pressure toward dividing the phonological space across dimensions related to the vocal apparatus and to ‘‘natural’’ perceptual boundaries—though such subdivisions may differ considerably from language to language and thus do not form a finite universal phonological inventory. 

Cognitive limitations on learning and processing: Another source of constraints derives from the nature of cognitive architecture, including learning, processing, and memory. In particular, language processing involves extracting regularities from highly complex sequential input, pointing to a connection between sequential learning and language: Both involve the extraction and further processing of discrete elements occurring in complex temporal sequences. It is therefore not surprising that sequential learning tasks have become an important experimental paradigm for studying language acquisition and processing (sometimes under the guise of ‘‘artificial grammar⁄ language learning’’ or ‘‘statistical learning’’); and, indeed, some linguists have argued that some important crosslinguistic regularities arise from sequential processing constraints.

Constraints from thought: The structure of mental representation and reasoning must, we suggest, have a fundamental impact on the nature of language. The structure of human concepts and categorization must strongly influence lexical semantics; the infinite range of possible thoughts presumably is likely to promote tendencies toward compositionality in natural language; the mental representation of time is likely to have influenced linguistic systems of tense and aspect; and, more broadly, the properties of conceptual structure may profoundly and richly influence linguistic structure. While the Whorfian hypothesis that language influences thought remains controversial, there can be little doubt that thought profoundly influences language. 

Pragmatic constraints: Similarly, language is likely to be substantially shaped by the pragmatic constraints involved in linguistic communication. Pragmatic processes may, indeed, be crucial in understanding many aspects of linguistic structure, as well as the processes of language change. Levinson notes that ‘‘discourse’’ and syntactic anaphora have interesting parallels, which provide the starting point for a detailed theory of anaphora and binding. As we discuss further below, Levinson argues that initially pragmatic constraints may, over time, become ‘‘fossilized’’ in syntax, leading to some of the complex syntactic patterns described by binding theory. Thus, one of the paradigm cases for arbitrary UG constraints may derive, at least in part, from pragmatics.

Language develops over time as a multi-objective optimization problem, simultaneously trying to satisfy all of these objectives.  Different objectives matter more in different places, and so result in slightly different languages, and there is also an element of random chance/multiple valid solutions that decently satisfy the constraints. But my understanding so far is that language itself arises due to a more general purpose cognitive capability we have.

This is where I'm at so far. I plan on reading some of the surveys they linked to for some of the details of those constraints, ideally they can be formalized in a purely mathematical way and the impact of each one can be studied. These are also just the views of two authors, I'm aware there are other takes as well, and am trying to keep my mind open for now. Finally, I plan on doing more reading about human and animal cognitive inductive priors.

Bottom up:

I started out trying to formalize some task like "spelling", "grammar", "vocab learning", etc. and my impression is that many of them seem to come down to grammar induction. Grammar induction can be loosely defined as: there is some function F that classifies strings as "valid" and "invalid". Given lots of examples of strings alongside their labels, and want to infer what F is.

By itself, this is just binary classification/learning a function from some hypothesis space. The first tweak is to say that F falls within the space of context free grammars - or an even simpler case - just starting with regular languages. The second tweak I find interesting is to transform the problem from a classification problem into an autoregressive problem (predict the next character). In that case, "learning" F is recreating it, and the encoding of F itself might be hard to extract from your learning model, that's okay. There are a few ways of doing this, but starting with regular languages, and using finite state machines where each node produces a symbol seemed like a good start. There's a few questions about what the distribution of input strings should be, and those questions seem to lead naturally into many questions about language learning, so this seemed like a good start. I've ran a few experiments with transformers learning in this simple setting, and plan on running more. There's also quite a bit of literature on learning regular languages from examples, for example see Bayesian Inference of Regular Expressions from Human-Generated Example Strings, so this gives me plenty of reading to do for finding natural extensions.

I also find myself coming back to the question of if there are any scaling laws or theory about transformers solving different problems in different complexity classes. Like, it seems they can do theorem proving which is uncomputable in general. So while they provably can't always do theorem proving, probably what that means is that they build up this "bag of heuristics" similar to what humans do. As (assuming cryptographic assumptions) there are no heuristic free NP-Complete problems, it seems like they could also solve many instances of NP-Complete problems, and I'm curious what the scaling laws are for doing such a thing. Presumably it's exponential in terms of input, so you don't get much gain over brute force, but I don't know. And the question is also interesting for other complexity classes. Essentially it's a question about what kind of capabilities a possibly Universal Learning Machine actually has. The no free lunch theorem means it can't be good at everything, so there must be some sort of tradeoffs, but I'm not aware of a good characterization of what those tradeoffs are yet, though Adrian de Wynter's work seems like a potentially promising approach.

Comments

Popular posts from this blog

Starting up this blog again (MATS)

Current projects

Open-ended concept creation through scaffolding