Inductive Linguistic Biases

(unfinished blog post, posting for reference) Introduction

Humans have "inductive biases". When humans see data that they could draw multiple conclusions from, they often pick one of those conclusions. There are certain kinds of conclusions that humans are more likely to come up with, and those conclusions are our inductive biases. For example, humans tend to assume the sequence (1,2,3) will be followed by (4,5,6,7,...) even though there are many other conclusions you could come to.

These inductive biases are very important for understanding how humans can learn from so little data: many mechanisms in our society involve learning to coordinate, instead of learning the "correct" behaviour. Often these norms determined based on the previous generation's "best guesses", and since you share the same inductive biases, you are likely to make those same guesses, and so can learn to coordinate easier.

This means that if you can create AI models that have very similar inductive biases to humans, it will be fairly easy for them to learn these coordination problems, as they'll naturally have the same guesses as humans. Thus, there's a strong motivation to learn what those inductive biases are. Once we understand that, we can encode them into models, either through encoding them into the architecture, or through encoding them into pre-training tasks (once the model is good at those tasks, it'll be good at those biases).

From a more speculative perspective, I also think these inductive priors have a good chance at giving insight into how human intelligence works. Ideally, these pre-training tasks would involve tasks that are open-ended, allowing for arbitrary improvement past human performance. For example, AI agents that create and use languages that are much more sophisticated than human language would find human language fairly easy to pick up. This also might help us approach framing "natural human problems" in a formal theoretical light, which would be really useful for AI safety questions. In other words:

"Scientific explanation of any complex biological information-processing system occurs at three levels: (1) a computational theory, which explains what is computed and why; (2) a representation for the input and output of the process and the algorithm for the transformation; and (3) the hardware implementation, or the device in which the representation and algorithm are physically realized [...] a [biological] algorithm is likely to be understood more readily by understanding the nature of the problem being solved than by examining the mechanism (and hardware) in which it is embodied." (Marr, David. 1980. Vision. W.H. Freeman and Company.)

Inductive biases arose because they helped us solve a problem in the world, I want to characterize what that problem is. Not specifically the problem in nature, but a refined, simpler version that has a scalable difficulty and captures the important bits.

There seems to be a few levels to inductive biases: the brain level and the language level. I'm not an expert on these topics and don't have a "horse in the race" on some of these very heated topics, I'm just sharing my current perspective based on the readings I've done.

Brain level

In Cognitive Gadgets:The Cultural Evolution of Thinking, Heyes presents a fairly compelling argument that the set of biologically hardwired things in humans is fairly small, and most of our abilities come through cultural learning. Thinking about thinking, trying to interpret others' mental states, and language are all socially learned mental technologies. The set of hardwired "biases" in humans that help us develop and learn social technologies are:

  1. We are less aggressive than other primates, and less likely to cause aggression in others (allowing us to be more social)
  2. We have a stronger social motivation drive, which is initially rooted in finding pleasure when our actions have predictable effects
  3. We have stronger biases towards focusing our attention on faces and voices, allowing us more opportunities to learn
  4. We have more powerful cognitive systems.
How do our cognitive systems work? They seem to have two main systems: Associative Learning, and Executive Function.

Associative Learning

Associative Learning considers two types of events: brain inputs (stimulus) and actions. Both events are (roughly) represented by the brain firing patterns when the events occur. The brain builds "associations" between events, either stimulus -> stimulus associations, stimulus -> action associations, and action -> stimulus associations. Presumably there is also an action -> action association.

The evolutionary history goes like this:
  1. Associative learning worked by just associating things that occurred nearby in time
  2. If an event A and B were associated, but one happened without the other, the association was decreased. More generally, association became proportional to the conditional probability P(A | B), and from an external perspective this looks like we estimate probability by creating generative models of events and sampling from those models (unless we are using explicit reasoning and overriding these signals)
  3. If an event is associated with lots of things (it seems like a very predictive signal), it can develop associations easier with new things
Step three happened millions of years ago and exists in most animals today. Primates might be capable of forming more associations than other animals, but it's unclear if the human species is specifically better at associative learning.

We have a lot of inputs, and are not capable of processing all of them at once, so we work on associations between a smaller subset of them. The following criteria cause us to focus our attention more on an action or stimulus:
  1. Are they "face-like" or "speech-like"? More generally, does the input seem "biological"?
  2. Are they consistent with our emotional state? (anxious and depressed moods bias us to focus on threatening inputs, positive moods bias us to focus on attractive and rewarding inputs)
  3. In the past, have they resulted in some kind of reward for us?
  4. In the past, were they useful for predicting other things?
  5. Are they unexpected according to our internal models? (curiosity driven attention based on prediction error)
  6. Are they very different from their surrounding stimuli? (this might just be 5, because we build a model for what are "normal" levels of variation in inputs and these stimuli are unexpected according to those models)
Because what we focus on is affected by the associations, and what associations we make is affected by what we focus on, they amplify each other in a feedback loop.

Executive Function

Executive function is composed of three systems:
  1. Inhibitory control is capable of modifying attention, emotions, thoughts, and behavior to tweak the impulses given by associative learning
  2. Working memory holds the "current thing we are thinking about"
  3. Cognitive flexibility guides the inhibitory control and working memory's function to change our perspective on a task, or change our current goal
Associative learning happens mostly subconsciously and can form many different associations at the same time, while executive function is thought to be our conscious thought and is mostly a sequential, slower thing.

There's some evidence that executive function is more developed in humans, at a minimum it seems to be connected to more things.

Language

Language Acquisition Meets Language Evolution proposes that language itself develops through evolution over time. "Language is shaped by our minds". We see this in the case of simple Pidgin languages that develop into complex Creole languages, and we also see this in cases where a language has emerged from scratch and develops complexity over time, like with the Nicaraguan sign language. The idea is that languages initially form as these simple things, then the process of using language and teaching it to successive generations refines the language itself. Properties of languages that make them more teachable, or that help people communicate better, are more likely to be passed on. They list four attributes that contribute to this refinement process, but I'm going to break them apart a little more and give further thoughts on them.

Constraints on the signal transmission mechanism:

Communication Bandwidth: There are only (roughly) four channels: pitch, resonance, sound type, and volume. As resonance and other similar sound characteristics varies based on biology it wouldn't be very helpful to use for language, but relative pitch changes, sound type, and volume are all used for communicating various things, and used more or less in different languages. Note that hand gestures and facial expressions give additional ways of communicating, and the usage of speech itself doesn't seem to be innate as sign languages seems to form and function just as well as speech.

Noisy signal: Different people say things slightly differently, the environment can be noisy, and some sounds sound more similar than others. This reduces the amount of information that can be transmitted in a period of time, leads to some redundancy in speaking.

These constraints lead to:

- Limitations on how much information can be transmitted in a period of time

- Communication being mostly one thing at a time

- Only a subset of the possible ways of communicating information are actually used in a language. For example, english doesn't use pitch to vary words, whereas many other languages do.

Though honestly, I don't think these constraints are super relevant because processing constraints in the next section also have the same effect and are much stronger.

Constraints on processing the signals:

Associative memory is trying its best to form associations between groups of sounds and work with executive function to draw conclusions, but it isn't perfect. Humans seem to be best at inferring grammatical rules when there is some perceptual difference between the classes.

There are a few major limitations:

Memory: We can't perfectly remember what we hear, so we have to only remember specific things on the fly, and predict what things are important to remember. 

Sequential processing: Processing a sentence takes thinking effort, so we have to strategically allocate our mental efforts to understanding what is most important in a sentence. This also means that in speaking, it's important to say things in a way that can be reasonably understood by others, which often requires some context and knowledge of what they already understand.

Inference Restrictions: Humans seem to be much better at grouping two words in the same grammatical category when there is some 

Thought: Our mental representations of concepts influence how we talk about them. For example, compositionality is a way of dealing with a practically infinite number of thoughts. The way we think about time influences tense, and more generally the way we think about concepts influences how we talk about them. However, most of these "thought mechanisms" predate language. (I'd like to understand this one more, it feels like this actually covers lots of different aspects)

These constraints lead to:

- Language needs to be somewhat redundant and transmit info slower than can theoretically be transmitted, because processing speed is the bottleneck (which leads to all of the kinds of limitations that constraints on the signal transmission mechanism have)

- Language needs to have a reasonable structure that can be expected, so we don't need to do processing about that (grammar, syntax, morphology, parts of speech, etc.). Some aspects of this structure such as parts of speech are innate and related to how we think about things, and so are shared between languages. Other aspects of this structure are arbitrary (simply having a predictable pattern is enough, what the specific pattern is doesn't really matter, many are good) which leads to natural language variations. Many "universal" linguistic principles like pronoun binding rules can be explained from this practical perspective: for example, the pronoun binding rules help the brain figure out what pronouns are referring to. But pronoun binding rules can also be influenced by our mental concepts: in "After a wild tee-shot, Ernie found himself in a deep bunker", himself refers to his golf ball.

Infant Language Learning Experiments

Researchers have done various experiments of presenting synthetic languages to infants and seeing how those infants react to them. The goal is to learn about the mechanisms that are available for language learning very early in life, as synthetic utterances can be constructed that test only one specific variable.

The first experiment was a test of "word boundaries", but really it seemed to me like a test of distribution learning. The idea is that after a given sound is uttered, there is a higher probability of some sounds than others. The hypothesis is that we use those probabilities to determine how valid a word is, and thus use that validity test to determine word boundaries. To test only this aspect, they choose an random probability distribution over sounds, then played to 8-month old infants a two minute recording of sounds pulled from that distribution. After doing this, they presented "words" (groups of sounds with high probability under that distribution) and "non-words" (groups of sounds with low probability under that distribution) and watched for if the infant turned their head, which apparently is a decent measure of preference. They found that not only does the infant learn to prefer high probability words under the made up distribution, the infant also shows preference towards words that are closer to the distribution of their mother's language. These kinds of statistical learning patterns even seem to be applied to non-linguistic things like musical tone sequences, so "predicting the next token, and using that to determine word validity" seems to be a more general purpose mechanism.

Later experiments extended this to test if 12-month old infants could determine which sentences were ungrammatical and grammatical, given just a subset of strings generated from a synthetic grammar. The infants were able to generalize the rules to classify unseen strings as grammatical and ungrammatical, and later experiments even showed that infants were even able to recognize the grammatical pattern when all of the words were replaced with different words (without needing to retrain!).

For adult learners, this ability even generalizes across domains: they can be shown a sequence of strings following some grammatical pattern, and after learning that pattern, they will recognize it happening in tones and spoken words. This transfer ability also happens in the opposite direction.

Here's the two grammars they tested infants with:

 

However, the ability to "pattern match to grammars" in humans is not perfect. There seem to be two types of abstraction:

Pattern-based abstraction uses rules like equals, less than, or greater than to infer relationships. Sequences that go ABA are recognizable because the first and last sound are equal. Many animals seem capable of recognizing greater than/less than relationships in various kinds of stimuli, and generalizing this relationship to novel stimulus. These kinds of rules are tied to our perceptions, as "equals" and "less than" are very relative to how we perceive the stimulus.

Category-based abstraction does a similar thing, but "equals" is determined based on category instead of perceptions. These categories might be things like parts of speech, object types, etc. 

TODO: Explain the negative result where without perceptual things humans have trouble distinguishing patterns.

First and second experiment:

Take 12 different syllables. There are two "languages", each language has four words, made of a randomly chosen set of 3 of the 12 syllables, such that no two words in a language share any syllables. A text speech device was made that presented the words in a completely random order, the only constraint being that a single word cannot be repeated twice.

This meant that P(y | x) = (frequency of xy)/(frequency of x)

Is 1.0 for transition probabilities inside words, and 1/3 for P(start of word | end of different word) and 0 for P(start of word | end of same word). This allows creating nonsense words by rearranging the words inside.

The later experiment presented two of the words twice as often as the other words, to test if infants can only use transitional probabilities to differentiate the words (and not frequency of cooccurrence). They found that infants could rely only on transitional probabilities. Later research showed that rats were able to do similar differentiation, but mostly relied on symbol co-occurrence instead of transitional probabilities, and were not able to infer more complex grammar like rules.

They found that infants paid closer attention to "novel" words/part words than to words. My hypothesis is that the learning mechanism works like this:

1. Construct a model to predict stimulus. This probability model is constructed by randomly sampling from past experiences, and those past experiences are weighted based on novelty. The mechanism for choosing the initial seed, the mechanism for hopping around, and the mechanism for deciding when it is "good enough" should all be detailed.

2. New experiences that don't match current predictions (so we are curious about them) are attended to longer, and the length aspect and the "higher surprise" aspect both lead to the experience being stored in memory with higher weights.


A Comparative Perspective on the Role of Acoustic Cues in Detecting Language Structure Describes how acoustic cues help for determining categories (turning Category-based abstraction into Pattern-based abstraction), not only for individual words, but also for higher level grammatical information. Acoustic cues can even override statistical regularities.

At 9 months old, infants use stress to override statistical info to segment speech streams, but 7 month olds rely on statistical info even if it contradicts stress cues. Infants also use stress cues to guess meanings (in the form of visual objects) of novel nonsense words.

Learners were not able to predict AxC (x is a variable middle element) unless they had prosodic countours like rising pitch at he beginning and longer duration of syllabels at the end. Or they need small pauses between nonsense words to learn the results defining the AxC structure beyond statistical dependencies between the intervening elements.

The point is that "domain-general perceptual principles guide which item pairs enter the focus of attention and are thus associated across a distance"


The point is that any ways to perceptually differentiate categories makes learning easier as the auditory info can be used to signal "these are the important differences here to pay attention to".



Things that are actually used:

Prosidy: intonation, tone, stress, and rhythm

pauses and prosodic changes (like pitch, pause, or lengthening the sound) at syllable, phrase, and utterance boundaries

frequency of cooccurrence (If probability that you see those letters together in a bunch is high, they probably are part of the same word)

transitional probability (If it's a lower probability transition between sounds then it might be a word boundry)

positional information or prosodic cues such as silent gaps between syllables can greatly assist speech stream segmentation

Stress: (PREsent (noun) vs preSENT (verb))

Duration (lengthening or shortening the sound)

Intensity

Pitch: 

Rythm: even non-human animals can distinguish between different languages by the rythm of the words

Infants can group sounds based on their prosodic contours, and even produce language-specific prosodic contours in their cries. Admiration vs Suspicion have two different contours, and Java sparrows could distinguish them.

All help us differentiate things.

For example, speak the following Garden-path Sentences out loud (or in your head) as you read them:

"the old man the boat"

"the complex houses married and single soldiers and their families"

"the horse raced past the barn fell"

You'll find the sentences confusing. Figure out their actual meaning (the wiki page is helpful), then try speaking them again and notice how your speaking patterns change slightly. Those subtle patterns are very useful for helping language learners distinguish grammatical categories, as it moves the problem from a category based abstraction to a pattern based abstraction.

These cues are useful for distinguishing grammatical categories, but also useful for helping predict future grammatical structures.

Infants that are 4 weeks old can already use these features to differentiate things like function words from content words


A comparative perspective on the role of acoustic cues in detecting language structure


Open questions: What differentiates what grammars they can learn from what grammars they struggle to learn? How does the statistical properties affect the kinds of things they can learn?

How does the process of category based abstraction work, and when does it fail? Take those "filling in tables" examples and see if a transformer can do them. If it can (even though humans can't), what does this mean for us?

Actually look at the n-gram syllable distributions between words, and that, word length, and other similar things in how predictive they are for things like part of speech

Ideas: 

- It sounds like perceptual info is very useful for distinguishing categories, so it would make sense to include auditory info and written info as part of the initial word embeddings.

- Gardenpath sentence generator?

- Some kind of hierarchical embedding where different words can be grouped into different categories, and those categories can have their own embeddings

- Maybe when pattern-based abstraction from perceptual queues is not available, humans still infer it mentally?


Point is, perceptual cues are used to make pattern learning much easier, and also as a method to focus attention and processing effort to particular pieces. This mechanism paired with curiosity about novel things allows humans to communicate the important pieces of language and induct and develop complex grammatical structure.

There's a question about: can you learn the methods for communicating perceptual cues to focus attention and creativity in a setting more simple then "create your own language"? Something like "create your own birdsong". Then once that teaches the mechanism of encoding the important pieces for structure learning and creativity, you can boostrap to language.

A few key mechanisms:

Distinguishing different perceptions (can you tell if two things are the same or different)




There is finite equivalence categories/comparison categories to do inference over. Some of them also have to be used for distinguishing words. As we can mostly only do inference over perceptual classes, this puts a limitation on the kinds of grammar complexity that can exist.


* The point of "a constantly changing environment creates a selective pressure for generalists. A mostly fixed environment creates a selective pressure for specialists. Mechanisms that allow simple cultural development through requiring coordination is one way of achieving a constantly changing environment" seems very relevant for artificial life and open-endedness research.

Comments

Popular posts from this blog

My prediction of AGI in 2021-2023

OpenAI Scholars Project