When you start learning a new language, the first thing you’ll notice is how it sounds. Specifically, you’ll notice the places where it sounds different from your native language: maybe there are tones, or unfamiliar consonant clusters, or speech sounds you’ve never heard before. I’ve always been curious about which languages sound similar. This blog post emerged from a project where I used math to quantify the difference between languages.
In the next section I’ll talk about how I framed this problem and the datasets I used. Then I’ll discuss the algorithm I developed and its results. I’ll conclude by talking about the significance of the results and the specific languages that turned up.
If you want to dive really deep into this topic, all the data analysis I did is in this notebook
Phonemes are the sounds that occur in a language. If we want to understand how similar two languages sound, it makes sense to start by examining their phonemes.
Here’s what phonemes look like: /kʰ/ is the first sound in the English word king and /ŋ/ is the sound of “ng”. The International Phonetic Alphabet is the system that linguists use to transcribe the sounds of words - it lets us escape all the messiness of a particular language’s spelling rules. In IPA, we’d transcribe the word “king” as /kʰɪŋ/. As you may have already noticed, linguists put IPA transcriptions inside of forward /slæʃɪz/.
The collection of all a language’s speech sounds are called its phoneme inventory. American English’s phoneme inventory is larger than you might expect, and looks like this:
b, d, d̠ʒ, f, h, j, kʰ, l, m, n, pʰ, s, tʰ, t̠ʃ, v, w, z, ð, ŋ, ɡ, ɹ, ʃ, ʒ, θ, aɪ, aʊ, eɪ̯, iɪ, oʊ, uː, æ, ɑ, ɔɪ, ə, ɚː, ɛ, ɪ, ʊ, ʌ
Other languages have larger and smaller inventory sizes: the largest is !Xóõ with 161 phonemes and the smallest is Pirahã with only 11. Phoible is a dataset of phoneme inventories from 2,186 languages - about a third of all the world’s languages. I used phoible for all of my analyses.
You can think of a phoneme inventory as a set of unique entries that describe a language. One language’s inventory might look like (a, b, c, i) and another might look like (d̠ʒ, a, b, k, f, kʰ, ʌ). We can go ahead and directly compare these two: count the overlapping phonemes and divide by the total number of phonemes. In this case there are two overlapping phonemes (a, b) and 9 phonemes total, so we can say that the similarity between these inventories is 2/9, or 0.222. If we subtract this number from 1, we get the distance. This distance measure has a name: Jaccard distance, and is also called intersection over union because we’re dividing the magnitude of the intersection of the sets over the magnitude of their union.
I used Jaccard distance to rank the similarity between American English and every other language in the phoible database, and the results were much better than random. However, I decided to create a custom metric and found even more compelling results.
The custom metric involves breaking phonemes down into their individual features. For instance, the English phoneme /k/ is a voiceless velar plosive: it lacks the voicing feature, has the velar feature, and has the plosive feature. There are about two dozen features that you can use to characterize each speech sound, and they’re all binary (present or absent).
After transforming each phoneme into a list of binary features, I took the Jaccard distance between every pair of a language’s phonemes and took the average over the closest pairs. This accounts for a situation where two languages might have two very similar phonemes, like /kʰ/ and /k/ - instead of treating these as completely different entities, there’s now a more fine-grained measure of similarity. Using my custom distance metric, I found the languages closest to American English. Here are the top-50 results:
- English (British)
- English (General)
- English (Australian)
- Belizean Creole
- Jamaican Creole
- Karipuna Creole
- Hill Jarawa
- Mauritian Creole
- Antiguan Creole
- English (New Zealand)
- Standard Malay
- San Miguel Chimalapa Zoque
Some of these are totally unsurprising: there are other English dialects, English-based creoles, closely-related languages (like German and Scots), and some more distant cousins (like Persian and Slovene). I marked these languages in blue. However, a few of the languages on this list are totally unrelated to English: Swahili is a Niger-Congo language from Sub-Saharan Africa, Tagalog is an Austronesian language from South-East Asia, and Bashkir is a Turkic language from Central Asia. However, when you look at these language’s phonemes it makes sense: they all seem very similar to English.
I wanted to understand whether my results were random or if there was some pattern. If closely-related languages are ranked as closer to English than unrelated languages, then the algorithm is performing well - after all, related languages should sound similar.
7% of the world’s languages are members of the Indo-European family, of which English is a member, whereas 40% of the languages in the top-50 sample are Indo-European - a significant difference. I performed a statistical test to prove that this result was unlikely to occur by chance. This result proves that the algorithm is actually working: it’s selecting closely-related languages more often than unrelated ones.
It also seems like the algorithm prefers languages from the unrelated Austronesian language family - 5% of the world’s languages are Austronesian, but 16% of the languages in the sample are Austronesian. I performed another statistical test which showed that this result was extremely unlikely to occur by chance - Austronesian languages actually do sound more like English than languages from other families.
Interestingly, it doesn’t matter very much which distant metric you use: Austronesian languages are always disproportionately represented in the top 50 results.
The Austronesian languages Tagalog, Ivatan, Taba, Indonesian, and Malay have a simple phonological inventory: their consonants are a subset of English (except for the palatal nasal /ɲ/ in Indonesian). Their vowels are also a subset of English, except for Tagalog which has a mid-back rounded vowel that English lacks: /o̞/.
Toba-Batak resembles the rest of the languages in this family except that is possesses the palatals /d͡ʑ/ and /t͡ɕ/.
Cebuano’s consonants are often dentalized, making it slightly different from its relatives on this list.
In Javanese there are breathy-voiced consonants like /ɖ̥/ and dentalized stops like /t̪/.
All of these languages are non-tonal like English. However, unlike English, the sound /ŋ/ can occur in the beginnings of syllables as well as at the end: English is weird in that it bans /ŋ/ from starting out a syllable.
I think it’s fair to conclude that Austronesian languages are weirdly similar to English in terms of their speech sounds. Despite being on opposite sides of the planet, these languages contain sounds that every English-speaker is familiar with. Learning Tagalog, Malay, or Cebuano shouldn’t present many problems for English speakers when it comes to pronunciation.
However, it’s worth keeping in mind that all the distance metrics I explored are, by definition, symmetrical: the distance from point A to point B is always the same as the distance from B to A. However, the “learning difficulty” of a language’s speech sounds is definitely not symmetrical: all the Austronesian languages I explored have phoneme inventories that are subsets of English: that means it should be easier for an English-speaker to pronounce Tagalog words than for a Tagalog-speaker to pronounce English words.
So if you’re an American and want to learn a language that’s easy to pronounce, I suggest taking a trip to Southeast Asia 🌴