Wednesday, July 29, 2015

Infographics time and the trouble with second languages

Infographics time again! Last time it was that pretty graph from South China Morning post that had huge spread and lots of problems, see this post about that infographic and also this one about journalists, researchers and info graphics. This time we've got two infographics of "second languages of the countries of the world" using almost the same data - one by the site MoveHub.com who works in international moving and one interactive map made by Olivet Nazarene University of countries second languages. I'll go through some of the problems with these infographics, suggest some improvements and show some other neat relevant illustrations.

If you have any questions or if you're a journalist interested in writing about the languages of the world and require assistance, don't hesitate to contact us.

Both infographics features data from the CIA's World Factbook - a US-run enterprise to provide information about the world for policymakers. The factbook is annual, the first classified edition came 1962 and the first unclassified in 1971. It provides a general summary of information about the world countries such as population size, military expenditures, GDP, energy consumption etc. The information is provided by the CIA, several other american governmental bodies (list here) and "several other public and private sources".

The list of languages per country in the CIA's World Factbook is said to be ranked by size, however this is clearly not true in many cases and there is no reference to whether native and non-native speakers are counted together, or how multilinguals are handled. It seems to be the case that official languages are listed first, regardless of size. Where the CIA gets its information about the languages of the world is unclear. This causes complications and makes this data source inappropriate for infographics, there are just plain errors and this gives people a faulty view of these countries. Ethnologue might not be perfect, but in this case if would have been much better.

Both infographics have been picked up by Business Insider. If anyone knows why Business Insider seems to have been showing such keen interest in linguistics lately, feel free to let us know. These are not the only two stories, their language and linguistics tags have been quite active lately.. .we're pleased and puzzled ^^!

I'm complaining because it matters, these infographics now show a faulty view of the world and the CIA can really do better/people should not use their factbook. As you know, data visualisation is a dear topic to me. I believe it is becoming more and more important as science is very often communicated through viral infographics now, and we scientists need to step in and aid journalists who are trying to communicate about our fields of research.

On to the infographics and the trouble with "second languages". The illustrations are supposed to show "the second language of countries of the world", and by that they clearly state that they mean the languages with the second largest speaker population in that country (according to the CIA's World Factbook, Wikipedia + various other sources). They do NOT mean what people mostly speak as a second language in that country. In the infographic by Olivet Nazarene they have been slightly smarter about things and selected the second after the official language(s) of the country. Meaning, if there is more than one official language they count the next most frequent after all of them. Alright,  let's check it out.

I have to say, to make a map like this, as a moving company, is a great idea. Like they say it can illustrate the "ancient furrows of conquest, colonisation and recent immigration trends" and it shows people who are include to travel the world what life is like in other places. One can see bits and pieces of the old colonisation division of the world. As a person working on linguistic diversity and also having lived in 5 different countries, I get this. It's good! There are complications though.. (of course there are)...

The CIA's World Factbook, that both infographics are based on, is inconsistent with counting native and non-native speakers. They frequently start with listing the formally official languages of the country, regardless of how many actually speak that language. In the text accompanying interactive map Olivet Nazarene University say "The most spoken language in any country is pretty much a no-brainer: it’s the country’s official language.". This is not true, the official language(s) need not be the largest language.

Both infographics state that they have used extra information sources, like Wikipedia, besides the CIA's World Factbook. But they do not say exactly which and when. This makes matters worse I'm afraid.

Some concrete examples of troubles
For Sweden English is listed as the second largest language in the infographic by MoveHub.com, which is probably true if we count in second language speakers, but incorrect if we count only native speakers (that would be Finnish). I'm a Swede and the Swedish state does not keep records of these kinds of things, but we still know enough to state this. Ok, well that seems to be in the spirit of a Moving company, competence is less important - what their readers want to know is what is useful if they plan to move abroad. That makes some sense.

The infographic by Olivet Nazarene University has Finnish down for Sweden, which again is true if we're only counting native speakers. But, are we only counting native speakers?

If we then move across the Baltic sea and have a look at our brethren people in Finland, Swedish is marked down as the second largest language in the infographic from MoveHub.com. That seems highly implausible. Sure, many Finns have learned Swedish in schools but they also learned English and I believe there are more people that know English well than that know Swedish (especially if we count in non-native speakers of Finnish living in Finland). I had a poke around the Finnish census, and I couldn't find numbers for competence in English over the entire population. It is only recently that Swedish has been removed as a compulsory language for all citizens (2014), but even so I suspect that the competence in English is still higher overall compared to Swedish. I got support for this from friends who are  residents of Finland, more confirmation is appreciated though. There are more native speakers of Swedish than of English within Finland and Swedish is an official language whereas English is not, that's for sure - but then again there are more native speakers of Finnish in Sweden than native English speakers. Aren't we lumping both native and non-native populations here though?

In the infographic by Olivet Nazarene University they show Sami as the second language of Finland (because Swedish is labeled as official they count that out). This is also very bad, there are many languages in Finland that are larger than Sami - most prominently Russian. This is actually even stated by the CIA's World Factbook. In addition, Ethnologue shows more speakers of Romani than Sami in Finland. So, why Sami is marked down as the second language for Finland in their map is not clear. (I also don't appreciated that small indigenous languages get the abbreviation "In" on the map, it is inconsistent and just plain wrong to lump like that.)

Matters get complex when we get to highly linguistically diverse places like Nigeria. There are more than 500 languages spoken in Nigeria, both infographics show Hausa as the second language, but.. well.. things are more complex. Nigeria is one of the most linguistically diverse places on the face of the earth, to know the second largest language is actually maybe not that useful and also, is it really Hausa?

Ethnologue states that there are 18,500 000 speakers of Hausa as a first language in Nigera and 15,000 000 second language speakers. For Standard English there are 60,000 000 second languages speakers and zero native listed (this is rather odd, yes). For Nigerian Pidgin English we've got 30,000 000 first and second language speakers (they could not tease them apart). There's also the major languages Igbo (18 000 000), Yoruba (18 900 000) and 500 more languages to keep track of. If we count by the language that has the most second language speakers, it'd be Standard English. If we count the language that has the second largest population of native speaker or just second population regardless, it would be Hausa. If we count only native speakers, then the second largest speaker population would be Yoruba (depending on how many actually speak Nigerian Pidgin English natively) and Hausa would be third.

Again, how useful is it to know this" second language" when we're comparing countries that have extremely few languages (American Samoa, Vatican, Iceland etc) to highly diverse countries (Papua New Guinea and Chad)? To miss out on Yoruba when learning about Nigeria seems like a terrible idea. This is why I recommend you check out the Greenberg Diversity Index of countries that tells you about how likely it is that two random people speak the same first language. Perhaps do an infographic of that instead?

Here is one from Worldmapper where the size of the states are simplydistorted with respect to number of languages spoken.  It is on the same as the Greenberg Diversity Index, but it still very useful. Worldmapper has stated their sources here.

© Copyright Sasi Group (University of Sheffield) and Mark Newman (University of Michigan).

Then there's the case of Madagascar that has Malagasy as it's second language in the infographic by MoveHub (the other map has no information on Madagascar), even though Malagasy outclasses the official language French by more than 16 million speakers (even lumping native and non-native). That's wrong.

Libya is labeled as Italian in the map by MoveHub.com, even though the Italian population is outranked even by Punjabi in terms of native speakers according to Ethnologue - where the CIA got the estimate of second language speakers of Italian in Libya I do not know. The map by Olivet Nazarene University have Libya down for English, that seems more plausible.

Wolof is listed as the second language of Senegal in both infographics, after the official language French even though more people speak Wolof than French according to Ethnologue. There's more cases, the list goes on. I'll stop here for now, it's just messy I'm afraid :(. I won't have time to go through all the issues that are present, the comment sections of these infographics are already overflowing with them and I don't have time to bring them all up.

I'm sorry for being such a drag, but science communication is important and it's not ok to get things this wrong. I'm also pointing this out because it is possible to do better: ask a linguist for advice, stop using the CIA's World Factbook or try and work on improving it.

Direct your criticism where it is needed
Remember that MoveHub.com and Olivet Nazarene University did not go out and gather this information on their own - they used the published source CIA's World Factbook, Wikipedia and other unnamed sources. If the error stems from there, be kind and direct your criticism there instead of to MoveHub.com, Olivet Nazarene University or Business Insider - they don't have power to make edits there anyway.

The troubles with CIA's World Factbook
These problem arises because the CIA's World Factbook is not a reliable peer-reviewed information source. It should not be used for infographics like this, there are far better sources. It does lots of things, and few of them as well as more specialised resources of information like Ethnologue or the International Monetary Fund. It gives an overview, but it should not be used as the sole source, ever. Why people use the CIA Factbook straight of like this is a mystery to me, they practically never give any detailed references to how they compiled their information and often contradict other more reliable sources. I know it's harsh but just don't use it. Unless they drastically improve I don't see the benefits over say, Wikipedia. Neither should be used as the sole source of information.

One might say, it's not easy to do lots of things well and the CIA's World Factbook is an old institution that provides a general overview over lots of topics - it cannot be expected to hold up to academic standards. To that I only say: yes, yes it can do better and there is no reason to set the bar that low. It's not hard to do it better and well if you can't do it well.. honestly just don't do it. You don't have to. We could use Wikipedia for a general overview and then go to specialised repositories, there's not reason why the US should keep a world encyclopaedia around. Wikipedia can be a mess, but these old-timey super general encyclopaedias are often not much better. Their main benefit are that they are accountable, but that's not always enough. Let's teach people to search for information and evaluate what they find on their own instead, teach scientific methods and thinking. Or, if it is really important to have a neat special factbook for the american public and policy makers - do a better job of keeping it updated and stating what facts are in there and from where. Ok, sorry. Perhaps that's a discussion for another time, let's leave that for now.

The main source of the trouble here is that the CIA's World Factbook does not indicate what is spoken as a first language and what is not nor form where they've got their information. What MoveHub.com and Olivet Nazarene University took from Wikipedia and other sources I don't know so I cannot understand what's going on there. Wikipedia tends to tease apart native and non-native competence though, as they often get their numbers from Ethnologue.

On using Ethnologue instead
Ethnologue is not perfect either, but it is better than the CIA's World Factbook when it comes to statistics of speaker populations. Way, way better. They are accountable, they state sources and they make explicit what they count. You can also of course contact them and help improve it!


Word of advice: if you're using Ethnologue, make sure you understand what these things mean when it comes to how Ethnologue works, otherwise you're bound to get tons of problems:
  • macro language
  • immigrant language
  • indigenous language
  • native language
Read what Ethnologue themselves have written about these terms and their work, it will make what you do better.

Parent languages and language families
Also, why we're at it: the infographic from MoveHub.com uses the term "parent languages". It is a rather odd choice of word for what is essentially language families. Sure, language families are theoretically assumed to have one parent language, but we usually name those hypothetical languages things like: "proto-uralic" and "proto-afro-asiatic". It might also be good to remember that this assumption, that they do have one and only one origin might not be accurate/useful, with contact and relationships between languages often being more similar to networks than trees, can we really assume that we get neat trees with one ancestor? ... oh well let's not get into evolutionary debates about one or several origins of language. "Parent languages" is actually not that bad, it makes that theoretical assumption clear which I think is a good thing.

On counting non-native competence
Countingl non-native language competence (many people master more than 2 languages), is very hard. We've talked about this before, how should it been done? What is competent enough? How apply that scale consistently? How treat different census's approach to this, because we cannot test this ourselves? Should we use organisations like Alliance Francais and TOEFL, wouldn't that skew our understanding drastically? There are tons of tests available, are they any useful to us? There are lost of different scales, like ABLLS and CEFR, can they tell us something?

Ethnologue does keep some sporadic information on second language users, but it is not as comprehensive as the first language counts. The main source I know of for second language population counts is a publication by Bentz & Winter from 2013 that combines Ethnologue and other sources, free PDF here. If anyone knows of other sources, lemme know. (Thanks Seán Roberts for recommending the Bentz & Winter-article, go read his excellent stuff on cultural evolution here.) Let's see if we can grab any of those fine infographic makers attention and get a new shiny infographic but with Bentz & Winter's numbers!

Suggestions for illustrative infographics of the world's languages
I like the initiative behind these two infographics, however in order to illustrate what they want to illustrate might I suggest instead displaying:
  • language of education
  • official language
  • largest non-indigenous language
  • second language competence according to Bentz & Winter (2013)
  • Greenberg Diversity Index
  • number of languages
  • which countries have been colonised by whom
Might I also recommend having a look at these two maps to learn about the history of colonisation of the world, something that I believe was part of the message behind these two infographics.

1) A map of Africa from 1914, i.e. the end result of the so called "Scramble for Africa". This image is from the brilliant site Atlas-historique.net and made by and copyrighted by Guillaume Balavoine. I highly recommend visiting his site. For the non-French readers: "allemandes" = germans. The rest you should be able to figure out. This gives you a clear image of part of what MoveHub.com wanted to show, for example that Libya was colonised by Italians.

2) Exclusive Economic Zones (EEZ) of the world today. EEZ are regions that a state has power over economically, it includes overseas territories and dependencies. New Caledonia is for example in the EEZ of France, Guam is part of the US and the UK has several islands down in the South Atlantic. Exploring these entities as they are today is very interesting, it's sometimes easy to forget that Kiribati borders to the US, France to New Zealand, UK to the Maldives and that Norway has land below the equator. This map is by Theo Deutinger, click here to explore it in greater detail.

This map brings in a perspective that is not present in the two infographics we've been discussing: the south pacific islands of Polynesia, Melanesia and Micronesia. In both infographics, these regions were partially or entirely excluded. For those who need, click here for a map of those three regions.© Theo Deutinger 2009

3 comments:

  1. The Olivet Nazarene map is pretty laughable, I've written about this here:
    http://www.languagesoftheworld.info/bad-linguistics/the-problematic-map-of-the-second-most-spoken-languages.html

    ReplyDelete
  2. p.s. New Zealand controls Tokelau's EEZ, but (NZ argued so that) the Cooks and Niue have their own EEZs: http://www.teara.govt.nz/en/law-of-the-sea/page-3

    ReplyDelete