Common Voice - a few insights

Date of publication

Some time ago, we invited people to participate in the Mozilla Foundation's Common Voice project. https://voice.mozilla.org/

Gamification

In the meantime, a small competition for the most contributions has started in our company. I myself currently have 655 donations and 623 confirmations. But that's just enough for 2nd place.

The Mozilla community has published a draft for the future dashboard. I am already looking forward to it :)

Diversity

During my confirmations, however, I noticed that there is hardly any diversity of voices. The majority is white, male and High German. Few Saxons, hardly any women's voices, no voice fractions, no Franks or Bavarians. It may seem that way only to me. But I suspect that this is not the case.

And although the inhibition threshold is as low as playing Candy Crush, we don't seem to be able to reach diversity. Maybe it has something to do with our filter bubbles and the world out there needs to hear about Common Voice through other channels. So go ahead. Share this opportunity through all channels.

Fortunately, in addition to the gut feeling, there are already English-language data sets available for download. So quickly download 12.5 GB of data and evaluate the metadata.

Evaluate metadata

Of the donors in the approximately 380,000 English-speaking samples, 60.75 per cent did not give any information about their gender. As women, 9.23 percent, as men 29.69 and as others 0.33 assigned themselves to the question of gender.

I feel that this does not coincide with my contributions when confirming sentences of the German language samples.

I then subtracted the number of no entries from the total number and recalculated the distribution.

In total, just under 150,000 samples with a distribution as follows:

Women: 23.51
Men: 75.64
Other: 0.85

UNBELIEVABLE. There's something going on. So also for German language samples. Please. https://voice.mozilla.org/

Age

As further metadata, the age range is queried as a decade. Over 89 years does not seem to be relevant for Mozilla. I have summarised more crudely for striking statements. Placatively, I could argue the thesis: Younger men find out about Common Voice and tell their mothers.

Women
Until the end of 20 = 28%
Until end 40 = 37%
Until the end of 80 = 35

Men
Until the end of 20 = 41%
Until end 40 = 41%
Until end 80 = 16%

other
Until end 20 = 77%
Until end 40 = 16%
Until end 80 = 6%

Provided the donors created a profile, they could still assign themselves to a dialect.

Dialect

Again, the largest group of samples was blank. 66.02% were affected. The remainder was dominated by US (15.74%), England (7.88%), India (2.99%) and Australia (2.16%). Adjusted and gendered, I was surprised by the high Indian female share (16.26%) compared to the 6.59% for Indian males.

Women
US = 40.84%
England = 21.93%
India = 16.26%

Men
US = 47.28%
England = 24.18%
India = 6.59%

other
US = 54.92%
England = 21.00%
Ireland = 10.42%

Unfortunately, these numbers say nothing about absolute speakers. It could be the one female 80-year-old who provided all the samples. Or it could be her many friends who together ... Hopefully the Mozilla Foundation dashboard will provide us with more. And hopefully also real data on the German speakers.

I used one of my favourite tools to analyse the data. Thank you OpenRefine.

Those who are interested in their own evaluations but do not want to download 12.5 GB of data:

  • CSV metadata of the English-language Common Voice (8.9 MB) Password: CommonVoice2018
Profile picture for user DeepL

DeepL is a deep learning company that develops AI systems for languages. The company, based in Cologne, Germany, was founded in 2009 as Linguee, and introduced the first internet search engine for translations. Linguee has answered over 10 billion queries from more than 1 billion users.

Profile picture for user luckow

Stephan Luckow

Stephan is an open source evangelist and constantly curious about technologies. Thematically, his blog posts can best be summarised as "curiosity satisfied".