Attempting isiXhosa text-to-speech

Training a computer to click

3 min readSep 30, 2019

With 11 million speakers, isiXhosa is the most popular language in the world with click consonants. This makes it an interesting subject for training a text-to-speech model with deep learning.

Intro

If you haven’t heard the language before, here is a guide to the X click in isiXhosa:

And the song Qongqothwane:

While researching this post, I also stumbled onto an isiXhosa ASMR account, if you’re into that.

The model

When I looked up open source text-to-speech, I found Tacotron and the successor Tacotron2. Mozilla’s TTS repo (dev branch) has the most straightforward of explanations for working with your own data and training a Tacotron model on Google CoLab. CoLab is a Python notebook environment, integrated into Google Drive, which will give you a free GPU to run the code.

The dataset

OpenSLR.org has 80 text and audio resources, including South African languages. Their isiXhosa dataset is 2,420 recordings totaling 907MB, but we should note that it’s 40% the size of a comparable free English dataset (LJ Speech). So I don’t expect the output to sound like a natural speaker.

After some trial and error, I was able to create a config.json which supported the provided WAV files and newer options in the TTS dev branch. Then it was time to train the model and wait to see the results:

python train.py — config_path ../config.json — data_path ../xh_za/za/xho/wavs/

There were two issues during training— first, the default config is for 1,000 epochs, but I did a back-of-the-envelope calculation that CoLab’s 12-hour limit would be about 100 epochs. Secondly, I was seeing a weird warning, but I found a GitHub issue dismissing it:

I had asked the model to synthesize a sentence from the isiXhosa Wikipedia:

IsiXhosa lolunye lweelwimi zaseMzantsi Afrika ezingundoqo nezaziwayo.
Xhosa is one of South Africa’s most important and popular languages.

After about 80 epochs, the output of CoLab had gone from an electric pulse to a few vocal sounds, spoken by an unintelligible fuzz:

Cooking faster on AWS SageMaker

Now I switched to a paid instance on AWS SageMaker. There’s still a 12-hour limit here, but I can run on a better setup (ml.p2.8xlarge, 8 GPUs, $7.20/hr), and set up 180 epochs. The three things to watch for here here were:

setting aside extra storage (50GB)
installing libsndfile withconda install -y -c conda-forge
using distribute.py in place of train.py for multiple GPUs

Unfortunately the output of this was also unintelligible.

The failure

Why is it not coming through?

One problem is that the text data does not have a standard spelling. There is no sign of the capital C, Q, and X typically used to represent the clicks. Where English is made easier by phonetic dictionaries, there is no extra boost or parsing of isiXhosa words here.

Update: A major problem with this attempt was that phonemizer, the script which converts the text into phonemes, was set for en-US. There is no isiXhosa setting within the library, so I would need to find or write a converter script. Other languages which are included in the library, or have a cousin (for example, Russian and Mongolian) would fare better.

It’s possible that the training is more difficult because the audio dataset includes multiple speakers, and only 40% of the data (counting bytes, not number of samples or length of content) of an English dataset.

The main problem is that a serious project should reach the default 1,000 epochs before generating results, and I was far short of that after 12 hours. I don’t have the budget to run this project independently for a week or longer.

Future updates

This article is from 2019. For latest links, see https://github.com/mapmeld/use-this-now#text-to-speech