Teaching Hindi to ELECTRA

Today I’m publishing a Hindi language model based on Google Research’s ELECTRA method for language training. This same process could be done for many languages, but I started with Hindi because it has more data and benchmarks, yet still no BERT model.

Building a Hindi Corpus

My first lead for a Hindi corpus was HindMonoCorp, a combination of CommonCrawl, Wikipedia, and other datasets totalling 9GB. Another download includes pre-tokenized text. The one downside was this was compiled in March 2014.

I considered updating the content with a recent download of Wikipedia articles. There would be duplication in the corpus from old and new Wikipedia articles, but hopefully nothing too damaging.
Wikipedia offers regularly-updated XML dumps of every language wiki. At the time of my download, the articles werefrom late January / early February 2020. The XML file is about 1GB unzipped.

Once you have the wiki XML, use Giuseppe Attardi’s wikiextractor to output plain text for training. This brings us down to ~400MB.

Around this stage of my research, HuggingFace posted a tutorial using their Tokenizers and Transformers libraries to train EspBERTo, an Esperanto language model:

That post let me know about OSCAR, which publishes an up-to-date and de-duped Hindi CommonCrawl, weighing in at 8.9GB of uncompressed text. This was last collected in April 2019. I decided to combine this with the latest Wiki articles to make my corpus.

Tokenizers

For a model to learn the connections between words or sub-word parts of language, it’s necessary to first split the source text into those parts. If you’re puzzling over sub-words, think about German compound words like Physikprofessor and and English prefix and suffix rules in words like phonics, telephone, and phonology. In the EspBERTo blog post, HuggingFace recommends letting Tokenizers work on the byte level to determine how characters work together and make up these sub-words.

Hindi and other Indic languages are also supported by a tokenizer in the Indic NLP Library (anoopkunchukuttan.github.io/indic_nlp_library/). This was recently cited in Facebook’s mBART post on machine translation.

For this project, ELECTRA expects one vocab.txt file of full words, and the finetuner tokenizer makes the same assumptions. I decided to use Tokenizers, but set to BERT-style whole word mode.

Training the ELECTRA way

ELECTRA is lighter-weight when compared to BERT and other Transformers models. When I read the post below, I was immediately convinced to switch this project from BERT to ELECTRA. It borrows techniques from GANs to generate additional text samples, and continually trains on your corpus and these texts.

Being smart about CoLab, GPUs, and TPUs

It’s possible to completely train an ELECTRA model on Google CoLab, especially CoLab Pro, but please consider a few different things:

  • Creating the pretraining data tfrecord files takes hours and is entirely CPU. Output is different for different model sizes. Once it’s done, copy it over to Google Drive or Cloud Storage. Don’t pay a premium for a GPU to be idle during this time.
  • If you use a prebuilt Deep Learning VM, select TensorFlow 1.15.
    If you are using CoLab, it now defaults to TensorFlow 2.x, so
    pip3 install tensorflow==1.15
  • The default in pretraining is a ‘small’ model. For a ‘base’ or ‘large’ model, you may need more than a CoLab GPU can offer.
    So start with small, use a GPU from GCloud / AWS / Azure, or use a TPU.
  • Using TPUs: upload data to G-Cloud Storage. If you’re on CoLab, select TPU instance type, and run:
    import os
    print(os.environ[‘TPU_NAME’])
    Call run_pretraining.py with --data-dir gs://address, and
    --hparams ‘{“use_tpu”: true, “num_tpu_cores”: 8, “tpu_name”: “grpc://TPU_NAME”, “model_size”: “base”}’
  • Training time depends on the hardware type and size of your model. In my case, 9 GB + ‘small’ model was predicted to be 7.5 days on a CoLab GPU, not counting for stopping and resuming training when it times out. A G-Cloud GPU reduced that time to 4.5 days (with fewer interruptions), and a TPU was only 2 days. If you have the budget, save some time.
  • If you’re using AWS / GCloud / Azure, you’ll want to start up with a Deep Learning prepped VM, or sign up for NVIDIA and download the files for CUDA-10.0 and cuDNN. If you have to do a manual install of GPU drivers, read the Google Cloud instructions, and check that ELECTRA can run. Don’t waste GPU dollars on searching and debugging! Don’t get invested on a machine that doesn’t work with a GPU!
  • When you copy an incomplete model off of a server, also bring the tfrecords folders! These are continually updated during training, and if you try to resume training without them, you’ll be back to step 1.

Comparing mBERT and Hindi-ELECTRA

After around 48 hours and 365,000 steps (out of the recommended million for a mini model) I was eager to compare results.

Testing on XLNI

The key question is whether training on Hindi-specific data provides a value that could motivate developers to switch from Multilingual BERT.

The standard benchmark for mBERT, XLM-R, and other cross-lingual models is called XLNI. As I was thinking about how I would add this new benchmark and fine-tune my model, Chinese-ELECTRA appeared:

The developer had already figured XLNI to the available tasks, and having this code already pointed saved a great deal of time.

Unfortunately, XLNI works by training on English MLNI data and evaluating on multilingual data. Even though I filtered the XLNI test to English and Hindi sentences, the results were rarely an improvement over random chance. This mini model doesn’t know enough English data for the training phase to transfer over.

With these disappointing results, I wanted to check if any language knowledge had been picked up.

Testing on Hindi Movie Reviews

I found a Hindi movie reviews dataset (similar to the common IMDB dataset) as a benchmark to compare Multilingual BERT and my new model on monolingual Hindi problems. There are about 3,500 each of negative, neutral, and positive-tagged reviews. Many include social media tags or misspellings which are not found in the original corpus.

I used train_test_split with a specific random seed (91) so every run of every model will get the same randomly sorted training and evaluation data.

After three epochs, SimpleTransformers + Multilingual BERT had 80.1% accuracy. Finetuning the early, mini Hindi-ELECTRA had 72.0% accuracy.

Continuing to improve

Though my accuracy is currently less than Google’s Multilingual BERT, I proved that Hindi-ELECTRA is learning Hindi sentiment. With more processing power and time, I think that this corpus has real value.

Today I started to re-run Hindi-ELECTRA with the ‘base’, medium-sized model. The GPU is the real cost for me, so I’ll switch to a lower cost GPU and increase the RAM.

How you can help

  • Come up with applications for a Hindi-ELECTRA model, so I can be motivated to keep developing this!
  • Suggest additional benchmarks for a trained Hindi language model
  • Find additional, larger Hindi corpuses (I don’t need a parallel English-Hindi corpus… just more Hindi)
  • Additional ML developer knowledge and GPU time

Updates?

This article is from March 2020. For my latest recommended NLP models, see github.com/mapmeld/use-this-now/blob/main/README.md#south-asian-language-model-projects

Nomadic web developer and mapmaker.