Teaching Hindi to ELECTRA

Building a Hindi Corpus

Tokenizers

Training the ELECTRA way

Being smart about CoLab, GPUs, and TPUs

  • Creating the pretraining data tfrecord files takes hours and is entirely CPU. Output is different for different model sizes. Once it’s done, copy it over to Google Drive or Cloud Storage. Don’t pay a premium for a GPU to be idle during this time.
  • If you use a prebuilt Deep Learning VM, select TensorFlow 1.15.
    If you are using CoLab, it now defaults to TensorFlow 2.x, so
    pip3 install tensorflow==1.15
  • The default in pretraining is a ‘small’ model. For a ‘base’ or ‘large’ model, you may need more than a CoLab GPU can offer.
    So start with small, use a GPU from GCloud / AWS / Azure, or use a TPU.
  • Using TPUs: upload data to G-Cloud Storage. If you’re on CoLab, select TPU instance type, and run:
    import os
    print(os.environ[‘TPU_NAME’])
    Call run_pretraining.py with --data-dir gs://address, and
    --hparams ‘{“use_tpu”: true, “num_tpu_cores”: 8, “tpu_name”: “grpc://TPU_NAME”, “model_size”: “base”}’
  • Training time depends on the hardware type and size of your model. In my case, 9 GB + ‘small’ model was predicted to be 7.5 days on a CoLab GPU, not counting for stopping and resuming training when it times out. A G-Cloud GPU reduced that time to 4.5 days (with fewer interruptions), and a TPU was only 2 days. If you have the budget, save some time.
  • If you’re using AWS / GCloud / Azure, you’ll want to start up with a Deep Learning prepped VM, or sign up for NVIDIA and download the files for CUDA-10.0 and cuDNN. If you have to do a manual install of GPU drivers, read the Google Cloud instructions, and check that ELECTRA can run. Don’t waste GPU dollars on searching and debugging! Don’t get invested on a machine that doesn’t work with a GPU!
  • When you copy an incomplete model off of a server, also bring the tfrecords folders! These are continually updated during training, and if you try to resume training without them, you’ll be back to step 1.

Comparing mBERT and Hindi-ELECTRA

Testing on XLNI

Testing on Hindi Movie Reviews

Continuing to improve

How you can help

  • Come up with applications for a Hindi-ELECTRA model, so I can be motivated to keep developing this!
  • Suggest additional benchmarks for a trained Hindi language model
  • Find additional, larger Hindi corpuses (I don’t need a parallel English-Hindi corpus… just more Hindi)
  • Additional ML developer knowledge and GPU time

Updates?

--

--

--

Web->ML developer and mapmaker.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Get Started With Machine Learning: It’s Not Too Late

Distributed TensorFlow 譯

Epileptic Seizure Classification ML Algorithms

Understanding C4.5 Decision tree algorithm

What is Bias-Variance Tradeoff?

Using Deep Convolution Generative Adversarial Networks (DCGAN) to generate anime faces!!

Detecting Pneumonia With 99.5% Accuracy Using FastAI

Introduction to Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nick Doiron

Nick Doiron

Web->ML developer and mapmaker.

More from Medium

Two minutes NLP — Beginner intro to Hugging Face main classes and functions

Context Matters in Data-Centric NLP

Building an Interactive NLP Demo

Common metrics for evaluating natural language processing (NLP) models