Making a mini GPT-2 with dialect prompts

Creating the Arabic base model

For models outside of English, you will need to retrain GPT-2 on a large corpus. The two popular methods which I’ve seen start with one of the pretrained models and replace those tokens and weights over several hours of training. The corpus can come from a recent dump of Wikipedia articles, and/or a CommonCrawl archive of web content.

  • Tutorial 1: Portuguese GPT-2 notebook by Pierre Guillou. I’ve recommended updating this code to resolve some issues with its dependencies. With repeating code to track new and old embeddings, I lost track of the plot, and was unable to train the model myself.
  • Tutorial 2: GPT-2 fork article by Ng Wai Foong. This takes a lot of RAM to generate the encoded NPZ corpus; on my laptop I could process about 2,000,000 lines from Arabic Wikipedia — not great, but better than what was possible on CoLab.
  • My current approach: download a wiki corpus with the method from Tutorial 1, and train the model with code from Tutorial 2.

Exporting the model

If you followed Tutorial 2, you will have a TensorFlow model checkpoint with a few associated files. Here’s what you need to make a HuggingFace-compatible model:

  • all files in checkpoints, beginning with model-### (where # is the highest completed step number); remove the number from the filename.
  • a file in checkpoints simply namedcheckpoint
  • config files inside of models/117M (encoder, hparams, vocab)
  • a config.json from a similar GPT-2 model (example)

2021 Update

Check out AraGPT2 and an updated dialect model.

Making a new, finetuned model

I’ve previously reviewed and combined Arabic dialect datasets to train a classifier based on mBERT. For this new project, I combined those three.

Meanwhile, making an Adapter

Separately I used the dialect datasets to train a dialect classifier. Then I published those weights as an adapter, a kind of packaged finetuned layer. You can find that on AdapterHub https://adapterhub.ml/adapters/mapmeld/bert-base-arabert-ar-dialect/

Potential improvements:

  • starting with a larger GPT-2 model (>117M)
  • using a larger corpus
  • more training time
  • also publishing an interactive model on Streamlit / Gradio

Updates?

This article was posted in September 2020. For new recommended models and datasets, please check https://github.com/mapmeld/use-this-now/blob/main/README.md#arabic-nlp

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nick Doiron

Nick Doiron

Web->ML developer and mapmaker.