Making a mini GPT-2 with dialect prompts

3 min readSep 18, 2020

This is a more casual than rigorous research post, but I wanted to review a current (September 2020) quick-and-easy way to set up a custom GPT-2 model.

If you’re working in English: you’re in luck! You can start with the main pre-trained GPT-2 models from OpenAI, and finetune them for a specific use-case. I recommend SimpleTransformers, and there are plenty of other tutorials out there.
If you’re in English but need to update the vocabulary/tokens of GPT-2, use a library based on Transformers, and adjust things before fine-tuning:
ft_model.tokenizer.add_tokens([“liopleurodon”]) ft_model.model.resize_token_embeddings(len(ft_model.tokenizer))

Creating the Arabic base model

For models outside of English, you will need to retrain GPT-2 on a large corpus. The two popular methods which I’ve seen start with one of the pretrained models and replace those tokens and weights over several hours of training. The corpus can come from a recent dump of Wikipedia articles, and/or a CommonCrawl archive of web content.

Tutorial 1: Portuguese GPT-2 notebook by Pierre Guillou. I’ve recommended updating this code to resolve some issues with its dependencies. With repeating code to track new and old embeddings, I lost track of the plot, and was unable to train the model myself.
Tutorial 2: GPT-2 fork article by Ng Wai Foong. This takes a lot of RAM to generate the encoded NPZ corpus; on my laptop I could process about 2,000,000 lines from Arabic Wikipedia — not great, but better than what was possible on CoLab.
My current approach: download a wiki corpus with the method from Tutorial 1, and train the model with code from Tutorial 2.

So far I’ve been running training overnight (<24 hours, to fit CoLab limits).

Exporting the model

If you followed Tutorial 2, you will have a TensorFlow model checkpoint with a few associated files. Here’s what you need to make a HuggingFace-compatible model:

all files in checkpoints, beginning with model-### (where # is the highest completed step number); remove the number from the filename.
a file in checkpoints simply namedcheckpoint
config files inside of models/117M (encoder, hparams, vocab)
a config.json from a similar GPT-2 model (example)

You should be able to load the folder as a Transformers AutoModel, and export back as a PyTorch model:
m = AutoModel.from_pretrained(‘./argpt’, from_tf=True) m.save_pretrained(“export”)

The vocab.json is a little trickier; at the end of the notebook I use the code from the GPT-2 fork to load their custom vocab-encoder, and output those words in a more standard format. You will need a tokenizer-config.json for the finished model.

I published the original Arabic GPT-2 as ‘Sanaa’ . In addition to making this easier for more users to use in their projects, it adds a little model card and an API widget on their site so people can try it.

2021 Update

Check out AraGPT2 and an updated dialect model.

Making a new, finetuned model

I’ve previously reviewed and combined Arabic dialect datasets to train a classifier based on mBERT. For this new project, I combined those three.

Next, I added special tokens to the dataset and to the tokenizer, where they serve as control characters (i.e. the following content is [dialect]). There is no need to add a space between the token and the next character in the sentence.

ft_model.tokenizer.add_tokens( ["[EGYPTIAN]", "[GULF]", "[LEVANTINE]", "[MSA]", "[MAGHREBI]"]) ft_model.model.resize_token_embeddings(len(ft_model.tokenizer))

Here’s the code which I used:

Google Colaboratory

colab.research.google.com

This model is now published as sanaa-dialect. I would still like to retrain both models on more complete data. What was important was to build a pipeline which holds up for this first pass of data, and identify pain points.

Meanwhile, making an Adapter

Separately I used the dialect datasets to train a dialect classifier. Then I published those weights as an adapter, a kind of packaged finetuned layer. You can find that on AdapterHub https://adapterhub.ml/adapters/mapmeld/bert-base-arabert-ar-dialect/

Potential improvements:

starting with a larger GPT-2 model (>117M)
using a larger corpus
more training time
also publishing an interactive model on Streamlit / Gradio

Updates?

This article was posted in September 2020. For new recommended models and datasets, please check https://github.com/mapmeld/use-this-now/blob/main/README.md#arabic-nlp