Comparing Arabic dialect datasets

3 min readFeb 8, 2020

In my last post, I was labeling disinformation Tweets by language, using Kedro for data flow visualization. In the past I used Google CoLab to train a classifier model on one dataset, with Multilingual BERT (mBERT) as the source of word vectorization.
In this project, I mixed in more datasets, (theoretically) improving accuracy. Yet this combined dataset had failed to train a predictive model.

Rooting out Problems

I didn’t know what sank my classifier, so I considered several possibilities and planned out how I would test each:

Are there too many differences between my datasets in structure and labeling? Some samples come from Tweets and some come from longer articles. Should I remove usernames and hashtags?
I should test whether the Qatar and JHU datasets are independently predictable.
Did the difficulty increase when adding Maghrebi dialect?
- Are dialect-specific words missing from mBERT’s vocabulary?
- Are their samples more difficult to group in vector-space — i.e. would a Gulf vs. Egyptian binary classifier perform better than Gulf vs. Maghrebi? This would be bizarre as humans always tell me these are very distinct.
- Did the UBC dataset include Maghrebi data, but label it differently?
Is this a shortcoming of mBERT? Would AraBERT or ELMo work better?
If all else checks out…. was my code bad?

The Actual Issue

The real problem was in how I used SimpleTransformers… eval_model was working correctly, but my manual tests of predict were unexpectedly returning cached results. I was pretty surprised when [a,b] and [b,a] returned the same prediction, and split this work off into its own post.
When I finally checked GitHub, the coder reporting the issue found that adding ‘use_cached_eval_features’: False during training and initialization fixed the problem:

Preloading : ClassificationModel always return the same output per class initialization ( for each…

Describe the bug From my experiments with Bert and DistillBert (I could not test others yet), if I load a model with …

github.com

Evaluating Dialect Datasets

I rewrote my script so I would load data from each dataset, balance each category, train-test split (80% training, 20% test), and create a model.

Training notebook link

Qatar University’s DART
MCC = 0.933 (1 as perfect); miscategorized only 5% of test data
Error counts (not corrected for proportion in the test data):
{ MSA: 0, Levantine: 34, Gulf: 43, Egyptian: 55, Mahgreb: 60}
Appears to be pro-Modern Standard Arabic (MSA) bias? Ideally this would include a confusion matrix.

University of British Columbia
dataset is about 6x the size of DART; no Maghrebi data
MCC = 0.66
Overall: ~80% accurate on MSA and Egyptian, ~70% on Gulf and Levantine

Johns Hopkins University
dataset is about 50% larger than UBC, or 9x the size of DART
MCC = 0.781
Overall: 86% accurate, on unbalanced test data
Error counts: { Maghrebi: 563, Levanitine: 698, MSA: 723, Egyptian: 781, Gulf: 860 }

All Combined (unbalanced)
MCC =0.777
Test input and error analysis:
13,443 Gulf, 85% labeled correctly by model
24,220 MSA, 93% accuracy
4,079 Levantine, 65%
2,105 Maghrebi, 68%
5,555 Egyptian, 75%
Overall: 85% accurate, but only Gulf and MSA meet that number, by outnumbering the other categories

Combined balanced (~14,000/label; 28% of all combined, 75% of UBC)
MCC = 0.730
77% accurate on Gulf
80% on MSA
75% on Levantine
82% on Maghrebi
79% on Egyptian
Overall: 78% accurate but that accuracy is better reflected globally; Maghrebi is slightly underrepresented in the data, yet was best in accuracy

Analyzing the Arabic disinfo Tweets

Using the model from combined/balanced datasets on 139,334 Tweets:
73.2% were predicted to have Gulf dialect
9.5% Egyptian
8.5% MSA
5.2% Levantine
3.5% Maghrebi

I had expected most to come back as Gulf (it is Saudi disinfo, after all) or MSA. The low numbers for Levantine and Maghrebi lead me to believe these regions were not targeted, and we might be seeing mislabeled data there.
MSA and Egyptian may be labeled accurately, and seen as a strategic way to reach people in a wider region (Egypt produces many Arabic-language movies, so their dialect is understood more widely).

Prediction notebook link

Updates?

This article is from February 2020. For newer datasets and models, please read https://github.com/mapmeld/use-this-now#arabic-nlp