Building a better classifier
Exploring FastText from Facebook Research
Last month I reviewed three libraries for Arabic sentiment analysis. Each came with data for testing and training, and measured their own accuracy, but none separated out their functions in a way that I could easily train the model and then use it repeatedly on experimental data. At first I thought about forking the best project and releasing it as a Python package, but that seemed against the spirit of open source.
The Original Plan
I sketched out the process that I’m looking for in my ideal sentiment analysis pipeline:
- User selects a labeled dataset for the model to divide into training and testing.
Docs can include instructions for downloading language-specific Tweets, reviews, and maybe more formal text samples. - User selects a pre-trained word embedding / word vector file. I’ll now use FastText’s CommonCrawl or Wiki set.
This is a technical improvement by Facebook over word2vec and gensim, and covers 157 languages including Arabic and Kurdish (Kurmanji and Sorani). - FastText tokenizes the input text and processes it into word vectors. Ideally this process respects emojis.
- Script uses SciKit AutoML to set the best hyperparameters for one of SciKit’s two Naive Bayesian classifiers.
The original repo compared several algorithms, but they could be repetitive (using both Decision Tree and Random Forest, itself an ensemble of decision trees) or potentially unscientific (if we know Naive Bayes is the ‘correct’ algorithm for binary sentiment analysis, a higher score for another algorithm is more of a fluke than a ‘eureka’ moment). - Trained model is saved to disk or available in memory for user to send their data which needs classification.
Training the Model
I started with the Tweets in Dr. Motaz Saad’s arabic-sentiment-analysis repo. After researching FastText, I see it can return raw vectors or make its own classifier, which combines multiple steps above. Me vs. Facebook… I’ll go with their built-in classifier.
Let’s reformat the training and test files and combine their two category files:
sed ‘s/pos\t/__label__pos /’ train_pos_20181206_1k.tsv > train_pos_label.txt
sed ‘s/neg\t/__label__neg /’ train_neg_20181206_1k.tsv > train_neg_label.txt
cat train_pos_label.txt train_neg_label.txt > train_combined_label.txt
There are Python bindings, but running these seemed to hang on my cloud server (not enough RAM?). I was able to kick off training on the command line with:./fasttext supervised -input ../azraq-fasttext/train_combined_label.txt -output ../model_tweets -dim 300 -label __label__ -pretrainedVectors wiki.ar.vec
I didn’t see how to do this with the CommonCrawl vectors, which have completely different dimensions.
./fasttext test ../model_tweets.bin ../azraq-fasttext/test_combined_label.txt
Results are: P@1 0.94, R@1 0.94 (these are good).
Trying it on NetflixMENA Tweets
I went back to my previous project and extracted Tweet text replying to the Jinn trailer:
op = open(‘tweettest.txt’, ‘w’)
for reply in tweets[‘replies’]:
op.write('__label__unknown ' + reply[8].replace(‘\n’, ‘ ‘) + ‘\n’)
Running FastText’s predict function returns a line-by-line prediction of positive or negative (unfortunately no confidence score?)./fasttext predict ../model_tweets.bin tweettest.txt
There are estimated to be 139 positive and 111 negative Tweets. In real life FastText could support additional classes including neutral, but not happening here.
One sample positive:
Among many neutral or difficult-to-understand Tweets marked negative, this one is actually negative (emojis were not captured in my evaluation text):
Unfortunately this was also read as negative
Summary
What started as a journey to a new text-processing module, ended up being exploring the options of FastText. I was a little disappointed, as the work happens on the command line and no module-ing was necessary.
It does help me do sentiment analysis, and opens up the possibility of doing analysis in other languages, so I can’t complain so much.
Updates?
This article is from June 2019. For newer models, datasets, and methods, start here: https://github.com/mapmeld/use-this-now#arabic-nlp