CRUD app for Text Classification
In the last post, I described a ‘CRUD for machine learning’ with these features:
- storing and serving SciKit-Learn models with a Flask/Python API
(from an existing project) - incremental learning - add training data over time to improve the model
- ELI5 to explain the model’s decision on every prediction
The next step for my project is supporting text classification. To support languages beyond English, I will use Facebook’s FastText to parse text into vectors, and from there can resume using SciKit-Learn and ELI5 as I would with any other dataset.
When using GPT-2, I learned two points which also apply here:
- FastText returns vectors for each word, but SciKit-Learn needs just one, so I have to average words together — in this project I will see if including min and max for each dimension is more helpful.
In the future I can use PyTorch or TensorFlow to make a neural network which accepts all of the word vectors. - ELI5 (and similar project Alibi) have no built-in explanation support for these classifiers, so we need their black-box method of comparing the predictions for many permutations of the text. This takes more computation time.
Modifying training for text classifiers
I created new endpoints /train_text/create
and /train_text/insert
For consistency, I move much of the old /train/
code into common functions.
FastText has pre-trained models for hundreds of languages, which gensim
can use like anything else based on word2vec:
import nltk
nltk.tokenize(sentence)
...
from gensim.models.keyedvectors import KeyedVectors
vectorizer = KeyedVectors.load_word2vec_format('wiki.ar.vec')
vectorizer.most_similar('مدرس')
Once the pre-trained word model is loaded I can accept training data — even loaded in incremental batches — because the full word-space should be vectorized already by FastText.
Some words or typos are going to be missing, so I can replace these with a filler word (‘the’).
Every time that I change the server code, Flask reloads the Arabic word vectors, and that takes a lot of time. I should use a smaller model for development, or run this in an independent process, to make it go smoother.
I’m not sure of the right way to smoothly combine all of the word vector data and non-text columns into one place. I have code working for now, but it’s ugly:
This is for predict
, and the train
one is even worse (as it’s using a Pandas dataframe, which was zero help here).
I ran my tests several times and made adjustments, until create
, insert
, and predict
are all functioning. My next steps are to build a real classifier, see if it can make accurate predictions, and then reconnect ELI5 for text type data.
Setting up a real classifier using my API
My goal will be to classify Tweets responding to NetflixMENA. My training data will be the positive- and negative-tagged Tweets from Professor Motaz Saad’s Arabic Sentiment Analysis repo. For English-language projects, you can easily find training data such as reviews from IMDB, Amazon, and Yelp. You also aren’t bound to FastText — you could use any modern transformer.
At first, I kept running into problems uploading the Arabic Tweet data in one POST request. How could I have forgotten the power of incremental machine learning? I split the training data up into 600-line pieces, used the create
endpoint for the first file, and then used insert
. Soon my system was learning, even over an airplane internet connection.
Activating ELI5 for Text
In the CRUD app, ELI5 returned a score for every column in the Titanic dataset. What makes our situation with text classifiers so different?
The problem is that our 300 dimensions returned by the FastText model (turned into 900 columns from resolving it into min, max, and average) have no human-expressible meaning. I’d be eager to see word vectors debunked or explained, but for the time being, we need ELI5 to use the LIME algorithm to try changing up and explaining the classifier on the word-by-word level.
LIME needs to call a pipeline withpredict_proba
, returning probabilities, so the Perceptron classifier was no longer useful. I retrained the model using the SGDClassifier
.
I struggled to set everything up where I could accept JSON input, fill in dummy values for missing columns, and send it into a SciKit pipeline for vectorizing and so on. Although it was originally cool that the system accepted partial rows and made use of Pandas, it might be easier for an internal system to have a fixed CSV format / array upload process.
Interpreting results
The JSON response returns a prediction and an explanation for each input.
The phrases from the first few Tweets with the highest scores were common words with little positive or negative to them: منزلين (two houses), وانتو (and you), and متى (when). I decided to start over without using min and max on each dimension as additional inputs. The new top words included يحمس (excite), حُب (love), and نفس (same), which were much more related to the positive or negative tone of the Tweet.
How would I continue to production-ize the CRUD app?
How will I take this from one model to a multi-model SaaS app? This is my plan for next steps, in order of importance:
- Give each classifier a unique ID, and a folder with its pickled model and data uploads.
Always store English and Arabic word embeddings in memory (this might need to be in a separate process to make server debugging less annoying).
Store the classifier in memory, but remove it from memory after several minutes of inaction. - Store the
text_type
variable to know what kind of model we have, and names of all output classes of the classifier — right now these are wiped out whenever the server restarts. - Use a database to store the
text_type
, and column details. - Display predictions on frontend, and allow changing of weights
- Store training data in a database and make it searchable
Updates
This article is from 2019. For latest links and recommendations see:
https://github.com/mapmeld/use-this-now#model-editing