What goes on inside NLP neural nets?

5 min readJun 28, 2020

Recently I was reading about ERASER, a benchmark of interpretable NLP. This got me thinking that for my projects, particularly around Tweets, the best explanation might be for the model to show similar messages from training. Now… I know that machine learning generalizes and doesn’t remember specific examples. But there are Tweets where you might not know the context, or a misclassified post where it’d help to see how the model sees it.

My theory is that the most similar-meaning examples should have similar coordinates in a middle layer of the neural net. At the input layer the closest examples would have the same words or syntax (‘this is the best/worst day ever’) and at the final output layer they’d be too general (same level of positivity or negativity).

my theory that a layer exists with the right level of meaningful similarity

There is some real basis for this idea. First, a family of neural networks — the autoencoders used in deep fakes and other style transfer — are designed to narrow input down to some essential signals in the middle, and then regenerate detail at the end:

a diagram of an autoencoder narrowing then widening

And recently Jesse Mu at Stanford posted a preprint specifically looking at inner neurons of natural language predictions:

Compositional Explanations of Neurons

We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts…

arxiv.org

we analyze the [multilayer perceptron] component, probing the 1024 neurons of the penultimate hidden layer for sentence-level explanations, so our inputs x are premise-hypothesis pairs.

This led me to believe the most important middle layer would be the one just before the final classification.

PyTorch / SimpleTransformers

My first idea for exploring the mind of a neural network was training in SimpleTransformers and then digging into its PyTorch internals. I used a Spanish-English sentiment analysis dataset from the University of Houston — part of their LinCE benchmark.

I was able to output the list of layers — a BERT layer from Transformers, a Dropout layer, then a Linear layer which narrows results down to the three output classes (positive, neutral, or negative sentiment). I’d like to remove these last two layers and see the 768 out_features appearing at the end of the BERT module, or somehow isolate the BERT module and run only that step.
Neural network libraries don’t readily add and remove layers in place, especially when we need all of the text preprocessing. We can’t remove the final layers from the original code, because we need them during the training phase. I tried swapping the existing model with a new torch.nn.Sequential with only the BERT step, but nothing panned out. So I decided to start over with Keras.

Keras / AutoKeras

In AutoKeras, the model uses an older form of tokenization, and not a modern pretrained model. For now I can put this aside because we’re just trying to edit a working neural network.
The AutoKeras TextClassifier comes with an IMDB data example. I ran that, exported a normal Keras neural network, and saw these layers:

[<tensorflow.python.keras.engine.input_layer.InputLayer>,
<tensorflow.python.keras.layers.preprocessing.text_vectorization.TextVectorization>,
<tensorflow.python.keras.layers.core.Dense>, <tensorflow.python.keras.layers.normalization_v2.BatchNormalization>, <tensorflow.python.keras.layers.advanced_activations.ReLU>, <tensorflow.python.keras.layers.core.Dense>, <tensorflow.python.keras.layers.core.Activation>, DictWrapper({‘classification_head_1’: <tensorflow.python.keras.losses.BinaryCrossentropy object})]

We can keep the input and vectorization layers, and get rid of the final classification layer. In honesty I probably removed more layers from the end than necessary. As with PyTorch, it wasn’t possible to edit the layer list in place, but I was able to create a new tf.keras.Sequential with the reduced layers.
I trained the model on the IMDB training set, then made vectors from the IMDB test set. Finally, I made up several phrases and found the closest test review to what I’d typed (measuring on linear distance, all nodes of the layer weighted evenly).

These were two examples where the test sample seemed remarkably similar to my invented review:

Me: “I hate this movie” to IMDB: “read the book forget the movie”

Me: “in short this movie has the worst dialogue the worst characters and the worst direction” to IMDB: “i <unknown> so much when I saw this film I nearly <unknown> myself awful acting <unknown> effects <unknown> <unknown> <unknown> and <unknown> slow <unknown> fighting…”

One detail that I noticed is that the closest review often had a similar number of words/tokens to my input. It’s true of these two examples as well. It could be that having fewer words = more null nodes = more similar values?

TensorFlow / Ktrain

Once I had one successful run with Keras, I figured I could try again with a TensorFlow-based neural network. I’ve used Ktrain before on the Hindi-BERT project — if you want to use Transformers directly and with more precision, instead of SimpleTransformers, I’d recommend it.
I trained on the LinCE dataset again and found an expected 54–55% accuracy rate using the Multilingual BERT pretrained model.
By running ktrain.get_predictor(learner.model, preproc=t) I was able to extract a TensorFlow model with these layers:

[<transformers.modeling_tf_bert.TFBertMainLayer,
<tensorflow.python.keras.layers.core.Dropout,
<tensorflow.python.keras.layers.core.Dense,
ListWrapper([])]

This looks similar to how things worked in SimpleTransformers. I need to drop these last two layers and see what the BERT layer is saying.
Unfortunately I got stuck here all over again. The original predictor and predictor.model would accept the test sentences as a list or a TensorFlow BatchDataset, but the BERT layer could not work on its own (because it needs preprocessing) and swapping out the predictor.model with a new tf.keras.Sequential model led to a lot of errors about indices.

Reflecting back

Here’s where I’m at — I think with a delicate process, I could eventually put the right pieces in the right order to study the Ktrain TensorFlow model, or make a hacky version of Ktrain, or use Transformers to vectorize sentences and AutoKeras to build a model as if it were tabular data.

In practice, I was surprised to find out about the community’s general disinterest in model introspection. The resources are sparse, even with these common neural network platforms. NLP adds an extra step of difficulty, where these layers and Lego blocks become more dependent on each other.

There are additional reasons to avoid reading from individual neurons. Jesse Mu’s paper (linked earlier) found common triggers for some neurons in computer vision and text models:

Do interpretable neurons contribute to model accuracy?
…
Unlike vision, most sentence-level logical descriptions recoverable by our approach are spurious by definition, as they are too simple compared to the true reasoning required for NLI. If a neuron can be accurately summarized by simple deterministic rules, this suggests the neuron is making decisions based on spurious correlations, which is reflected by the lower performance

Next steps: I will probably study Ktrain more, as it’s been helpful for two projects now and is my closest link to inner workings of TensorFlow 2.x. I should also make some more comprehensible and repeatable examples for model introspection, as those have been difficult to find.
If you are also analyzing NLP blackbox models, please comment below?

Updates?

This article was posted in June 2020. For my latest suggestions on model introspection, see https://github.com/mapmeld/use-this-now/blob/main/README.md#model-introspection