Measuring loss on new GPT tokens

When I created GPT-NYC, one part of the experiment was adding hundreds of tokens to capture foods, neighborhood and subway stations, and words common on the /r/AskNYC subreddit, including ‘halal’, ‘touristy’, and ‘gentrified’. My theory was that GPT models know about these concepts from sequences of existing tokens, but new tokens would make it easier for a model to pick them up and generate text in a particular New York voice.

I didn’t produce metrics to back that process or compare it to fine-tuning the original model. I’d like to try that today and use the opportunity to try out the new T0pp model in the process.

Before fine-tuning

Measuring token-specific match during generative model fine-tuning

I start with the Causal Language Modeling example.
When iterating on a small CoLab GPU, I’d recommend using a smaller model (gpt2, the smallest version on HuggingFace), a smaller training batch size (per_device_train_batch_size), and a smaller dataset until your code pipeline is working. In this case I loaded a small hate speech dataset until I was confident enough to use the CSV from GPT-NYC.

There are some options to get callbacks from Trainer, but none collected inputs, so I ended up creating a subclass TokenTrackingTrainer which only spies on / overridescompute_loss. In this initial step I would print out the logit value looking for higher probabilities / true-positives, but not measuring the actual loss which measures both positives and negatives.
We could do some fancy loss function stuff to weight new tokens differently, but now isn’t the time.

I decided to pick one common and pre-existing token at first — ‘ train’ for the NYC questions dataset — and track the logit values for that token. If my fine-tuning code was working, I ought to see the probability of this token increasing from start to finish of training [final loss will need to measure both correct and incorrect application of the token, but this is only a test].
I found convincing evidence that the ‘train’ token was becoming more accurate in minutes (0.28 epochs) and it had significantly improved (though still having some negative scores) by the end of 3 epochs of training.

Measuring loss within added tokens

I wasn’t sure which loss function I should use here. The HuggingFace docs include an example of a problem using a customBCEWithLogitsLoss, but then I found the GPT-2 implementation using CrossEntropyLoss and specific processing of the tensors. Once I determined I was calculating loss in the right way, I then started filtering the tokens to only focus on where I expect to see the new tokens.
Shouldn’t I have the old tokens, too, for false positives? Ideally yes… but there’s enough variety within the new NYC-specific tokens that swapping a train station for a food, or ‘Airbnb’ for ‘skunk,’ ought to raise false positive alarms in an unready model.
And remember, the actual training loop is looking at the full input. This loss is a measure only for us to check that new tokens are learned.

I noticed that there is an initial drop in loss, but improvement stagnates.

loss on new tokens

It doesn’t have the consistent downward trend of loss of the overall model:

the ideal: a gradually decreasing metric. loss from each step in the GPT-2 model

I next tried running and tracking loss on only one token, ‘bagel’. The loss report has a lot of NaN values. I determined that the array flattening in out_logits[focus_tokens][…, :-1, :].contiguous() isn’t appropriate in sequences of only one token. The loss which I recorded earlier are still meaningful.

While debugging this, I also realized that tokens such as ‘bagel’ appear fewer than 100 times in the questions and answers, and neighborhoods Sunnyside and Astoria (often suggested) appear from 30 to over 300 times. The frequency explains why Astoria is so often suggested in generated text by GPT-NYC. This was discouraging, as the new tokens likely cannot be correctly incorporated into the model by training on so few examples. An alternative would be to fine-tune or train from scratch on a very large dataset, but it would be difficult to guarantee enough NYC-specific content about each neighborhood and subway station.

I need to track the loss on the NYC terms as they are tokenized by the original GPT-2 model.
Can I compare the single-token loss to multiple-token loss? I have a shaky understanding of the process, so I tested whether loss is accumulative or multiplicative in some way. I believe that the loss is comparable for sequences with similar likelihood, regardless of sequence length.

Nomadic web developer and mapmaker.