GPT-NYC Part 3 — what we token

I added ‘NYC words’ to the vocabulary. Was that necessary?

3 min readAug 21, 2021

While I was building GPT-NYC, one of my focus areas for customization was tokens. Long story short, GPT-2 doesn’t come with individual tokens for bagels, strollers, or the Bronx. This is not essential for a model to know a concept— here we see GPT-2 [large] suggests the word ‘stroller’, associates bagel with ‘a sandwich’, and recommends pita and vegetarian foods for ‘a halal breakfast’:

I haven’t seen research about tokenization affecting bias or usefulness of BERT and GPT models. Even if the potential for bias is slight, I’d like to know more about it. Generative models especially probably write common European names from their vocabulary such as ‘Alex’ or ‘Molly’ more often than a name such as ‘Sunita’ or ‘Ernesto’ outside the vocabulary (in my Anti-Explanations post, names from the original dataset featured frequently in responses, with unfamiliar names replaced by a pronoun).
If looking at the probability of names sounds like splitting hairs, remember that in a generative model everything is probabilities, and many implementations just pick the most likely token.

The Research So Far

The closest paper that I thought of was Superbizarre:

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We…

arxiv.org

After some searching, a blog post by Gergely Dániel Németh discusses how all except one typically-White name were one token, and typically-Black names averaged 2–3 tokens. But there isn’t a specific connection from tokenization to bias:

Racial Bias in BERT

Understanding and visualizing unjust bias in BERT embedding vectors

towardsdatascience.com

Designing an experiment for single vs. multi-token

A/B testing: Fine-tune the foundation model and compare user satisfaction to models which had your selected topic tokens, additional but rare tokens, and additional randomly selected tokens, and determine which model appealed most to the end users.
It would be interesting to test if a larger system provably changes after tokenizing ‘halal’, for example better results searching for grocery stores or generating a travel itinerary.
Difficulty of incorporating new words into generation: Invent a word which is unknown to the pre-trained model (Jorvalep). Use it to replace names before fine-tuning. Take care when replacing names to avoid it being too frequent or 1:1 replacement of a celebrity name (‘actress Jorvalep Kendrick’) or a particular group. Study frequency of the generative model recommending the new token sequence. Compare to fine-tuning on a generative model where the name is a single token.
Difficulty of learning new uses: select many abstract nouns or adjectives which are rarely or never used as names (i.e. less common than ‘Patience’) as names, and measure how difficult it is to fine-tune a model to learn one-token and multi-token sequences are now names in a NER model.

Updates?

This article was written in August 2021. If my recommendations change in the future, I’ll update it on this GitHub.