Scrambling memorized info in GPT-2
Removing phone numbers from generative models
--
What do GPT-2 and GPT-3 know about us?
Since OpenAI’s release of GPT-2 in 2019 and GPT-3 in 2020, users have praised them for generating remarkably human-like text.
In recent research, members of Berkeley AI Research (BAIR) have shown how these text models memorized personal information and that many memorized sequences, such as phone numbers and copyrighted works, can be teased out of the model weights. This information is in a grey area where it was found on the public internet, but we don’t know if it was meant to be private, and we don’t want it to appear in a generated story or chatbot message.
In February 2021, a company using generative text models on user chats was discovered to leak names and private information. Researchers discovered that their model was using an unusual architecture which made memorization more likely, but it should serve as a warning for developers to take this risk seriously.
Can’t we remove this private information?
BAIR’s 2020 post concluded that removing this information is a difficult problem to automate at scale. Consider:
- personal/private information takes many forms
- some public information looks like private information (a restaurant phone number, the address of Apple headquarters)
- some personal information is public knowledge (Abraham Lincoln’s birthday)
GPT-2 does not directly encode where it read a sequence or how frequent it is. A nuclear option of removing all dates, addresses, or numbers would constrain our model too much.
Let’s talk about phone numbers
Phone numbers are the easiest piece of personal information to identify. An app could put a simple filter between the generative model’s output and the internet. We need a more complex process if we’re publishing model weights and embeddings. There are three scenarios to avoid:
- A prompt such as 867-_ completes to a memorized phone number
- A complete phone number returns other information about the owner
- A name (Sarah Connor) returns their personal phone number
Here we can look inside the model’s thinking after seeing the prompt “Call 202-”. I used Jay Alammar’s ecco library to show the token breakdown of the generated text, and probabilities of the middle and final digits of the number:
The model determined some sequences are much more likely to appear here.
With that context, here’s my thought process:
- Is it OK to generate a randomized but still valid phone number?
Or should we use -555- phone numbers like in movies? - The vocabulary includes a starter token and a suffix token for most three-digit numbers. If we disable all three-digit suffix tokens, the model would stop after generating “202-” or “(202) 707-”. This anonymizes phone numbers, but would cause havoc for all counting numbers over 999.
Exploring access points
There are four points to mess with GPT-2: vocabulary, input embeddings, output embeddings, and internal weights.
- By changing the vocabulary and merges.txt, we could remove the ability to recognize some tokens, or swap tokens’ meanings.
- By moving input embeddings, we could give input numbers similar meaning (disconnecting them from following sequences).
- By moving output embeddings, we could make the model unlikely to generate numbers (for example, moving them close to ‘wheelbarrow’).
- By fine-tuning the whole model, we could overwrite previous weights and memorize new sequences. We could add and train a new set of more randomized number tokens.
Bet it all on +000
In this notebook, you can see my process to swap all three-digit number suffix tokens with other number tokens. I then remembered all of the token ids which I had changed, and moved their input and output embeddings to match +000.
I use ecco again to view tokens generated to complete a phone number. Here you can see that all of these three-digit number suffix tokens have an equal probability:
Directly prompting phone numbers had odd results — some return text (like “Call 202-9728–Dental”), and some formed a four-digit number from two two-digit number tokens.
In the next round of development, I moved two-digit suffix tokens to +00, and I removed all numbers from merges.txt.
Luckily non-phone number prompts such as “Congratulations, you won 1,” continue to return valid numbers (“you won 1,000 points!”), now with equal weights for numbers.
Posting the model
This process seems to successfully confound GPT-2’s number process, so I ended up not adding any new tokens or fine-tuning on new data.
It would be interesting to see this process work on a larger-sized GPT model, or in other countries’ phone number formats and numerals.
I published my new model vocabulary and embeddings on huggingface.co/monsoon-nlp/no-phone-gpt2 — this will still generate some phone numbers, but it is randomized and much less conducive to phone numbers compared to the original huggingface.co/gpt2. Let me know if you are able to decode the original tokens or prompt personal information!
Next steps: thoughts on addresses
The next type of information which I’d consider excising from GPT models would be global addresses. My thought is to collect lines of text from CommonCrawl containing addresses, and then fine-tune GPT-2 against completing the text. To avoid focusing on specific addresses, it might be better to predefine groups of tokens (avoiding all numbers, all city names, etc), or even include an address parser (libpostal) in the loss function. This would be more complex and time-consuming than most loss functions, but ought to successfully tune out memorized addresses.