GPT-NYC — Training an NLP model on local knowledge of New York City

Published in

Geek Culture

7 min readMay 16, 2021

When I recently chatted with members of a fiction-generation AI startup, one of my pre-written questions was whether a model could be trained for a specific non-fiction location. This is based on my idea for a model trained on the AskNYC subreddit.

In the semantic search approach, we should just return the answers to a textually similar question.
In a question-answering (QA) approach, this would be framed as a retrieval or knowledge graph problem where we give the model some authoritative data, and train it to retrieve appropriate and relevant answers.
In a natural language generation (NLG) approach that I’ll use in this post, we don’t have one answer or source for queries like ‘what are the most romantic restaurants’, so a GPT or T5 architecture model can be fine-tuned for a NYC use-case.
A risk here is that results may sound right but not be factually true.

Challenge Questions

How do I get to Governor’s Island? / Activities on Governor’s Island
Where do you live and what is great about it?
Is Bushwick good for families?
How often do you go to Central Park? / Times Square?
Where can I eat South Indian food such as dosa? / upma?
Which is worst: the G train, LaGuardia, or JFK airport?

Getting the data

/r/AskNYC was created in August 2012. Locals’ Reddit activity is currently divided between /r/nyc and /r/newyorkcity, with community rules often redirecting questions from tourists and newer residents to /r/AskNYC. This is more typical of a state or country subreddit (/r/HawaiiVisitors and /r/JapanTravel) than a city (/r/chicago being a single location).

PushShift.io has monthly Reddit archives from January 2006 to December 2019 (from start of /r/AskNYC to before Covid-era questions).
I downloaded comments from January 2015 (32 GB unzipped JSON lines) and filtered with cat RC_2015–01 | grep 't5_2uqch' > asknyc.jsonl, leaving only 5,478 comments.
When I downloaded June 2019 (164 GB unzipped), that month has 19,824 AskReddit comments. I decided to include additional examples from other months: October 2016, August 2017, and April 2018.
I initially planned to scrape the original questions from Reddit, but found it easier to download and filter them also from PushShift.
Of 67,195 total collected comments, only 13,400 are direct responses to the parent question (in thread-speak: children and not grandchildren posts).

Considering Bias and Toxicity

We can anticipate dual issues of bias and toxicity in generated content. We want our NYC travel guide AI to be positive and knowledgable about many cultures and experiences of New Yorkers. It’s not enough to be polite if the model is limited to one borough, advises against trying Persian food, or describes religious practices as ‘strange’.

We can use upvote counts and initial moderation to help curate our comment dump. Admittedly /r/AskNYC is a public site with its own problems. Questions range from “Where do people buy matzah in NYC?” to “Why are all of the camera stores in NYC operated by Orthodox Jews?”, and “Homeless people in our basement” to “Any NYC’ers have experience helping the homeless sign up for their stimulus checks?”.
Any question about minority communities could be read as curious or as other-ing. People bring these questions to /r/AskNYC when they don’t know someone in their network who could answer, or when asking could be seen as confrontational, racist, or homophobic. Redditors tend to upvote succinct, personal experiences, and/or politically popular answers to these questions.

In Part 2, I plan to read into multiple papers to better regulate the output of the GPT-NYC model.

NLG (GPT-2) approach

Background

The GEM benchmark combines multiple language generation tasks. Two tasks include text constrained to a particular topic such as a restaurant. One of the examples is in Czech so I cannot evaluate it; the other uses data entries / coded input to generate sentences limited to a restaurant topic.

Preparing my Dataset

CoLab link
You can download the dataset here.

Finding Additional Tokens

CoLab link
I would like to tokenize streets, neighborhoods, transit stations, people, foods, and events so we don’t drop NYC-specific information during training. I compare the text to tokenizers to figure out which common tokens were missing from standard GPT-2 and T5.
The top 200+ missing tokens include

touristy, gentrified, Midtown, nyc, blizzard, blotter, Bushwick, Adjudication, bellboy, oncoming, dumpling, bagel, deli, halal, stroller

Along with many other places and street numbers.
This doesn’t mean that GPT-2 is blind to all of these words — if every occurrence of stroller has been baked into GPT’s stroll and +er tokens, then it might be learned (try prompting it though…) Missing ‘bagel’, ‘burrito’, ‘dumpling’, and ‘halal’ is going to under-serve a lot of New Yorkers, Americans, and people globally.
I decided to add this list, subway stations, 1st to 220th, neighborhoods, and the foods from The world’s 50 best foods and 99 things to eat that will blow your mind (not NYC-specific, but best lists I could with both dosa and pho).

When I first placed the vocabulary, these new tokens are selected nonsensically, so we need to continue training:

How do I get to Governor’s Island? — yelp Chilli Goong Old Town Livingston Chilli Livingston Cypress Chilli Chilli Lorimer Lorimer ChilliLivingston beetroot confrontational Cypress Chilli Greenpoint yelp Lorimer Nassau — new tokens prior to re-alignment

Fine-tuning GPT2

I started by fine-tuning GPT2 [small] on Google CoLab.
I was pleasantly surprised with results which were generally coherent, and sometimes funny, even if short on facts.

Which is worst: the G train, LaGuardia, or JFK airport? — LGA, JFK — The G train is horrible at times and the 2nd best option is to go into F/Q at LGA. F train gets you to JFK airport, though it can be delayed in a few blocks. For example, if you want to get to Times Square, you’ll have to take the E into Williamsburg, then switch trains to the G at Coney island. < end of text > — GPT on trains

My name is Clara and I am — moving to different apartment to rent out sofa and other stuff. The management company is from the city and they want to change the lease for me but the broker has told us it’s against the law and I like that — GPT on apartments

I was motivated to continue my work on a larger model.
I also redesigned the prompts to the format: “question — additional info %% comment”. This avoids a particular mode where the question and additional info are continued by the model.

visiting NYC for one week — and need help finding the best place for the Manhattan area! — Hi guys, my wife and I are going to NYC for one week from Thursday to Sunday in April. We have two small children, aged 10 — GPT continues the original question

The GPT2-Medium model was too big to fine-tune on the CoLab GPU, so I needed to seek out an A100 GPU with >16GB of on-device memory. It takes about 2 hours to train for 5 epochs. There were shortages of this machine type in the US, so I ended up running this command in an EU cloud.

gcloud compute instances create Sample --project Project --zone europe-west4-a --machine-type a2-highgpu-1g --image-family pytorch-1-6-cu110  --image-project deeplearning-platform-release  --boot-disk-size 100GB --metadata "install-nvidia-driver=True,proxy-mode=project_editors" --scopes https://www.googleapis.com/auth/cloud-platform --maintenance-policy TERMINATE

Results

How do I get to Governor’s Island?

How do I get to Governor’s Island? https://www.nyc gov parks.org/parks/governor-island

How do I get to Governor’s Island? LIRR, if you’re interested in spending a night there and don’t mind paying.

Where do you live and what is great about it?

Where do you live and what is great about it? I was born and raised in Staten Island, went to undergrad at Columbia and MA/MS in New York, moved to NYC last year for my Masters.

Where do you live and what is great about it? The biggest selling point is your location, but I don’t see a huge benefit in commuting to Brooklyn for a 6 minute walk (Manhattan or Harlem?). The other big selling point is convenient access to midtown via subway, but that makes it less desirable to people who like walking to work. If you commute by subway, then by walk, then by car, well the desire to get somewhere to walk to is much greater.

Is Bushwick good for families / parties?

Is Bushwick good for families? I am moving to NYC with my two kids. — Family friendly! < end of text >

Is Bushwick good for families? In my neighborhood / park, there are plenty of families from the same-sex couple. < end of text >

Is Bushwick good for families? I am moving to NYC with my wife and 3 kids. — My wife and I really like our neighborhood in Harlem. Good schools for all our kids. I would suggest moving somewhere else if your kids’ school choices don’t take you to Manhattan. You are better off looking in Queens.

Is Bushwick good for parties? Why not?! You won’t be spending a lot of money on drinks you’ll likely only be having with friends.

How often do you go to Central Park / Times Square?

How often do you go to Central Park? Every summer for the last 10 years or so. Most weekends. < end of text >

How often do you go to Central Park? Every time I visit. < end of text >

How often do you go to Times Square? Times Square is an awesome area, but I’m sorry, I go there IMO. You can read (and probably play) your computer without too much of a hassle. But I’m not really going to use it too often.

Where can I eat South Indian food such as dosa?

Where can I eat South Indian food such as dosa? Saravana Bhavan, in Queens. < end of text >

This is actually accurate!

Yelp page for a food truck, Saravanaa Bhavan, selling dosa

Where can I eat South Indian food such as dosa? Varuna in Kolkata has plenty if not all Indian food. And if you want to treat yourself and take in Indian food, then you should go to Bombay :) < end of text >

Which is worst: the G train, LaGuardia, or JFK airport?

Which is worst: the G train, LaGuardia, or JFK airport? — A bit of a mixed bag depending on which lines you’re looking at. The A at JFK is very solid, but the trains aren’t as frequent as on the G. I think the A is the only line where you can really feel it because it has some really great trains coming every ~5 min. The J is the worst of the three because it has some bad trains and a terrible track layout which are both bad when traveling by subways. But it does have trains running every 10 min

Which is worst: the G train, LaGuardia, or JFK airport? — JFK is the worst. It’s about 2 or 3 times the NYC subway fare. Asking for air-train, same as subway. Going to be my go-to for picking flights, etc. That being said, the PATH is the dirtiest of the bunch. It’s a lot closer to NYC than the JFK. If you don’t have to worry about missing your flight, the PATH beats the J B at JFK. I got fucked on the 7 from LGA to EWR because they over screwed me. However, it’s closer to you and more reliable.

I’m happy with these answers, though they seem to suffer from memorization, and over-interest in the new vocabulary. A more rigorous probe of responses might reveal that instead of answering the original question, you are prompting a handful of responses: riffing on apartments, riffing on food, dismissing the question (such as ‘no’ or ‘anywhere’), or the memorized URLs and responses.