ML Arxiv Haul #3

Nick Doiron
7 min readMar 4, 2022

From my Google Drive and open browser tabs, another round of papers to skim through. Thanks to conference schedules, most of these were just released in the first two months of 2022.

This is a survey paper trying to establish some grounding for prompt engineering. Instead of a focus on fine-tuning language models on large datasets, this approach moves a generative model over a carefully worded prompt either to give a few examples (few-shot learning), or to tweak outputs (adding ‘on ArtStation’ to visual prompts for a stylized illustration). There’s now a hub for these papers on http://pretrain.nlpedia.ai

This is Meta’s computer vision model based on NLP’s unsupervised learning -> pre-train-able model approach. Includes discussion of fairness in images (particularly: gender, race, and age).
I was looking if this model was based on ResNet or visual transformers, and this seems to be something else?

This is an older challenge which I was just researching now — really great that there’s a computer vision dataset which goes beyond what you might see on the steps of a building at Oxford.

Sharpened Cosine Similarity

https://e2eml.school/scs.html

A measurement which seems to scale better for evaluating a model during training. This was originally proposed in early 2020 and then reappeared recently in 2022. Research into applying it to larger and larger problems is progressing in a bizarrely de-centralized way (Twitter threads and the above link are the best resources). It’s very cool.

Fine-tuning, only freezing most of the parameters and only editing the bias parameters (in this context, not a human bias, but parameters known as the bias). Comparable results to full fine-tuning.

This reminds me of a popular 2019 paper about neural networks having some kind of structural power and intelligence even when randomly created. Reading into this, it looks like they’re doing something with more intention (plotting the inputs on some kind of nearest-network feature space, then making that into a network somehow) but I may be missing it.

Differential privacy could be bad news on fairness metrics. One criticism I have is that I can’t figure out which tasks/metrics we are measuring against. On a re-read there are two tasks referenced, but I don’t know if the data is available for download.

All about evaluating generative models. I note that there’s a section that automated methods are closer to experts than a human crowd-worker. This goes against the tuning via Reinforcement Learning (often called alignment, because what are words) that’s led to InstructGPT.
The conclusions include a whole diagram of how you could lay out a report. At first I thought this was going to be another page saying ‘make a model card’ but it looks great.
There was a reference to a paper on AAVE in NLP, but it’s only a small part of the paper.

Unsupervised data generation to augment the data going into the actual training model. These frameworks are always a little peculiar because if GPT-3 is so good at replicating traits of the given task, wouldn’t GPT-3 be better for the actual task? Only makes sense if you’re going to deploy a smaller, distilled model for serving.

I haven’t been keeping up-to-date on distilled language models. This is a super-recent survey of distillation, quantization, pruning, etc. Apparently there is not a great deal of research into combining these methods into one mega-compressor technique. Probably, like in training, there is no perfect recipe to use on every task. There’s a small section of the paper on worsened performance on explainability and robustness.

Experiments on memorized text sequences in GPT-2. De-duplication sounds smart, though difficult to do over a huge corpus like the whole internet.

An improvement in ‘biologically-plausible’ ML… Essentially, even though neural networks are compared to our brains’ neurons, it’s mistaken to think that torch.nn works anything like a brain (with steps, epochs, activation functions, backprop, weights and optimizers and embeddings and dropout and other math…). The last thing that I saw about this was spiking neural networks but this is a different direction.
I wanted to highlight this because of some Tweets which I read in the AI Ethics space which emphasized that ‘neural networks = brains’ is not only misleading, but could instill dangerous levels of confidence.

Lots going on in this paper — there’s a whole history lesson about how Black jurors and women were allowed onto juries. For the technical portion, multiple ‘juries’ are made out of labeler functions and their predictions are used for votes. This emphasis on labeler functions sounds a lot like Snorkel (also from Stanford) but it isn’t mentioned here for whatever reason.

There’s a growing movement of embeddings-as-a-service. If you’re familiar with word2vec mapping words by similarity, this is a step up where a search should map to a similar phrase, webpage, or article. Even if there is no word-for-word similarity, like in a traditional search engine, you should still find the best article for your query.
This paper isn’t exactly like that… instead of the embeddings being searched by a script, it’s a seq2seq model which aims to transform the query into the right docid / link.

This got some traction on ML Twitter. It’s long been known that large language models are bad at math, and it’s frustrating because the models are supposed to be smart, right. Is it due to tokenization of numbers, is it because language models memorize stuff, is it because they are too fuzzy, etc. There was another GPT math post around the time, where it got a lot wrong yet tended to get the first and last digits right.
In this paper it’s shown that GPT-J is more likely to get the math right depending on how frequent those number-tokens appear in the training data. Basically if the answer is ‘8,320’ but it’s not a popular number then maybe GPT is going to fuzz it.

Instead of pre-training or fine-tuning, here’s reprogramming. But what is it?
A frozen pre-trained model is sandwiched in between an ‘input transformation layer’ and an ‘output mapping layer’. This is similar to the AdapterHub idea of applying swappable layers for different tasks. There seems to be some value to the initial structure of language model neural nets. The paper discusses reprogramming a model for new domains, so I wonder if a language model could be reprogrammed for low resource languages?

Generating game levels via reinforcement learning — this is interesting to me from my long-term project idea of generating leveled language lessons.

Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages

http://wietsedv.nl/files/devries_acl2022.pdf

Study of Universal Dependencies data / part-of-speech analysis. I’m not sure how much part-of-speech tagging is used these days, because transformers models just chomp on words without specifically building a grammatical / dependency model.

Advantages of Artificial Intelligences, Uploads, and Digital Minds

https://philpapers.org/archive/SOTAOA.pdf

Futurist view in favor of uploading minds.

--

--