ML Arxiv Haul #3

7 min readMar 4, 2022

From my Google Drive and open browser tabs, another round of papers to skim through. Thanks to conference schedules, most of these were just released in the first two months of 2022.

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language…

This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub…

arxiv.org

This is a survey paper trying to establish some grounding for prompt engineering. Instead of a focus on fine-tuning language models on large datasets, this approach moves a generative model over a carefully worded prompt either to give a few examples (few-shot learning), or to tweak outputs (adding ‘on ArtStation’ to visual prompts for a stylized illustration). There’s now a hub for these papers on http://pretrain.nlpedia.ai

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Discriminative self-supervised learning allows training models on any random group of internet images, and possibly…

arxiv.org

This is Meta’s computer vision model based on NLP’s unsupervised learning -> pre-train-able model approach. Includes discussion of fairness in images (particularly: gender, race, and age).
I was looking if this model was based on ResNet or visual transformers, and this seems to be something else?

Visually Grounded Reasoning across Languages and Cultures

The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration…

arxiv.org

This is an older challenge which I was just researching now — really great that there’s a computer vision dataset which goes beyond what you might see on the steps of a building at Oxford.

Sharpened Cosine Similarity

https://e2eml.school/scs.html

A measurement which seems to scale better for evaluating a model during training. This was originally proposed in early 2020 and then reappeared recently in 2022. Research into applying it to larger and larger problems is progressing in a bizarrely de-centralized way (Twitter threads and the above link are the best resources). It’s very cool.

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

We show that with small-to-medium training data, fine-tuning only the bias terms (or a subset of the bias terms) of…

arxiv.org

Fine-tuning, only freezing most of the parameters and only editing the bias parameters (in this context, not a human bias, but parameters known as the bias). Comparable results to full fine-tuning.

Learning from Randomly Initialized Neural Network Features

We present the surprising result that randomly initialized neural networks are good feature extractors in expectation…

arxiv.org

This reminds me of a popular 2019 paper about neural networks having some kind of structural power and intelligence even when randomly created. Reading into this, it looks like they’re doing something with more intention (plotting the inputs on some kind of nearest-network feature space, then making that into a network somehow) but I may be missing it.

Differential Privacy and Fairness in Decisions and Learning Tasks: A Survey

This paper surveys recent work in the intersection of differential privacy (DP) and fairness. It reviews the conditions…

arxiv.org

Differential privacy could be bad news on fairness metrics. One criticism I have is that I can’t figure out which tasks/metrics we are measuring against. On a re-read there are two tasks referenced, but I don’t know if the data is available for download.

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are…

arxiv.org

All about evaluating generative models. I note that there’s a section that automated methods are closer to experts than a human crowd-worker. This goes against the tuning via Reinforcement Learning (often called alignment, because what are words) that’s led to InstructGPT.
The conclusions include a whole diagram of how you could lay out a report. At first I thought this was going to be another page saying ‘make a model card’ but it looks great.
There was a reference to a paper on AAVE in NLP, but it’s only a small part of the paper.

Towards Zero-Label Language Learning

This paper explores zero-label learning in Natural Language Processing (NLP), whereby no human-annotated data is used…

arxiv.org

Unsupervised data generation to augment the data going into the actual training model. These frameworks are always a little peculiar because if GPT-3 is so good at replicating traits of the given task, wouldn’t GPT-3 be better for the actual task? Only makes sense if you’re going to deploy a smaller, distilled model for serving.

A Survey on Model Compression for Natural Language Processing

With recent developments in new architectures like Transformer and pretraining techniques, significant progress has…

arxiv.org

I haven’t been keeping up-to-date on distilled language models. This is a super-recent survey of distillation, quantization, pruning, etc. Apparently there is not a great deal of research into combining these methods into one mega-compressor technique. Probably, like in training, there is no perfect recipe to use on every task. There’s a small section of the paper on worsened performance on explainability and robustness.

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Past work has shown that large language models are susceptible to privacy attacks, where adversaries generate sequences…

arxiv.org

Experiments on memorized text sequences in GPT-2. De-duplication sounds smart, though difficult to do over a huge corpus like the whole internet.

Towards Scaling Difference Target Propagation by Learning Backprop Targets

The development of biologically-plausible learning algorithms is important for understanding learning in the brain, but…

arxiv.org

An improvement in ‘biologically-plausible’ ML… Essentially, even though neural networks are compared to our brains’ neurons, it’s mistaken to think that torch.nn works anything like a brain (with steps, epochs, activation functions, backprop, weights and optimizers and embeddings and dropout and other math…). The last thing that I saw about this was spiking neural networks but this is a different direction.
I wanted to highlight this because of some Tweets which I read in the AI Ethics space which emphasized that ‘neural networks = brains’ is not only misleading, but could instill dangerous levels of confidence.

Jury Learning: Integrating Dissenting Voices into Machine Learning Models

Whose labels should a machine learning (ML) algorithm learn to emulate? For ML tasks ranging from online comment…

arxiv.org

Lots going on in this paper — there’s a whole history lesson about how Black jurors and women were allowed onto juries. For the technical portion, multiple ‘juries’ are made out of labeler functions and their predictions are used for votes. This emphasis on labeler functions sounds a lot like Snorkel (also from Stanford) but it isn’t mentioned here for whatever reason.

Transformer Memory as a Differentiable Search Index

In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all…

arxiv.org

There’s a growing movement of embeddings-as-a-service. If you’re familiar with word2vec mapping words by similarity, this is a step up where a search should map to a similar phrase, webpage, or article. Even if there is no word-for-word similarity, like in a traditional search engine, you should still find the best article for your query.
This paper isn’t exactly like that… instead of the embeddings being searched by a script, it’s a seq2seq model which aims to transform the query into the right docid / link.

Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few…

arxiv.org

This got some traction on ML Twitter. It’s long been known that large language models are bad at math, and it’s frustrating because the models are supposed to be smart, right. Is it due to tokenization of numbers, is it because language models memorize stuff, is it because they are too fuzzy, etc. There was another GPT math post around the time, where it got a lot wrong yet tended to get the first and last digits right.
In this paper it’s shown that GPT-J is more likely to get the math right depending on how frequent those number-tokens appear in the training data. Basically if the answer is ‘8,320’ but it’s not a popular number then maybe GPT is going to fuzz it.

Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning

In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance…

arxiv.org

Instead of pre-training or fine-tuning, here’s reprogramming. But what is it?
A frozen pre-trained model is sandwiched in between an ‘input transformation layer’ and an ‘output mapping layer’. This is similar to the AdapterHub idea of applying swappable layers for different tasks. There seems to be some value to the initial structure of language model neural nets. The paper discusses reprogramming a model for new domains, so I wonder if a language model could be reprogrammed for low resource languages?

Path of Destruction: Learning an Iterative Level Generator Using a Small Dataset

We propose a new procedural content generation method which learns iterative level generators from a dataset of…

arxiv.org

Generating game levels via reinforcement learning — this is interesting to me from my long-term project idea of generating leveled language lessons.

Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages

http://wietsedv.nl/files/devries_acl2022.pdf

Study of Universal Dependencies data / part-of-speech analysis. I’m not sure how much part-of-speech tagging is used these days, because transformers models just chomp on words without specifically building a grammatical / dependency model.

Advantages of Artificial Intelligences, Uploads, and Digital Minds

https://philpapers.org/archive/SOTAOA.pdf

Futurist view in favor of uploading minds.

ML Arxiv Haul #3

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language…

This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub…

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Discriminative self-supervised learning allows training models on any random group of internet images, and possibly…

Visually Grounded Reasoning across Languages and Cultures

The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration…

Sharpened Cosine Similarity

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

We show that with small-to-medium training data, fine-tuning only the bias terms (or a subset of the bias terms) of…

Learning from Randomly Initialized Neural Network Features

We present the surprising result that randomly initialized neural networks are good feature extractors in expectation…

Differential Privacy and Fairness in Decisions and Learning Tasks: A Survey

This paper surveys recent work in the intersection of differential privacy (DP) and fairness. It reviews the conditions…

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are…

Towards Zero-Label Language Learning

This paper explores zero-label learning in Natural Language Processing (NLP), whereby no human-annotated data is used…

A Survey on Model Compression for Natural Language Processing

With recent developments in new architectures like Transformer and pretraining techniques, significant progress has…

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Past work has shown that large language models are susceptible to privacy attacks, where adversaries generate sequences…

Towards Scaling Difference Target Propagation by Learning Backprop Targets

The development of biologically-plausible learning algorithms is important for understanding learning in the brain, but…

Jury Learning: Integrating Dissenting Voices into Machine Learning Models

Whose labels should a machine learning (ML) algorithm learn to emulate? For ML tasks ranging from online comment…

Transformer Memory as a Differentiable Search Index

In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all…

Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few…

Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning

In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance…

Path of Destruction: Learning an Iterative Level Generator Using a Small Dataset

We propose a new procedural content generation method which learns iterative level generators from a dataset of…

Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages

Advantages of Artificial Intelligences, Uploads, and Digital Minds

Written by Nick Doiron