ML Arxiv Haul #11

8 min readNov 9, 2022

Adversarial Policies Beat Professional-Level Go AIs

We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a…

arxiv.org

Researchers beat the current dominance of Go-playing AIs. Rather oddly, “the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary”. The strategy wins by having a high-scoring corner of territory unoccupied by KataGo at the moment when it convinces it to resign. Human players do not follow the script and usually win.

Broken Neural Scaling Laws

We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of…

arxiv.org

A controversial paper which claims to describe the scalability of several ML model tasks. The authors focus on two assertions:

‘inverse scaling’ tasks (as being studied in the Inverse Scaling Prize) are just showing ‘U-shaped scaling’ and would perform better on appropriately large models
the scalability of models generally can be represented with a series of power law functions; this makes it possible to show the scaling function with a few points around each break point

The controversy comes from whether this is oh-shit groundbreaking, something already understood (there are a few replies showing older papers with U-shaped scaling), or a tautology (what curve couldn’t be described with power laws… the challenge lies in finding those points).

Contrastive Decoding: Open-ended Text Generation as Optimization

Likelihood, although useful as a training loss, is a poor search objective for guiding open-ended generation from…

arxiv.org

Another entry in the alternative LLM decoder space (up with typical decoding, RankGen, etc.). This compares the output of the model to a smaller model, and assuming the smaller model is less intelligent or fluent, or more error-prone, steers away from its choices.
I would like to have a tool which makes it easier to switch between these.

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual…

arxiv.org

Introduced the NaturalInstructions dataset — simple language tasks where the instructions are provided by non-expert crowdworkers. I believe that I saw this being used for a model in the InstructGPT / T0 family. Since then, there’s already a multilingual version (Super-NaturalInstructions), and the following paper does a crosslingual version also in this Instruct model mindset.

Crosslingual Generalization through Multitask Finetuning

Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot…

arxiv.org

As mentioned above, this takes the InstructGPT / T0 family and studies its performance when fine-tuned on instructions (released here) in one language, then can the model perform the task well on text in another language. The authors release these instruct-ified models BLOOMZ and mT0 based on previous HuggingFace releases.

DiffusER: Discrete Diffusion via Edit-based Reconstruction

In text generation, models that generate text from scratch one token at a time are currently the dominant paradigm…

arxiv.org

Diffusion models in text — sort of bizarre; you start out with random tokens and then generate valid text. Later they are able to perform summarization or translation using the model. This might be useful to edit text in place rather than generating it (so you can see why summarization or translation might be best tasks).

Diffusion Models for Adversarial Purification

Adversarial purification refers to a class of defense methods that remove adversarial perturbations using a generative…

arxiv.org

Researchers create a diffusion model library (DiffPure) to un-break adversarial images.

Does AI Debias Recruitment? Race, Gender, and AI’s “Eradication of Difference” — Philosophy &…

In this paper, we analyze two key claims offered by recruitment AI companies in relation to the development and…

link.springer.com=

Not on arxiv, but a useful study of worrying ‘AI interview’ companies.

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model

Progress in machine learning (ML) comes with a cost to the environment, given that training ML models requires…

arxiv.org

Discusses the carbon from energy production and from general manufacturing of hardware etc. to train the BLOOM model. This is something we can research in more critical detail because it comes through HuggingFace / Big Science instead of a private LLM project.

The team specifically chose a French supercomputer where the energy would come from nuclear power (described in the paper only in grid carbon intensity). This led to a complaint in the project issues that: “…an environmental impact assessment is far beyond CO2 emissions. Take for example the ecological (and societal !) disasters that the mining of the uranium…” source and the paper does not mention ‘nuclear’ or ‘uranium’ at all.

Federated Unlearning: How to Efficiently Erase a Client in FL?

With privacy legislation empowering users with the right to be forgotten, it has become essential to make a model…

arxiv.org

It’s already tricky to ‘unlearn’ something in ML, so it must be harder to do in federated learning. This describes a process to maximize loss on an unlearned example, but doesn’t test on a problem beyond MNIST.

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any…

arxiv.org

Researchers are skeptical of typical few-shot models, such as T0, because there is research showing that the names of the labels, or the order of examples, is hugely influential on how well the model parses few-shot prompts. I’m not super-clear on what their alternative method is, but it fine-tunes the model weights on a few examples instead of using the prompting.

Fine-tuned Language Models are Continual Learners

Recent work on large language models relies on the intuition that most natural language processing tasks can be…

arxiv.org

Researchers at Facebook/Meta and Columbia take an instruct model (T0) and show that they can continue to fine-tune new instruction tasks without losing performance onprevious instruct tasks. Positive news for continual learning.

I wish I would have loved this one, but I didn’t: A multilingual dataset for counterfactual…

Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual…

www.amazon.science

Amazon has an unusual approach to open research at times. Here we have an interesting dataset which cover different ways people express alternatives to the reality (could have done this, would end up like this, etc). This makes it easier to test models’ understanding of natural hypotheticals and negation over just keywords. The dataset also goes beyond English to also include statements in German and Japanese.

Inverse scaling can become U-shaped

Although scaling language models improves performance on a range of tasks, there are apparently some scenarios where…

arxiv.org

This is the latest work on U-shaped scaling (it’s more recent than the Broken Neural Scaling Laws paper). Group finds that specific models within the Inverse Scaling Prize would do well on a 5x larger model (PaLM). They also explore whether chain-of-thought (CoT, basically adding ‘let’s think step-by-step’).

Large language models are not zero-shot communicators

Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of…

arxiv.org

Researchers find that LLMs perform poorly on conversation exchanges where instead of saying ‘yes’ or ‘no’, the response gives some information or experience for humans to infer (86% accuracy) but AI still doesn’t get it (62% on T0, 72% on InstructGPT). I found it frustrating that the dataset mixes joke responses (‘can fish swim?’) in with directly relevant responses (‘we’re still looking’).
This paper takes a notable step of labeling InstructGPT as ‘UNK FT’ (a recent Tweet noted that OpenAI’s current instruct models use an ‘unknown’ fine-tuning method not described in current papers).

Large Language Models Can Self-Improve

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM…

arxiv.org

Researchers improve results on math class word problems by providing the steps to solve a problem. The LLM generates the reasoning for other word problems, using diverse decoding for multiple possible solutions. A simple script reads the possible solutions to pick the most popular answer. These generated word problem work-throughs can be used to improve the model’s training and results on unseen problems.

Scaling Instruction-Finetuned Language Models

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model…

arxiv.org

Large Google paper using instruct models (Flan-PaLM). They find that fine-tuning their model on some tasks which use chain-of-thought (CoT), it improves performance on all tasks. They describe CoT with this table — essentially you can use the prompt to encourage a ‘step-by-step’ generation before reaching the answer, and if you provide a one-shot example in the prompt you can use that to demonstrate reasoning.

GitHub - s2e-lab/SecurityEval: Repository for "SecurityEval Dataset: Mining Vulnerability Examples…

This repository contains source code for the paper titled SecurityEval Dataset: Mining Vulnerability Examples to…

github.com

SecurityEval isn’t on arxiv, so I’ll link to a GitHub repo instead of a PDF. The team follows the cited Asleep at the Keyboard? in using CodeQL and the CWE dataset to find vulnerable code. They focus on 130 issues which are also relevant to Python, and study results on InCoder and Copilot.

I would like to see the Big Code project use this dataset, and maybe add more examples, particularly if they are ‘bad’ examples included in the training data, to check that we are not copying learned errors.

SetFit: Efficient Few-Shot Learning Without Prompts

SetFit is significantly more sample efficient and robust to noise than standard fine-tuning. Few-shot learning with…

huggingface.co=

I saw a video about this but don’t know a ton; essentially it’s an architecture on sentence transformers which makes few-shot learning work especially well.

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes < 1…

www.automl.org

TabPFN makes progress on a long-standing problem (how to get neural networks performing well on tabular data?) but falters on datasets over 1,000 rows. Peculiar, would be nice to see if the next generation of this is more generally useful.

TVLT: Textless Vision-Language Transformer

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw…

arxiv.org

Expands the visual transformers world, by including audio inputs instead of text. It’s interesting that audio and images can be ‘encoded’ with this system originally designed for more particle-able text.

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and…

arxiv.org

Researchers are curious about the difficulty of skilled multimodal transformers to figure out descriptions of position, nested items, etc. They look through the dataset to find many have confusing captions or unclear images. Even for the best ~43% of images, there remains an issue. The team uses text augmentation to create stranger and more obviously ‘wrong’ captions, but the models still don’t perform well on Winoground. The thought here is that the visual and text components of the model independently understand, but cannot connect the distinctions.

— — —

On a lighter note:

ML Arxiv Haul #11

Adversarial Policies Beat Professional-Level Go AIs

We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a…

Broken Neural Scaling Laws

We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of…

Contrastive Decoding: Open-ended Text Generation as Optimization

Likelihood, although useful as a training loss, is a poor search objective for guiding open-ended generation from…

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual…

Crosslingual Generalization through Multitask Finetuning

Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot…

DiffusER: Discrete Diffusion via Edit-based Reconstruction

In text generation, models that generate text from scratch one token at a time are currently the dominant paradigm…

Diffusion Models for Adversarial Purification

Adversarial purification refers to a class of defense methods that remove adversarial perturbations using a generative…

Does AI Debias Recruitment? Race, Gender, and AI’s “Eradication of Difference” — Philosophy &…

In this paper, we analyze two key claims offered by recruitment AI companies in relation to the development and…

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model

Progress in machine learning (ML) comes with a cost to the environment, given that training ML models requires…

Federated Unlearning: How to Efficiently Erase a Client in FL?

With privacy legislation empowering users with the right to be forgotten, it has become essential to make a model…

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any…

Fine-tuned Language Models are Continual Learners

Recent work on large language models relies on the intuition that most natural language processing tasks can be…

I wish I would have loved this one, but I didn’t: A multilingual dataset for counterfactual…

Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual…

Inverse scaling can become U-shaped

Although scaling language models improves performance on a range of tasks, there are apparently some scenarios where…

Large language models are not zero-shot communicators

Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of…

Large Language Models Can Self-Improve

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM…

Scaling Instruction-Finetuned Language Models

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model…

GitHub - s2e-lab/SecurityEval: Repository for "SecurityEval Dataset: Mining Vulnerability Examples…

This repository contains source code for the paper titled SecurityEval Dataset: Mining Vulnerability Examples to…

SetFit: Efficient Few-Shot Learning Without Prompts

SetFit is significantly more sample efficient and robust to noise than standard fine-tuning. Few-shot learning with…

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes < 1…

TVLT: Textless Vision-Language Transformer

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw…

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and…

Written by Nick Doiron

No responses yet