ML Arxiv Haul #11
Adversarial Policies Beat Professional-Level Go AIs
We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a…
Researchers beat the current dominance of Go-playing AIs. Rather oddly, “the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary”. The strategy wins by having a high-scoring corner of territory unoccupied by KataGo at the moment when it convinces it to resign. Human players do not follow the script and usually win.
Broken Neural Scaling Laws
We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of…
A controversial paper which claims to describe the scalability of several ML model tasks. The authors focus on two assertions:
- ‘inverse scaling’ tasks (as being studied in the Inverse Scaling Prize) are just showing ‘U-shaped scaling’ and would perform better on appropriately large models
- the scalability of models generally can be represented with a series of power law functions; this makes it possible to show the scaling function with a few points around each break point
The controversy comes from whether this is oh-shit groundbreaking, something already understood (there are a few replies showing older papers with U-shaped scaling), or a tautology (what curve couldn’t be described with power laws… the challenge lies in finding those points).
Contrastive Decoding: Open-ended Text Generation as Optimization
Likelihood, although useful as a training loss, is a poor search objective for guiding open-ended generation from…
Another entry in the alternative LLM decoder space (up with typical decoding, RankGen, etc.). This compares the output of the model to a smaller model, and assuming the smaller model is less intelligent or fluent, or more error-prone, steers away from its choices.
I would like to have a tool which makes it easier to switch between these.
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual…
Introduced the NaturalInstructions dataset — simple language tasks where the instructions are provided by non-expert crowdworkers. I believe that I saw this being used for a model in the InstructGPT / T0 family. Since then, there’s already a multilingual version (Super-NaturalInstructions), and the following paper does a crosslingual version also in this Instruct model mindset.
Crosslingual Generalization through Multitask Finetuning
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot…
As mentioned above, this takes the InstructGPT / T0 family and studies its performance when fine-tuned on instructions (released here) in one language, then can the model perform the task well on text in another language. The authors release these instruct-ified models BLOOMZ and mT0 based on previous HuggingFace releases.
DiffusER: Discrete Diffusion via Edit-based Reconstruction
In text generation, models that generate text from scratch one token at a time are currently the dominant paradigm…
Diffusion models in text — sort of bizarre; you start out with random tokens and then generate valid text. Later they are able to perform summarization or translation using the model. This might be useful to edit text in place rather than generating it (so you can see why summarization or translation might be best tasks).
Diffusion Models for Adversarial Purification
Adversarial purification refers to a class of defense methods that remove adversarial perturbations using a generative…
Researchers create a diffusion model library (DiffPure) to un-break adversarial images.
Does AI Debias Recruitment? Race, Gender, and AI’s “Eradication of Difference” — Philosophy &…
In this paper, we analyze two key claims offered by recruitment AI companies in relation to the development and…
Not on arxiv, but a useful study of worrying ‘AI interview’ companies.
Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model
Progress in machine learning (ML) comes with a cost to the environment, given that training ML models requires…
Discusses the carbon from energy production and from general manufacturing of hardware etc. to train the BLOOM model. This is something we can research in more critical detail because it comes through HuggingFace / Big Science instead of a private LLM project.
The team specifically chose a French supercomputer where the energy would come from nuclear power (described in the paper only in grid carbon intensity). This led to a complaint in the project issues that: “…an environmental impact assessment is far beyond CO2 emissions. Take for example the ecological (and societal !) disasters that the mining of the uranium…” source and the paper does not mention ‘nuclear’ or ‘uranium’ at all.
Federated Unlearning: How to Efficiently Erase a Client in FL?
With privacy legislation empowering users with the right to be forgotten, it has become essential to make a model…
It’s already tricky to ‘unlearn’ something in ML, so it must be harder to do in federated learning. This describes a process to maximize loss on an unlearned example, but doesn’t test on a problem beyond MNIST.
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any…
Researchers are skeptical of typical few-shot models, such as T0, because there is research showing that the names of the labels, or the order of examples, is hugely influential on how well the model parses few-shot prompts. I’m not super-clear on what their alternative method is, but it fine-tunes the model weights on a few examples instead of using the prompting.
Fine-tuned Language Models are Continual Learners
Recent work on large language models relies on the intuition that most natural language processing tasks can be…
Researchers at Facebook/Meta and Columbia take an instruct model (T0) and show that they can continue to fine-tune new instruction tasks without losing performance onprevious instruct tasks. Positive news for continual learning.
I wish I would have loved this one, but I didn’t: A multilingual dataset for counterfactual…
Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual…
Amazon has an unusual approach to open research at times. Here we have an interesting dataset which cover different ways people express alternatives to the reality (could have done this, would end up like this, etc). This makes it easier to test models’ understanding of natural hypotheticals and negation over just keywords. The dataset also goes beyond English to also include statements in German and Japanese.
Inverse scaling can become U-shaped
Although scaling language models improves performance on a range of tasks, there are apparently some scenarios where…
This is the latest work on U-shaped scaling (it’s more recent than the Broken Neural Scaling Laws paper). Group finds that specific models within the Inverse Scaling Prize would do well on a 5x larger model (PaLM). They also explore whether chain-of-thought (CoT, basically adding ‘let’s think step-by-step’).
Large language models are not zero-shot communicators
Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of…
Researchers find that LLMs perform poorly on conversation exchanges where instead of saying ‘yes’ or ‘no’, the response gives some information or experience for humans to infer (86% accuracy) but AI still doesn’t get it (62% on T0, 72% on InstructGPT). I found it frustrating that the dataset mixes joke responses (‘can fish swim?’) in with directly relevant responses (‘we’re still looking’).
This paper takes a notable step of labeling InstructGPT as ‘UNK FT’ (a recent Tweet noted that OpenAI’s current instruct models use an ‘unknown’ fine-tuning method not described in current papers).
Large Language Models Can Self-Improve
Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM…
Researchers improve results on math class word problems by providing the steps to solve a problem. The LLM generates the reasoning for other word problems, using diverse decoding for multiple possible solutions. A simple script reads the possible solutions to pick the most popular answer. These generated word problem work-throughs can be used to improve the model’s training and results on unseen problems.
Scaling Instruction-Finetuned Language Models
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model…
Large Google paper using instruct models (Flan-PaLM). They find that fine-tuning their model on some tasks which use chain-of-thought (CoT), it improves performance on all tasks. They describe CoT with this table — essentially you can use the prompt to encourage a ‘step-by-step’ generation before reaching the answer, and if you provide a one-shot example in the prompt you can use that to demonstrate reasoning.
GitHub - s2e-lab/SecurityEval: Repository for "SecurityEval Dataset: Mining Vulnerability Examples…
This repository contains source code for the paper titled SecurityEval Dataset: Mining Vulnerability Examples to…
SecurityEval isn’t on arxiv, so I’ll link to a GitHub repo instead of a PDF. The team follows the cited Asleep at the Keyboard? in using CodeQL and the CWE dataset to find vulnerable code. They focus on 130 issues which are also relevant to Python, and study results on InCoder and Copilot.
I would like to see the Big Code project use this dataset, and maybe add more examples, particularly if they are ‘bad’ examples included in the training data, to check that we are not copying learned errors.
SetFit: Efficient Few-Shot Learning Without Prompts
SetFit is significantly more sample efficient and robust to noise than standard fine-tuning. Few-shot learning with…
I saw a video about this but don’t know a ton; essentially it’s an architecture on sentence transformers which makes few-shot learning work especially well.
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes < 1…
TabPFN makes progress on a long-standing problem (how to get neural networks performing well on tabular data?) but falters on datasets over 1,000 rows. Peculiar, would be nice to see if the next generation of this is more generally useful.
TVLT: Textless Vision-Language Transformer
In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw…
Expands the visual transformers world, by including audio inputs instead of text. It’s interesting that audio and images can be ‘encoded’ with this system originally designed for more particle-able text.
Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and…
Researchers are curious about the difficulty of skilled multimodal transformers to figure out descriptions of position, nested items, etc. They look through the dataset to find many have confusing captions or unclear images. Even for the best ~43% of images, there remains an issue. The team uses text augmentation to create stranger and more obviously ‘wrong’ captions, but the models still don’t perform well on Winoground. The thought here is that the visual and text components of the model independently understand, but cannot connect the distinctions.
— — —
On a lighter note: