ML Arxiv Haul #11

Nick Doiron
8 min readNov 9, 2022

--

Researchers beat the current dominance of Go-playing AIs. Rather oddly, “the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary”. The strategy wins by having a high-scoring corner of territory unoccupied by KataGo at the moment when it convinces it to resign. Human players do not follow the script and usually win.

A controversial paper which claims to describe the scalability of several ML model tasks. The authors focus on two assertions:

  • ‘inverse scaling’ tasks (as being studied in the Inverse Scaling Prize) are just showing ‘U-shaped scaling’ and would perform better on appropriately large models
  • the scalability of models generally can be represented with a series of power law functions; this makes it possible to show the scaling function with a few points around each break point

The controversy comes from whether this is oh-shit groundbreaking, something already understood (there are a few replies showing older papers with U-shaped scaling), or a tautology (what curve couldn’t be described with power laws… the challenge lies in finding those points).

Another entry in the alternative LLM decoder space (up with typical decoding, RankGen, etc.). This compares the output of the model to a smaller model, and assuming the smaller model is less intelligent or fluent, or more error-prone, steers away from its choices.
I would like to have a tool which makes it easier to switch between these.

Introduced the NaturalInstructions dataset — simple language tasks where the instructions are provided by non-expert crowdworkers. I believe that I saw this being used for a model in the InstructGPT / T0 family. Since then, there’s already a multilingual version (Super-NaturalInstructions), and the following paper does a crosslingual version also in this Instruct model mindset.

As mentioned above, this takes the InstructGPT / T0 family and studies its performance when fine-tuned on instructions (released here) in one language, then can the model perform the task well on text in another language. The authors release these instruct-ified models BLOOMZ and mT0 based on previous HuggingFace releases.

Diffusion models in text — sort of bizarre; you start out with random tokens and then generate valid text. Later they are able to perform summarization or translation using the model. This might be useful to edit text in place rather than generating it (so you can see why summarization or translation might be best tasks).

Researchers create a diffusion model library (DiffPure) to un-break adversarial images.

Not on arxiv, but a useful study of worrying ‘AI interview’ companies.

Discusses the carbon from energy production and from general manufacturing of hardware etc. to train the BLOOM model. This is something we can research in more critical detail because it comes through HuggingFace / Big Science instead of a private LLM project.

The team specifically chose a French supercomputer where the energy would come from nuclear power (described in the paper only in grid carbon intensity). This led to a complaint in the project issues that: “…an environmental impact assessment is far beyond CO2 emissions. Take for example the ecological (and societal !) disasters that the mining of the uranium…” source and the paper does not mention ‘nuclear’ or ‘uranium’ at all.

It’s already tricky to ‘unlearn’ something in ML, so it must be harder to do in federated learning. This describes a process to maximize loss on an unlearned example, but doesn’t test on a problem beyond MNIST.

Researchers are skeptical of typical few-shot models, such as T0, because there is research showing that the names of the labels, or the order of examples, is hugely influential on how well the model parses few-shot prompts. I’m not super-clear on what their alternative method is, but it fine-tunes the model weights on a few examples instead of using the prompting.

Researchers at Facebook/Meta and Columbia take an instruct model (T0) and show that they can continue to fine-tune new instruction tasks without losing performance onprevious instruct tasks. Positive news for continual learning.

Amazon has an unusual approach to open research at times. Here we have an interesting dataset which cover different ways people express alternatives to the reality (could have done this, would end up like this, etc). This makes it easier to test models’ understanding of natural hypotheticals and negation over just keywords. The dataset also goes beyond English to also include statements in German and Japanese.

This is the latest work on U-shaped scaling (it’s more recent than the Broken Neural Scaling Laws paper). Group finds that specific models within the Inverse Scaling Prize would do well on a 5x larger model (PaLM). They also explore whether chain-of-thought (CoT, basically adding ‘let’s think step-by-step’).

Researchers find that LLMs perform poorly on conversation exchanges where instead of saying ‘yes’ or ‘no’, the response gives some information or experience for humans to infer (86% accuracy) but AI still doesn’t get it (62% on T0, 72% on InstructGPT). I found it frustrating that the dataset mixes joke responses (‘can fish swim?’) in with directly relevant responses (‘we’re still looking’).
This paper takes a notable step of labeling InstructGPT as ‘UNK FT’ (a recent Tweet noted that OpenAI’s current instruct models use an ‘unknown’ fine-tuning method not described in current papers).

Researchers improve results on math class word problems by providing the steps to solve a problem. The LLM generates the reasoning for other word problems, using diverse decoding for multiple possible solutions. A simple script reads the possible solutions to pick the most popular answer. These generated word problem work-throughs can be used to improve the model’s training and results on unseen problems.

Large Google paper using instruct models (Flan-PaLM). They find that fine-tuning their model on some tasks which use chain-of-thought (CoT), it improves performance on all tasks. They describe CoT with this table — essentially you can use the prompt to encourage a ‘step-by-step’ generation before reaching the answer, and if you provide a one-shot example in the prompt you can use that to demonstrate reasoning.

SecurityEval isn’t on arxiv, so I’ll link to a GitHub repo instead of a PDF. The team follows the cited Asleep at the Keyboard? in using CodeQL and the CWE dataset to find vulnerable code. They focus on 130 issues which are also relevant to Python, and study results on InCoder and Copilot.

I would like to see the Big Code project use this dataset, and maybe add more examples, particularly if they are ‘bad’ examples included in the training data, to check that we are not copying learned errors.

I saw a video about this but don’t know a ton; essentially it’s an architecture on sentence transformers which makes few-shot learning work especially well.

TabPFN makes progress on a long-standing problem (how to get neural networks performing well on tabular data?) but falters on datasets over 1,000 rows. Peculiar, would be nice to see if the next generation of this is more generally useful.

Expands the visual transformers world, by including audio inputs instead of text. It’s interesting that audio and images can be ‘encoded’ with this system originally designed for more particle-able text.

Researchers are curious about the difficulty of skilled multimodal transformers to figure out descriptions of position, nested items, etc. They look through the dataset to find many have confusing captions or unclear images. Even for the best ~43% of images, there remains an issue. The team uses text augmentation to create stranger and more obviously ‘wrong’ captions, but the models still don’t perform well on Winoground. The thought here is that the visual and text components of the model independently understand, but cannot connect the distinctions.

— — —

On a lighter note:

--

--

Nick Doiron
Nick Doiron

Written by Nick Doiron

Web->ML developer and mapmaker.

No responses yet