ML Arxiv Haul #12

9 min readNov 16, 2022

We’re seeing a bunch of NLP pre-prints ahead of conference dates or end of the semester, so I’m already ready for another 20 papers in this ML Arxiv Haul. Maybe I should turn these into a monthly email newsletter?

A Contrastive Framework for Neural Text Generation

Text generation is of great importance to many natural language processing applications. However, maximization-based…

arxiv.org

In the previous haul I was looking at ‘contrastive decoding’ using a smaller model, and here is a totally different process named ‘contrastive search’. ‘Contrastive’ is also in two other papers in this post. So hot right now. HuggingFace had a blog about this paper last week, so I added it to my future decoders project. The process seems to start during model training, where the tokens are moved around to occupy more of vector-space. The contrastive method here selects a token from most probable tokens, but includes a penalty based on how close it is to the previous token (i.e. highest penalty if it is the same token repeated). I wonder how this would handle situations where you intentionally repeat a word or token (I did not know that that would happen, how does a fly fly — incidentally I am listening to a Grammar Girl podcast episode about double words now).

They use LCCC as a benchmark for Chinese text generation. I haven’t seen this before, and it’s encouraging to see decoder research outside of English. Here they found significant human preference over other decoding methods, but the English models needed both components of their algorithm (SimCTG) to improve.
One downside is though the decoding method is supposedly workable with other models, their examples (I think) all include a fine-tuning step first.

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations

Recent work on explainable NLP has shown that few-shot prompting can enable large pretrained language models (LLMs) to…

arxiv.org

The study looks at Winograd sentences (Angela tried to calm Carrie’s nerves at the airport because _ was scared of flying in airplanes.) and compares human-written explanations to ones written by GPT-3. I was expecting them to then do something with the explanations, but the explanation itself is accurate. They used the ‘Data Maps’ tool which came out in 2020 to pick out which examples are most difficult for a model, and the explanations from GPT-3 are still high quality.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural…

arxiv.org

BigScience / HuggingFace has released a bunch of papers about work on their collaborative multilingual model BLOOM — collecting the ROOTS dataset, carbon emissions, and an Instruct version of the model (BLOOMZ).
My Hindi-BERT model makes a tiny appearance here (!) when its tokenizer is compared to BLOOM’s.

CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation

The full power of human language-based communication cannot be realized without negation. All human languages have some…

arxiv.org

The researchers hire crowd-workers to collect sentences from Wikipedia which include more complex negation. The crowd-workers create variations on the sentence. Then the QA task is interpreting what did and didn’t happen based on a human understanding.
The best results come from UnifiedQA (I’m unfamiliar with this one; it’s based on T5) and (when few-shot is allowed) InstructGPT with chain-of-thought.

Data Feedback Loops: Model-driven Amplification of Dataset Biases

Datasets scraped from the internet have been critical to the successes of large-scale machine learning. Yet, this very…

arxiv.org

This is an emerging fear in AI Ethics land which I’m still a bit skeptical of. Essentially if AIs create a significant % of the text and images on the internet, this creates a cycle where their generated content influences and passes on biases to future models. There is some interesting research on how that works and what might be done to make generation from sampling less likely to pass on biases.

Do Users Write More Insecure Code with AI Assistants?

We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety…

arxiv.org

New research from Stanford on GitHub Copilot. When students were given several encryption scripting tasks and Copilot, they were likely to create vulnerable code, and were more likely to create these vulnerabilities the more that they accepted suggestions and highly rated its output. It wouldn’t be enough to use static analysis tools listed in the paper, because the security bugs are often “conceptual”.

Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

Machine learning models are now able to convert user-written text descriptions into naturalistic images. These models…

arxiv.org

Most comprehensive external audit which I’ve seen of DALL-E image generation. The authors find several stereotypes in careers, emotions, roles, families, etc. It’s difficult to push the model to generate some scenarios.

Evaluating the Adversarial Robustness of Adaptive Test-time Defenses

Adaptive defenses, which optimize at test time, promise to improve adversarial robustness. We categorize such adaptive…

arxiv.org==

The authors endorse a model robustness tool (AutoAttack) but instead decide to focus on robustness measures which occur after the model is trained (in other words: at test time). I’m not familiar with these techniques so this is a good overview. I believe what they’re saying is the tools modify the model weights or the input data before the test or between batches (?). Anyway, their conclusion is that this category of test-time tools is disappointing and computationally expensive.

Harmonizing the object recognition strategies of deep neural networks with humans

The many successes of deep neural networks (DNNs) over the past decade have largely been driven by computational scale…

arxiv.org

The Twitter sales pitch for this paper was, as the model becomes more accurate, its technique becomes less aligned with how humans view images.

As I scanned this, I wondered where the human attention maps come from? They come from the ClickMe dataset where “Participants select important image parts with their mouse by “painting” translucent bubbles on screen.” As you might suspect, reporting based on clicking and dragging is not the same as eye-tracking or showing portions of the image to the user. The two examples which I’d highlight are the snake (where people have highlighted the closer, front half of the snake) and the ball (where models highlight the player). The humans seem to be highlighting the minimum area to recognize the object.

Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and…

We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text…

arxiv.org

There is significant interest in NLP to ‘align’ language models to human intent as with InstructGPT, code models (which are easier to test by running their code), etc. The major research labs have been doing this with reinforcement learning, and Allen AI is letting us in on the process here. The repo for this paper is “RL4LMs” which is more optimistic than the paper’s title implies. Finally they introduce a new RL algorithm and a benchmark.
Unfortunately I failed to run these on a CoLab GPU :( maybe next time.

It's Hard for Neural Networks To Learn the Game of Life

Efforts to improve the learning abilities of neural networks have focused mostly on the role of optimization methods…

arxiv.org

It’d seem that neural networks could think through the basic rules of Conway’s Game of Life and predict the image a few rounds into the future. The researchers seemed interested in discussing the ‘lottery ticket’ phenomenon but instead were disappointed by neural networks’ general performance.

Large Language Models Struggle to Learn Long-Tail Knowledge

The internet contains a wealth of knowledge -- from the birthdays of historical figures to tutorials on how to code …

arxiv.org

There are scripts for finding what facts neural models know, editing facts in a model, etc. etc. but this brings it back to the training data. The researchers find that the ability to recall a fact is related to the number of times it sees that fact in pre-training. The study uses TriviaQA to benchmark ability. I wish they had looked at categories of facts which are difficult for models, for example they may choose a well-known city (Rio) for ‘what is the capital of Brazil?’.

Mutation Models: Learning to Generate Levels by Imitating Evolution

Search-based procedural content generation (PCG) is a well-known method for level generation in games. Its key…

arxiv.org

I am interested in game-generation algorithms in case they could be applied to language learning. Here they have a game level generator which works through evolution, and then attempt to train a neural network to mimic this evolutionary behavior.

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and…

arxiv.org

Expanding on previous collaborations on Bahasa Indonesia language benchmarks, this paper announces sentiment analysis data for 10 other Indonesian languages. I’m intrigued by the community organizing and collaboration visible for Indonesia, plus this win for preserving minority languages.
From Twitter convos, I learned that there are two models which came out around the same time both called ‘IndoBERT’.
I am hopeful this could carry over to some regional cooperation on AI (i.e. with Bahasa Melayu).

Operationalizing Machine Learning: An Interview Study

Organizations rely on machine learning engineers (MLEs) to operationalize ML, i.e., deploy and maintain ML pipelines in…

arxiv.org

Interviewing machine learning engineers and managers about their jobs. There’s an interest in speeding up validation/iteration, and tracking versions of models. Includes a list of anti-patterns. I thought that there would be more discussion of feature engineering -type work, but there’s just one point made about changing tools to SparkSQL.

Patching open-vocabulary models by interpolating weights

Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks. However, there are still…

arxiv.org

The researchers compare CLIP and a CLIP fine-tuned on a new image task or synthetic data, and use their PAINT code to improve the model based on that. They manage to do this without disrupting accuracy on previous tasks. They describe this as ‘patching’ the model, when it seems more like ‘fine-tune then amplify’?

GitHub - devglobalpartners/ramp-code: Open-source repository for the ramp (Replicable AI for…

Our team aspires to turn over control of the data value chain to humanitarians. The Replicable AI for Microplanning…

github.com

The Humanitarian OpenStreetMap Team (HOT) announced that they are almost ready to release an AI mapping tool (named fAIr). This works as a competitor to Meta’s RapiD. That was released in 2019 and is now previewing a v2.0 (the main changes are including layers from Esri and Microsoft, and rendering the map in a canvas element instead of SVG).

I’m a bit embarassed that I hadn’t heard about HOT’s fAIr project until today. That project seems to have benefited from ramp, which the repo shows tracing buildings in several countries.

RealTime QA: What's the Answer Right Now?

We introduce RealTime QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on…

arxiv.org

Researchers maintain the changing answers to a set of questions (how many home runs has <player> hit) and evaluate models which are ‘closed book’ or use a standardized retrieval process. As of November 12th, the best results come from GPT-3 and a custom search engine.
Interestingly, they even release a new set of questions each week, based on newspaper quizzes: github.com/realtimeqa/realtimeqa_public/tree/main/latest

Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural…

arxiv.org

Last year, a paper on Contrastive Authoring and Reviewing Pairing (CARP) was shown to evaluate whether humans would like a story. This paper fine-tunes GPT-2 to generate new stories based on CARP’s preference score. This finally explains why the Stable Diffusion team’s text and code team is called Carper AI! The paper brings in prompt-tuning (CoOp) and reinforcement learning to improve the stories.

SEAL : Interactive Tool for Systematic Error Analysis and Labeling

With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards…

arxiv.org

I think what’s going on here is that they select the highest-loss examples of the text dataset, use k-means clustering, and then present these clusters together so ideally you can spot a trend. They study some examples where GPT-3 describes the clusters. Rather strange for them to have ALBERT model for process and GPT-3 for explaining the error.

Shortcut Learning of Large Language Models in Natural Language Understanding: A Survey

Large language models (LLMs) have achieved state-of-the-art performance on a series of natural language understanding…

arxiv.org

NLP researchers want to avoid models discovering ‘shortcuts’ on tasks instead of a more diverse and robust set of signals. In this paper, they look at explainability and other metrics to detect a shortcut in the process. Considering that their example of robustness is susceptiblity to unrelated information in a text / summary, the easiest fix is adversarial training. They mention several other methods, such as ‘worst-group loss minimization’. This is a survey paper so you would have to track down repos for each method.

ML Arxiv Haul #12

A Contrastive Framework for Neural Text Generation

Text generation is of great importance to many natural language processing applications. However, maximization-based…

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations

Recent work on explainable NLP has shown that few-shot prompting can enable large pretrained language models (LLMs) to…

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural…

CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation

The full power of human language-based communication cannot be realized without negation. All human languages have some…

Data Feedback Loops: Model-driven Amplification of Dataset Biases

Datasets scraped from the internet have been critical to the successes of large-scale machine learning. Yet, this very…

Do Users Write More Insecure Code with AI Assistants?

We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety…

Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

Machine learning models are now able to convert user-written text descriptions into naturalistic images. These models…

Evaluating the Adversarial Robustness of Adaptive Test-time Defenses

Adaptive defenses, which optimize at test time, promise to improve adversarial robustness. We categorize such adaptive…

Harmonizing the object recognition strategies of deep neural networks with humans

The many successes of deep neural networks (DNNs) over the past decade have largely been driven by computational scale…

Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and…

We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text…

It's Hard for Neural Networks To Learn the Game of Life

Efforts to improve the learning abilities of neural networks have focused mostly on the role of optimization methods…

Large Language Models Struggle to Learn Long-Tail Knowledge

The internet contains a wealth of knowledge -- from the birthdays of historical figures to tutorials on how to code …

Mutation Models: Learning to Generate Levels by Imitating Evolution

Search-based procedural content generation (PCG) is a well-known method for level generation in games. Its key…

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and…

Operationalizing Machine Learning: An Interview Study

Organizations rely on machine learning engineers (MLEs) to operationalize ML, i.e., deploy and maintain ML pipelines in…

Patching open-vocabulary models by interpolating weights

Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks. However, there are still…

GitHub - devglobalpartners/ramp-code: Open-source repository for the ramp (Replicable AI for…

Our team aspires to turn over control of the data value chain to humanitarians. The Replicable AI for Microplanning…

RealTime QA: What's the Answer Right Now?

We introduce RealTime QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on…

Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural…

SEAL : Interactive Tool for Systematic Error Analysis and Labeling

With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards…

Shortcut Learning of Large Language Models in Natural Language Understanding: A Survey

Large language models (LLMs) have achieved state-of-the-art performance on a series of natural language understanding…

Written by Nick Doiron

No responses yet