ML Arxiv Haul #12
We’re seeing a bunch of NLP pre-prints ahead of conference dates or end of the semester, so I’m already ready for another 20 papers in this ML Arxiv Haul. Maybe I should turn these into a monthly email newsletter?
A Contrastive Framework for Neural Text Generation
Text generation is of great importance to many natural language processing applications. However, maximization-based…
In the previous haul I was looking at ‘contrastive decoding’ using a smaller model, and here is a totally different process named ‘contrastive search’. ‘Contrastive’ is also in two other papers in this post. So hot right now. HuggingFace had a blog about this paper last week, so I added it to my future decoders project. The process seems to start during model training, where the tokens are moved around to occupy more of vector-space. The contrastive method here selects a token from most probable tokens, but includes a penalty based on how close it is to the previous token (i.e. highest penalty if it is the same token repeated). I wonder how this would handle situations where you intentionally repeat a word or token (I did not know that that would happen, how does a fly fly — incidentally I am listening to a Grammar Girl podcast episode about double words now).
They use LCCC as a benchmark for Chinese text generation. I haven’t seen this before, and it’s encouraging to see decoder research outside of English. Here they found significant human preference over other decoding methods, but the English models needed both components of their algorithm (SimCTG) to improve.
One downside is though the decoding method is supposedly workable with other models, their examples (I think) all include a fine-tuning step first.
Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations
Recent work on explainable NLP has shown that few-shot prompting can enable large pretrained language models (LLMs) to…
The study looks at Winograd sentences (Angela tried to calm Carrie’s nerves at the airport because _ was scared of flying in airplanes.) and compares human-written explanations to ones written by GPT-3. I was expecting them to then do something with the explanations, but the explanation itself is accurate. They used the ‘Data Maps’ tool which came out in 2020 to pick out which examples are most difficult for a model, and the explanations from GPT-3 are still high quality.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural…
BigScience / HuggingFace has released a bunch of papers about work on their collaborative multilingual model BLOOM — collecting the ROOTS dataset, carbon emissions, and an Instruct version of the model (BLOOMZ).
My Hindi-BERT model makes a tiny appearance here (!) when its tokenizer is compared to BLOOM’s.
CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation
The full power of human language-based communication cannot be realized without negation. All human languages have some…
The researchers hire crowd-workers to collect sentences from Wikipedia which include more complex negation. The crowd-workers create variations on the sentence. Then the QA task is interpreting what did and didn’t happen based on a human understanding.
The best results come from UnifiedQA (I’m unfamiliar with this one; it’s based on T5) and (when few-shot is allowed) InstructGPT with chain-of-thought.
Data Feedback Loops: Model-driven Amplification of Dataset Biases
Datasets scraped from the internet have been critical to the successes of large-scale machine learning. Yet, this very…
This is an emerging fear in AI Ethics land which I’m still a bit skeptical of. Essentially if AIs create a significant % of the text and images on the internet, this creates a cycle where their generated content influences and passes on biases to future models. There is some interesting research on how that works and what might be done to make generation from sampling less likely to pass on biases.
Do Users Write More Insecure Code with AI Assistants?
We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety…
New research from Stanford on GitHub Copilot. When students were given several encryption scripting tasks and Copilot, they were likely to create vulnerable code, and were more likely to create these vulnerabilities the more that they accepted suggestions and highly rated its output. It wouldn’t be enough to use static analysis tools listed in the paper, because the security bugs are often “conceptual”.
Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale
Machine learning models are now able to convert user-written text descriptions into naturalistic images. These models…
Most comprehensive external audit which I’ve seen of DALL-E image generation. The authors find several stereotypes in careers, emotions, roles, families, etc. It’s difficult to push the model to generate some scenarios.
Evaluating the Adversarial Robustness of Adaptive Test-time Defenses
Adaptive defenses, which optimize at test time, promise to improve adversarial robustness. We categorize such adaptive…
The authors endorse a model robustness tool (AutoAttack) but instead decide to focus on robustness measures which occur after the model is trained (in other words: at test time). I’m not familiar with these techniques so this is a good overview. I believe what they’re saying is the tools modify the model weights or the input data before the test or between batches (?). Anyway, their conclusion is that this category of test-time tools is disappointing and computationally expensive.
Harmonizing the object recognition strategies of deep neural networks with humans
The many successes of deep neural networks (DNNs) over the past decade have largely been driven by computational scale…
The Twitter sales pitch for this paper was, as the model becomes more accurate, its technique becomes less aligned with how humans view images.
As I scanned this, I wondered where the human attention maps come from? They come from the ClickMe dataset where “Participants select important image parts with their mouse by “painting” translucent bubbles on screen.” As you might suspect, reporting based on clicking and dragging is not the same as eye-tracking or showing portions of the image to the user. The two examples which I’d highlight are the snake (where people have highlighted the closer, front half of the snake) and the ball (where models highlight the player). The humans seem to be highlighting the minimum area to recognize the object.
Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and…
We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text…
There is significant interest in NLP to ‘align’ language models to human intent as with InstructGPT, code models (which are easier to test by running their code), etc. The major research labs have been doing this with reinforcement learning, and Allen AI is letting us in on the process here. The repo for this paper is “RL4LMs” which is more optimistic than the paper’s title implies. Finally they introduce a new RL algorithm and a benchmark.
Unfortunately I failed to run these on a CoLab GPU :( maybe next time.
It's Hard for Neural Networks To Learn the Game of Life
Efforts to improve the learning abilities of neural networks have focused mostly on the role of optimization methods…
It’d seem that neural networks could think through the basic rules of Conway’s Game of Life and predict the image a few rounds into the future. The researchers seemed interested in discussing the ‘lottery ticket’ phenomenon but instead were disappointed by neural networks’ general performance.
Large Language Models Struggle to Learn Long-Tail Knowledge
The internet contains a wealth of knowledge -- from the birthdays of historical figures to tutorials on how to code …
There are scripts for finding what facts neural models know, editing facts in a model, etc. etc. but this brings it back to the training data. The researchers find that the ability to recall a fact is related to the number of times it sees that fact in pre-training. The study uses TriviaQA to benchmark ability. I wish they had looked at categories of facts which are difficult for models, for example they may choose a well-known city (Rio) for ‘what is the capital of Brazil?’.
Mutation Models: Learning to Generate Levels by Imitating Evolution
Search-based procedural content generation (PCG) is a well-known method for level generation in games. Its key…
I am interested in game-generation algorithms in case they could be applied to language learning. Here they have a game level generator which works through evolution, and then attempt to train a neural network to mimic this evolutionary behavior.
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and…
Expanding on previous collaborations on Bahasa Indonesia language benchmarks, this paper announces sentiment analysis data for 10 other Indonesian languages. I’m intrigued by the community organizing and collaboration visible for Indonesia, plus this win for preserving minority languages.
From Twitter convos, I learned that there are two models which came out around the same time both called ‘IndoBERT’.
I am hopeful this could carry over to some regional cooperation on AI (i.e. with Bahasa Melayu).
Operationalizing Machine Learning: An Interview Study
Organizations rely on machine learning engineers (MLEs) to operationalize ML, i.e., deploy and maintain ML pipelines in…
Interviewing machine learning engineers and managers about their jobs. There’s an interest in speeding up validation/iteration, and tracking versions of models. Includes a list of anti-patterns. I thought that there would be more discussion of feature engineering -type work, but there’s just one point made about changing tools to SparkSQL.
Patching open-vocabulary models by interpolating weights
Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks. However, there are still…
The researchers compare CLIP and a CLIP fine-tuned on a new image task or synthetic data, and use their PAINT code to improve the model based on that. They manage to do this without disrupting accuracy on previous tasks. They describe this as ‘patching’ the model, when it seems more like ‘fine-tune then amplify’?
GitHub - devglobalpartners/ramp-code: Open-source repository for the ramp (Replicable AI for…
Our team aspires to turn over control of the data value chain to humanitarians. The Replicable AI for Microplanning…
The Humanitarian OpenStreetMap Team (HOT) announced that they are almost ready to release an AI mapping tool (named fAIr). This works as a competitor to Meta’s RapiD. That was released in 2019 and is now previewing a v2.0 (the main changes are including layers from Esri and Microsoft, and rendering the map in a canvas element instead of SVG).
I’m a bit embarassed that I hadn’t heard about HOT’s fAIr project until today. That project seems to have benefited from ramp, which the repo shows tracing buildings in several countries.
RealTime QA: What's the Answer Right Now?
We introduce RealTime QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on…
Researchers maintain the changing answers to a set of questions (how many home runs has <player> hit) and evaluate models which are ‘closed book’ or use a standardized retrieval process. As of November 12th, the best results come from GPT-3 and a custom search engine.
Interestingly, they even release a new set of questions each week, based on newspaper quizzes: github.com/realtimeqa/realtimeqa_public/tree/main/latest
Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning
Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural…
Last year, a paper on Contrastive Authoring and Reviewing Pairing (CARP) was shown to evaluate whether humans would like a story. This paper fine-tunes GPT-2 to generate new stories based on CARP’s preference score. This finally explains why the Stable Diffusion team’s text and code team is called Carper AI! The paper brings in prompt-tuning (CoOp) and reinforcement learning to improve the stories.
SEAL : Interactive Tool for Systematic Error Analysis and Labeling
With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards…
I think what’s going on here is that they select the highest-loss examples of the text dataset, use k-means clustering, and then present these clusters together so ideally you can spot a trend. They study some examples where GPT-3 describes the clusters. Rather strange for them to have ALBERT model for process and GPT-3 for explaining the error.
Shortcut Learning of Large Language Models in Natural Language Understanding: A Survey
Large language models (LLMs) have achieved state-of-the-art performance on a series of natural language understanding…
NLP researchers want to avoid models discovering ‘shortcuts’ on tasks instead of a more diverse and robust set of signals. In this paper, they look at explainability and other metrics to detect a shortcut in the process. Considering that their example of robustness is susceptiblity to unrelated information in a text / summary, the easiest fix is adversarial training. They mention several other methods, such as ‘worst-group loss minimization’. This is a survey paper so you would have to track down repos for each method.