ML Arxiv Haul #7

9 min readJul 24, 2022

Just last month I made my last Arxiv haul post with most of the papers which I’d seen on Twitter, Reddit, and wherever else. More papers have stacked up, and I’ve decided to do a summarize ~half of them now rather than keep falling behind.

Adversarially trained neural representations may already be as robust as corresponding biological…

Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking…

arxiv.org

The paper explores adversarial examples in brain scans — very unlike anything else which I’ve seen in the ML world. When you look at the images you can still tell what they are, but the authors have evidence that your neurons get a bit of a workout in figuring it out. I could have sworn it was a bit (annoying? uncomfortable?) to scan through the example images at first.

auton-survival: an Open-Source Package for Regression, Counterfactual Estimation, Evaluation and…

Applications of machine learning in healthcare often require working with time-to-event prediction tasks including…

arxiv.org

Specialized library for health ML, where there is some obfuscation / censorship around patients’ deaths.

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News…

David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow…

aclanthology.org

A collaborative paper from Masakhane (African NLP community) about translation between 16 languages, including language pairs without parallel corpora. The team discusses challenges for low-resource languages, and creates T5 and ByT5 models.

Backward baselines: Is your model predicting the past?

When does a machine learning model predict the future of individuals and when does it recite patterns that predate the…

arxiv.org

This goes a bit too deep into ‘causal ML’ diagrams, but this is a possible framework for seeing if models are ‘predicting’ an outcome or finding a person with a similar demographic info in the training dataset.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale…

arxiv.org

This is my first co-authorship credit! In spring 2021, Google opened a repo to accept voluntarily contributed benchmarks — an opportunity to get onto a major paper and have your task run against new non-public LLMs. It’s been pretty cool seeing my tasks (disambiguation_qs, which_wiki_edit) pop up or get directly discussed in Google and DeepMind papers. It looks like there’s a variety of approaches to taking subsets of tasks, but it isn’t yet visibly being picked up by OpenAI, Microsoft, AllenAI, etc…. I did a presentation at work which I should turn into a blog post here soon.

Side note about citations — when this launched I joked about citing it as Doiron et al., and everyone got named in a citation in Long Range Language Modeling via Gated State Spaces (weirdly Nick → Nicholas there). Hundreds of authors is an oddity in CS, but common in particle physics and genetics, so you can just include the title and a few of the first listed authors if you like.

Beyond neural scaling laws: beating power law scaling via data pruning

Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both…

arxiv.org

Experiments in removing (pruning) training data from Imagenet without impacting accuracy.

Controlling Translation Formality Using Pre-trained Multilingual Language Models

This paper describes the University of Maryland’s submission to the Special Task on Formality Control for Spoken…

arxiv.org

I’ve been looking into this space because the European NLP people said a major concern is summarizing and/or simplifying materials for language learners. They fine-tune mT5 and mBART to do a sort of translation plus formality change, with some unclear work to include zero-shot experiments. mT5 includes many languages so this could be useful to explore.

GitHub — huggingface/diffusers

🤗 Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves as a…

github.com

When I was at the Probabilistic AI course in Finland, everyone was excited about generating content with diffusion models. HuggingFace, which already has the popular transformers model, is looking for another hit with a diffusers repo.

FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization

We present FactPEGASUS, an abstractive summarization model that addresses the problem of factuality during pre-training…

arxiv.org

Work on text summarization which preserves factual accuracy.

First the Worst: Finding Better Gender Translations During Beam Search

Danielle Saunders, Rosie Sallis, Bill Byrne. Findings of the Association for Computational Linguistics: ACL 2022. 2022.

aclanthology.org

When we run an NLG model, there’s a choice about whether each next token is picked from the highest probability, sampling of probabilities, or a variety of somewhat-more-complex decoding methods. On WinoMT, which measures gender bias in translations, this beam search method improves performance.

Forecasting Future World Events with Neural Networks

Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict…

arxiv.org

A dataset to see if language models can predict actual events. I would say that you need to take care that your pretrained model is itself older than the events (the researchers use GPT-2 and T5). For prompting, the model gets one article with information available at prediction time. Overall this is a little interesting but frustrated by the small size of the models for 2022, and not so much focus on difficulty of a prediction (such as collecting numbers from a prediction market).

Googling for Abortion: Search Engine Mediation of Abortion Accessibility in the United States

Among the myriad barriers to abortion access, crisis pregnancy centers (CPCs) pose an additional difficulty by…

arxiv.org

This is an early 2022 pre-print, and cites other recent work. I remember seeing Google and Siri being ‘uncomfortable’ about directing people to abortion clinics, or getting tricked by ‘crisis centers’. This paper tracks searches in multiple locations over time, and finds that clinics are usually returned, but that ‘crisis centers’ get better positioning in poorer and more rural areas.

How Good is the Bayes Posterior in Deep Neural Networks Really?

During the past five years the Bayesian deep learning community has developed increasingly accurate and efficient…

arxiv.org

After going through a Probabilistic AI mini-course which had a strong interest in Bayesian methods and Bayesian neural networks, it’s fascinating to see this 2020 Google paper about what aspects of this are accepted or rejected in industry. Discusses a ‘cold posterior’ which goes against Bayesian dogma, and would continue to be discussed in 2021, but is not super popular in research that I could find.

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation…

arxiv.org

With a subset of ImageNet, the researchers try to predict the amount of data needed to reach a target accuracy. Their final method makes several tests and compares multiple prediction curves as it goes through training, so there’s no one ‘method’ to estimating this.

Language Model Cascades

Prompted models have demonstrated impressive few-shot learning abilities. Repeated interactions at test-time with a…

arxiv.org

After a paper went viral for fixing GPT math errors by prompting “Let’s think step by step”, this paper is pitched as some type of study or taxonomy of this approach, putting forth the name “language model cascades”. The text gets a little muddled and I wonder if this paper initially sought to show something else? In any case it looks like a good method.

Language Models (Mostly) Know What They Know

We study whether language models can evaluate the validity of their own claims and predict which questions they will be…

arxiv.org

Asking language models to put a number on their own accuracy. Straightforward and useful.

Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities

With the advent of large language models, methods for abstractive summarization have made great strides, creating…

arxiv.org

Interesting dataset which the authors have framed as a summarization problem that matters. Expert summaries of very long legal documents. Includes different granularities of summarization.
For 256 GB of just legal documents, see Pile of Law.

P-Adapters: Robustly Extracting Factual Information from Language…

Recent work (e.g. LAMA (Petroni et al., 2019)) has found that the quality of the factual information extracted from…

openreview.net

Facebook/Meta’s work on updating factual associations inside of language model knowledge led to benchmarks LAMA and mLAMA. Salesforce proposes a trainable layer between the token/embedding layer and the rest of the network, which I guess is shifting specific tokens to update those answers? It’s tricky to say without tinkering whether this is just swapping tokens around for specific questions (i.e. “current US president” returning an updated name, but “characteristics and policy of current US president” concept not changing).

Plex: Towards Reliability using Pretrained Large Model Extensions

A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have…

arxiv.org

Google has an AI blog post announcing these ‘Plex’ models but sort of hand-waving about what makes them ‘reliable’. The main improvements are work on robustness and in reporting uncertainty. Though there are comparisons to Bayesian neural networks and probabilistic ML, the different approaches are not discussed in detail.
This work references the uncertainty-baselines project which has been developed for the past 2 years.

Quark: Controllable Text Generation with Reinforced Unlearning

Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may…

arxiv.org

Using reinforcement learning to tune model responses (i.e. making less-toxic responses), not forgetting information and facts. Still very cool.

SALSA: Attacking Lattice Cryptography with Transformers

Currently deployed public-key cryptosystems will be vulnerable to attacks by full-scale quantum computers…

arxiv.org

A number of major internet companies have been praising NIST’s pick of a few lattice-based cryptography algorithms for classical computers to stay secure in the post-quantum era. Here Facebook/Meta team applies ML to the task of decrypting information (itself a pretty big task) by pointing it at this less-tested algorithm, starting with a small key size and scaling up. To clarify, they are exploring whether ML on a classical computer is likely to learn to decrypt the system.

“The invisible gorilla strikes again: Sustained inattentional blindness in expert observers”

We like to think that we would notice the occurrence of an unexpected yet salient event in our world. However, we know…

www.ncbi.nlm.nih.gov

Researchers paste a cartoon gorilla into lung cancer images, referencing previous experiments in change blindness and limited observation skills. Most experts did not report the gorilla, but eye-tracking shows that they did fixate on it as an anomaly:

of the 20 radiologists who did not report the gorilla, 12 looked directly at the gorilla’s location when it was visible. The mean dwell time on the gorilla amongst this group was 547ms

Towards Robust Spanish Author Profiling and Lessons Learned from Adversarial Attacks

This paper popped up in my e-mail because it uses my seq2seq model for Spanish gender-reinflection.
Author profiling accuracy drops when you break tokenization with invisible characters (tbh not great work if someone pre-processes their text). When you use my seq2seq model (labeled Counterfactual here) the author profiling-by-gender accuracy drops from 0.738 to 0.515 (almost doubling the error).

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those…

arxiv.org

I took a look into other toxic language models after the ‘GPT-4chan’ debacle… were other models in the HuggingFace / NLP ecosystem also full of toxic text? ToxiGen is a well-designed dataset which instead has a mix of in-the-wild, adversarially-generated, and human-in-the-loop processes to build up a large dataset of toxic texts.

VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

Many past works aim to improve visual reasoning in models by supervising feature importance (estimated by model…

arxiv.org

I’m not super interested in the paper or problem, but this is the first time that I’m seeing this ‘Right-for-the-Right-Reason metrics’ (RRR) term in explainable AI, which is descriptive and necessary.
It goes back to an IJCAI-17 paper with different authors at another university, but the term does not get used often.

Wild Patterns Reloaded: A Survey of Machine Learning Security against Training Data Poisoning

The success of machine learning is fueled by the increasing availability of computing power and large training…

arxiv.org

In the AI + cybersecurity world, there are a variety of theories about how attackers will approach ML systems. This paper does a survey of methods to either fill the training data space with misleading examples, or craft examples which in training effectively build a backdoor/shortcut to override intentional features (i.e. I associate a kid-safe social media account with unique patterns or phrases which then will pass through content filters). They also discuss defenses by ‘sanitizing’ data or analyzing the model.

Long reads / Overview docs

An Introduction to Lifelong Supervised Learning

This primer is an attempt to provide a detailed summary of the different facets of lifelong learning.

arxiv.org

(lifelong learning = continually-trained models)

A Path Towards Autonomous Machine Intelligence

How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could…

openreview.net

AI legend Yann LeCun describes his vision for the future of ML, posts it on OpenReview for public comment. Controversial response by Schmidhuber (who frequently asks LeCun and others to cite his early work in ML as the original). I swear there was a controversy about ethics not being emphasized enough here, and two AI ethics critics being the first commenters, but after digging I must have mixed it up with something else?

ML Arxiv Haul #7

Adversarially trained neural representations may already be as robust as corresponding biological…

Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking…

auton-survival: an Open-Source Package for Regression, Counterfactual Estimation, Evaluation and…

Applications of machine learning in healthcare often require working with time-to-event prediction tasks including…

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News…

David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow…

Backward baselines: Is your model predicting the past?

When does a machine learning model predict the future of individuals and when does it recite patterns that predate the…

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale…

Beyond neural scaling laws: beating power law scaling via data pruning

Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both…

Controlling Translation Formality Using Pre-trained Multilingual Language Models

This paper describes the University of Maryland’s submission to the Special Task on Formality Control for Spoken…

GitHub — huggingface/diffusers

🤗 Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves as a…

FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization

We present FactPEGASUS, an abstractive summarization model that addresses the problem of factuality during pre-training…

First the Worst: Finding Better Gender Translations During Beam Search

Danielle Saunders, Rosie Sallis, Bill Byrne. Findings of the Association for Computational Linguistics: ACL 2022. 2022.

Forecasting Future World Events with Neural Networks

Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict…

Googling for Abortion: Search Engine Mediation of Abortion Accessibility in the United States

Among the myriad barriers to abortion access, crisis pregnancy centers (CPCs) pose an additional difficulty by…

How Good is the Bayes Posterior in Deep Neural Networks Really?

During the past five years the Bayesian deep learning community has developed increasingly accurate and efficient…

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation…

Language Model Cascades

Prompted models have demonstrated impressive few-shot learning abilities. Repeated interactions at test-time with a…

Language Models (Mostly) Know What They Know

We study whether language models can evaluate the validity of their own claims and predict which questions they will be…

Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities

With the advent of large language models, methods for abstractive summarization have made great strides, creating…

P-Adapters: Robustly Extracting Factual Information from Language…

Recent work (e.g. LAMA (Petroni et al., 2019)) has found that the quality of the factual information extracted from…

Plex: Towards Reliability using Pretrained Large Model Extensions

A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have…

Quark: Controllable Text Generation with Reinforced Unlearning

Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may…

SALSA: Attacking Lattice Cryptography with Transformers

Currently deployed public-key cryptosystems will be vulnerable to attacks by full-scale quantum computers…

“The invisible gorilla strikes again: Sustained inattentional blindness in expert observers”

We like to think that we would notice the occurrence of an unexpected yet salient event in our world. However, we know…

Towards Robust Spanish Author Profiling and Lessons Learned from Adversarial Attacks

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those…

VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

Many past works aim to improve visual reasoning in models by supervising feature importance (estimated by model…

Wild Patterns Reloaded: A Survey of Machine Learning Security against Training Data Poisoning

The success of machine learning is fueled by the increasing availability of computing power and large training…

Long reads / Overview docs

An Introduction to Lifelong Supervised Learning

This primer is an attempt to provide a detailed summary of the different facets of lifelong learning.

A Path Towards Autonomous Machine Intelligence

How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could…

Written by Nick Doiron