ML Arxiv Haul #9

Nick Doiron

7 min readSep 20, 2022

New Yorker captions, bias bounties, and more:

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first- and…

arxiv.org

Researchers describe an optimizer which works better than the popular Adam and AdamW methods.

An Algorithmic Framework for Bias Bounties

We propose and analyze an algorithmic framework for “bias bounties”: events in which external participants are invited…

arxiv.org

After Twitter was criticized for racially biased image-cropping, they investigated, changed the algorithm, and held a groundbreaking ‘bias bounty’ competition at DEF CON in 2021. Some of these developers have set up a continuing organization at https://biasbounty.ai . This paper is a bit more abstract, coming up with an autonomous way to maintain such a contest based on its theoretical change in bias and Bayesian analysis.

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource…

arxiv.org

Carnegie Mellon researchers use text from 1,909 languages in a Crúbadán dataset, and 700 languages in the CMU Wilderness dataset, to come up with audio recognition models for low-resource languages. They point out a concern that the audio portion of the model is pre-trained on Indo-European languages and phonemes, so it does not transfer as well to more distant languages, or longer words.

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large…

arxiv.org

I saw this described on Twitter as being able to train two separate models and merge their weights, so I was curious. The researchers describe a mixture-of-experts system where there is weighting to a tree of language models (how many parameters are in each model?). They can then keep this tree structure or merge these parameters into one model. Either way, they outperform a 1.3B parameter language model with only 40% of the compute.

Deep Unlearning via Randomized Conditionally Independent Hessians

Recent legislation has led to interest in machine unlearning, i.e., removing specific training samples from a…

arxiv.org

I had an ‘unlearning’ paper in the last Arxiv haul. This method is tested in NLP, pre-trained models, and a person-identifying model.

Do Androids Laugh at Electric Sheep? Humor “Understanding” Benchmarks from The New Yorker Caption…

We challenge AI models to “demonstrate understanding” of the sophisticated multimodal humor of The New Yorker Caption…

arxiv.org

A new fav. Multi-modal models are given a cartoon and caption; and text-only models are tested on the caption, a written description of the cartoon, and an image-captioning service. Models are evaluated on whether they will match the correct caption, pick the editors’ and readers’ favorite captions from the weekly caption contest, and generate a satisfying explanation for the cartoon’s humor.
The authors (from a variety of institutions) find that humans are better at every task, save for GPT-3 picking New Yorker editors’ picks. It seems difficult to imagine a model winning the caption contest with its own jokes anytime soon. But there’s room for improvement in how the models note the visual gags in the images, and in the fluency of the explanations, even with a basic model.
More details will be made available as capcon.dev. Also major props to a section of the paper which includes links to several humor + ML projects.

DORA: Exploring outlier representations in Deep Neural Networks

Deep Neural Networks (DNNs) draw their power from the representations they learn. In recent years, however, researchers…

arxiv.org

Technique to prevent unusual or possibly damaging data from entering the neural network training. The researchers claim that DORA should be agnostic to whatever dataset. They do warn that it could be overrun by a large-scale, systemic deception.

Ethereum Emissions: A Bottom-up Estimate

The Ethereum ecosystem is maintained by a distributed global network of computers that currently require massive…

arxiv.org

Ethereum recently switched from Proof of Work to Proof of Stake, reducing its energy footprint by ~99.5% (it remains to be seen whether miners switch to other cryptocurrencies or what). In this paper from late 2021, one researcher makes a detailed estimate of emissions of the Ethereum network based on hashrate and other publicly available data.

Explaining Bayesian Neural Networks

To make advanced learning machines such as Deep Neural Networks (DNNs) more transparent in decision making, explainable…

arxiv.org

A method to highlight salient/significant regions of an image fed into a Bayesian Neural Network.

Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive…

The problems of unfaithful summaries have been widely discussed under the context of abstractive summarization. Though…

arxiv.org

You can’t summarize an article by taking pieces and gluing them back together. This group from UNC Chapel Hill makes a dataset of explanations which fall into a few pitfalls.

Faithful Reasoning Using Large Language Models

Although contemporary large language models (LMs) demonstrate impressive question-answering capabilities, their answers…

arxiv.org

DeepMind project to search a model’s knowledge and present reasoning for why it chooses an answer. This is preferable to GPT-3 style generation of explanations, which are difficult to trace back to source material, or to prove were the model’s actual reason for doing something.
I was curious if this could still be tricked to generate nonsensical explanations for unexplainable content and appreciated their approach.

If the trace does not terminate within a specified number of steps then the answer is considered to be ‘Unknown’, allowing us to filter model answers and increase answer precision

If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare

If life on Earth had to achieve n ‘hard steps’ to reach humanity’s level, then the chance of this event rose as time to…

arxiv.org

Another non-ML Arxiv paper which has been on my list for a while, I think since it came up on the Lex Fridman podcast.
This is sort of a power-of-big-numbers reasoning of whether we expect other civilizations in our galaxy.

We estimate that loud alien civilizations now control 40- 50% of universe volume

Wild.

If loud aliens arise from quiet ones, a depressingly low transition chance (< ∼10−4 ) is required to expect that even one other quiet alien civilization has ever been active in our galaxy.

Another wording of this is even if only 1 in 10,000 basic quiet civilizations graduates to being a loud/grabby (expanding, sustaining, easily-detectable, spacefaring) civilization, it’s odd that we’re not in their ~50% of the universe. The authors’ explanation would be that it’s rare for a galaxy to have more than one civilization, for our nearest loud aliens to have started 1,000s of galaxies away.

I think I’d understand if aliens simply found it unrewarding to travel between galaxies? It’s 2.5 million light-years to Andromeda, and our galaxy is only 100,000 light-years across, so that’s a long way to haul your plants/animals/habitat without a pit stop.

Online Learning for Recommendations at Grubhub

We propose a method to easily modify existing offline Recommender Systems to run online using Transfer Learning. Online…

arxiv.org

Note that ‘online’ here means a continuously learning model. The author says, “Recommendations drive 80% of revenue at Grubhub,” so an efficient and smart system is a major goal for them. The team’s original approach to freshness was retraining the model on the past n days of data, every day. Continuous learning allows them to train on just the one new day of data every day.

Out of One, Many: Using Language Models to Simulate Human Samples

We propose and explore the possibility that language models can be studied as effective proxies for specific human…

arxiv.org

The team creates multiple responses to a question as if you were surveying multiple people. The personality is prompted with some demographic information and political party affiliation. They create a task where fluency is not so important (making word lists) and GPT’s lists are similar and not distinguishable from similar humans.

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that…

arxiv.org

A previous version of the paper described it as ‘Goldilocks selection’. Basically we’re looking for an active learning strategy where training time is spent on examples which benefit the model, without burning time on extras. The authors compare their method to active learning baselines, but it appears that those don’t have awareness of the true labels, and this method does, so it knows which points it still needs to learn. The paper shows getting to a better accuracy in 1/18th the time through this method, then continues on to higher accuracy.

TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN…

arxiv.org

MSA Arabic translation model with links to corpora. It looks like they’ve fine-tuned an Arabic T5 model. There’s also some research into different decoding options, but most of these also involve sampling? Maybe it averages out with the number of translation pairs, but I’d like to see if you did 100 samples on each translation, what’s the variance in accuracy scores?

Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post hoc…

Despite the plethora of post hoc model explanation methods, the basic properties and behavior of these methods and the…

arxiv.org

Explanations of individual model decisions are not always agreeing. The researchers state that LIME and several other methods are different takes which can all be based on local function approximation.

ML Arxiv Haul #9

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first- and…

An Algorithmic Framework for Bias Bounties

We propose and analyze an algorithmic framework for “bias bounties”: events in which external participants are invited…

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource…

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large…

Deep Unlearning via Randomized Conditionally Independent Hessians

Recent legislation has led to interest in machine unlearning, i.e., removing specific training samples from a…

Do Androids Laugh at Electric Sheep? Humor “Understanding” Benchmarks from The New Yorker Caption…

We challenge AI models to “demonstrate understanding” of the sophisticated multimodal humor of The New Yorker Caption…

DORA: Exploring outlier representations in Deep Neural Networks

Deep Neural Networks (DNNs) draw their power from the representations they learn. In recent years, however, researchers…

Ethereum Emissions: A Bottom-up Estimate

The Ethereum ecosystem is maintained by a distributed global network of computers that currently require massive…

Explaining Bayesian Neural Networks

To make advanced learning machines such as Deep Neural Networks (DNNs) more transparent in decision making, explainable…

Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive…

The problems of unfaithful summaries have been widely discussed under the context of abstractive summarization. Though…

Faithful Reasoning Using Large Language Models

Although contemporary large language models (LMs) demonstrate impressive question-answering capabilities, their answers…

If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare

If life on Earth had to achieve n ‘hard steps’ to reach humanity’s level, then the chance of this event rose as time to…

Online Learning for Recommendations at Grubhub

We propose a method to easily modify existing offline Recommender Systems to run online using Transfer Learning. Online…

Out of One, Many: Using Language Models to Simulate Human Samples

We propose and explore the possibility that language models can be studied as effective proxies for specific human…

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that…

TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN…

Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post hoc…

Despite the plethora of post hoc model explanation methods, the basic properties and behavior of these methods and the…

Written by Nick Doiron