ML Arxiv Haul #9

Nick Doiron
7 min readSep 20, 2022

New Yorker captions, bias bounties, and more:

Researchers describe an optimizer which works better than the popular Adam and AdamW methods.

After Twitter was criticized for racially biased image-cropping, they investigated, changed the algorithm, and held a groundbreaking ‘bias bounty’ competition at DEF CON in 2021. Some of these developers have set up a continuing organization at https://biasbounty.ai . This paper is a bit more abstract, coming up with an autonomous way to maintain such a contest based on its theoretical change in bias and Bayesian analysis.

Carnegie Mellon researchers use text from 1,909 languages in a Crúbadán dataset, and 700 languages in the CMU Wilderness dataset, to come up with audio recognition models for low-resource languages. They point out a concern that the audio portion of the model is pre-trained on Indo-European languages and phonemes, so it does not transfer as well to more distant languages, or longer words.

I saw this described on Twitter as being able to train two separate models and merge their weights, so I was curious. The researchers describe a mixture-of-experts system where there is weighting to a tree of language models (how many parameters are in each model?). They can then keep this tree structure or merge these parameters into one model. Either way, they outperform a 1.3B parameter language model with only 40% of the compute.

I had an ‘unlearning’ paper in the last Arxiv haul. This method is tested in NLP, pre-trained models, and a person-identifying model.

A new fav. Multi-modal models are given a cartoon and caption; and text-only models are tested on the caption, a written description of the cartoon, and an image-captioning service. Models are evaluated on whether they will match the correct caption, pick the editors’ and readers’ favorite captions from the weekly caption contest, and generate a satisfying explanation for the cartoon’s humor.
The authors (from a variety of institutions) find that humans are better at every task, save for GPT-3 picking New Yorker editors’ picks. It seems difficult to imagine a model winning the caption contest with its own jokes anytime soon. But there’s room for improvement in how the models note the visual gags in the images, and in the fluency of the explanations, even with a basic model.
More details will be made available as capcon.dev. Also major props to a section of the paper which includes links to several humor + ML projects.

Technique to prevent unusual or possibly damaging data from entering the neural network training. The researchers claim that DORA should be agnostic to whatever dataset. They do warn that it could be overrun by a large-scale, systemic deception.

Ethereum recently switched from Proof of Work to Proof of Stake, reducing its energy footprint by ~99.5% (it remains to be seen whether miners switch to other cryptocurrencies or what). In this paper from late 2021, one researcher makes a detailed estimate of emissions of the Ethereum network based on hashrate and other publicly available data.

A method to highlight salient/significant regions of an image fed into a Bayesian Neural Network.

You can’t summarize an article by taking pieces and gluing them back together. This group from UNC Chapel Hill makes a dataset of explanations which fall into a few pitfalls.

DeepMind project to search a model’s knowledge and present reasoning for why it chooses an answer. This is preferable to GPT-3 style generation of explanations, which are difficult to trace back to source material, or to prove were the model’s actual reason for doing something.
I was curious if this could still be tricked to generate nonsensical explanations for unexplainable content and appreciated their approach.

If the trace does not terminate within a specified number of steps then the answer is considered to be ‘Unknown’, allowing us to filter model answers and increase answer precision

Another non-ML Arxiv paper which has been on my list for a while, I think since it came up on the Lex Fridman podcast.
This is sort of a power-of-big-numbers reasoning of whether we expect other civilizations in our galaxy.

We estimate that loud alien civilizations now control 40- 50% of universe volume

Wild.

If loud aliens arise from quiet ones, a depressingly low transition chance (< ∼10−4 ) is required to expect that even one other quiet alien civilization has ever been active in our galaxy.

Another wording of this is even if only 1 in 10,000 basic quiet civilizations graduates to being a loud/grabby (expanding, sustaining, easily-detectable, spacefaring) civilization, it’s odd that we’re not in their ~50% of the universe. The authors’ explanation would be that it’s rare for a galaxy to have more than one civilization, for our nearest loud aliens to have started 1,000s of galaxies away.

I think I’d understand if aliens simply found it unrewarding to travel between galaxies? It’s 2.5 million light-years to Andromeda, and our galaxy is only 100,000 light-years across, so that’s a long way to haul your plants/animals/habitat without a pit stop.

Note that ‘online’ here means a continuously learning model. The author says, “Recommendations drive 80% of revenue at Grubhub,” so an efficient and smart system is a major goal for them. The team’s original approach to freshness was retraining the model on the past n days of data, every day. Continuous learning allows them to train on just the one new day of data every day.

The team creates multiple responses to a question as if you were surveying multiple people. The personality is prompted with some demographic information and political party affiliation. They create a task where fluency is not so important (making word lists) and GPT’s lists are similar and not distinguishable from similar humans.

A previous version of the paper described it as ‘Goldilocks selection’. Basically we’re looking for an active learning strategy where training time is spent on examples which benefit the model, without burning time on extras. The authors compare their method to active learning baselines, but it appears that those don’t have awareness of the true labels, and this method does, so it knows which points it still needs to learn. The paper shows getting to a better accuracy in 1/18th the time through this method, then continues on to higher accuracy.

MSA Arabic translation model with links to corpora. It looks like they’ve fine-tuned an Arabic T5 model. There’s also some research into different decoding options, but most of these also involve sampling? Maybe it averages out with the number of translation pairs, but I’d like to see if you did 100 samples on each translation, what’s the variance in accuracy scores?

Explanations of individual model decisions are not always agreeing. The researchers state that LIME and several other methods are different takes which can all be based on local function approximation.

--

--