ML Arxiv Haul / Speed Run

6 min readDec 8, 2021

This is going to be experimental writing, reminding me to clear out the machine learning papers in my open browser tabs. Generally I skim these until I get the gist (i.e. please do not quiz me on these) and file it into this giant Google Doc.
If it has a lot of math it might just go into the ‘ML Math’ section, and if it’s unclear conclusions, I might just close the tab. I really only hold onto these if I feel like a few weeks from now, I’ll be having a conversation and desperately need a link to back up my vague memory.

Sustainable AI: Environmental Implications, Challenges and Opportunities

This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective…

arxiv.org

This Facebook paper is a good look behind the curtain into a huge AI research lab. We get statistics about how long models are trained for, how big the biggest internal models are, and what share of the energy is carbon-free. A particular focus is Neural Architecture Search, where trials should be stopped early to avoid an up-to-3000x energy spend. Also: quantizing / 16-bit models, and clever AI scheduling to find optimal times to run servers and do training.

Adversarial Attacks on ML Defense Models Competition

Due to the vulnerability of deep neural networks (DNNs) to adversarial examples, a large number of defense techniques…

arxiv.org

I was searching for updates from the people who organized an interesting Adversarial AI Village when I was at DEFCON in 2018. This covers a Tsinghua University / Alibaba Security-sponsored competition around a computer-vision conference in 2021. They’ve introduced an adversarial benchmark. The paper covers technical details of the top 6 teams. I’m interested in using AutoAttack in the future as a starting point.

Florence: A New Foundation Model for Computer Vision

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with…

arxiv.org

This paper first drew attention on Twitter because it accepts and continues the ‘foundation models’ label given by Stanford. It’s also interesting in its own right. I assumed this wouldn’t be so different from the ‘visual transformers’ models which have appeared online since OpenAI’s CLIP and DALL-E in January 2021. The main differences here seem to be a way of handling a many-to-many relationship between images and possible / matching captions, and new image encoder (they call it CoSwin). They show state-of-the-art results on several benchmarks.

The Values Encoded in Machine Learning Research

Machine learning (ML) currently exerts an outsized influence on the world, increasingly affecting communities and…

arxiv.org

A team discusses what values researchers attribute to their research.

This paper was online since June, but recently got more attention after getting dismissed by NeurIPS reviewers and meta-reviewers. The results, such as that papers aren’t introspective enough to reliably cover their negative impacts, reminds me of last year’s discussion of the NeurIPS broader impacts statement.

My first instinct was to be uncertain of the authors selecting 100 papers from two conferences x four years (2008, 2009, 2018, 2019) with care to take the most-cited papers. It’s necessary to take a sample of papers to make this a manageable task, but I can’t help but wonder if we could say these truly are representative. Citations are a weird game with some big companies (Google, Facebook, OpenAI) and celebrities (Hinton’s capsule networks) getting more attention for their authorship or scale, and papers which do pay a good deal of attention to ethics (such as this one) might be getting other conference placements and less citations.

Building Scalable, Explainable, and Adaptive NLP Models with Retrieval

Natural language processing (NLP) has witnessed impressive developments in answering questions, summarizing or…

ai.stanford.edu

This is a good peek into retrieval-based NLP and Stanford’s current challenges in this space. Even though large language models are known to have some internalized knowledge about celebrities and countries etc. to do analogies or complete sentences, they can’t possibly know everything. I was really interested in ‘patching’ the models’ internalized knowledge, which the HuggingFace / BigScience group misinterpreted as being a type of retrieval, so I end up reading more into both of these topics.

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

As major progress is made in open-ended text generation, measuring how close machine-generated text is to human…

arxiv.org

Text generation models often are measured in perplexity (e^cross-entropy loss). Even high-scoring models are subject to repetition or unusual wording. This new MAUVE metric is compared to known quality differences in text generated by different-sized versions of GPT-2, and humans’ actual rating of the text being human-like, interesting, and ‘sensible’.

Mastering Atari Games with Limited Data

Reinforcement learning has achieved great success in many applications. However, sample efficiency remains a key…

arxiv.org

I know almost nothing about reinforcement learning (experimenting in RL takes a lot of resources which I don’t have) but one of the ongoing problems is that it takes an absurd, inhuman amount of time for most models to learn Atari games. This paper has results after only 2 hours of training.

Randomness In Neural Network Training: Characterizing The Impact of Tooling

The quest for determinism in machine learning has disproportionately focused on characterizing the impact of noise…

arxiv.org

Unlike the ‘random seed’ optimization paper, these authors are looking at whether the ML framework, CUDA drivers, etc. may affect the reproducibility of an experiment, and of the top results.

A General Language Assistant as a Laboratory for Alignment

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose…

arxiv.org

The first paper from Anthropic, a team which splintered off from OpenAI. The author introduces ‘HHH’ (helpful, honest, and harmless) as the ‘alignment’ goals of their assistant bot. I initially thought that the bot was a chat bot that helped with editing and summarizing documents, but there’s also a Code Correctness task which gets a lot of attention.
Supplementary material suggests that the AI would be able to do just about anything:

Writing an essay from bullet points
Teaching a third-grader about fractions
Identifying useful papers for a researcher
Explaining a convoluted legal contract
Providing a recipe and advice for baking a cherry tart
Comforting a parent whose daughter has left for college
Suggesting songs based on your favorite music
Fixing a bug in javascript code

Are My Deep Learning Systems Fair? An Empirical Study of Fixed-Seed Training

Are My Deep Learning Systems Fair? An Empirical Study of Fixed-Seed Training Part of Advances in Neural Information…

proceedings.neurips.cc

Sort of related to the ‘Randomness in Neural Network Training’ reproducibility paper above, this talks about variance and de-biasing tools as to put together a system insufficient to withstand legal challenges. Even on ‘fixed-seed
identical training runs’ the software implementation and floating-point calculations can change outputs and fairness metrics by a significant amount.

AI and the Everything in the Whole Wide World Benchmark

There is a tendency across different subfields in AI to valorize a small collection of influential benchmarks. These…

arxiv.org

This is a fun paper inspired by a Sesame Street book (which we definitely had in my house!). Dr. Bender has been discussing this on Twitter. Overall I would summarize it as ‘metrics aren’t everything, because they cannot possibly contain everything’.

End-to-End Weak Supervision

Aggregating multiple sources of weak supervision (WS) can ease the data-labeling bottleneck prevalent in many machine…

arxiv.org

A new release of weakly supervised learning (data with messy / fuzzy labeling, or labeling functions) similar to Snorkel.

Rethinking Evaluation in ASR: Are Our Models Robust Enough?

Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic…

arxiv.org

Facebook’s research into speech recognition benchmarks — unfortunately success on one benchmark does not transfer well to another, so they suggest training on public submitted data. Luckily Facebook has a lot of that!

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP)…

arxiv.org

GEM collected a lot of these augmentation tools into one repo.

ML Arxiv Haul / Speed Run

Sustainable AI: Environmental Implications, Challenges and Opportunities

This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective…

Adversarial Attacks on ML Defense Models Competition

Due to the vulnerability of deep neural networks (DNNs) to adversarial examples, a large number of defense techniques…

Florence: A New Foundation Model for Computer Vision

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with…

The Values Encoded in Machine Learning Research

Machine learning (ML) currently exerts an outsized influence on the world, increasingly affecting communities and…

Building Scalable, Explainable, and Adaptive NLP Models with Retrieval

Natural language processing (NLP) has witnessed impressive developments in answering questions, summarizing or…

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

As major progress is made in open-ended text generation, measuring how close machine-generated text is to human…

Mastering Atari Games with Limited Data

Reinforcement learning has achieved great success in many applications. However, sample efficiency remains a key…

Randomness In Neural Network Training: Characterizing The Impact of Tooling

The quest for determinism in machine learning has disproportionately focused on characterizing the impact of noise…

A General Language Assistant as a Laboratory for Alignment

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose…

Are My Deep Learning Systems Fair? An Empirical Study of Fixed-Seed Training

Are My Deep Learning Systems Fair? An Empirical Study of Fixed-Seed Training Part of Advances in Neural Information…

AI and the Everything in the Whole Wide World Benchmark

There is a tendency across different subfields in AI to valorize a small collection of influential benchmarks. These…

End-to-End Weak Supervision

Aggregating multiple sources of weak supervision (WS) can ease the data-labeling bottleneck prevalent in many machine…

Rethinking Evaluation in ASR: Are Our Models Robust Enough?

Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic…

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP)…

Written by Nick Doiron