ML Arxiv Haul / Speed Run

Nick Doiron
6 min readDec 8, 2021

This is going to be experimental writing, reminding me to clear out the machine learning papers in my open browser tabs. Generally I skim these until I get the gist (i.e. please do not quiz me on these) and file it into this giant Google Doc.
If it has a lot of math it might just go into the ‘ML Math’ section, and if it’s unclear conclusions, I might just close the tab. I really only hold onto these if I feel like a few weeks from now, I’ll be having a conversation and desperately need a link to back up my vague memory.

This Facebook paper is a good look behind the curtain into a huge AI research lab. We get statistics about how long models are trained for, how big the biggest internal models are, and what share of the energy is carbon-free. A particular focus is Neural Architecture Search, where trials should be stopped early to avoid an up-to-3000x energy spend. Also: quantizing / 16-bit models, and clever AI scheduling to find optimal times to run servers and do training.

I was searching for updates from the people who organized an interesting Adversarial AI Village when I was at DEFCON in 2018. This covers a Tsinghua University / Alibaba Security-sponsored competition around a computer-vision conference in 2021. They’ve introduced an adversarial benchmark. The paper covers technical details of the top 6 teams. I’m interested in using AutoAttack in the future as a starting point.

This paper first drew attention on Twitter because it accepts and continues the ‘foundation models’ label given by Stanford. It’s also interesting in its own right. I assumed this wouldn’t be so different from the ‘visual transformers’ models which have appeared online since OpenAI’s CLIP and DALL-E in January 2021. The main differences here seem to be a way of handling a many-to-many relationship between images and possible / matching captions, and new image encoder (they call it CoSwin). They show state-of-the-art results on several benchmarks.

A team discusses what values researchers attribute to their research.

This paper was online since June, but recently got more attention after getting dismissed by NeurIPS reviewers and meta-reviewers. The results, such as that papers aren’t introspective enough to reliably cover their negative impacts, reminds me of last year’s discussion of the NeurIPS broader impacts statement.

My first instinct was to be uncertain of the authors selecting 100 papers from two conferences x four years (2008, 2009, 2018, 2019) with care to take the most-cited papers. It’s necessary to take a sample of papers to make this a manageable task, but I can’t help but wonder if we could say these truly are representative. Citations are a weird game with some big companies (Google, Facebook, OpenAI) and celebrities (Hinton’s capsule networks) getting more attention for their authorship or scale, and papers which do pay a good deal of attention to ethics (such as this one) might be getting other conference placements and less citations.

This is a good peek into retrieval-based NLP and Stanford’s current challenges in this space. Even though large language models are known to have some internalized knowledge about celebrities and countries etc. to do analogies or complete sentences, they can’t possibly know everything. I was really interested in ‘patching’ the models’ internalized knowledge, which the HuggingFace / BigScience group misinterpreted as being a type of retrieval, so I end up reading more into both of these topics.

Text generation models often are measured in perplexity (e^cross-entropy loss). Even high-scoring models are subject to repetition or unusual wording. This new MAUVE metric is compared to known quality differences in text generated by different-sized versions of GPT-2, and humans’ actual rating of the text being human-like, interesting, and ‘sensible’.

I know almost nothing about reinforcement learning (experimenting in RL takes a lot of resources which I don’t have) but one of the ongoing problems is that it takes an absurd, inhuman amount of time for most models to learn Atari games. This paper has results after only 2 hours of training.

Unlike the ‘random seed’ optimization paper, these authors are looking at whether the ML framework, CUDA drivers, etc. may affect the reproducibility of an experiment, and of the top results.

The first paper from Anthropic, a team which splintered off from OpenAI. The author introduces ‘HHH’ (helpful, honest, and harmless) as the ‘alignment’ goals of their assistant bot. I initially thought that the bot was a chat bot that helped with editing and summarizing documents, but there’s also a Code Correctness task which gets a lot of attention.
Supplementary material suggests that the AI would be able to do just about anything:

Writing an essay from bullet points
Teaching a third-grader about fractions
Identifying useful papers for a researcher
Explaining a convoluted legal contract
Providing a recipe and advice for baking a cherry tart
Comforting a parent whose daughter has left for college
Suggesting songs based on your favorite music
Fixing a bug in javascript code

Sort of related to the ‘Randomness in Neural Network Training’ reproducibility paper above, this talks about variance and de-biasing tools as to put together a system insufficient to withstand legal challenges. Even on ‘fixed-seed
identical training runs
’ the software implementation and floating-point calculations can change outputs and fairness metrics by a significant amount.

This is a fun paper inspired by a Sesame Street book (which we definitely had in my house!). Dr. Bender has been discussing this on Twitter. Overall I would summarize it as ‘metrics aren’t everything, because they cannot possibly contain everything’.

A new release of weakly supervised learning (data with messy / fuzzy labeling, or labeling functions) similar to Snorkel.

Facebook’s research into speech recognition benchmarks — unfortunately success on one benchmark does not transfer well to another, so they suggest training on public submitted data. Luckily Facebook has a lot of that!

GEM collected a lot of these augmentation tools into one repo.

--

--