How to read an ML paper

4 min readNov 16, 2020

Why do most of my projects focus on NLP? It’s partly because that’s where I’ve done the most reading of blogs, code, and papers (most papers have actually been pre-prints posted on arXiv.org). This has been a steep learning curve because when I worked as a web developer, I was never expected to look through research papers. Recently I’ve tried to read papers in other subfields of machine learning, on some more beginner-friendly topics.

Best places to start

The #1 piece of advice I ever saw about reading a pre-print is to plan on reading in multiple passes. Huh? But allowing myself to skim or jump over math on the first pass, instead of reading through until I hit a roadblock, made a huge difference. I went looking for where I first read this and it was likely this lecture by Andrew Ng or something based on his outline:

Advice on building a machine learning career and reading research papers by Prof. Andrew Ng

Introduction:

blog.usejournal.com

Papers with Code has papers and open source code, side by side.

I also have landed on this repo and multiple times bookmarked it to read more later — it’s annotated papers:

AakashKumarNain/annotated_research_papers

Do you love reading research papers? Or do you find reading papers intimidating? Or are you looking for annotated…

github.com

Part 1: Finding a Friendly Paper and Abstract

Let’s study a paper on data augmentation — the practice of expanding your training data by generating a variety of new and mutated examples (in this case: images).

Affinity and Diversity: Quantifying Mechanisms of Data Augmentation

Though data augmentation has become a standard component of deep neural network training, the underlying mechanism…

arxiv.org

I knew I could focus to finish this paper for two reasons:

in NLP we also do data augmentation — so the concept is familiar to me
the paper follows one of my favorite formats: “X is recommended, but… why? Are we sure X is always the effective way to accomplish that?” This helps because often as a newbie I’ve only had time to learn which practices that I should follow, and not why.

we seek to quantify how data augmentation improves model generalization
we introduce interpretable and easy-to-compute measures: Affinity and Diversity

Model generalization = good, the opposite of your model overfitting to training data and becoming useless on new examples
Affinity and Diversity, I think, what is meant in this context? Are they used here as opposite or orthogonal terms? I’ll expect the paper to define these, and not try to understand them from when I use these words.

Part 2: What data are they using?

We present an extensive study of 204 different augmentations on CIFAR-10 and 223 on ImageNet, varying both broad transform families and finer transform parameters

CIFAR and ImageNet are large datasets which you see again and again in machine learning. I’m more familiar with ImageNet — over a million images in a thousand categories (if you are getting started, I’d recommend the smaller imagenette instead).
It’s always good if the paper has a familiar and public dataset, and open source code, because you could potentially repeat the experiment or do a riff on it. ICLR recently debated a proposal using MuJoCo, a proprietary physics simulator.

Part 3: Picking up surprising takeaways from intro and results

I might not remember or understand every part of a paper, but I’m content to pick up surprising pieces of info. If in the future I have an issue with data augmentation, I need to remember only enough detail to look up this paper from my Google Doc or bookmarks and reread from there.

Image distortion works well as an augmentation technique
Augmentation methods perform differently on ImageNet and CIFAR (so there probably is no one all-purpose image augmentation step)
“Images were pre-processed by dividing each pixel value by 255 and normalizing by the data set statistics” — I’ve never done this on my image data and I wonder if it’s a good practice?
The paper discusses several combined augmentation methods — RandAugment, AutoAugment, and mixup — which are closer to the drop-in solutions that I’m looking for when starting a new project
“Data augmentation has the potential to amplify bias” — concerning

Part 4: The definitions

Affinity: a simple metric for distribution shift
Diversity: A measure of augmentation complexity

The terms from the title and abstract each get their own definitive section.

Part 5: The conclusions

Usually papers have more core content here — for example if they compared their new paper to previous papers, you’ll see what did and didn’t work. This paper answers its original questions (which may be why you picked it up) then recommends more, broader data augmentation.

If you’re looking for more academic papers, I’ll repeat my recommendations of Papers with Code and annotated_research_papers. If you’re more of a YouTube person, stay subscribed to Rasa, Weights and Biases, and Yannic Kilcher for some educational content.