How to read an ML paper
Why do most of my projects focus on NLP? It’s partly because that’s where I’ve done the most reading of blogs, code, and papers (most papers have actually been pre-prints posted on arXiv.org). This has been a steep learning curve because when I worked as a web developer, I was never expected to look through research papers. Recently I’ve tried to read papers in other subfields of machine learning, on some more beginner-friendly topics.
Best places to start
The #1 piece of advice I ever saw about reading a pre-print is to plan on reading in multiple passes. Huh? But allowing myself to skim or jump over math on the first pass, instead of reading through until I hit a roadblock, made a huge difference. I went looking for where I first read this and it was likely this lecture by Andrew Ng or something based on his outline:
Papers with Code has papers and open source code, side by side.
I also have landed on this repo and multiple times bookmarked it to read more later — it’s annotated papers:
Part 1: Finding a Friendly Paper and Abstract
Let’s study a paper on data augmentation — the practice of expanding your training data by generating a variety of new and mutated examples (in this case: images).
I knew I could focus to finish this paper for two reasons:
- in NLP we also do data augmentation — so the concept is familiar to me
- the paper follows one of my favorite formats: “X is recommended, but… why? Are we sure X is always the effective way to accomplish that?” This helps because often as a newbie I’ve only had time to learn which practices that I should follow, and not why.
we seek to quantify how data augmentation improves model generalization
we introduce interpretable and easy-to-compute measures: Affinity and Diversity
Model generalization = good, the opposite of your model overfitting to training data and becoming useless on new examples
Affinity and Diversity, I think, what is meant in this context? Are they used here as opposite or orthogonal terms? I’ll expect the paper to define these, and not try to understand them from when I use these words.
Part 2: What data are they using?
We present an extensive study of 204 different augmentations on CIFAR-10 and 223 on ImageNet, varying both broad transform families and finer transform parameters
CIFAR and ImageNet are large datasets which you see again and again in machine learning. I’m more familiar with ImageNet — over a million images in a thousand categories (if you are getting started, I’d recommend the smaller imagenette instead).
It’s always good if the paper has a familiar and public dataset, and open source code, because you could potentially repeat the experiment or do a riff on it. ICLR recently debated a proposal using MuJoCo, a proprietary physics simulator.
Part 3: Picking up surprising takeaways from intro and results
I might not remember or understand every part of a paper, but I’m content to pick up surprising pieces of info. If in the future I have an issue with data augmentation, I need to remember only enough detail to look up this paper from my Google Doc or bookmarks and reread from there.
- Image distortion works well as an augmentation technique
- Augmentation methods perform differently on ImageNet and CIFAR (so there probably is no one all-purpose image augmentation step)
- “Images were pre-processed by dividing each pixel value by 255 and normalizing by the data set statistics” — I’ve never done this on my image data and I wonder if it’s a good practice?
- The paper discusses several combined augmentation methods — RandAugment, AutoAugment, and mixup — which are closer to the drop-in solutions that I’m looking for when starting a new project
- “Data augmentation has the potential to amplify bias” — concerning
Part 4: The definitions
Affinity: a simple metric for distribution shift
Diversity: A measure of augmentation complexity
The terms from the title and abstract each get their own definitive section.
Part 5: The conclusions
Usually papers have more core content here — for example if they compared their new paper to previous papers, you’ll see what did and didn’t work. This paper answers its original questions (which may be why you picked it up) then recommends more, broader data augmentation.
If you’re looking for more academic papers, I’ll repeat my recommendations of Papers with Code and annotated_research_papers. If you’re more of a YouTube person, stay subscribed to Rasa, Weights and Biases, and Yannic Kilcher for some educational content.