Labeling ML Data with Snorkel

2 min readNov 2, 2019

In previous posts on the AOC Reply Dataset, I mentioned the difficulty of training a troll detector with Google AutoML or SciKit-Learn, when I don’t want to manually label 110k Tweets. In practice I would use SQL to find overtly profane keywords, and bundle all Tweets by their authors into one category.
Since then, I learned that the term for my problem is “weak supervision” and Snorkel is a leading tool for building a better supervised learning dataset with labeling functions. Recent research around Snorkel includes Snuba, DryBell, and SuperGLUE. Generally useful elements seem to get merged back into the main Snorkel, so we will stick to that.

snorkel-team/snorkel

Programmatically Build and Manage Training Data The quickest way to familiarize yourself with the Snorkel library is to…

github.com

The concept is several different labeling functions, which Snorkel will figure out how to combine and weight. For example, a troll Tweet could contain profanity, weird conspiracy theories, certain hashtags, etc. These are all red flags, but in a world of probabilities some have more weight and meaning. Explicit, racist hashtags are almost always going to be used negatively, but profanity often can go either way (“keep fucking rocking it”).

Snorkel has other tutorials which could apply to you, such as validating crowdsourced data and generating similar text or images (data augmentation).

Update: you can now listen to a podcast that TWIML did about Snorkel!

Snorkel: A System for Fast Training Data Creation with Alex Ratner

Today we're joined by Alex Ratner, Ph.D. student at Stanford, to discuss his work on Snorkel, a framework for creating…

twimlai.com

I didn’t see a solution to my #1 problem (suggesting additional words for my labeling) but if you’re developing a programmatic solution for your project, and avoiding the hurdles of building a SQL database, I would highly recommend this for your pre-processing and labeling tasks.

Labeling ML Data with Snorkel

snorkel-team/snorkel

Programmatically Build and Manage Training Data The quickest way to familiarize yourself with the Snorkel library is to…

Snorkel: A System for Fast Training Data Creation with Alex Ratner

Today we're joined by Alex Ratner, Ph.D. student at Stanford, to discuss his work on Snorkel, a framework for creating…

Written by Nick Doiron

No responses yet