Labeling ML Data with Snorkel

Nick Doiron
2 min readNov 2, 2019

--

In previous posts on the AOC Reply Dataset, I mentioned the difficulty of training a troll detector with Google AutoML or SciKit-Learn, when I don’t want to manually label 110k Tweets. In practice I would use SQL to find overtly profane keywords, and bundle all Tweets by their authors into one category.
Since then, I learned that the term for my problem is “weak supervision” and Snorkel is a leading tool for building a better supervised learning dataset with labeling functions. Recent research around Snorkel includes Snuba, DryBell, and SuperGLUE. Generally useful elements seem to get merged back into the main Snorkel, so we will stick to that.

The concept is several different labeling functions, which Snorkel will figure out how to combine and weight. For example, a troll Tweet could contain profanity, weird conspiracy theories, certain hashtags, etc. These are all red flags, but in a world of probabilities some have more weight and meaning. Explicit, racist hashtags are almost always going to be used negatively, but profanity often can go either way (“keep fucking rocking it”).

Snorkel has other tutorials which could apply to you, such as validating crowdsourced data and generating similar text or images (data augmentation).

Update: you can now listen to a podcast that TWIML did about Snorkel!

I didn’t see a solution to my #1 problem (suggesting additional words for my labeling) but if you’re developing a programmatic solution for your project, and avoiding the hurdles of building a SQL database, I would highly recommend this for your pre-processing and labeling tasks.

--

--

Nick Doiron
Nick Doiron

Written by Nick Doiron

Web->ML developer and mapmaker.

No responses yet