Designing Data Fails
I’m creating a website where I share datasets with issues. You don’t want to read through 10,000 rows of data, so the objective is to think like a data scientist, to programmatically detect and resolve data problems.
Try it:
I’m starting out with editing some real datasets and placing them in Observable notebooks.
First problem, with one analysis and one query revealing the bug:
Harder problems with election data, some hints, but no solutions provided:
Using Real Datasets
Chad Skelton did a talk about finding datasets for a data journalism class. One of my main takeaways was to use real data, or the audience will know it’s fake, and also the outputs will not be so interesting.
He recommends using datasets which are interesting to the students. I’ll be looking at https://github.com/pplonski/datasets-for-start in the future, but for now I have census and election data.
Designing Fair Problems
The problem should be something that I introduce into the dataset, that is unambiguous once found, and fixable by replacing values, deleting a row, or deleting a column.
For example, a Gender column with four values ‘Male’, ‘Female’, ‘M’, and ‘F’, is doubling up on categories, and can be fixed in easy ways (consolidating into two categories) or more complex ways (opening up a discussion about non-binary gender, often represented as ‘X’, or whether a gender column is necessary at all).
For numbers: it would be fair to have a Height column with blank values, outliers with absurdly tall or short values, or for all values to be multiplied 2.5x due to a metric conversion. It would be unfair to add a few inches to all heights and expect the user to know the expected distribution.
If the problem was that all people named ‘Sam’ were assigned the same gender or that ‘Amaris’ is usually a female name, that’s a realistic problem, but it’s not a fair expectation. The problem should be findable by comparing rows to rows within the dataset.
Election data problems have some cultural knowledge tied to them, but these can also be rooted out as outliers or unpaired data.
What environment does the user have to solve?
Most of the problems should be findable using Pandas or dataframe-js. I thought of a few problems (involving maps and visualizations) which go beyond this.
In any case, I think the user should use a notebook. I started with JS notebooks at Observable, but because data scientists love Python, I plan on looking into nbdev’s Python notebooks.
What happens when I get it right?
When I first thought about this project, it seemed easy. When someone ‘finds’ the problem, they see that they’ve solved it!
As I start making example code, it’s harder to make that experience interesting.
- I could ask for a row ID or column name, but then lazy users might guess values until they find the right one.
- I could ask for a DataFrame with only the ‘problem’ rows, but future problems will cover sample size and bias, and I won’t want people to remove or problem-atize these underrepresented groups.
- I could expect one ‘correct’ dataset, but users might have different solutions to patch the dataset or to remove rows with incomplete data.
- I could create a colorful data visualization that only ‘looks right’ after the problem is fixed, but this could visually reveal the data problem. Also, data viz is time-consuming!
- I could ask users to program a script, but then I also need serverside execution to validate the solution (like leetcode) which is hard.
Looking for feedback
If a platform like this already exists, or you’d like to get involved, please let me know!