Reviewing dataset explorer tools

Testing Manifold and Yellowbrick

Nick Doiron
4 min readMar 7, 2020

The Holy Grail of exploratory data analysis is one place to drop in a dataset and reveal interesting results. It can’t be so simple — after all, if our data can be plotted in 2D scatter plots, we would just use Excel and not machine learning. Also, as I was planning this post, I realized these data tools work with only certain types of problems, and mostly with numeric tables; many of my ideas (IMDB dataset, images, pet adoptions) weren’t compatible.

Anyway, here are some picks out of my exploring-data to-try list:

Manifold

Uber’s Manifold is “a model-agnostic visual debugging tool for machine learning” which can review tabular and geospatial data. Let’s use it with a model estimating price of Airbnb listings in Amsterdam, with a dataset updated this month on Inside Airbnb.

I tried a simple setup with Pandas, and LinearRegression or XGBRegressor as the predictor. The accuracy of the price model varied a ton when I re-split the train/test data, and not much when I changed the model, so the actual predictive power is nil. In a real situation, I would expand the training set, and add columns with neighborhood information and counts of local features on OpenStreetMap. Anyway, when I happened to draw a set which had an r² of 0.014, I decided we should review what it looks like in Manifold.

Extract three CSVs: X_test, y_predict, and y_test (which Manifold calls ground-truth). Keep headers, but remove any Pandas index columns.

Introducing Clusters in Manifold

This took a minute to understand, but it’s important. When grouping all 9,850 price predictions by error, the top 72 are so off-base that they fall into these two clusters, which we then combine to form Group 0. The pink highlight will make it easier to study this unpredictable cluster.

Feature importance in Manifold

This was a cool screen. We can see the pink unpredictable places on the map. Below, see that a price is closely related to whether a host has many listings and reviews, and the room type (shared, private, hotel, or full apartment). We see that the distribution of pink properties usually matches the grey.

latitude and longitude columns (or similar names) gives you a pretty map

Inspecting the room_type chart, I notice that the pink group has proportionally more hotel rooms. Maybe these are less typical listings? Manifold recognizes this is categorical, but orders them in a peculiar way from most populous in the grey group, to least populous (at least I think this was the logic?).

One thing that I don’t like is that the density of pink properties in the city center is deceptive — grey dots are super-dense there, too, but there is not enough difference between little-dense and super-dense. After zooming in, it’s possible to see a little more variation:

a closer look at the map

The islet in the left/middle here has four pink dots plus one across a canal, but the islet on the right has none. My instinct is that this is meaningful — is this neighborhood especially more or less desirable? If you look closely you can see a park and some other features which could characterize the neighborhood.

Verdict: I like this tool, but I want to revamp the maps: basemap, density layer, and maybe different colors for unexpectedly-high and unexpectedly-low prices.

Yellowbrick (SciKit-Learn)

District Data Labs makes Yellowbrick, “a suite of visual diagnostic tools… that extend the scikit-learn API to allow human steering of the model selection process.”
I set up a new journal to create train and test sets from the Austin Animal Center Outcomes dataset and followed a process similar to their model selection tutorial to create this multidimensional plot.

ParallelCoordinates visualization

I don’t really have the ins and outs of multi-dimensional data plots, so I’d like to try this with some more established demo datasets to understand how to read it.
Their model-selection stuff looks good:

directly from their tutorial

Overall, Yellowbrick seems more like a shared toolbox or a TensorBoard for SciKit-Learn, rather than a place to drop some data in and get back results.

--

--

Nick Doiron
Nick Doiron

Written by Nick Doiron

Web->ML developer and mapmaker.

No responses yet