Geodata and Airbnb price predictions
Making predictions with AutoKeras
In an earlier post, I used Uber’s Manifold dashboard to review the feature importance and variance of an Airbnb rents dataset, and map the least predictable rows of the dataset.
In addition to some visualization changes, I want to try adding data from OpenStreetMap to improve predictions. I’ve done projects with OSM data before, but often it was in countries where OSM coverage was not uniform, or I was supplying data to a blackbox model.
Associating OSM data
There are many OSM download services in the world, including GeoFabrik and the HOT Export Tool, but I used Interline’s Extracts, previously hosted by Michal Migurski and Mapzen.
I can compare them to the location of each Airbnb listing, specifically counts of nearby restaurants, hotels, and other features which could inform the model of the je ne sais quoi of ‘desirability’ or ‘tourist appeal’ which influences the price of a listing.
The Amsterdam GeoJSON file consists of over 5.1 million points, lines, and polygons, with each on a separate line. It’s not feasible to load this all into memory at one time with GeoPandas.read_file
or json.loads
, nor is it necessary. We can read line-by-line, remove any trailing comma or newline, and only add elements to our dataframe if they are features that we want to remember.
I verified that all of my tags exist in the GeoJSON. Considering it’s Amsterdam, it’s no surprise that the ‘bike’ tag is popular:
{‘museum’: 275, ‘fuel’: 712, ‘religion’: 1,245, ‘school’: 1,886, ‘camp’: 713, ‘sight’: 1,303, ‘parking’: 19,268, ‘hotel’: 1,044, ‘bank’: 247, ‘bike’: 30,382, ‘restaurant’: 6,819, ‘hostel’: 78, ‘big_store’: 1,115, ‘info’: 1,327, ‘tower’: 5,165, ‘bench’: 7,955, ‘small_store’: 416, ‘hair’: 833, ‘hardware’: 70, ‘car’: 187, ‘clothes’: 1,824, ‘building’: 2,756,604, ‘pier’: 4,525, ‘corp’: 414, ‘realestate’: 108, ‘govt’: 45, ‘bridge’: 16,082, ‘residential’: 1,916, ‘forest’: 81,294, ‘farm’: 11,151}
GeoPandas was kind of a roundabout way to do this, so if I redid it I would use PostGIS again.
Wait — how should distance work to a polygon?
For my first run of the script, I am only reading point geometries. For a second run, it’d make sense to pick a central point from lines and polygons.
Could you also add the distance to the nearest {school/hotel/farm}?
Totally possible. In practice, there is some noise in the coordinates that I have for the Airbnbs, and a risk of overfitting (if two listings are 107m away from the nearest restaurant, and 225m from the post office, likely they are in the same building with similar prices).
Predictions with Neural Networks
After seeing a presentation at ScaledML and getting abysmal results with traditional methods, I’m graduating from SciKit-Learn to AutoKeras. TensorFlow v2 is a dependency, so on Google CoLab you will need to upgrade and restart the runtime first.
AutoKeras accepts a data table and doestrain_test_split
internally to compare trial runs during neural architecture search. I need to know the validation data afterward, to compare accuracy of predictions in Manifold, so I use a parameter to set the validation data sets.
The regression prediction class,StructuredDataRegressor
, doesn’t have documentation at the moment. I am thinking about adding this as an example! Anyway, if you’re familiar with any other machine learning library, this should look familiar:
import autokeras as ak
ak.StructuredDataRegressor(max_trials=10)
clf.fit(X_train, y_train, validation_data=(X_test, y_test))
clf.predict(X_test)
y_predict = clf.predict(X_test)
Extract X_test
, y_test
, and y_predict
, and upload them into Uber’s Manifold.
segment_3
, 28% fall into segment_2 and skew lowerEvaluating the Prediction
Did OpenStreetMap make prices more predictable? Do the features make sense?
After adding the OpenStreetMap data, all of the most predictive features come from there. It makes sense that value of an apartment is connected with restaurants, hotels, and clothing stores. With additional data from Yelp, we would be able to tell the difference between high-end, well-reviewed restaurants and low-end restaurants, fitting closer to the neighborhood value.
Look at the restaurants feature. I notice that the grey, more-predictable features peak low, and the pink, less-predictable features peak around 94 restaurants within a kilometer radius. Does this mean that pink features are deeper in the tourist center of Amsterdam? Do these less-predictable apartments skew as overpriced or underpriced? Can’t really say.
I also liked that restaurants, hotels, and clothes appeared in the top features, and not something without personal appeal, like bridges.
By valuing Airbnbs by hotels, aren’t you using it as a proxy for tourism? By valuing neighborhoods with high rents won’t you continue to drive up values and speculation on those neighborhoods?
If you’re planning to invest or predict real estate values, you can use calculations like this, and you can look at recent sales.
If you’re looking for your home, there’s no end to the factors which you might consider: what do the houses look like, is it close to your future school or office, can you go for a run or a swim… Also many of us wouldn’t want to live somewhere that’s mobbed daily by tourists, or is known for high crime.
What I’m trying to say is, prices are predictive, but value is complicated and personal.
On a lighter note — next time that you do predictive analytics, consider AutoKeras and Manifold! It works well!