Near Future Predictions for Machine Learning
Three predictions for 2020–2022
1. The test set should actually be larger than the training set
Currently, most guides on machine learning tell you to use train_test_split
to divide your data 80%-20%, with the larger portion going to training data.
It’s intuitive — we want the machine learning model to be built with as much knowledge of our dataset as possible, and data left out for the testing phase is ‘sacrificed’ for use to compare accuracy of trained models.
As a thought experiment, suppose you have a million records to train a spam filter. You use a randomly-selected 80% to train three models, and are about to compare results to choose the best option. Then your boss comes down and presents you with a million new records. What do you do next?
- Continue to compare the trained models on the original test set
- Retrain the models with 80% of the old data and 80% of the new data
- Compare the trained models by testing on 100% of the new dataset
Your end goal is to make a spam filter which continues to work on new spam messages, so I would use this new batch of data as the ‘test set’.
If the first million records cannot train a model that fits the second million, then training on either or even a combined dataset is likely not fixing the problem.
My theory is in the near future we’ll be using AutoML/NAS libraries to test many different models, and always make our test set bigger, or chronologically more recent, or otherwise better than our training data. If our model is good enough, it should be able to handle it! If we don’t have enough data for a 50–50 or 20–80 split, maybe we don’t have enough data to make lasting solutions in machine learning?
2. People will fight a lot about weight-agnostic and random-weight neural networks
Two papers which stuck in my memory from 2019, were explorations of what neural networks actually are, and how they work.
Weight Agnostic Neural Networks, from Google Brain:
In this work, we question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. We propose a search method for neural network architectures that can already perform a task without any explicit weight training.
What’s Hidden in a Randomly Weighted Neural Network, from Allen Institute for AI and University of Washington:
…we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever training the weight values. Hidden in a randomly weighted Wide ResNet-50 we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet. Not only do these “untrained subnetworks” exist, but we provide an algorithm to effectively find them.
There is some related neural network architecture research like Facebook AI’s lottery tickets which is also interesting.
We still train neural networks like these papers never happened, but with deeper understanding and hardware optimization, it seems like the counterintuitive techniques could be faster and better. I wonder if people who subscribe to these methods will become in conflict with our current system, or be seen as philosophically different, like ‘Bayesians’ in statistics.
3. EU will pass XAI rules with industry impact greater than GDPR
Some initial regulation is going to happen really soon. What might happen?
Small regulation: Explainable AI (XAI) for public-facing systems, federated learning for health data
Global intervention: GDPR-inspired protection for people to opt-out from or remove their information from AI systems, meaning companies must frequently re-train systems. Complex rules around gender/race/ethnicity, both in what can be stored, and what values must be supported.
Full chaos: Banning entire fields (killing for-profit facial recognition, dynamic pricing on airlines and hotels). Federated learning for any personal information.
Will it end up being bigger than GDPR? I wanted the prediction to sound cool, so, yes.
April 2019:
November 2019: