Near Future Predictions for Machine Learning

Three predictions for 2020–2022

4 min readJan 11, 2020

1. The test set should actually be larger than the training set

Currently, most guides on machine learning tell you to use train_test_split to divide your data 80%-20%, with the larger portion going to training data.

It’s intuitive — we want the machine learning model to be built with as much knowledge of our dataset as possible, and data left out for the testing phase is ‘sacrificed’ for use to compare accuracy of trained models.

As a thought experiment, suppose you have a million records to train a spam filter. You use a randomly-selected 80% to train three models, and are about to compare results to choose the best option. Then your boss comes down and presents you with a million new records. What do you do next?

Continue to compare the trained models on the original test set
Retrain the models with 80% of the old data and 80% of the new data
Compare the trained models by testing on 100% of the new dataset

Your end goal is to make a spam filter which continues to work on new spam messages, so I would use this new batch of data as the ‘test set’.
If the first million records cannot train a model that fits the second million, then training on either or even a combined dataset is likely not fixing the problem.

My theory is in the near future we’ll be using AutoML/NAS libraries to test many different models, and always make our test set bigger, or chronologically more recent, or otherwise better than our training data. If our model is good enough, it should be able to handle it! If we don’t have enough data for a 50–50 or 20–80 split, maybe we don’t have enough data to make lasting solutions in machine learning?

2. People will fight a lot about weight-agnostic and random-weight neural networks

Two papers which stuck in my memory from 2019, were explorations of what neural networks actually are, and how they work.

Weight Agnostic Neural Networks, from Google Brain:

In this work, we question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. We propose a search method for neural network architectures that can already perform a task without any explicit weight training.

Weight Agnostic Neural Networks

weightagnostic.github.io

What’s Hidden in a Randomly Weighted Neural Network, from Allen Institute for AI and University of Washington:

…we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever training the weight values. Hidden in a randomly weighted Wide ResNet-50 we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet. Not only do these “untrained subnetworks” exist, but we provide an algorithm to effectively find them.

What's Hidden in a Randomly Weighted Neural Network?

arxiv.org

There is some related neural network architecture research like Facebook AI’s lottery tickets which is also interesting.

We still train neural networks like these papers never happened, but with deeper understanding and hardware optimization, it seems like the counterintuitive techniques could be faster and better. I wonder if people who subscribe to these methods will become in conflict with our current system, or be seen as philosophically different, like ‘Bayesians’ in statistics.

Frequentist And Bayesian Approaches In Statistics - Probabilistic World

What is statistics about? Well, imagine you obtained some data from a particular collection of things. It could be the…

www.probabilisticworld.com

3. EU will pass XAI rules with industry impact greater than GDPR

Some initial regulation is going to happen really soon. What might happen?
Small regulation: Explainable AI (XAI) for public-facing systems, federated learning for health data
Global intervention: GDPR-inspired protection for people to opt-out from or remove their information from AI systems, meaning companies must frequently re-train systems. Complex rules around gender/race/ethnicity, both in what can be stored, and what values must be supported.
Full chaos: Banning entire fields (killing for-profit facial recognition, dynamic pricing on airlines and hotels). Federated learning for any personal information.

Will it end up being bigger than GDPR? I wanted the prediction to sound cool, so, yes.

April 2019:

AI systems should be accountable, explainable, and unbiased, says EU

The European Union today published a set of guidelines on how companies and governments should develop ethical…

www.theverge.com

November 2019:

Von der Leyen pledges action on AI during first 100 days in office

The European Commission's incoming President Ursula von der Leyen has called for the EU to draw up rules to regulate…

www.cityam.com