I feel techniques for making machine learning work with small data are not talked about enough. This makes sense because many ML applications are only possible by collecting a huge amount of data. And running Kaggle competitions on small datasets would basically be the same as running a “guess the random number” competition. For evaluating the performance of a certain “small data” approach you would need quite a lot of small datasets and this would be a hassle to work with so no one really bothers.
Also, companies just love to boast about how big their data is. But there are many legitimate reasons for only being able to work on small data sets:
- Data is too expensive to acquire
- There does not exist more data than is already collected
- Data is out of date so fast that only ever a small quantity can be used for training purposes
I think there are basically two approaches/philosophies to dealing with this problem:
1.) SCALE IS ALL YOU NEED – AGI IS COMING.
While there might not be enough data for the particular problem you are trying to solve, there might be a lot of data in slightly adjacent problems you can capitalize on. This is what meta learning, few-shot learning, transfer learning and multi-task learning are all about. There are myriads of different techniques you can utilize. If you want to get an overview you can watch these free Stanford Lectures by Chelsea Finn, one of the leading meta learning researchers. But maybe you don’t really need to? It seems to me as Transformers are slowly emerging as a very general architecture that can handle all kinds of problems if only given enough compute and data.
An example of this kind of approach is using a large language model that was trained on terabytes of text to solve your particular problem of estimating how angry your customer’s emails are. This is a “small data” problem because the dataset is of limited size and you really don’t want to generate any more training samples.
But the same approach can also be used for something like time-series data. Here adjacent problems could be old time-series that are no longer current due to data drift. Or you could create data synthetically by using augmentation techniques.
2.) I’m something of a large language model myself
Of course, there might also be situations where you don’t even have access to data from adjacent problems. In these dire situations, you only have one recourse. Instead of letting the model decide how predictions should be made you go back to deciding yourself how the model should make predictions. It’s the traditional way that statisticians have used for decades. One tool I had a good experience with is Tensorflow Lattice. It makes it possible to inject your domain knowledge by determining constraints like monotonicity, convexity and pairwise trust. This is very handy because you can make your model quite flexible while preventing overfitting.
This way of working is very fun because you can bring your unique insights into the problem to bear. But there is also a risk: Your assumptions about the problem could be wrong. Or the nature of the problem slowly changes over time and this can’t be fixed by retraining because of the strong assumptions that were made about the data.
What are your favourite approaches when working with small data? Do you know any cool libraries that make it easier? Let me know in the comments!