Everybody thinks everything is/needs Machine Learning. Teaching cameras on cars to detect traffic patterns? Machine Learning. Trying to find out what the key drivers are for home prices? Machine Learning. Want to determine correlations between KPIs for a financial presentation? Machine Learning. Want to find what percentile your 40 time would be compared to NFL athletes? Machine Learning. Though all of these problems are interesting and worth exploring, only the first truly warrants the use of Machine Learning (in my opinion). In this post, I will attempt to explain my dividing line succinctly and discuss which types of problems warrant each paradigm.
The big distinction for me is whether you want to know the HOW/WHY of your problem, or you want the most ACCURATE output possible.
If you are interested in the behind the scenes connections that are driving your problem or are interested in summarizing data relevant to your problem, you are probably looking to apply STATISTICS. Statistics are very useful for identifying meaningful relationships and trends in data that can be used to enhance discussions or investigate influential variables. Statistical approaches are a necessity in many natural science fields because they allow researchers to analyze the underlying causes to various outcomes/conditions. They encompass rigorously proven methods for establishing significance that lend various phenomenon to explanation/description. Because of the formality of these approaches, most contain very hard assumptions that must be checked to ensure validity of the results. They are typically pretty fast techniques that do not require much tuning. They also can work on smaller datasets (given the assumptions are still satisfied). Statistics are a very powerful tool set for exploring/explaining data and assigning confidence to the underlying connections.
If you want the most accurate prediction possible, then you are more likely to find a satisfactory solution applying MACHINE LEARNING. Machine Learning is all about pattern recognition and exploitation. It leverages techniques from many different fields, including signal processing, optimization, linear algebra, calculus, etc. to recognize the similarities/dissimilarities within data sets and leverage those similarities to group, classify, regress, recommend, etc. They are often called “black box” techniques, especially when people are talking about Deep Learning, a very powerful Machine Learning technique. I am not crazy about this description, because it implies Machine Learning is some evil mysterious power that nobody can tame. The mathematics that drive Deep Learning are not outrageously complicated; most people can understand the basics of the approach with basic knowledge of calculus and linear algebra. The “black box” description is accurate in that when you use most Machine Learning techniques, it can be unclear which variables are affecting the output. In Machine Learning, we lose the interpretability of the relationships between variables. There’s no “key driver” or “cause” when a image recognition network identifies a stop sign; the image simply contains data that is similar to other data which had contained stop signs after being encoded by the algorithm. The algorithms can only learn based on the data they are fed. (This has led to some very bad PR for Deep Learning. Machine Learning is not inherently racist/sexist. The data that Machine Learning models are fed are.) These models need tons of data, but given the right inputs and configurations, their performance can be mind blowing.
There is often a trade off in this decision; if you select a Deep Neural Net to predict real estate prices, you will probably not be able to find the most significant variable and its effect in driving up a homes value or be able to explain the model to your boss. If you use linear regression to try recognize traffic signals and make decisions for an autonomous car, I hope you have good air bags. The scale of the data is also important; if you are doing your analyses in Excel, there is no reason to throw your data into a Neural Network and see what happens. There is a place for both approaches; I often find myself using statistical approaches to initially explore a data set and formalize the right question of interest, then deciding whether the interpretability or performance of the output is most important. This is by no means an all-encompassing definition of either approach, but hopefully it illuminates the differences between them and how they can be best leveraged.