Monday, January 20, 2014

Statistical modeling: two ways to see the world.

This a machine-learning-vs-traditional-statistics kind of blog post inspired by Leo Breiman's "Statistical Modeling: The Two Cultures". If you're like: "I had enough of this machine learning vs. statistics discussion,  BUT I would love to see beautiful beamer-slides with an awesome font.", then jump to the bottom of the post and for my slides on this subject plus source code.

I prepared presentation slides about the paper for a university course. Leo Breiman basically argued, that there are two cultures of statistical modeling:
  • Data modeling culture: You assume to know the underlying data-generating process and model your data accordingly. For example if you choose to model your data with a linear regression model you assume that the outcome y is normally distributed given the covariates x. This is a typical procedure in traditional statistics. 
  • Algorithmic modeling culture:  You treat the true data-generating process as unkown and try to find a model that is good at predicting the outcome y given the covariates x. So you basically try to find a function of your covariates x that minimizes the loss between prediction and true outcome y. This culture is associated with machine learning. 
Breiman was a supporter of the algorithmic modeling culture and argued that in many cases this culture is superior to the other. I recommend to read the paper when you have a quiet hour. I think it is a compulsive reading for everyone who seriously analyses data. 

The paper is 13 years old and I guess most people in the statistics community still live the data modeling culture (at least at the statistics department where I study). But the world is not black and white: There are also professors and research associates who research in the field of the algorithmic modeling culture: Extention of boosting algorithms to a more flexible model class, introduction of permutation tests to random forests, study of the effect of different cross validation schemas, ... Not only can the traditional statistics be enhanced by machine learning, but there is also a need to bring more statistics into machine learning. I think there is a lot of room for mutual benefit

My personal opinion on this subject is very pragmatic. I use predictive accuracy now more often as an evaluation of model goodness and added Random Forests, (Tree-) Boosting and others to my tool kit. But sometimes it is okay to pay some MSE, AUC or Gini to replace a complex random forest with a glm with a nice interpretation. Even if you assumed the wrong data-generating process, your conclusions from the fitted model is not deadly wrong in most cases. 

So here are the slides. They contain notes so they should be easy to follow:
If you want to now how to reproduce them please visit my Github account.








Explaining the decisions of machine learning algorithms

Being both statistician and machine learning practitioner, I have always been interested in combining the predictive power of (black box) ma...