The author Leo Breiman (who you may know from such greatest hits as bagging and random forests) talks about going from stats academia to working in industry as a statistical consultant. When he eventually returned to academia, he experienced a sort of reverse culture shock
-
-
Show this thread
-
He frames academia and industry as different cultures of statistical modeling. Both share the goal understanding the relationship of some input variables X an output variable Y. The true relationship is unknown, a black box, but statisticians in both camps aim to approximate itpic.twitter.com/Cfi7gBp0hV
Show this thread -
The paper describes data modeling (academic stats) culture as assuming that the black box is fundamentally orderly, stochastic and parametric. The other culture (algorithmic modeling) isn't terribly interested in what's in the black box, focusing instead on predictive accuracypic.twitter.com/mZWCki8BJE
Show this thread -
After spending time in industry, Breiman finds himself more sympathetic to algorithmic modeling culture, and he gets into plenty of formal and technical reasons why. I'm not gonna get into them in a tweet thread, but I recommend checking out the paper if you're interested in them
Show this thread -
Some of his reflections from working in industry resonate strongly with me: * Focus on finding a good solution—that’s what consultants get paid for * Live with the data before you plunge into modeling
Show this thread -
Some kind of caught me off guard to hear a statistician in industry say: * Search for a model that gives a good solution, either algorithmic or data * Predictive accuracy on test sets is the criterion for how good the model is
Show this thread -
The first of those two is a little less surprising for me, given that "a good solution" could mean a lot of different things, but the second is both makes a lot of sense to me and is weird to think about.
Show this thread -
That's probably because of the type of DS I am. I've done some forecasting and productionized precious few ML models, but I've mostly modeled to do what Breiman describes as "extracting information about how nature is associating response variables to input variables"
Show this thread -
I perceiving caring about accuracy first as ML eng territory, and I perceive MLE as stemming directly from computer science. This is sort of silly of me, given that "data scientist" has been an overloaded junk title for most of the time I've held it (and if I'm honest still is)
Show this thread -
Specialization within the DS world is still emerging and while some of the boundaries between types of DS roles have gotten sharper in the last few years, they're all still coming from the same lineage
Show this thread -
"Terrabytes of data are pouring into computers from many sources, both scientific, and commercial, and there is a need to analyze and understand the data," Breiman says, like a prophet foretelling HBR articles to come
Show this thread -
And reflecting on his work in the 90's (!!!), he was already seeing how crossfunctional this line of work can be: "there has been a noticeable move toward statistical work on real world problems and reaching out by statisticians toward collaborative work with other disciplines"
Show this thread -
The problems we're solving with data today are greater in scale and complexity, which means we're getting the luxury of focusing on narrower subsets of those problems. Now we ask for analysts, scientists, MLEs, analytics engineers, etc. instead of just DSes or statisticians
Show this thread
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.

