Applying engineering best practices to data science is a well-intentioned effort, but it has to be done with care. The raw materials, goals and organizational roles of the two professions are different, so treating DS like eng sets it up to look like engineering done badly
-
Show this thread
-
This article by
@Mike_Kaminsky, which is about why git for data probably won't banish the specter of dAtA qUaLiTy that plagues so many data orgs, is a good example of just that.https://locallyoptimistic.com/post/git-for-data-not-a-silver-bullet/ …1 reply 3 retweets 17 likesShow this thread -
TL;DR the data is PERFECT. YOU are the problem.
1 reply 0 retweets 5 likesShow this thread -
Nah, I'm just trolling. But the article does describe an interesting aspect of DS work--the data is fixed and DSes create value by writing code to transform it for various purposes. Often, the problem is that DS stakeholders misunderstand of the role of DATA in the DS-value chain
1 reply 0 retweets 4 likesShow this thread -
There are two main modes for creating this transformation code (as described at length in the must-read DataOps Manifesto), innovation and productionization. https://www.dataopsmanifesto.org/ pic.twitter.com/hqQo27zBMB
1 reply 3 retweets 12 likesShow this thread -
Innovation code evaluates whether transformations of raw data are useful, and productionization code makes those transformations widely available.
1 reply 0 retweets 4 likesShow this thread -
(And as a side note, since I'm a shameless
@LocalOptimistic stan, this article by@AyRenay and Caitlin Moorman captures the same innovation/productionization split under the names 'circular' and 'linear'! https://locallyoptimistic.com/post/linear-and-circular-projects-part-1/ …)1 reply 0 retweets 10 likesShow this thread -
At any rate, both workflows are similar to software engineering since they're based around writing code, and when it comes to the literal process of writing code, DSes should copy SWE by checking their work into version control, writing unit tests, etc.
1 reply 0 retweets 3 likesShow this thread -
The departure point is the code's input. Rather than working with well-understood production database, DS works with the wild world of log data. And as Heraclitus said, no one ever steps in the same data lake twice, for it's not the same data lake and they are not the same person
1 reply 0 retweets 11 likesShow this thread -
The fact that the shape and volume of input data can change so quickly is what makes writing DS code hard. Statistical (and data) modeling is all about encoding assumptions, and the logs of a fast changing product can upend your assumptions in real time.
1 reply 1 retweet 4 likesShow this thread
Rather than focus on changing the raw data, DSes are better suited to making their data transformations more robust to variation in said raw data. This can be hard to wrap your head around if you think data should always perfectly reflect what happened in your product.
-
-
But it's also why it's called data SCIENCE. It's about finding signals in noise. It uses similar tools to SWE, but it's a fundamentally different craft. Being crisp about this distinction saves you the grief of looking like an amateurish engineer
2 replies 0 retweets 10 likesShow this threadThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.

