Conversation

1) Lies, Damned Lies, and Statistics: My uneasy relationship with data
116
2,506
2) I went to MIT, became a quant trader, and then a fintech founder. Outside of work, I'm an Effective Altruist: what matters is maximizing the amount of positive impact you can have. So you'd think that I'd love data.
15
542
3) But the truth is that I think most people misuse, and overuse, statistics. So much so, that many people would be better off ignoring data than what they're currently doing. I think it took me a while to come to terms with this.
17
627
4) Some examples: a) Bob is running a consumer fintech company. He studies the multiples of exchange fees and B2B subscription fees; he finds that they're 20x and 80x, respectively. So he decides against building a mobile interface, and focuses on being a B2B liquidity source.
4
186
5) b) Alice is at a VC firm. She does a study of the correlation between employee count and market cap for their portfolio companies. Controlling for lots of other factors, it's +75%. In the next round, she mostly funds companies rapidly expanding their headcount.
5
162
6) c) Zed is trying to decide whether to do a superbowl commercial, or a Facebook ad. They look at impressions per dollar, and decide the latter is cheaper; so they forgo the game.
6
153
Replying to
8) The key insight: you're not choosing between looking at statistics or acing randomly. You have a prior coming in: based on your intuition and critical thinking. The question is whether data is more or less useful than your priors, and whether you combine them well.
4
347
9) In Bob's case, his data is technically correct! But there are two core issues: a) his revenue might not be the same in both cases; maybe the mobile app makes more than 4x as much revenue as the B2B product. b) also: valuation isn't all that matters! I'd prefer earnings.
3
163
10) Bob probably would have been better just saying "let's build the business that seems the best" and ignoring valuation. In Alice's case, her data is probably being misinterpreted.
4
131
11) Yes, there is a positive correlation between having 10k employees and being successful: You can only hire 10k employees if you've done well. So there's a correlation, but the direction of causation is probably wrong.
5
169
12) And how about Zed? Well, what, in the end, is an impression? One of the important properties about superbowl ads: they're talked about again and again and again, in lots of places that are hard to track. The direct views significantly underestimate it's impact.
3
161
13) And in this case, a simple gut check might have made Zed realize that _obviously_ superbowl ads have large impact, and a lot of that is the chatter. So there are lots of ways to use data poorly. That doesn't make it useless--there are also lots of ways to use it well!
7
141
14) But if you do a mediocre job of using data, it just adds noise which distracts you from your baseline reasonable judgement. There is a fairly high bar that statistical analysis has to overcome to be net useful!
5
238
16) And this is a failure mode that a _lot_ of people fall into. The vast majority of statistics that I see quoted are useless. The times when stats are more likely to be useful are when they are answering a very specific, intentional question.
12
220
17) If you've thought hard about a decision you have to make and think you really understand the various factors, and know which factor you're uncertain about, then it can be *extremely* helpful to get some data!
15
206
18) But aimlessly generating data just distracts. It's also very similar to a trap that some interview candidates fall into, particularly those with strong math backgrounds: Given a hard, messy question, they'll try to solve it exactly. And if they can't, they get flummoxed.
19
199
19) The flipside of overfit, irrelevant data: fermi estimates. Trying to estimate quantitative factors without knowing all the relevant data is hard, but you can often get reasonable bounds on it. And those bounds can be extremely useful.
60
315