Conversation

Replying to
One of my first projects at a ~data science~ startup was imputing some characteristics of a client's book of business. Not because it needed sophisticated statistical inference. They could have just looked the relevant info up, if they had the names and addresses.
1
20
But they didn't have the names and addresses of their own clients, because someone accidentally deleted a column in the only copy of a spreadsheet. So they decided to try and machine learn them instead. Their own clients' names and addresses.
1
33
Some of the smartest people I know, like literal particle physicists, are working on the problem of data extraction from pdfs. You're probably thinking of "extracting insight" or something but I literally mean pulling tables that already exist out of pdfs and into csvs.
4
84
The pdf standard, which is turing complete by the way, apparently does not define a standard mechanism for representing tabular data. Or if it does, no one uses it. And when you get down to it, like what even is a table, man? Meanwhile, people need their csvs of tabular data.
2
31
The invisible hand's solution to this was to pull a bunch of people away from their postdocs in limning the fundamental nature of the universe, and set them to the task of getting numbers out of pdfs that other people had put into pdfs.
1
48
You don't pay these people's salaries by accident and so I have every confidence that this project will eventually pay for itself. But, when you zoom out, they're implementing one half of the world's most expensive identity function.
1
38
It's good for an academic field to have some competitive pressure from its cognate industry, giving people an escape hatch. And every decision en route to pdf_unfuckr 1.0 made sense in its immediate context.
1
22
The result is that people who were once studying the nature of reality are now just building up mental models of the insides of the pdf spec authors' heads, or worse their own colleagues'. And probably finding it more taxing, too.
3
32