why do PDF readers need multiple seconds to search through something like a thousand pages of formatted text when grep can dig through the equivalent plaintext in ~50ms? is extracting plaintext from PDF that costly?
-
-
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
PDF doesn't work like HTML + CSS where there's at least supposed to be separation between content and style. It probably has to do layout to figure out the order of the text on the page and to make good guesses about how to format the plain text in terms of whitespace.
-
but does it have to actually layout curves for that?
- 1 more reply
New conversation -
-
-
My guess would be this: as far as I know, the PDF format has no concept of 'flowing text'. It's all absolutely positioned characters, so you need to do a full layouting cycle (and then some heuristics!) to figure out what text constitutes a 'paragraph' or 'sentence'.
-
After all, the search algorithm needs to distinguish between "two chunks of text that happen to be close together but are distinctly separate to a user", and "two chunks of text that form a longer single chunk of text".
- 1 more reply
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.