Still no nearer to reading a batch of pdfs and outputting the needed data without opening them really
-
-
Replying to @brazen_cabeza
like, reading for brain or reading for silicon? cuz this exists, somewhat flawed but generally usefulhttps://github.com/euske/pdfminer
1 reply 0 retweets 1 like -
Replying to @danlistensto
That's one thing I've got. Still have to work out how to use it. I tried a lesser library and it gave me maybe 5% of the data & my concern is that I need OCR as in tesseract = steep learning curve.
2 replies 0 retweets 0 likes -
Replying to @brazen_cabeza
I worked on a project that involved ingesting hundreds of PDFs per batch, with about 1 batch per week, and pdfs subject to (usually minor) changes in formatting and structure between batches. I went through the same process and ruled out OCR early.
1 reply 0 retweets 1 like -
Replying to @danlistensto @brazen_cabeza
the tesseract lib itself does its one narrow job very well but building out a practical ETL pipeline to plug it into is 99% of the work and is very hard. pdfminer made the ETL pipeline part as easy as passing in a file handle.
1 reply 0 retweets 1 like -
Replying to @danlistensto @brazen_cabeza
it has mild learning curve but I figured it out enough for practical use in under a week of lots of swearing-at-my-monitor work sessions. was worth it though.
1 reply 0 retweets 1 like
this was helpful for learning curvehttps://www.youtube.com/watch?v=k34wRxaxA_c …
-
-
Replying to @danlistensto
I was trying to follow this video earlier today. Nevertheless encouraged.
0 replies 0 retweets 0 likesThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.