Extracting data from multiple PDFs

82 views Asked by At

I have 200 PDF files, all formatted similarly.

Currently I am opening each PDF and looking for the two relevant values and typing them into an Excel table, all manually.

I'm wondering if there is a way to automate this. My (non-IT background) idea is to write a program that OCR scans all the files located in a folder, and then finds and extracts the relevant data in CVS format, and transfers it to Excel.

I was wondering if anyone could give me some pointers on how to first approach this. Is something remotely similar possible at all? Is there a language that's better suited for this task than the other? Would VBA or PowerQuery be in any way helpful to this task?

1

There are 1 answers

0
Benji over_9000 'benchonaut' On

OCR

For the OCR part there are tons of tools, just to name a few popular ones:

There are many documents on how to install these tools , unfortunately mostly for linux

Relevant Data

A good question .. but a non-detailed one ( and you may get downvotes because you did not tell whether you need tables extracted or just text )

Of course you can use any programming language , an easy approach would be OCR to single files , then e.g. grep -l MYTERM myfiles will yield the filenames (linux, or git bash under windows ),

any finally generate a CSV that you import to excel( easy approach) or find a way to generate "real" Excel files.

Regards