Is it possible to remove PDF's images and only keep the OCR'd text?

Question

Is it possible to remove PDF's images and only keep the OCR'd text?

188 views Asked by Colin At 20 October 2023 at 13:03

I scanned a book and OCR'd it using ABBYYY but all I really care about is the text from the OCR and that it's organized by page. Is there a tool I could use to drop all of the scanned page images but keep all of the OCR text? I realize it wouldn't be human readable at that point, but that's ok because I'm processing the PDF with Python scripts.

Original Q&A

There are 1 answers

**keen** · Accepted Answer · 2023-10-21T18:39:49+00:00

@johnwhitington's comment to the question worked great for me. but it's not a complete answer.

cpdf -draft in.pdf -o out.pdf

you can get cpdf from https://github.com/coherentgraphics/cpdf-binaries

the -draft option removes images:

  -draft Remove images from the file

You need to make sure you actually have text in the file first, of course - with Acrobat, that's the editable text and images option in the OCR settings - if you can copy a block of text and paste it outside and get readable text, you might have a pdf that works for this.

This produces a perfectly human readable result (minus any supporting graphics, obviously).

further information and documentation on the cpdf tool can be found at:

https://www.coherentpdf.com https://www.coherentpdf.com/cpdfmanual.pdf

you may find a combination of -draft AND -blacktext useful (I did)

TechQA.

Is it possible to remove PDF's images and only keep the OCR'd text?

There are 1 answers

Related Questions in MACOS

Related Questions in PDF

Related Questions in OCR

Related Questions in ABBYY

Popular Questions

Trending Questions