I'm exploring options for semi-automated redaction of PDFs using various NLP techniques, and have been using PyMuPDF with Tesseract via ocrmypdf for OCR. This works pretty well overall, but management want to try Textract as an alternative. It's easy enough to call it against a single page of a PDF and read the resulting dictionary, but there's no simple way (that I've found yet) for mapping that back into the PDF as invisible text to create a searchable version of the page (all of which ocrmypdf does automatically).
For reference, here's an example of the dict that Textract produces. A given entry can be either a WORD or LINE.
'Id': 'be018daa-02c9-47d2-903a-73b69bdaa181',
'Text': "owners'",
'TextType': 'PRINTED'},
{'BlockType': 'WORD',
'Confidence': 95.73345947265625,
'Geometry': {'BoundingBox': {'Height': 0.014128071255981922,
'Left': 0.7538964748382568,
'Top': 0.7295616269111633,
'Width': 0.08705723285675049},
'Polygon': [{'X': 0.7539187669754028,
'Y': 0.7295616269111633},
{'X': 0.8409537076950073,
'Y': 0.7295762896537781},
{'X': 0.8409309983253479,
'Y': 0.7436897158622742},
{'X': 0.7538964748382568,
'Y': 0.7436745166778564}]},
Has anyone done this in Python, or have suggestions?
I'm working through various options. One mechanism I was thinking of was using the polygon coordinates provided for each LINE or WORD to create a new PyMuPDF Rect, then calling insertTextbox() against that rectangle.
But then there's the problem of font size/face and making sure it all aligns, which means identifying what font was detected and its size.
We also have the problem that our PDFs come from a variety of uncontrolled sources, and can variously contain 100% searchable, 100% image-only, or a mix of page types. And they can be produced by a whole range of applications, so there's no single option that will likely cover everything.
I have done that many times using PyMuPDF. There are a few things to watch out for:
Once you have solutions for the above (using PyMuPDF makes it fairly simple), insert text to your output page using
page.insert_text()in PyMuPDF with render mode 3: this causes the text to be invisible.For point 3 above use a PyMuPDF rectangle method:
matrix = fitz.Rect(0, 0, 1, 1).torect(page.rect). If you then take a Textract boundary box, make a PyMuPDF-compatible rectangle of it with top-left coordinates (x0, y0) and bottom-left coordinates (x1, y1):textract_rect = fitz.Rect(x0, y0, x1, y1). Then the following gives you the corresponding bbox on your output page:bbox = textreact_rect * matrix.Suggest you use font Helvetica for output:
font = fitz.Font("helv").If you have your text and its output bbox, compute the font size like this:
textlen = font.text_length(text,fontsize=1)to get output length if fontsize where 1. Thenbbox.width / textlenshould give you a good value for the fontsize to take.Next problem is the insertion point (needed for
page.insert_text()).bbox.bl(bottom left point) is a good start, but if your text contains characters descending below the base line (e.g. g, y, etc.), you need to adjust the insertion point upwards a little. Usefont.descenderand computed fontsize to compute this.