Attribute Error - Detecting Tables in PDFs with Borb

50 views Asked by At

I need to detect tables in pdf files so I can create field forms in empty cells.

To test the functionality of borb to achieve that goal, I attempted to use the author's example as a guide.

My code:

from decimal import Decimal
from borb.pdf import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines as TDBL
from borb.pdf.canvas.layout.annotation.square_annotation import SquareAnnotation
from borb.pdf.canvas.color.color import HexColor, X11Color

def main(infile, outfile):

    # get doc
    t = TDBL()
    doc = None
    with open(infile, "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle, [t])
    assert doc is not None

    # get page
    p = doc.get_page(0)

    # get Tables
    for r in t.get_table_bounding_boxes():
        r = r.grow(Decimal(5))
        p.add_annotation(SquareAnnotation(r, stroke_color=X11Color("Green")))

    with open(outfile, "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)

    return

if __name__ == "__main__":
    infile = "Parameter.pdf"
    outfile = "ParameterOut.pdf"
    main(infile, outfile)

I used my own pdf file here: "Parameter.pdf".


When I run the program, I receive this error:

# etc..
r = r.grow(Decimal(5))
AttributeError: 'int' object has no attribute 'grow'

What am I getting wrong?

1

There are 1 answers

12
Joris Schellekens On

disclaimer: I am the author of borb

0. About borb

Please check the signature of get_table_bounding_boxes:

    def get_table_bounding_boxes(self) -> typing.Dict[int, typing.List[Rectangle]]:
        """
        This function returns the bounding boxes (as Rectangle objects) of each Table
        that was recognized on the given page.
        """

In short, this method returns a dict, and not a list[Rectangle].

That's what's going wrong in your code. You are trying to call grow on something that is not a Rectangle.

1. About your PDF

disclaimer2: I checked your document. I may be wrong. But this is my interpretation of what is happening.

Your PDF is doing something strange. At the lowest possible level, a PDF document uses postscript operators to draw things (lines, curves, images, text) on a page. These postscript instructions are typically stored compressed in the PDF.

I decompressed your PDF. It contains the following instructions:

q
72.504 706.06 77.4 13.464 re
W* n
 /Span <</MCID 0/Lang (en-US)>> BDC q
72.504 706.06 77.4 13.464 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 77.664 709.06 Tm
/GS10 gs
0 g
/GS11 gs
0 G
[<0057>17<0102018C>24<01020175>6<011E>9<019A>9<011E018C>] TJ
ET
Q

Now for some explanation (you can follow along in the PDF spec, Annex A - Operator Summary. A copy of the spec can be found in the borb repository):

q: Save graphics state
re: Append rectangle to path
W*: Set clipping path using even-odd rule
n: End path without filling or stroking

So, your document is instructing the PDF viewer to add a rectangle to the path, but then never issues an instruction to actually fill or stroke that path.

Just to drive that point home, this is the explanation the spec gives for the re operator:

Append a rectangle to the current path as a complete subpath, with lower-left corner (x, y) and dimensions width and height in user space. The operation x y width height re
is equivalent to
x y m
( x + width ) y l
( x + width ) ( y + height ) l
x ( y + height ) l
h

l: Append a straight line segment from the current point to the point (x, y). The new current point shall be (x, y).

h: Close the current subpath by appending a straight line segment from the current point to the starting point of the subpath. If the current subpath is already closed, h shall do nothing. This operator terminates the current subpath. Appending another segment to the current path shall begin a new subpath, even if the new segment begins at the endpoint reached by the h operation.

And the PDF spec clearly differentiates between building a path (and closing it) and actually stroking or filling it.

In Table 60 - Path-Painting Operators you can find the evidence:

S: Stroke the path.
s: Close and stroke the path. This operator shall have the same effect as the sequence h S.