I need to detect tables in pdf files so I can create field forms in empty cells.
To test the functionality of borb to achieve that goal, I attempted to use the author's example as a guide.
My code:
from decimal import Decimal
from borb.pdf import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines as TDBL
from borb.pdf.canvas.layout.annotation.square_annotation import SquareAnnotation
from borb.pdf.canvas.color.color import HexColor, X11Color
def main(infile, outfile):
# get doc
t = TDBL()
doc = None
with open(infile, "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle, [t])
assert doc is not None
# get page
p = doc.get_page(0)
# get Tables
for r in t.get_table_bounding_boxes():
r = r.grow(Decimal(5))
p.add_annotation(SquareAnnotation(r, stroke_color=X11Color("Green")))
with open(outfile, "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
return
if __name__ == "__main__":
infile = "Parameter.pdf"
outfile = "ParameterOut.pdf"
main(infile, outfile)
I used my own pdf file here: "Parameter.pdf".
When I run the program, I receive this error:
# etc..
r = r.grow(Decimal(5))
AttributeError: 'int' object has no attribute 'grow'
What am I getting wrong?
disclaimer: I am the author of
borb0. About
borbPlease check the signature of
get_table_bounding_boxes:In short, this method returns a
dict, and not alist[Rectangle].That's what's going wrong in your code. You are trying to call
growon something that is not aRectangle.1. About your PDF
disclaimer2: I checked your document. I may be wrong. But this is my interpretation of what is happening.
Your PDF is doing something strange. At the lowest possible level, a PDF document uses postscript operators to draw things (lines, curves, images, text) on a page. These postscript instructions are typically stored compressed in the PDF.
I decompressed your PDF. It contains the following instructions:
Now for some explanation (you can follow along in the PDF spec, Annex A - Operator Summary. A copy of the spec can be found in the borb repository):
q: Save graphics statere: Append rectangle to pathW*: Set clipping path using even-odd rulen: End path without filling or strokingSo, your document is instructing the PDF viewer to add a rectangle to the path, but then never issues an instruction to actually fill or stroke that path.
Just to drive that point home, this is the explanation the spec gives for the
reoperator:l: Append a straight line segment from the current point to the point (x, y). The new current point shall be (x, y).h: Close the current subpath by appending a straight line segment from the current point to the starting point of the subpath. If the current subpath is already closed, h shall do nothing. This operator terminates the current subpath. Appending another segment to the current path shall begin a new subpath, even if the new segment begins at the endpoint reached by the h operation.And the PDF spec clearly differentiates between building a path (and closing it) and actually stroking or filling it.
In Table 60 - Path-Painting Operators you can find the evidence:
S: Stroke the path.s: Close and stroke the path. This operator shall have the same effect as the sequence h S.