I've been using tabula-py, PyPDF2 and tika modules, but none of them seems to detect the background color of a table cell, which is within a PDF file.
These colored cells mean important information in the context of my problem. I know, for exemple, that tabula-py is a wrapper from tabula-java and this one does not provided colored cell information. Is there some easy-to-follow solution in Python out there?
Thanks in advance.
disclaimer: I am the author of the library
borbused in this answerabout PDF: PDF is not so much a "what you see is what you get" format, as it is a container for rendering instructions. That means a table is in fact just a collection of rendering instructions that draws something we humans interpret as a table. Something like:
Whenever a PDF library is extracting tables from a PDF, it's important to keep in mind this is a heuristic. It's based on some assumptions. Such as "tables tend to have a large number of lines that intersect at 90-degree angles".
I suggest you have a look at
TableDetectionByLinesinborb. It's a class that gathers the aforementioned rendering instructions and spits out the locations of tables in the PDF document.You would use it as such:
As it stands, this class does not track the stroke/fill colour. But you can easily subclass it, and modify it so it does.
For this, I would start at this particular line.