I am working on a project where I want to extract text coordinates for a specific character range within a PDF document using PyMuPDF. I have a PDF file and a character range defined by the start and end indices. I want to locate the exact position (coordinates) of the text within this character range on the PDF.
An example: The text of the pdf is:
Erste Verordnung zum Sprengstoffgesetz in der Fassung der Bekanntmachung vom 31. Januar 1991 (BGBl. I S. 169), die zuletzt durch Artikel 1 der Verordnung vom 20. Dezember 2021 (BGBl. I S. 5238) geƤndert worden ist. Das Sprengstoffgesetz ist anzuwenden.
I have the character range from 21 to 38, which in this case represents the phrase Sprengstoffgesetz. I tried using the search_for function, but it gives me all instances of the term, not just the term in the character range.
Furthermore, the character range might not be a full for word, but only a part. For example the range from 32 to 38, which represents just the part gesetz.
Is there a way to find the coordinates of the given range.
I may have a solution that involves a few steps. I haven't fully tested it and it may still have some bugs. One problem is that it takes a long time if the PDF is large and the character range is at the end of the PDF.
clipparameter to find the words and iterate through the words to find the rectangles that contain parts of the term.`import fitz