How to handle ligature issue while using pdf text

216 views Asked by At

I need to capture some text from some PDFs. I use PymuPDF to do this. But facing ligature issue while writing those selected text inside a text file.

I use the following code snippet to read the PDF

pdf = fitz.open("file_path") 
full_text = ""
for page_n in range(pdf.page_count):
    page = pdf.load_page(page_n)
    full_text += page.get_text()
pdf.close()

# do some operation to get desire text 
desire_text = ...

And use the following code snippet to write them inside txt file

with open('output.txt', 'w') as f:
    f.write(desire_text)

but got the error:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
     33 with open('output.txt', 'w') as f:
---> 34     f.write(desire_text)

File c:\Python311\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 2: character maps to 

I know that the PDF contain some ligature like which create the issue. I can manually replace them using string replace, but I don't thing manually handle this can be efficient for large pdf.

0

There are 0 answers