Using Fitz to extract images from pdf alongside text to the left

55 views Asked by At

I'm new to python and I'm trying to perform a task to cycle through a pdf catalogue, extract each image, then save it as a .png with the filename being the text to the left of the image in the pdf (see screenshot as example).

To complicate matters, as per the example, I would need the result to be the smiley face saved as 4 separate images (Happy.png; Face.png; Teeth.png; Smiling.png)

Example pdf section showing a left justified header title, subtitles of 'Cat No' and 'ID' on one line underneath, list of 4 single words under Cat No and 4 numbers under ID, with a dotted line separating each. Image is to the right positioned across some of the lines:

1

I have successfully used this code to save the images down, but how can I amend it to pick up each label to the left as the image name? Thanks for any help you can give!

import fitz
import tqdm
import os
from PIL import Image

workdir = "data"

for each_path in os.listdir(workdir):
    if ".pdf" in each_path:
        doc = fitz.Document((os.path.join(workdir, each_path)))

        for i in tqdm(range(len(doc)), desc="pages"):
            for img in tqdm(doc.get_page_images(i), desc="page_images"):
                xref = img[0]
                image = doc.extract_image(xref)
                pix = fitz.Pixmap(doc, xref)
                pix.save(os.path.join(workdir, "%s_p%s-%s.png" % (each_path[:-4], i, xref)))
                
print("Task completed")
0

There are 0 answers