How to extract font names using PyMuPDF without subsets?

Question

How to extract font names using PyMuPDF without subsets?

37 views Asked by Sanika Girase At 26 March 2024 at 10:41

We are using PyMuPDF Page.get_fonts() function to extract the font names from the PDF but we are getting font names with subsets we tried using fitz.Tools.set_subset_fontnames() setting in our code and its working for fonts returned by get_text() but its not working on get_fonts().

Here is my sample code:

import fitz
fitz.TOOLS.set_subset_fontnames(False)

file_path = "sample.pdf"
pdf_document = fitz.open(file_path)
for page in pdf_document:
    extracted_fonts = page.get_page_fonts(full=True)
print(extracted_fonts)

Here is the output I am getting:

[
  (140, 'ttf', 'TrueType', 'XEAAAC+Arial Bold', 'F3', 'WinAnsiEncoding', 0), 
  (138, 'ttf', 'TrueType', 'XEAAAB+Times New Roman', 'F2', 'WinAnsiEncoding', 0),
  (137, 'ttf', 'TrueType', 'XEAAAA+Arial', 'F1', 'WinAnsiEncoding', 0)
]

And I want the font names without subsets. For example, "Arial Bold" instead of "XEAAAC+Arial Bold"

Original Q&A

There are 1 answers

**jepozdemir** · Answer 1 · 2024-03-26T10:45:25+00:00

You can split the font name by the '+' character and then select the last part, which represents the actual font name without the subset prefix:

import fitz

fitz.TOOLS.set_subset_fontnames(False)

file_path = "sample.pdf"
pdf_document = fitz.open(file_path)

for page in pdf_document:
    extracted_fonts = page.get_fonts(full=True)
    cleaned_fonts = [(font_id, font_format, font_type, font_name.split('+')[-1], font_flags, font_encoding, font_embedded) for font_id, font_format, font_type, font_name, font_flags, font_encoding, font_embedded in extracted_fonts]
    print(cleaned_fonts)

TechQA.

How to extract font names using PyMuPDF without subsets?

There are 1 answers

Related Questions in PYTHON

Related Questions in PDF

Related Questions in FONTS

Related Questions in PYMUPDF

Popular Questions

Trending Questions