We are using PyMuPDF Page.get_fonts() function to extract the font names from the PDF but we are getting font names with subsets we tried using fitz.Tools.set_subset_fontnames() setting in our code and its working for fonts returned by get_text() but its not working on get_fonts().
Here is my sample code:
import fitz
fitz.TOOLS.set_subset_fontnames(False)
file_path = "sample.pdf"
pdf_document = fitz.open(file_path)
for page in pdf_document:
extracted_fonts = page.get_page_fonts(full=True)
print(extracted_fonts)
Here is the output I am getting:
[
(140, 'ttf', 'TrueType', 'XEAAAC+Arial Bold', 'F3', 'WinAnsiEncoding', 0),
(138, 'ttf', 'TrueType', 'XEAAAB+Times New Roman', 'F2', 'WinAnsiEncoding', 0),
(137, 'ttf', 'TrueType', 'XEAAAA+Arial', 'F1', 'WinAnsiEncoding', 0)
]
And I want the font names without subsets. For example, "Arial Bold" instead of "XEAAAC+Arial Bold"
You can split the font name by the '+' character and then select the last part, which represents the actual font name without the subset prefix: