I am trying to make a simple tool which can look for keywords (from multiple txt files) in multiple PDFs. In the end, I would like it to produce a report in the following form:
| Name of the pdf document | Keyword document 1 | Keyword document... | Keyword document x |
|---|---|---|---|
| PDF 1 | 1 | 546 | 77 |
| PDF... | 3 | 8 | 8 |
| PDF x | 324 | 23 | 34 |
Where the numbers represent the total number of occurrences of all keywords from the keyword document in that particular file.
This is where I got so far - the function can successfully locate, count, and relate summed keywords to the document:
import fitz
import glob
def keyword_finder():
# access all PDFs from current directory
for pdf_file in glob.glob('*.pdf'):
# open files using PyMuPDF
document = fitz.open(pdf_file)
# count the number of pages in document
document_pages = document.page_count
# access all txt files (these contain the keywords)
for text_file in glob.glob('*.txt'):
# empty list to store the results
occurrences_sdg = []
# open keywords file
inputs = open(text_file, 'r')
# read txt file
keywords_list = inputs.read()
# split the words by an 'enter'
keywords_list_separated = keywords_list.split('\n')
for keyword in keywords_list_separated[1:-1]: # omit first and last entry
occurrences_keyword = []
# read in page by page
for page in range(0, document_pages):
# load in text from i page
text_per_page = document.load_page(page)
# search for keywords on the page, and sum all occurrences
keyword_sum = len(text_per_page.search_for(keyword))
# add occurrences from each page to list per keyword
occurrences_keyword.append(keyword_sum)
# sum all occurances of a keyword in the document
occurrences_sdg.append(sum(occurrences_keyword))
if sum(occurrences_sdg) > 0:
print(f'{pdf_file} has {sum(occurrences_sdg)} keyword(s) from {text_file}\n')
I did try using pandas and I believe that still is the best choice. The number of loops makes it difficult for me to decide at which point the "skeleton" dataframe should be made, and when the results should be added. Final goal is to have this produced report saved as csv.