Adding rows and columns to a pandas DataFrame in multiple loops

33 views Asked by At

I am trying to make a simple tool which can look for keywords (from multiple txt files) in multiple PDFs. In the end, I would like it to produce a report in the following form:

Name of the pdf document Keyword document 1 Keyword document... Keyword document x
PDF 1 1 546 77
PDF... 3 8 8
PDF x 324 23 34

Where the numbers represent the total number of occurrences of all keywords from the keyword document in that particular file.

This is where I got so far - the function can successfully locate, count, and relate summed keywords to the document:

import fitz
import glob


def keyword_finder():

    # access all PDFs from current directory
    for pdf_file in glob.glob('*.pdf'):
        # open files using PyMuPDF
        document = fitz.open(pdf_file)
        # count the number of pages in document
        document_pages = document.page_count
        
        # access all txt files (these contain the keywords)
        for text_file in glob.glob('*.txt'):
            # empty list to store the results
            occurrences_sdg = []
            # open keywords file
            inputs = open(text_file, 'r')
            # read txt file
            keywords_list = inputs.read()
            # split the words by an 'enter'
            keywords_list_separated = keywords_list.split('\n')


            for keyword in keywords_list_separated[1:-1]: # omit first and last entry
                occurrences_keyword = []
                # read in page by page
                for page in range(0, document_pages):
                    # load in text from i page
                    text_per_page = document.load_page(page)
                    # search for keywords on the page, and sum all occurrences
                    keyword_sum = len(text_per_page.search_for(keyword))
                    # add occurrences from each page to list per keyword
                    occurrences_keyword.append(keyword_sum)
                # sum all occurances of a keyword in the document
                occurrences_sdg.append(sum(occurrences_keyword))
        
            if sum(occurrences_sdg) > 0:
                print(f'{pdf_file} has {sum(occurrences_sdg)} keyword(s) from {text_file}\n')

I did try using pandas and I believe that still is the best choice. The number of loops makes it difficult for me to decide at which point the "skeleton" dataframe should be made, and when the results should be added. Final goal is to have this produced report saved as csv.

0

There are 0 answers