" indicates the start o" /> " indicates the start o" /> " indicates the start o"/>

How to add each start and closing tag to section list in HMTLParser?

49 views Asked by At

What I'm attempting in python:

  • Open DocX file
  • Convert to HTML (provides tags)
  • Use HTMLParser to get start/end/data
  • "<p><strong>" indicates the start of a new section, so I need to append the current tempHTML text to the sections list and then empty the tempHTML string for the next run through.

My Code:

import mammoth
from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    
        tempHTML = ''
        sections = list()
        index = 0
    
        def handle_starttag(self, tag, attrs):
    
            self.tempHTML += '<' + tag + '>'

            if(str(self.tempHTML) == "<p><strong>" and len(self.tempHTML) <= len("<p><strong>")):
                print("APPENDING")
                self.sections.append(self.tempHTML)
                self.tempHTML = ''
                print("AFTER RESET: " + str(self.tempHTML))
                self.index += 1
            else:
                self.sections.append(self.tempHTML)
    
        def handle_endtag(self, tag):
    
            self.tempHTML += '</' + tag + '>'
    
        def handle_data(self, data):
    
            self.tempHTML += data
    
    
    with open("4Runner.docx", "rb") as docx_file:
        result = mammoth.convert_to_html(docx_file)
        html = result.value
        parser = MyHTMLParser()
        parser.feed(html)
    
        print(*MyHTMLParser.sections, sep="\n \n")
    
        print("done")

Current first few lines of output:

<p>

<p><strong>

What is the MPG of the 2022 Toyota 4Runner?</strong></p><p>

What is the MPG of the 2022 Toyota 4Runner?</strong></p><p>The mpg of the new 2022 Toyota 4Runner is great for a full size SUV. The new 2022 Toyota 4Runner has a large fuel tank capacity, so drivers are given a fair driving range that is perfect for all of the adventures you will go on in this vehicle. </p><p>

What is the MPG of the 2022 Toyota 4Runner?</strong></p><p>The mpg of the new 2022 Toyota 4Runner is great for a full size SUV. The new 2022 Toyota 4Runner has a large fuel tank capacity, so drivers are given a fair driving range that is perfect for all of the adventures you will go on in this vehicle. </p><p>The new 2022 Toyota 4Runner is the vehicle you need to take on a test drive when you visit xxx.  </p><p>

I think the error is that once the first <p><strong> is read, it's true everytime? I'm just not sure and am totally lost!

1

There are 1 answers

0
AudioBubble On

For any in as odd of a situation/need as me. The solution I used was converting everything to giant html string and then searching with a regex expression.

re.search("(?!\<strong\>)(?<=\<p\>)(.*)(?=\<\/p\>)", str(thing)).group(0))