How to add each start and closing tag to section list in HMTLParser?

Question

How to add each start and closing tag to section list in HMTLParser?

49 views Asked by AudioBubble At 13 December 2021 at 04:12

What I'm attempting in python:

Open DocX file
Convert to HTML (provides tags)
Use HTMLParser to get start/end/data
"<p><strong>" indicates the start of a new section, so I need to append the current tempHTML text to the sections list and then empty the tempHTML string for the next run through.

My Code:

import mammoth
from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    
        tempHTML = ''
        sections = list()
        index = 0
    
        def handle_starttag(self, tag, attrs):
    
            self.tempHTML += '<' + tag + '>'

            if(str(self.tempHTML) == "<p><strong>" and len(self.tempHTML) <= len("<p><strong>")):
                print("APPENDING")
                self.sections.append(self.tempHTML)
                self.tempHTML = ''
                print("AFTER RESET: " + str(self.tempHTML))
                self.index += 1
            else:
                self.sections.append(self.tempHTML)
    
        def handle_endtag(self, tag):
    
            self.tempHTML += '</' + tag + '>'
    
        def handle_data(self, data):
    
            self.tempHTML += data
    
    
    with open("4Runner.docx", "rb") as docx_file:
        result = mammoth.convert_to_html(docx_file)
        html = result.value
        parser = MyHTMLParser()
        parser.feed(html)
    
        print(*MyHTMLParser.sections, sep="\n \n")
    
        print("done")

Current first few lines of output:

<p>

<p><strong>

What is the MPG of the 2022 Toyota 4Runner?</strong></p><p>

What is the MPG of the 2022 Toyota 4Runner?</strong></p><p>The mpg of the new 2022 Toyota 4Runner is great for a full size SUV. The new 2022 Toyota 4Runner has a large fuel tank capacity, so drivers are given a fair driving range that is perfect for all of the adventures you will go on in this vehicle. </p><p>

What is the MPG of the 2022 Toyota 4Runner?</strong></p><p>The mpg of the new 2022 Toyota 4Runner is great for a full size SUV. The new 2022 Toyota 4Runner has a large fuel tank capacity, so drivers are given a fair driving range that is perfect for all of the adventures you will go on in this vehicle. </p><p>The new 2022 Toyota 4Runner is the vehicle you need to take on a test drive when you visit xxx.  </p><p>

I think the error is that once the first <p><strong> is read, it's true everytime? I'm just not sure and am totally lost!

Original Q&A

There are 1 answers

**AudioBubble** · Answer 1 · 2022-01-23T02:33:59+00:00

AudioBubble On 23 January 2022 at 02:33

For any in as odd of a situation/need as me. The solution I used was converting everything to giant html string and then searching with a regex expression.

re.search("(?!\<strong\>)(?<=\<p\>)(.*)(?=\<\/p\>)", str(thing)).group(0))

TechQA.

How to add each start and closing tag to section list in HMTLParser?

There are 1 answers

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in HTML-PARSING

Related Questions in HTML-PARSER

Popular Questions

Trending Questions