What I'm attempting in python:
- Open DocX file
- Convert to HTML (provides tags)
- Use HTMLParser to get start/end/data
- "
<p><strong>" indicates the start of a new section, so I need to append the current tempHTML text to the sections list and then empty thetempHTMLstring for the next run through.
My Code:
import mammoth
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
tempHTML = ''
sections = list()
index = 0
def handle_starttag(self, tag, attrs):
self.tempHTML += '<' + tag + '>'
if(str(self.tempHTML) == "<p><strong>" and len(self.tempHTML) <= len("<p><strong>")):
print("APPENDING")
self.sections.append(self.tempHTML)
self.tempHTML = ''
print("AFTER RESET: " + str(self.tempHTML))
self.index += 1
else:
self.sections.append(self.tempHTML)
def handle_endtag(self, tag):
self.tempHTML += '</' + tag + '>'
def handle_data(self, data):
self.tempHTML += data
with open("4Runner.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value
parser = MyHTMLParser()
parser.feed(html)
print(*MyHTMLParser.sections, sep="\n \n")
print("done")
Current first few lines of output:
<p>
<p><strong>
What is the MPG of the 2022 Toyota 4Runner?</strong></p><p>
What is the MPG of the 2022 Toyota 4Runner?</strong></p><p>The mpg of the new 2022 Toyota 4Runner is great for a full size SUV. The new 2022 Toyota 4Runner has a large fuel tank capacity, so drivers are given a fair driving range that is perfect for all of the adventures you will go on in this vehicle. </p><p>
What is the MPG of the 2022 Toyota 4Runner?</strong></p><p>The mpg of the new 2022 Toyota 4Runner is great for a full size SUV. The new 2022 Toyota 4Runner has a large fuel tank capacity, so drivers are given a fair driving range that is perfect for all of the adventures you will go on in this vehicle. </p><p>The new 2022 Toyota 4Runner is the vehicle you need to take on a test drive when you visit xxx. </p><p>
I think the error is that once the first <p><strong> is read, it's true everytime? I'm just not sure and am totally lost!
For any in as odd of a situation/need as me. The solution I used was converting everything to giant html string and then searching with a regex expression.