lxml can not parse html fragment contains certain unicode character

Question

lxml can not parse html fragment contains certain unicode character

399 views Asked by Leonardo.Z At 26 September 2021 at 11:59

lxml can not parse any html content that contains the character .

The python code below can not find the html element by xpath. Further more the result of etree.tostring(root) contails many extra whitespaces.

code:

from lxml import html, etree

text = """<div id="content">
  
</div>
"""
root  = html.document_fromstring(text)

print(etree.tostring(root))
content = root.xpath("//div[@id='content']")
print(content)

Output:

b'<html><body><p>d   i   v       i   d   =   "   c   o   n   t   e   n   t   "   &gt;   \n           1\x14/p></body></html>'
[]

Update: I believe this is due to a lxml bug. It has been fixed in lxml 4.4.3. However after checking lxml's changelog & commit history between 4.4.2-4.4.3, I still don't know the root cause.

Original Q&A

There are 1 answers

**balderman** · Answer 1 · 2021-09-26T12:25:03+00:00

ElementTree based working solution below

import xml.etree.ElementTree as ET

parser = ET.XMLParser()
parser.entity["#119857"] = 'x'
html = '''<html><body><p><div id='content'>&#119857;</div></p></body></html>'''
root = ET.fromstring(html)
content = root.find('.//div[@id="content"]')
print(content.text)

output

x

TechQA.

lxml can not parse html fragment contains certain unicode character

There are 1 answers

Related Questions in PYTHON

Related Questions in LXML

Related Questions in LXML.HTML

Popular Questions

Trending Questions