lxml can not parse html fragment contains certain unicode character

399 views Asked by At

lxml can not parse any html content that contains the character .

The python code below can not find the html element by xpath. Further more the result of etree.tostring(root) contails many extra whitespaces.

code:

from lxml import html, etree

text = """<div id="content">
  
</div>
"""
root  = html.document_fromstring(text)

print(etree.tostring(root))
content = root.xpath("//div[@id='content']")
print(content)

Output:

b'<html><body><p>d   i   v       i   d   =   "   c   o   n   t   e   n   t   "   &gt;   \n           1\x14/p></body></html>'
[]

Update: I believe this is due to a lxml bug. It has been fixed in lxml 4.4.3. However after checking lxml's changelog & commit history between 4.4.2-4.4.3, I still don't know the root cause.

1

There are 1 answers

0
balderman On

ElementTree based working solution below

import xml.etree.ElementTree as ET

parser = ET.XMLParser()
parser.entity["#119857"] = 'x'
html = '''<html><body><p><div id='content'>&#119857;</div></p></body></html>'''
root = ET.fromstring(html)
content = root.find('.//div[@id="content"]')
print(content.text)

output

x