Can't extract the result as expected when using requests_html

536 views Asked by At

I can't extract the correct result with using requests_html:

>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://www.amazon.com/dp/B07569DYGN')
>>> r.html.find("#productDetails_detailBullets_sections1")
[]

I can find the id 'productDetails_detailBullets_sections1' in the source content:

>>> """<table id="productDetails_detailBullets_sections1" class="a-keyvalue prodDetTable" role="presentation">""" in r.text
True

Actually, the issue similarly exist in PyQuery.

Why can't requests_html find this element?

1

There are 1 answers

4
Alfe On

I was searching for #comparison_price_row which still finds something. The next id in the source is comparison_shipping_info_row but searching for #comparison_shipping_info_row returns an empty array. The two elements are on the same level (same parent). I examined all the source between the two but found no problem.

At first.

Then I saw that there is a NUL byte somewhere between the two which probably makes the library stumble.

After removing the NUL bytes from the input, the wanted element could be found:

r2 = requests_html.HTML(html=r.text.replace('\0', ''))
r2.find('#productDetails_detailBullets_sections1')

[<Element 'table' role='presentation' class=('a-keyvalue', 'prodDetTable') id='productDetails_detailBullets_sections1'>]