lxml: Xpath works in Chrome but not in lxml

Question

lxml: Xpath works in Chrome but not in lxml

153 views Asked by Oneechan69 At 04 December 2022 at 21:36

I'm trying to scrape information from this episode wiki page on Fandom, specifically the episode title in Japanese, 謀略Ⅳ：ドライバーを奪還せよ！:

Conspiracy IV: Recapture the Driver! (謀略Ⅳ：ドライバーを奪還せよ！, Bōryaku Fō: Doraibā o Dakkan seyo!)

I wrote this xpath which selects the text in Chrome: //div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text(), but it does not work in lxml when I do this:

import requests
from lxml import html

getPageContent = lambda url : html.fromstring(requests.get(url).content)
content = getPageContent("https://kamenrider.fandom.com/wiki/Conspiracy_IV:_Recapture_the_Driver!")
JapaneseTitle = content.xpath("//div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text()")
print(JapaneseTitle)

I had already written these xpaths to scrape other parts of the page which are working:

//h2[@data-source='name']/center/text(), the episode title in English.
//div[@data-source='airdate']/div/text(), the air date.
//div[@data-source='writer']/div/a, the episode writer a element.
//div[@data-source='director']/div/a, the episode director a element.
//p[preceding-sibling::h2[contains(span,'Synopsis')] and following-sibling::h2[contains(span,'Plot')]], all the p elements under the Snyposis section.

Original Q&A

There are 1 answers

**larsks** · Accepted Answer · 2022-12-04T22:11:06+00:00

As with all questions of this sort, start by breaking down your xpath into smaller expressions:

Let's start with the first expression...

>>> content.xpath("//div[@class='mw-parser-output']")
[<Element div at 0x7fbf905d5400>]

Great, that works! But if we add the next component from your expression...

>>> content.xpath("//div[@class='mw-parser-output']/span")
[]

...we don't get any results. It looks like the <div> element matched by the first component of your expression doesn't have any immediate descendants that are <span> elements.

If we select the relevant element in Chrome and select "inspect element", and then "copy full xpath", we get:

/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/span/span[1]

And that looks like it should match. But if we match it (or at least a similar element) using lxml, we see a different path:

>>> res=content.xpath('//span[@class="t_nihongo_kanji"]')[0]
>>> tree = content.getroottree()
>>> tree.getpath(res)
'/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/p[1]/span/span[1]'

The difference is here:

/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/p[1] <-- extra <p> element

One solution is simply to ignore the difference in structure by sticking a // in the middle of the expression, so that we have something like :

>>> content.xpath("(//div[@class='mw-parser-output']//span[@class='t_nihongo_kanji'])[1]/text()")
['謀略Ⅳ：ドライバーを奪還せよ！']

TechQA.

lxml: Xpath works in Chrome but not in lxml

There are 1 answers

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in XPATH

Related Questions in LXML

Related Questions in LXML.HTML

Popular Questions

Trending Questions