lxml: Xpath works in Chrome but not in lxml

153 views Asked by At

I'm trying to scrape information from this episode wiki page on Fandom, specifically the episode title in Japanese, 謀略Ⅳ:ドライバーを奪還せよ!:

Conspiracy IV: Recapture the Driver! (謀略Ⅳ:ドライバーを奪還せよ!, Bōryaku Fō: Doraibā o Dakkan seyo!)

I wrote this xpath which selects the text in Chrome: //div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text(), but it does not work in lxml when I do this:

import requests
from lxml import html

getPageContent = lambda url : html.fromstring(requests.get(url).content)
content = getPageContent("https://kamenrider.fandom.com/wiki/Conspiracy_IV:_Recapture_the_Driver!")
JapaneseTitle = content.xpath("//div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text()")
print(JapaneseTitle)

I had already written these xpaths to scrape other parts of the page which are working:

  • //h2[@data-source='name']/center/text(), the episode title in English.
  • //div[@data-source='airdate']/div/text(), the air date.
  • //div[@data-source='writer']/div/a, the episode writer a element.
  • //div[@data-source='director']/div/a, the episode director a element.
  • //p[preceding-sibling::h2[contains(span,'Synopsis')] and following-sibling::h2[contains(span,'Plot')]], all the p elements under the Snyposis section.
1

There are 1 answers

0
larsks On BEST ANSWER

As with all questions of this sort, start by breaking down your xpath into smaller expressions:

Let's start with the first expression...

>>> content.xpath("//div[@class='mw-parser-output']")
[<Element div at 0x7fbf905d5400>]

Great, that works! But if we add the next component from your expression...

>>> content.xpath("//div[@class='mw-parser-output']/span")
[]

...we don't get any results. It looks like the <div> element matched by the first component of your expression doesn't have any immediate descendants that are <span> elements.

If we select the relevant element in Chrome and select "inspect element", and then "copy full xpath", we get:

/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/span/span[1]

And that looks like it should match. But if we match it (or at least a similar element) using lxml, we see a different path:

>>> res=content.xpath('//span[@class="t_nihongo_kanji"]')[0]
>>> tree = content.getroottree()
>>> tree.getpath(res)
'/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/p[1]/span/span[1]'

The difference is here:

/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/p[1] <-- extra <p> element

One solution is simply to ignore the difference in structure by sticking a // in the middle of the expression, so that we have something like :

>>> content.xpath("(//div[@class='mw-parser-output']//span[@class='t_nihongo_kanji'])[1]/text()")
['謀略Ⅳ:ドライバーを奪還せよ!']