I'm trying to scrape information from this episode wiki page on Fandom, specifically the episode title in Japanese, 謀略Ⅳ:ドライバーを奪還せよ!:
Conspiracy IV: Recapture the Driver! (謀略Ⅳ:ドライバーを奪還せよ!, Bōryaku Fō: Doraibā o Dakkan seyo!)
I wrote this xpath which selects the text in Chrome: //div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text(), but it does not work in lxml when I do this:
import requests
from lxml import html
getPageContent = lambda url : html.fromstring(requests.get(url).content)
content = getPageContent("https://kamenrider.fandom.com/wiki/Conspiracy_IV:_Recapture_the_Driver!")
JapaneseTitle = content.xpath("//div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text()")
print(JapaneseTitle)
I had already written these xpaths to scrape other parts of the page which are working:
//h2[@data-source='name']/center/text(), the episode title in English.//div[@data-source='airdate']/div/text(), the air date.//div[@data-source='writer']/div/a, the episode writeraelement.//div[@data-source='director']/div/a, the episode directoraelement.//p[preceding-sibling::h2[contains(span,'Synopsis')] and following-sibling::h2[contains(span,'Plot')]], all thepelements under the Snyposis section.
As with all questions of this sort, start by breaking down your xpath into smaller expressions:
Let's start with the first expression...
Great, that works! But if we add the next component from your expression...
...we don't get any results. It looks like the
<div>element matched by the first component of your expression doesn't have any immediate descendants that are<span>elements.If we select the relevant element in Chrome and select "inspect element", and then "copy full xpath", we get:
And that looks like it should match. But if we match it (or at least a similar element) using
lxml, we see a different path:The difference is here:
One solution is simply to ignore the difference in structure by sticking a
//in the middle of the expression, so that we have something like :