I am using import.io to scrape some pages. I came across a page that uses internal hrefs like this: http://domain.com//Event - notice the double slash after the domain name. From my research, this is done for SEO purposes but I need to get the url without those double slashes, so it returns http://domain.com/Event.
I am trying to use XPath (which I'm very new to) and I can get the link fine with: //a[contains(@class, 'event-info-btn')]//@href.
My next step was to try fn:repace() with this: fn:replace(//a[contains(@class, 'event-info-btn')]//@href, 'http://domain.com//', 'http://domain.com/'). This isn't working - nothing is returned.
I'm not sure if my implementation is bad, or if import.io just doesn't support this.
- I'll also note the reason why I'm trying to do this: import.io is failing on all of the urls. If I manually remove the slash and try again, it works fine.
Note that import.io claims to support XPath 2.0.
Problem
You probably mean
/@hrefrather than//@href, but that's not the real problem.Your XPath is returning a sequence of
hrefattributes wherereplace()is expecting a string.Solution
For this HTML,
this XPath,
will return
as requested.
Update
You can see this working here.
You might try putting the XPath on a single line, then:
If that doesn't work, then import.io's claim that they support XPath 2.0 is not correct.