Replace double slash with single slash in import.io XPath selector

Question

Replace double slash with single slash in import.io XPath selector

177 views Asked by Chris Rockwell At 05 September 2016 at 00:48

I am using import.io to scrape some pages. I came across a page that uses internal hrefs like this: http://domain.com//Event - notice the double slash after the domain name. From my research, this is done for SEO purposes but I need to get the url without those double slashes, so it returns http://domain.com/Event.

I am trying to use XPath (which I'm very new to) and I can get the link fine with: //a[contains(@class, 'event-info-btn')]//@href.

My next step was to try fn:repace() with this: fn:replace(//a[contains(@class, 'event-info-btn')]//@href, 'http://domain.com//', 'http://domain.com/'). This isn't working - nothing is returned.

I'm not sure if my implementation is bad, or if import.io just doesn't support this.

I'll also note the reason why I'm trying to do this: import.io is failing on all of the urls. If I manually remove the slash and try again, it works fine.

Original Q&A

There are 1 answers

**kjhughes** · Accepted Answer · 2016-09-05T02:43:17+00:00

Note that import.io claims to support XPath 2.0.

Problem

You probably mean /@href rather than //@href, but that's not the real problem.

Your XPath is returning a sequence of href attributes where replace() is expecting a string.

Solution

For this HTML,

<div>
  <a class="event-info-btn" href="http://domain.com//1">one</a>
  <a class="event-info-btn" href="http://domain.com//2">one</a>
  <a class="event-info-btn" href="http://domain.com//3">one</a>
</div>

this XPath,

for $href in //a[contains(@class, 'event-info-btn')]/@href 
    return replace($href, 'http://domain.com//', 'http://domain.com/')

will return

http://domain.com/1
http://domain.com/2
http://domain.com/3

as requested.

Update

This doesn't work in import.io and I'm having trouble finding a fiddle-like site to test it.

You can see this working here.

Import.io, it seems, only allows you to input one line of xpath.

You might try putting the XPath on a single line, then:

for $href in //a[contains(@class, 'event-info-btn')]/@href return replace($href, 'http://domain.com//', 'http://domain.com/')

If that doesn't work, then import.io's claim that they support XPath 2.0 is not correct.

TechQA.

Replace double slash with single slash in import.io XPath selector

There are 1 answers

Problem

Solution

Update

Related Questions in HTML

Related Questions in XML

Related Questions in XPATH

Related Questions in IMPORT.IO

Popular Questions

Trending Questions