I need to parse a malformed HTML-page and extract certain URLs from it as any kind of Collection. I don't really care what kind of Collection, I just need to be able to iterate over it.
Let's say we have a structure like this:
<html>
<body>
<div class="outer">
<div class="inner">
<a href="http://www.google.com" title="Google">Google-Link</a>
<a href="http://www.useless.com" title="I don't need this">Blah blah</a>
</div>
<div class="inner">
<a href="http://www.youtube.com" title="Youtube">Youtube-Link</a>
<a href="http://www.useless2.com" title="I don't need this2">Blah blah2</a>
</div>
</div>
</body>
</html>
And here is what I do so far:
// tagsoup version 1.2 is under apache license 2.0
@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
XmlSlurper slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser());
GPathResult nodes = slurper.parse("test.html");
def links = nodes."**".findAll { it.@class == "inner" }
println links
I want something like
["http://google.com", "http://youtube.com"]
but all I get is:
["Google-LinkBlah blah", "Youtube-LinkBlah blah2"]
To be more precise I can't use all URLs, because the HTML-document, that I need parse is about 15-thousand lines long and has alot of URLs that I don't need. So I need the first URL in each "inner" block.
As The Trav says, you need to grab the
hrefattribute from each matchingatag.You've edited your question so the
classbit in thefindAllmakes no sense, but with the current HTML example, this should work:Edit
If (as you say after the edit) you just want the first
ainside anything marked withclass="inner", then try: