Correlating HTML elements when scraping with golang/goquery/colly

408 views Asked by At

I've been using colly for some simple web scraping tasks. It works fine for most of the cases where the web page layouts are consistent or for simple logic (e.g. a lot of existing examples and projects are "here's how you find the second table")

I'm trying to do more context-aware scraping in order to enrich the results. For example take the following representative webpage layout

<h2>Beans Table</h2>
<table class="myTableClass">
<tr> <td></td><td></td><td></td><td></td> </tr>
<tr> <td></td><td></td><td></td><td></td> </tr>
<tr> <td></td><td></td><td></td><td></td> </tr>
</table>

<h2>Rice Table</h2>
<table class="myTableClass">
<tr> <td></td><td></td><td></td><td></td> </tr>
</table>

If I wanted to grab every element in all myTableClass tables I could do something like this:

   c.OnHTML(".myTableClass tr", func(e *colly.HTMLElement) {
      qoquerySelection := e.DOM
      qoquerySelection.Find("td").Each(func(i int, s *goquery.Selection) {
          fmt.Printf("%d, Cell value: %s\n", i, s.Text())
      })
   })

Or if I wanted to find the headings above tables I could do this:

c.OnHTML("h2", func(e *colly.HTMLElement) {
      if strings.Contains(e.Text, "Beans") {
         log.Println("Beans table follows")
         log.Println(qoquerySelection.Html())
      }
})

But I don't see an easy way to correlate "this table is under this heading". The index values and etc. returned as part of colly's objects are all relative post-parse, and the goquery APIs also look slanted towards "iterate all of these tags for me".

I have a partial solution right now by pulling colly.HTMLElement.DOM.Html() as part of the request/initialization and trying to map positional awareness there using string matching but that doesn't seem very clean - is there a supported way to maintain positional awareness when iterating a webpage?

0

There are 0 answers