Python - HTML Parser - Narrow Down Scrape

64 views Asked by At

I am new to HTML Parser. I have written a Spider in Python which aims to crawl a website. I have included my code below. This code specifically looks for all URLs which are identified with an "a" start tag and a href attribute. However, I would like to further filter this by only scraping URLs which contain a specific word. I am currently working around this by outputting my "crawled" URLs to a txt file. From there i read the content of this file, filter it by my key word and then write the results to a new txt file. However, I feel it would be more efficient if I could narrow the focus of my crawler to only look at "a" tags, href attributes and "where word XXX" exists.

Is there a way in which I can expand the "if" statement within the def handle_starttag function to only scrape urls which contain a specific word? The word is usually contained in the href link in the html also.

from html.parser import HTMLParser
from urllib import parse


class LinkFinder(HTMLParser):

    def __init__(self, base_url, page_url):
        super().__init__()
        self.base_url = base_url
        self.page_url = page_url
        self.links = set()

    # When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (attribute, value) in attrs:
                if attribute == 'href':
                    url = parse.urljoin(self.base_url, value)
                    self.links.add(url)

    def page_links(self):
        return self.links

    def error(self, message):
        pass

1

There are 1 answers

2
dskrypa On

You may have an easier time using BeautifulSoup than the lower level HTMLParser.

To add the additional filter to your current implementation, you could add an additional parameter to your LinkFinder class, store the value, and use it in the conditional:

class LinkFinder(HTMLParser):
    def __init__(self, base_url, page_url, url_filter):
        super().__init__()
        self.base_url = base_url
        self.page_url = page_url
        self.links = set()
        self.url_filter = url_filter

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (attribute, value) in attrs:
                if attribute == 'href' and self.url_filter in value:
                    url = parse.urljoin(self.base_url, value)
                    self.links.add(url)