I am new to HTML Parser. I have written a Spider in Python which aims to crawl a website. I have included my code below. This code specifically looks for all URLs which are identified with an "a" start tag and a href attribute. However, I would like to further filter this by only scraping URLs which contain a specific word. I am currently working around this by outputting my "crawled" URLs to a txt file. From there i read the content of this file, filter it by my key word and then write the results to a new txt file. However, I feel it would be more efficient if I could narrow the focus of my crawler to only look at "a" tags, href attributes and "where word XXX" exists.
Is there a way in which I can expand the "if" statement within the def handle_starttag function to only scrape urls which contain a specific word? The word is usually contained in the href link in the html also.
from html.parser import HTMLParser
from urllib import parse
class LinkFinder(HTMLParser):
def __init__(self, base_url, page_url):
super().__init__()
self.base_url = base_url
self.page_url = page_url
self.links = set()
# When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
def handle_starttag(self, tag, attrs):
if tag == 'a':
for (attribute, value) in attrs:
if attribute == 'href':
url = parse.urljoin(self.base_url, value)
self.links.add(url)
def page_links(self):
return self.links
def error(self, message):
pass
You may have an easier time using BeautifulSoup than the lower level
HTMLParser.To add the additional filter to your current implementation, you could add an additional parameter to your
LinkFinderclass, store the value, and use it in the conditional: