I'm using Scrapy to crawl pages from different websites. With every scrapy.Request() I set some meta data which is used to yield an item. It's also possible that my code yields multiple scrapy.Request() for same url however with different meta.
yield scrapy.Request(url='http://www.example.com', meta={'some_field': 'some_value'} ..)
Now I can set dont_filter=True and scrapy won't block the duplicate request.
yield scrapy.Request(url='http://www.example.com', meta={'some_other_field': 'some_other_value'}, dont_filter=True, ..)
However, since for duplicate requests I'm only interested in metadata set on scrapy.Request(), I want to yield an Item from RFPDupeFilter or CustomDupFilter so it will be written to JSON by the item pipeline.
class CustomDupFilter(BaseDupeFilter):
def request_seen(self, request: Request) -> bool:
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
yield request.meta['some_other_value'] # yield metadata as Item
self.fingerprints.add(fp)
return True
else:
return False
Any help is much appreciated.
I don't think you can yield items in Dupefilter, but I think one way around this is to disable the Filter and handle duplicate requests in custom spider middleware. Maybe something like this: