How to yield item from RFPDupeFilter or CustomFiler

114 views Asked by At

I'm using Scrapy to crawl pages from different websites. With every scrapy.Request() I set some meta data which is used to yield an item. It's also possible that my code yields multiple scrapy.Request() for same url however with different meta.

yield scrapy.Request(url='http://www.example.com', meta={'some_field': 'some_value'} ..)

Now I can set dont_filter=True and scrapy won't block the duplicate request.

yield scrapy.Request(url='http://www.example.com', meta={'some_other_field': 'some_other_value'}, dont_filter=True, ..)

However, since for duplicate requests I'm only interested in metadata set on scrapy.Request(), I want to yield an Item from RFPDupeFilter or CustomDupFilter so it will be written to JSON by the item pipeline.

    class CustomDupFilter(BaseDupeFilter):

        def request_seen(self, request: Request) -> bool:
            fp = self.request_fingerprint(request)
            if fp in self.fingerprints:
                yield request.meta['some_other_value'] # yield metadata as Item
                self.fingerprints.add(fp)
                return True
            else:
                return False

Any help is much appreciated.

1

There are 1 answers

2
zaki98 On BEST ANSWER

I don't think you can yield items in Dupefilter, but I think one way around this is to disable the Filter and handle duplicate requests in custom spider middleware. Maybe something like this:

class DupeFilterMiddleware:
    seen_requests = set()

    def process_spider_output(self, response, result, spider):
        for output in result:
            if isinstance(output, scrapy.Request) and fingerprint(output) in self.seen_requests:
                # yield from meta
            elif isinstance(output, scrapy.Request):
                self.seen_requests.add(fingerprint(output))
                yield output
            else:
                yield output