Scrapy does not return any element and closes without scraping [Closing spider (finished)]

Question

Scrapy does not return any element and closes without scraping [Closing spider (finished)]

280 views Asked by Afiador At 16 March 2022 at 15:37

the following code gives response 200 but closes without returning the requested data.

I understand that it may be a problem with xpath but I have checked them all over and over again in scrapy shell and I think they are correct.

Very similar code has worked for me more times I don't know what I am missing this time. Data ara available in the Source Code of the website so it does not appear to be a dynamic loading problem.

Thanks for any help

from folium import Link
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.loader.processors import MapCompose
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.crawler import CrawlerProcess

class Articulo(Item):
    nombre = Field()
    direccion = Field()
    telefono = Field()
    comunaregion = Field()


class SeccionAmarillaCrawler(CrawlSpider):
    name = 'scraperfunerarias'

custom_settings = {
  'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 Safari/537.36'}

allowed_domains = ['paginasamarillas.com']
start_urls = ["https://www.paginasamarillas.com.ar/buscar/q/funerarias/"]
   
download_delay = 3 

rules = (
    Rule(
        LinkExtractor(
            allow=r'https://www.paginasamarillas.com.ar/buscar/q/funerarias/p-\d+/?tieneCobertura=true'
    ), follow=True, callback= "parseador" 
    ),
)


def parseador(self, response):
    sel = Selector(response)
    funerarias = sel.xpath('//div[contains(@class, "figBox")]')

    for funeraria in funerarias:
        item = ItemLoader(Articulo(), funeraria)
        item.add_xpath('nombre', './/span[@class="semibold"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
        item.add_xpath('direccion', './/span[@class="directionFig"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
        item.add_xpath('telefono', './/span[@itemprop="telephone"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
        item.add_xpath('comunaregion', './/span[@class="city"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))

        yield item.load_item()



process = CrawlerProcess({
     'FEED_FORMAT': 'csv',
     'FEED_URI': 'datos_scrapeados.csv'
})
process.crawl(SeccionAmarillaCrawler)
process.start()

OUTPUT

2022-03-16 16:30:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-03-16 16:30:23 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-35-generic-x86_64-with-glibc2.29
2022-03-16 16:30:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-16 16:30:23 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 '
               'Safari/537.36'}
2022-03-16 16:30:23 [scrapy.extensions.telnet] INFO: Telnet Password: 9eb59ae51c5aae24
2022-03-16 16:30:23 [py.warnings] WARNING: /home/maka/.local/lib/python3.8/site-packages/scrapy/extensions/feedexport.py:247: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
  exporter = cls(crawler)

2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-16 16:30:23 [scrapy.core.engine] INFO: Spider opened
2022-03-16 16:30:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-16 16:30:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-16 16:30:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paginasamarillas.com.ar/buscar/q/funerarias/> (referer: None)
2022-03-16 16:30:25 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-16 16:30:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 346,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 34763,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.26536,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 16, 15, 30, 25, 219139),
 'httpcompression/response_bytes': 367033,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 103419904,
 'memusage/startup': 103419904,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 16, 15, 30, 23, 953779)}
2022-03-16 16:30:25 [scrapy.core.engine] INFO: Spider closed (finished)

Original Q&A

There are 1 answers

**furas** · Answer 1 · 2022-03-16T18:18:28+00:00

I found two problems

Typo: you forgot .ar in allowed_domains

    allowed_domains = ['paginasamarillas.com.ar']

Char ? has special meaning in regex so you have to use \? instead of ?

allow=r'https://www.paginasamarillas.com.ar/buscar/q/funerarias/p-\d+/\?tieneCobertura=true'

but you can use also something simpler like allow=r'funerarias/p-\d+'

And now your code works for me.

TechQA.

Scrapy does not return any element and closes without scraping [Closing spider (finished)]

There are 1 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in SCRAPY

Related Questions in SCRAPY-SHELL

Popular Questions

Trending Questions