the following code gives response 200 but closes without returning the requested data.
I understand that it may be a problem with xpath but I have checked them all over and over again in scrapy shell and I think they are correct.
Very similar code has worked for me more times I don't know what I am missing this time. Data ara available in the Source Code of the website so it does not appear to be a dynamic loading problem.
Thanks for any help
from folium import Link
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.loader.processors import MapCompose
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.crawler import CrawlerProcess
class Articulo(Item):
nombre = Field()
direccion = Field()
telefono = Field()
comunaregion = Field()
class SeccionAmarillaCrawler(CrawlSpider):
name = 'scraperfunerarias'
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 Safari/537.36'}
allowed_domains = ['paginasamarillas.com']
start_urls = ["https://www.paginasamarillas.com.ar/buscar/q/funerarias/"]
download_delay = 3
rules = (
Rule(
LinkExtractor(
allow=r'https://www.paginasamarillas.com.ar/buscar/q/funerarias/p-\d+/?tieneCobertura=true'
), follow=True, callback= "parseador"
),
)
def parseador(self, response):
sel = Selector(response)
funerarias = sel.xpath('//div[contains(@class, "figBox")]')
for funeraria in funerarias:
item = ItemLoader(Articulo(), funeraria)
item.add_xpath('nombre', './/span[@class="semibold"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
item.add_xpath('direccion', './/span[@class="directionFig"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
item.add_xpath('telefono', './/span[@itemprop="telephone"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
item.add_xpath('comunaregion', './/span[@class="city"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
yield item.load_item()
process = CrawlerProcess({
'FEED_FORMAT': 'csv',
'FEED_URI': 'datos_scrapeados.csv'
})
process.crawl(SeccionAmarillaCrawler)
process.start()
OUTPUT
2022-03-16 16:30:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-03-16 16:30:23 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-35-generic-x86_64-with-glibc2.29
2022-03-16 16:30:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-16 16:30:23 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 '
'Safari/537.36'}
2022-03-16 16:30:23 [scrapy.extensions.telnet] INFO: Telnet Password: 9eb59ae51c5aae24
2022-03-16 16:30:23 [py.warnings] WARNING: /home/maka/.local/lib/python3.8/site-packages/scrapy/extensions/feedexport.py:247: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
exporter = cls(crawler)
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-16 16:30:23 [scrapy.core.engine] INFO: Spider opened
2022-03-16 16:30:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-16 16:30:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-16 16:30:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paginasamarillas.com.ar/buscar/q/funerarias/> (referer: None)
2022-03-16 16:30:25 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-16 16:30:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 346,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 34763,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.26536,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 16, 15, 30, 25, 219139),
'httpcompression/response_bytes': 367033,
'httpcompression/response_count': 1,
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 103419904,
'memusage/startup': 103419904,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 3, 16, 15, 30, 23, 953779)}
2022-03-16 16:30:25 [scrapy.core.engine] INFO: Spider closed (finished)
I found two problems
.arinallowed_domains?has special meaning in regex so you have to use\?instead of?but you can use also something simpler like
allow=r'funerarias/p-\d+'And now your code works for me.