Python Scrapy 999 error meaning. Linkedin company data scraping

116 views Asked by At

I am trying to scrape company data on Linkedin. In order to get to the company page, login is required.

I am using the following Scrapy Spider:

import scrapy
from scrapy import Spider
from scrapy.http import FormRequest
import time


class BasicLoginSpider(Spider):
    name = 'basic_login_spider'

    def start_requests(self):
       
        login_url = 'https://www.linkedin.com/'
        yield scrapy.Request(login_url, callback=self.login)

    
    def login(self, response):
        # token = response.css("form input[name=csrf_token]::attr(value)").extract_first()
        yield FormRequest.from_response(response,
                                          formdata={
                                                   'password': 'password',
                                                   'username': 'username'},
                                         callback=self.start_scraping)
                                        


    def start_scraping(self, response):

        time.sleep(10)

        self.company_pages = [
        'https://www.linkedin.com/company/nymbus?trk=public_jobs_topcard-org-name',
        'https://www.linkedin.com/company/centraprise?trk=public_jobs_jserp-result_job-search-card-subtitle'
        ]

        company_index_tracker = 0

        first_url = self.company_pages[company_index_tracker]

        yield scrapy.Request(url=first_url, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker})

    def parse_response(self, response):

            headers = {
                'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9',
                'Connection':'keep-alive',
                'accept-encoding': 'gzip, deflate, br',
                'Referer':'http://www.linkedin.com/',
                'accept-language': 'en-US,en;q=0.9',
                'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
                
                }

            company_index_tracker = response.meta['company_index_tracker']
            print('***************')
            print('****** Scraping page ' + str(company_index_tracker+1) + ' of ' + str(len(self.company_pages)))
            print('***************')

            company_item = {}

            company_item['name'] = response.css('.top-card-layout__entity-info h1::text').get(default='not-found').strip()
            company_item['summary'] = response.css('.top-card-layout__entity-info h4 span::text').get(default='not-found').strip()

            try:
                ## all company details 
                company_details = response.css('.core-section-container__content .mb-2')

                #industry line
                company_industry_line = company_details[1].css('.text-md::text').getall()
                company_item['industry'] = company_industry_line[1].strip()

                #company size line
                company_size_line = company_details[2].css('.text-md::text').getall()
                company_item['size'] = company_size_line[1].strip()

                #company founded
                company_size_line = company_details[5].css('.text-md::text').getall()
                company_item['founded'] = company_size_line[1].strip()
            except IndexError:
                print("Error: Skipped Company - Some details missing")

            yield company_item
            

            company_index_tracker = company_index_tracker + 1

            if company_index_tracker <= (len(self.company_pages)-1):
                next_url = self.company_pages[company_index_tracker]

                yield scrapy.Request(url=next_url, headers = headers, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker})
        

Looking at the log, I seem to be able to login but after that it is 999 error and spider stopps:

scrapy crawl linkedin_company_profile
2023-09-03 16:30:11 [scrapy.utils.log] INFO: Scrapy 2.10.1 started (bot: basic_scrapy_spider)
2023-09-03 16:30:11 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.5 (main, Aug 24 2023, 15:23:14) [Clang 13.0.0 (clang-1300.0.29.30)], pyOpenSSL 23.2.0 (OpenSSL 3.1.2 1 Aug 2023), cryptography 41.0.3, Platform macOS-11.4-x86_64-i386-64bit
2023-09-03 16:30:11 [scrapy.addons] INFO: Enabled addons:
[]
2023-09-03 16:30:11 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'basic_scrapy_spider',
 'NEWSPIDER_MODULE': 'basic_scrapy_spider.spiders',
 'SPIDER_MODULES': ['basic_scrapy_spider.spiders']}
2023-09-03 16:30:11 [py.warnings] WARNING: /Users/rob/basic-scrapy-project/venv/lib/python3.11/site-packages/scrapy/utils/request.py:248: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-09-03 16:30:11 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-09-03 16:30:11 [scrapy.extensions.telnet] INFO: Telnet Password: 1d83964a12a65b87
2023-09-03 16:30:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-09-03 16:30:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-09-03 16:30:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-09-03 16:30:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-09-03 16:30:12 [scrapy.core.engine] INFO: Spider opened
2023-09-03 16:30:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-09-03 16:30:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-09-03 16:30:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.linkedin.com/> (referer: None)
2023-09-03 16:30:12 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.linkedin.com/uas/login-submit> (referer: https://www.linkedin.com/)
2023-09-03 16:30:23 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/company/nymbus?trk=public_jobs_topcard-org-name> (referer: https://www.linkedin.com/uas/login-submit)
2023-09-03 16:30:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/company/nymbus?trk=public_jobs_topcard-org-name>: HTTP status code is not handled or not allowed
2023-09-03 16:30:23 [scrapy.core.engine] INFO: Closing spider (finished)
2023-09-03 16:30:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1810,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 35697,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/999': 1,
 'elapsed_time_seconds': 11.11855,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 9, 3, 14, 30, 23, 131831),
 'httpcompression/response_bytes': 159141,
 'httpcompression/response_count': 2,
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/999': 1,
 'log_count/DEBUG': 4,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'memusage/max': 58908672,
 'memusage/startup': 58908672,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2023, 9, 3, 14, 30, 12, 13281)}
2023-09-03 16:30:23 [scrapy.core.engine] INFO: Spider closed (finished)

Does anybody know what I am missing? Or I am stretching this tool too far?

1

There are 1 answers

0
Сергей Мельник On
  1. Login requests needs make to url https://www.linkedin.com/uas/login-submit
  2. Login request its POST param, pass according param "method='POST'"
  3. You need passing more params and make it correctly for login request. You need params uncluding: loginCsrfParam, session_key, session_password, trk, controlId, pageInstance, apfc.
  4. Investigate developer panel in your browser (Network section) and get more information about login request.
  5. 999 response status is equal to the same absurdity as the modified request in your code :) System absolutely can't recognize your request