• Link 1
  • Link 2
    • Link 1
    • Link 2
      • Link 1
      • Link 2
      • TechQA.

        How to get text and href value in anchor tag with scrapy, xpath, python

        827 views Asked by Claire Duong At 2020-06-12T08:02:31+00:00 12 June 2020 at 08:02 2025-12-24T10:54:45+00:00

        I have a HTML file like this:

        <div ckass="jokes-nav">
          <ul>
            <li><a href="http://link_1">Link 1</a></li>
            <li><a href="http://link_2">Link 2</a></li>
          </ul>
        </div>
        

        In the folder spiders, I have a file jokes.py like this:

        import scrapy
        from demo_project.items import JokeItem
        from scrapy.loader import ItemLoader
        
        class JokesSpider(scrapy.Spider):
            name = 'jokes'
        
            start_urls = [
                'http://www.laughfactory.com/jokes/'
            ]
        
            def parse(self, response):
                for joke in response.xpath("//div[@class='jokes-nav']/ul"):
                    l = ItemLoader(item = JokeItem(), selector = joke)
                    l.add_xpath('joke_title', ".//li/a/text()")
        
                    """ yield {
                        'joke_text': joke.xpath(".//div[@class='joke-text']/p").extract_first()
                    } """
        
                    yield l.load_item()
        

        and I call the class JokesSpider in my main.py (this file is at root), and this is my code

        from scrapy.crawler import CrawlerProcess
        from demo_project.spiders.jokes import JokesSpider
        
        process = CrawlerProcess(settings={
            "FEEDS": {
                "items.json": {"format": "json"},
            },
        })
        
        process.crawl(JokesSpider)
        process.start() # the script will block here until the crawling is finished
        

        I want to write data to items.json, but when I run this code, items.json does not contain anything in it, how can I solve this problem. Thank you very much

        python web-scraping scrapy web-mining
        Original Q&A
        1

        There are 1 answers

        0
        Patrick Klein Patrick Klein On 2020-06-13T07:17:50+00:00 13 June 2020 at 07:17 BEST ANSWER

        You can set FEED_FORMAT and FEED_URI settings to save data in a json file.

        process = CrawlerProcess(settings={
            'FEED_FORMAT': 'json',
            'FEED_URI': 'items.json'
        })
        

        Related Questions in PYTHON

        • How to store a date/time in sqlite (or something similar to a date)
        • Instagrapi recently showing HTTPError and UnknownError
        • How to Retrieve Data from an MySQL Database and Display it in a GUI?
        • How to create a regular expression to partition a string that terminates in either ": 45" or ",", without the ": "
        • Python Geopandas unable to convert latitude longitude to points
        • Influence of Unused FFN on Model Accuracy in PyTorch
        • Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
        • Writes to child subprocess.Popen.stdin don't work from within process group?
        • Conda has two different python binarys (python and python3) with the same version for a single environment. Why?
        • Problem with add new attribute in table with BOTO3 on python
        • Can't install packages in python conda environment
        • Setting diagonal of a matrix to zero
        • List of numbers converted to list of strings to iterate over it. But receiving TypeError messages
        • Basic Python Question: Shortening If Statements
        • Python and regex, can't understand why some words are left out of the match

        Related Questions in WEB-SCRAPING

        • Using Puppeteer to scrape a public API only when the data changes
        • Scraping information in a span located under nested span
        • How to scrape website which loads json content dynamically?
        • How can I find a button element and click on it?
        • WebScraping doesnt work, even without error
        • Need Help Extracting Redirect URL from a div Element with Specific Class Name in Python Selenium
        • beautifulsoup library not showing below #document data inside iframe tag in python
        • how to create robust scraper for specific website without updating code after develop?
        • Optimizing Selenium script for faster execution
        • Parse Dynamic Power BI table with selenium
        • How to extract table from webpage that requires click/toggle?
        • SSL Certificate Verification Error When Scraping Website and Inserting Data into MongoDB
        • Scraping all links using BeautifulSoup
        • How do I make it so all arrays are the same length?
        • I am getting 'NoneType object is not subscriptable' error in web scraping method

        Related Questions in SCRAPY

        • pagination, next page with scrapy
        • Scraping Text through sections using scrapy
        • How to access Script Tag Variables From a Website using Python
        • xpath issue in nested div
        • How to fixed Crawled (403) forbbiden in scrapy?
        • Cannot set LOG_LEVEL when using CrawlerRunner
        • Scrapy handle closespider timeout in middleware
        • Scrapy CrawlProcess is throwing reactor already installed
        • Scrapy playwright non-headless browser always closing
        • why can't I retrieve the track of my Spotify playlist even i have given correct full xpath
        • Scrapy - how do I load data from the database in ItemLoader before sending it to the pipeline?
        • Scrapy Playwright Page Method: Prevent timeout error if selector cannot be located
        • Why scrapy shell did not return an output?
        • Python Scrapy Function that does always work
        • Scrapy / extracting data across multiple HTML tags

        Related Questions in WEB-MINING

        • Unable to fetch the Youtube Username using Javascript ( Chrome Extension )
        • API | Coinimp | user/withdraw | Invalid parameters (POST)
        • POST request issue with httr: desired table not retrieved
        • Scrape join-dates/user info from a list (csv) of Twitter-users
        • How can I use scrapy on booking.com without being blocked?
        • Defensive web scraping techniques for scrapy spider
        • Apache Nutch index only article pages to Solr
        • Function not importing from external js file in react
        • Craw data from urls by passing URL to Scrapy from other *.py file
        • How to get text and href value in anchor tag with scrapy, xpath, python
        • ECLAT Algorithm to find maximal and closed frequent sets
        • Is it easier to scrape the AMP versions of webpages?
        • Degree, Proximity and Rank Prestige
        • Rcrawler - How to crawl account/password protected sites?
        • Problems text mining using the ‘rJava’ and ‘tm.plugin.webmining’ packages

        Popular Questions

        • How do I undo the most recent local commits in Git?
        • How can I remove a specific item from an array in JavaScript?
        • How do I delete a Git branch locally and remotely?
        • Find all files containing a specific text (string) on Linux?
        • How do I revert a Git repository to a previous commit?
        • How do I create an HTML button that acts like a link?
        • How do I check out a remote Git branch?
        • How do I force "git pull" to overwrite local files?
        • How do I list all files of a directory?
        • How to check whether a string contains a substring in JavaScript?
        • How do I redirect to another webpage?
        • How can I iterate over rows in a Pandas DataFrame?
        • How do I convert a String to an int in Java?
        • Does Python have a string 'contains' substring method?
        • How do I check if a string contains a specific word?

        Popular Tags

        javascript python java c# php android html jquery c++ css ios sql mysql r reactjs node.js arrays c asp.net json

        Trending Questions

        • UIImageView Frame Doesn't Reflect Constraints
        • Is it possible to use adb commands to click on a view by finding its ID?
        • How to create a new web character symbol recognizable by html/javascript?
        • Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
        • Heap Gives Page Fault
        • Connect ffmpeg to Visual Studio 2008
        • Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
        • How to avoid default initialization of objects in std::vector?
        • second argument of the command line arguments in a format other than char** argv or char* argv[]
        • How to improve efficiency of algorithm which generates next lexicographic permutation?
        • Navigating to the another actvity app getting crash in android
        • How to read the particular message format in android and store in sqlite database?
        • Resetting inventory status after order is cancelled
        • Efficiently compute powers of X in SSE/AVX
        • Insert into an external database using ajax and php : POST 500 (Internal Server Error)
        • Privacy
        • Terms
        • Cookies
        • Homegardensmart
        • Math
        • Aftereffectstemplates