I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. I have few hundred domains that I want to fetch. I have fetched many of them till now. I am curious that when Nutch will visit already fetched document again and refetch it if it is update. Is there any control parameter or something like that ?
nutch time schedule to visit a page again
405 views Asked by Hafiz Muhammad Shafiq At
1
There are 1 answers
Related Questions in APACHE
- Special access rule in an .htaccess file for IP addresses, authorized only for one directory structure
- How to isolate PHP apps from each other on a local machine(Windows or Linux)?
- Cannot load modules/mod_dav_svn.so into server
- How to ignore case in regexp mapping in a .htaccess rewrite rule?
- Oracle Http server ISNT-07551
- I cant access file directory with PHP local host on XAMPP. it just shows one of the files I have in my visual studio code
- Apache Reverse Proxy: only one proxy directive is working. Second one is ignored
- Issue with Django --> Apache WSGI deployment
- changing the node version used by apache web server
- Apache: How can I redirect to a subfolder with a URL param but serve required content via the main URL?
- Why/How does Apache auto-include "DHE" TLS1.2 ciphers while nginx needs "dhparams" file?
- Set up MX records in apache/Ubuntu to point to external mail server
- How to proxy to another port?
- Php can not upload file out of /var/www/html even after disabling Selinux
- Serve static site on S3 + CloudFlare with Apache retaining the source URL
Related Questions in WEB-CRAWLER
- How do i get the newly opened page after a form submission using puppeteer
- How to crawl 5000 different URLs to find certain links
- Selenium cannot load a page
- FaceBook-Scraper (without API) works nicely - but Login Process failes some how
- Why scrapy shell did not return an output?
- Highcharts Spider Chart with different scale for each category
- Chrome for Testing crashes soon after launching chrome driver in script
- Permission denied When deploy Splash in OpenShift
- scrape( n ′ gcontent−serverapp ′ , ′ How to scrape HTML elements with a specific attribute using Python ′ )
- Puppeteer recognized by BET365 during crawler
- Python requests.get(url) returns empty content in Colab
- I want some of the content in my page to be crawlable but should not be indexed
- Selenium crawler had no problems starting up locally, but it always failed to start up on Linux,org.openqa.selenium.interactions.Coordinates
- Website Branch address not updating in Google search engine even after 1 month
- How can I execute javasript function before page load for search engine crawlers?
Related Questions in NUTCH
- Apache Nutch - How to store crawl data under the folder with the page name/url
- Nutch 1.19 / Solr 9.4.0 How to point Nutch to the Solr instance?
- nutch error: Illegal to have multiple roots (start tag in epilog?)
- What is the correct format for a solrcloud url in Nutch's index-writers.xml config?
- How can I fix the Bad Gateway error when adding Solr as a data source to Grafana?
- Apache Nutch 1.19 Getting Error: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
- Running apache nutch in local machine
- Nutch 1.19 Webgraph command error: OutlinkDb job did not succeed, job id: job_local306968781_0001, job status: FAILED, reason: NA
- Nutch 2.x response content : doesn't work properly without JavaScript enabled. Please enable it to continue
- Using Java & Apache Nutch to scrape dynamic elements from a website
- Building Apache Nutch Docker container
- Nutch additional fields for indexing in solr
- after fresh installation of nutch and solr crawl error
- Updating Max Depth for Apache-Nutch Crawler in scoring-depth filter is not working
- Search for solve a error 255 in SOLR Nutch
Related Questions in NUTCH2
- Updating Max Depth for Apache-Nutch Crawler in scoring-depth filter is not working
- Apache Nutch is crawling few domain more and other less with default configuration
- Apache Nutch not reading a new configuration file when run with job file
- I had some questions on db_redir_temp
- Nutch http.redirect.max may I know what does it Mean
- org.apache.tika.utils.XMLReaderUtils acquireSAXParser WARNING: Contention waiting for a SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE
- nutch fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url=https://www.nicobuyscars.com
- Nutch 1.17 web crawling with storage optimization
- Restrict Nutch to Seed path and its following webpages only
- Nutch - Visit few pages again and again to find new links
- Apache Nutch index only article pages to Solr
- Errors using curl for nutch RESTapi calls
- Apache Nutch skipping URLs & truncating
- Apache Nutch 2.3.1, increase reducer memory
- Configuring RAM in Nutch
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Nutch has several ways to configure when a page is fetched again (see https://github.com/apache/nutch/blob/release-2.3.1/conf/nutch-default.xml).
db.fetch.interval.default(initial fetch value assigned when the page is fetched for the first time). Keep in mind that the default implementation (db.fetch.schedule.class, https://github.com/apache/nutch/blob/release-2.3.1/conf/nutch-default.xml#L396) always add the fetch interval to the last fetch time, so is not ideal. I would recommend switching to the adaptive fetch schedule algorithm, that will try to optimize the next fetch time depending on how often the page is updated (https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java).Keep in mind that you can also specify per URL fetch time (at inject time) using the
nutch.fetchIntervalmetadata key in the seed file (https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/crawl/InjectorJob.java#L59).