In our application, Heritrix is being used as the crawl engine and once the crawl job is finished, we are manually kicking off an endpoint to download the PDFs from a website. We would like to automate this downloading pdf task as soon as the crawl job is complete. Does HEritrix provide any URI/webservice method - which returns the status of the job? (or) Do we need to create a polling app to continuously monitor the status of the job?
How do we know when Heritrix completes a crawl job?
300 views Asked by bking007 At
1
I don't know if there is any option to do it without continious monitoring but you can use Heritrix API to get status for a job, smth like
gives you XML from where you can read job status.
Another, maybe easier (yet not so 'professional') option is to check if your jobs warcs directory contains a file with .open extension. If not - the job is finished.