I build a job to crawl web data by Heritrix3.0. But it must run Heritrix.java as Java application and then the server was built. And I have to open the browser to type https://localhost:8443 to build my job, then launch the job. Then unpause the job. How can I make a cron job for web crawling automatically? Please use Java language.
How to write a cron job for Heritrix3 web crawling?
167 views Asked by 莫绮静 At
1
I have this automated for my FYP. You can use Java but still according to
Heritrixdocumentation the calls will beCURLshence best, easiest and fastest would be to useShellScripts to invoke theCURLand get the task done.Get Current Status of Engine:
Create new job for crawling in the Engine:
Build the Job:
Launch the Job: