【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析
生活随笔
收集整理的這篇文章主要介紹了
【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
在創建一個job后,就要開始job的運行,運行的全流程如下:
1、在界面上啟動job
2、index.jsp
查看上述頁面對應的源代碼
<a href='"+request.getContextPath()+"/console/action.jsp?action=start'>Start</a>3、action.jsp
4、CrawlJobHandler.jsp
(1)
public void startCrawler() {running = true;if (pendingCrawlJobs.size() > 0 && isCrawling() == false) {// Ok, can just start the next jobstartNextJob();}}(2) protected final void startNextJob() {synchronized (this) {if(startingNextJob != null) {try {startingNextJob.join();} catch (InterruptedException e) {e.printStackTrace();return;}}startingNextJob = new Thread(new Runnable() {public void run() {startNextJobInternal();}}, "StartNextJob");startingNextJob.start();}}
(3) protected void startNextJobInternal() {if (pendingCrawlJobs.size() == 0 || isCrawling()) {// No job ready or already crawling.return;}this.currentJob = (CrawlJob)pendingCrawlJobs.first();assert pendingCrawlJobs.contains(currentJob) :"pendingCrawlJobs is in an illegal state";pendingCrawlJobs.remove(currentJob);try {this.currentJob.setupForCrawlStart();// This is ugly but needed so I can clear the currentJob// reference in the crawlEnding and update the list of completed// jobs. Also, crawlEnded can startup next job.this.currentJob.getController().addCrawlStatusListener(this);// now, actually startthis.currentJob.getController().requestCrawlStart();} catch (InitializationException e) {loadJob(getStateJobFile(this.currentJob.getDirectory()));this.currentJob = null;startNextJobInternal(); // Load the next job if there is one.}}
(4) public void requestCrawlStart() {runProcessorInitialTasks();sendCrawlStateChangeEvent(STARTED, CrawlJob.STATUS_PENDING);String jobState;state = RUNNING;jobState = CrawlJob.STATUS_RUNNING;sendCrawlStateChangeEvent(this.state, jobState);// A proper exit will change this value.this.sExit = CrawlJob.STATUS_FINISHED_ABNORMAL;Thread statLogger = new Thread(statistics);statLogger.setName("StatLogger");statLogger.start();frontier.start();}
總結
以上是生活随笔為你收集整理的【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【Heritrix基础教程之3】Heri
- 下一篇: 【Lucene4.8教程之一】使用Luc