當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析

發布時間：2024/1/23 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在創建一個job后，就要開始job的運行，運行的全流程如下：

1、在界面上啟動job

2、index.jsp

查看上述頁面對應的源代碼

<a href='"+request.getContextPath()+"/console/action.jsp?action=start'>Start</a>
3、action.jsp

String sAction = request.getParameter("action");if(sAction != null){// Need to handle an action if(sAction.equalsIgnoreCase("start")){// Tell handler to start crawl jobhandler.startCrawler();} else if(sAction.equalsIgnoreCase("stop")) {// Tell handler to stop crawl jobhandler.stopCrawler();} else if(sAction.equalsIgnoreCase("terminate")) {// Delete current jobif(handler.getCurrentJob()!=null){handler.deleteJob(handler.getCurrentJob().getUID());}} else if(sAction.equalsIgnoreCase("pause")) {// Tell handler to pause crawl jobhandler.pauseJob();} else if(sAction.equalsIgnoreCase("resume")) {// Tell handler to resume crawl jobhandler.resumeJob();} else if(sAction.equalsIgnoreCase("checkpoint")) {if(handler.getCurrentJob() != null) {handler.checkpointJob();}}} response.sendRedirect(request.getContextPath() + "/index.jsp");
4、CrawlJobHandler.jsp

（1）

public void startCrawler() {running = true;if (pendingCrawlJobs.size() > 0 && isCrawling() == false) {// Ok, can just start the next jobstartNextJob();}}
（2）

protected final void startNextJob() {synchronized (this) {if(startingNextJob != null) {try {startingNextJob.join();} catch (InterruptedException e) {e.printStackTrace();return;}}startingNextJob = new Thread(new Runnable() {public void run() {startNextJobInternal();}}, "StartNextJob");startingNextJob.start();}}
（3）

protected void startNextJobInternal() {if (pendingCrawlJobs.size() == 0 || isCrawling()) {// No job ready or already crawling.return;}this.currentJob = (CrawlJob)pendingCrawlJobs.first();assert pendingCrawlJobs.contains(currentJob) :"pendingCrawlJobs is in an illegal state";pendingCrawlJobs.remove(currentJob);try {this.currentJob.setupForCrawlStart();// This is ugly but needed so I can clear the currentJob// reference in the crawlEnding and update the list of completed// jobs. Also, crawlEnded can startup next job.this.currentJob.getController().addCrawlStatusListener(this);// now, actually startthis.currentJob.getController().requestCrawlStart();} catch (InitializationException e) {loadJob(getStateJobFile(this.currentJob.getDirectory()));this.currentJob = null;startNextJobInternal(); // Load the next job if there is one.}}
（4）

public void requestCrawlStart() {runProcessorInitialTasks();sendCrawlStateChangeEvent(STARTED, CrawlJob.STATUS_PENDING);String jobState;state = RUNNING;jobState = CrawlJob.STATUS_RUNNING;sendCrawlStateChangeEvent(this.state, jobState);// A proper exit will change this value.this.sExit = CrawlJob.STATUS_FINISHED_ABNORMAL;Thread statLogger = new Thread(statistics);statLogger.setName("StatLogger");statLogger.start();frontier.start();}

總結

以上是生活随笔為你收集整理的【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【Heritrix基础教程之3】Heri
下一篇：【Lucene4.8教程之一】使用Luc