當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【搜索引擎Jediael开发4】V0.01完整代码

發布時間：2024/1/23 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了【搜索引擎Jediael开发4】V0.01完整代码小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

截止目前，已完成如下功能：

1、指定某個地址，使用HttpClient下載該網頁至本地文件

2、使用HtmlParser解釋第1步下載的網頁，抽取其中包含的鏈接信息

3、下載第2步的所有鏈接指向的網頁至本地文件

下一步需要完成的功能：

1、創建用于保存種子URL的配置文件及其數據結構

2、創建用于保存Todo信息（未下載URL）的數據結構

3、創建用于保存Visited信息（已下載的URL）的數據結構

4、下載網頁時同步更新Tode與Visited。

5、從上述第3步下載的網頁抽取鏈接并繼續下載，直到Todo列表為空。

主要有以下類：

1、主類MyCrawler

2、網頁下載類PageDownloader

3、網頁內容分類類HtmlParserTool

4、接口Filter

完整代碼可見歸檔代碼?Jediael_v0.01

或者

https://code.csdn.net/jediael_lu/daopattern/tree/d196da609baa59ef08176322ca61928fbfbdf813

或者

http://download.csdn.net/download/jediael_lu/7382011

1、主類MyCrawler

package org.ljh.search;import java.io.IOException; import java.util.Iterator; import java.util.Set;import org.htmlparser.Parser; import org.ljh.search.downloadpage.PageDownloader; import org.ljh.search.html.HtmlParserTool; import org.ljh.search.html.LinkFilter;public class MyCrawler { public static void main(String[] args) {String url = "http://www.baidu.com";LinkFilter linkFilter = new LinkFilter(){@Overridepublic boolean accept(String url) {if(url.contains("baidu")){return true;}else{return false;}} };try {PageDownloader.downloadPageByGetMethod(url);Set<String> urlSet = HtmlParserTool.extractLinks(url, linkFilter);Iterator iterator = urlSet.iterator();while(iterator.hasNext()){PageDownloader.downloadPageByGetMethod((String) iterator.next());}} catch (Exception e) {e.printStackTrace();}}}

2、網頁下載類PageDownloader

package org.ljh.search.downloadpage;import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStream; import java.io.PrintWriter; import java.io.Writer; import java.util.Scanner;import org.apache.http.HttpEntity; import org.apache.http.HttpStatus; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients;//本類用于將指定url對應的網頁下載至本地一個文件。 public class PageDownloader {public static void downloadPageByGetMethod(String url) throws IOException {// 1、通過HttpGet獲取到response對象CloseableHttpClient httpClient = HttpClients.createDefault();// 注意，必需要加上http://的前綴，否則會報：Target host is null異常。HttpGet httpGet = new HttpGet(url);CloseableHttpResponse response = httpClient.execute(httpGet);InputStream is = null;if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {try {// 2、獲取response的entity。HttpEntity entity = response.getEntity();// 3、獲取到InputStream對象，并對內容進行處理is = entity.getContent();String fileName = getFileName(url);saveToFile("D:\\tmp\\", fileName, is);} catch (ClientProtocolException e) {e.printStackTrace();} finally {if (is != null) {is.close();}if (response != null) {response.close();}}}}//將輸入流中的內容輸出到path指定的路徑，fileName指定的文件名private static void saveToFile(String path, String fileName, InputStream is) {Scanner sc = new Scanner(is);Writer os = null;try {os = new PrintWriter(path + fileName);while (sc.hasNext()) {os.write(sc.nextLine());}} catch (FileNotFoundException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} finally {if (sc != null) {sc.close();}if (os != null) {try{os.flush();os.close();}catch(IOException e){e.printStackTrace();System.out.println("輸出流關閉失敗！");}}}}// 將url中的特殊字符用下劃線代替private static String getFileName(String url) {url = url.substring(7);String fileName = url.replaceAll("[\\?:*|<>\"/]", "_") + ".html";return fileName;}}
3、網頁內容分類類HtmlParserTool

package org.ljh.search.html;import java.util.HashSet; import java.util.Set;import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.Parser; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.filters.OrFilter; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException;//本類創建用于HTML文件解釋工具 public class HtmlParserTool {// 本方法用于提取某個html文檔中內嵌的鏈接public static Set<String> extractLinks(String url, LinkFilter filter) {Set<String> links = new HashSet<String>();try {// 1、構造一個Parser，并設置相關的屬性Parser parser = new Parser(url);parser.setEncoding("gb2312");// 2.1、自定義一個Filter，用于過濾<Frame >標簽，然后取得標簽中的src屬性值NodeFilter frameNodeFilter = new NodeFilter() {@Overridepublic boolean accept(Node node) {if (node.getText().startsWith("frame src=")) {return true;} else {return false;}}};//2.2、創建第二個Filter，過濾<a>標簽NodeFilter aNodeFilter = new NodeClassFilter(LinkTag.class);//2.3、凈土上述2個Filter形成一個組合邏輯Filter。OrFilter linkFilter = new OrFilter(frameNodeFilter, aNodeFilter);//3、使用parser根據filter來取得所有符合條件的節點NodeList nodeList = parser.extractAllNodesThatMatch(linkFilter);//4、對取得的Node進行處理for(int i = 0; i<nodeList.size();i++){Node node = nodeList.elementAt(i);String linkURL = "";//如果鏈接類型為<a />if(node instanceof LinkTag){LinkTag link = (LinkTag)node;linkURL= link.getLink();}else{//如果類型為<frame />String nodeText = node.getText();int beginPosition = nodeText.indexOf("src=");nodeText = nodeText.substring(beginPosition);int endPosition = nodeText.indexOf(" ");if(endPosition == -1){endPosition = nodeText.indexOf(">");}linkURL = nodeText.substring(5, endPosition - 1);}//判斷是否屬于本次搜索范圍的urlif(filter.accept(linkURL)){links.add(linkURL);}}} catch (ParserException e) {e.printStackTrace();}return links;} }
4、接口Filter

package org.ljh.search.html;//本接口所定義的過濾器，用于判斷url是否屬于本次搜索范圍。 public interface LinkFilter {public boolean accept(String url); }

總結

以上是生活随笔為你收集整理的【搜索引擎Jediael开发4】V0.01完整代码的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【搜索引擎Jediael开发笔记3】使用
下一篇： HtmlParser基础教程