當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【搜索引擎Jediael开发笔记3】使用HtmlParser提取网页中的链接

發布時間：2024/1/23 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了【搜索引擎Jediael开发笔记3】使用HtmlParser提取网页中的链接小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

關于HtmpParser的基本內容請見 HtmlParser基礎教程

本文示例用于提取HTML文件中的鏈接

package org.ljh.search.html;import java.util.HashSet; import java.util.Set;import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.Parser; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.filters.OrFilter; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException;//本類創建用于HTML文件解釋工具 public class HtmlParserTool {// 本方法用于提取某個html文檔中內嵌的鏈接public static Set<String> extractLinks(String url, LinkFilter filter) {Set<String> links = new HashSet<String>();try {// 1、構造一個Parser，并設置相關的屬性Parser parser = new Parser(url);parser.setEncoding("gb2312");// 2.1、自定義一個Filter，用于過濾<Frame >標簽，然后取得標簽中的src屬性值NodeFilter frameNodeFilter = new NodeFilter() {@Overridepublic boolean accept(Node node) {if (node.getText().startsWith("frame src=")) {return true;} else {return false;}}};//2.2、創建第二個Filter，過濾<a>標簽NodeFilter aNodeFilter = new NodeClassFilter(LinkTag.class);//2.3、凈土上述2個Filter形成一個組合邏輯Filter。OrFilter linkFilter = new OrFilter(frameNodeFilter, aNodeFilter);//3、使用parser根據filter來取得所有符合條件的節點NodeList nodeList = parser.extractAllNodesThatMatch(linkFilter);//4、對取得的Node進行處理for(int i = 0; i<nodeList.size();i++){Node node = nodeList.elementAt(i);String linkURL = "";//如果鏈接類型為<a />if(node instanceof LinkTag){LinkTag link = (LinkTag)node;linkURL= link.getLink();}else{//如果類型為<frame />String nodeText = node.getText();int beginPosition = nodeText.indexOf("src=");nodeText = nodeText.substring(beginPosition);int endPosition = nodeText.indexOf(" ");if(endPosition == -1){endPosition = nodeText.indexOf(">");}linkURL = nodeText.substring(5, endPosition - 1);}//判斷是否屬于本次搜索范圍的urlif(filter.accept(linkURL)){links.add(linkURL);}}} catch (ParserException e) {e.printStackTrace();}return links;} }

程序中的一些說明：

（1）通過Node#getText()取得節點的String。

（2）node instanceof TagLink，即<a/>節點，其它還有很多的類似節點，如tableTag等，基本上每個常見的html標簽均會對應一個tag。官方文檔說明如下：

org.htmlparser.nodes	The nodes package has the concrete node implementations.
org.htmlparser.tags	The tags package contains specific tags.

因此可以通過此方法直接判斷一個節點是否某個標簽內容。

其中用到的LinkFilter接口定義如下：

package org.ljh.search.html;//本接口所定義的過濾器，用于判斷url是否屬于本次搜索范圍。 public interface LinkFilter {public boolean accept(String url); }

測試程序如下：

package org.ljh.search.html;import java.util.Iterator; import java.util.Set;import org.junit.Test;public class HtmlParserToolTest {@Testpublic void testExtractLinks() {String url = "http://www.baidu.com";LinkFilter linkFilter = new LinkFilter(){@Overridepublic boolean accept(String url) {if(url.contains("baidu")){return true;}else{return false;}}};Set<String> urlSet = HtmlParserTool.extractLinks(url, linkFilter);Iterator<String> it = urlSet.iterator();while(it.hasNext()){System.out.println(it.next());}}}

輸出結果如下：

http://www.hao123.com
http://www.baidu.com/
http://www.baidu.com/duty/
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=
http://music.baidu.com
http://ir.baidu.com
http://www.baidu.com/gaoji/preferences.html
http://news.baidu.com
http://map.baidu.com
http://music.baidu.com/search?fr=ps&key=
http://image.baidu.com
http://zhidao.baidu.com
http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=
http://www.baidu.com/more/
http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w
http://wenku.baidu.com
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
http://www.baidu.com/cache/sethelp/index.html
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
http://tieba.baidu.com/f?kw=&fr=wwwt
http://home.baidu.com
https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
http://v.baidu.com
http://e.baidu.com/?refer=888
;
http://tieba.baidu.com
http://baike.baidu.com
http://wenku.baidu.com/search?word=&lm=0&od=0
http://top.baidu.com
http://map.baidu.com/m?word=&fr=ps01000

總結

以上是生活随笔為你收集整理的【搜索引擎Jediael开发笔记3】使用HtmlParser提取网页中的链接的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【搜索引擎基础知识2】网络爬虫的介绍
下一篇：【搜索引擎Jediael开发4】V0.0