java解析html之HTMLparser初次尝试
為了爬取一個(gè)網(wǎng)頁(yè)的數(shù)據(jù),嘗試了一下Htmlparser來(lái)做小爬蟲(chóng)。
下面是一個(gè)小案例,用來(lái)爬取論壇的帖子內(nèi)容。
1. HtmlParser 簡(jiǎn)介
htmlparser是一個(gè)純的java寫(xiě)的html解析的庫(kù),主要用于改造或提取html。用來(lái)分析抓取到的網(wǎng)頁(yè)信息是個(gè)不錯(cuò)的選擇,遺憾的是參考文檔太少。項(xiàng)目主頁(yè): http://htmlparser.sourceforge.net/
API文檔: http://htmlparser.sourceforge.net/javadoc/index.html
2. 建立Maven工程
添加相關(guān)依賴
pom.xml
2.1 創(chuàng)建一個(gè)解析器
用parser來(lái)抓取并分析一個(gè)網(wǎng)頁(yè)。
parser并不會(huì)處理網(wǎng)頁(yè)中的異步請(qǐng)求,在抓取頁(yè)面后會(huì)把真?zhèn)€頁(yè)面解析成DOM樹(shù),并以各種形式的節(jié)點(diǎn)/TAG存儲(chǔ),然后我們就可以用各種過(guò)濾器來(lái)帥選自己想要的節(jié)點(diǎn)。
htmlparser的已包含節(jié)點(diǎn)如下
org.htmlparser
Interface Node
All Superinterfaces: 每一個(gè)節(jié)點(diǎn)都包含如下方法(很多節(jié)點(diǎn)還會(huì)自己實(shí)現(xiàn)更多的方法,例如linktag有些方法用于獲取link標(biāo)簽的url,檢查這個(gè)url的協(xié)議類(lèi)型...)
| ?void | accept(NodeVisitor?visitor) ??????????Apply the visitor to this node. |
| ?Object | clone() ??????????Allow cloning of nodes. |
| ?void | collectInto(NodeList?list,NodeFilter?filter) ??????????Collect this node and its child nodes into a list, provided the node satisfies the filtering criteria. |
| ?void | doSemanticAction() ??????????Perform the meaning of this tag. |
| ?NodeList | getChildren() ??????????Get the children of this node. |
| ?int | getEndPosition() ??????????Gets the ending position of the node. |
| ?Node | getFirstChild() ??????????Get the first child of this node. |
| ?Node | getLastChild() ??????????Get the last child of this node. |
| ?Node | getNextSibling() ??????????Get the next sibling to this node. |
| ?Page | getPage() ??????????Get the page this node came from. |
| ?Node | getParent() ??????????Get the parent of this node. |
| ?Node | getPreviousSibling() ??????????Get the previous sibling to this node. |
| ?int | getStartPosition() ??????????Gets the starting position of the node. |
| ?String | getText() ??????????Returns the text of the node. |
| ?void | setChildren(NodeList?children) ??????????Set the children of this node. |
| ?void | setEndPosition(int?position) ??????????Sets the ending position of the node. |
| ?void | setPage(Page?page) ??????????Set the page this node came from. |
| ?void | setParent(Node?node) ??????????Sets the parent of this node. |
| ?void | setStartPosition(int?position) ??????????Sets the starting position of the node. |
| ?void | setText(String?text) ??????????Sets the string contents of the node. |
| ?String | toHtml() ??????????Return the HTML for this node. |
| ?String | toHtml(boolean?verbatim) ??????????Return the HTML for this node. |
| ?String | toPlainTextString() ??????????A string representation of the node. |
| ?String | toString() ??????????Return the string representation of the node. |
節(jié)點(diǎn)過(guò)濾器,這些過(guò)濾器可以按照即誒但類(lèi)型。節(jié)點(diǎn)之間父子關(guān)系,也可以自定義過(guò)濾器。多個(gè)過(guò)濾器之間可以組合成符合過(guò)濾器用于多條件過(guò)濾,
比如AndFilter,NotFilter,OrFilter,XorFilter
| AndFilter | Accepts nodes matching all of its predicate filters (AND operation). |
| CssSelectorNodeFilter | A NodeFilter that accepts nodes based on whether they match a CSS2 selector. |
| HasAttributeFilter | This class accepts all tags that have a certain attribute, and optionally, with a certain value. |
| HasChildFilter | This class accepts all tags that have a child acceptable to the filter. |
| HasParentFilter | This class accepts all tags that have a parent acceptable to another filter. |
| HasSiblingFilter | This class accepts all tags that have a sibling acceptable to another filter. |
| IsEqualFilter | This class accepts only one specific node. |
| LinkRegexFilter | This class accepts tags of class LinkTag that contain a link matching a given regex pattern. |
| LinkStringFilter | This class accepts tags of class LinkTag that contain a link matching a given pattern string. |
| NodeClassFilter | This class accepts all tags of a given class. |
| NotFilter | Accepts all nodes not acceptable to it's predicate filter. |
| OrFilter | Accepts nodes matching any of its predicates filters (OR operation). |
| RegexFilter | This filter accepts all string nodes matching a regular expression. |
| StringFilter | This class accepts all string nodes containing the given string. |
| TagNameFilter | This class accepts all tags matching the tag name. |
抓取http://www.v2ex.com網(wǎng)站中的一篇帖子
首先要?jiǎng)?chuàng)建獲取網(wǎng)頁(yè)內(nèi)容,分析網(wǎng)頁(yè)元素結(jié)構(gòu)制作過(guò)濾器;
可以看到回復(fù)div的id都是r_加六位數(shù)字,推薦使用正則表達(dá)式匹配,主題的樣式是corder-bottom:0px(一定要缺人過(guò)濾器的結(jié)果,免得引入多余節(jié)點(diǎn))。
創(chuàng)建一個(gè)方法,獲得主題和回復(fù)節(jié)點(diǎn)集合
/*** * 獲取html中的主題和所有回復(fù)節(jié)點(diǎn)* * @param url* @param ENCODE* @return*/protected NodeList getNodelist(String url, String ENCODE) {try {NodeList nodeList = null;Parser parser = new Parser(url);parser.setEncoding(ENCODE);//定義一個(gè)Filter,過(guò)濾主題divNodeFilter filter = new NodeFilter() {@Overridepublic boolean accept(Node node) {if(node.getText().contains("style=\"border-bottom: 0px;\"")) {return true;} else {return false;}}};//定義一個(gè)Filter,過(guò)濾所有回復(fù)divNodeFilter replyfilter = new NodeFilter() {@Overridepublic boolean accept(Node node) {String containsString = "id=\"r_";if(node.getText().contains(containsString)) {return true;} else {return false;}}};//組合filterOrFilter allFilter = new OrFilter(filter, replyfilter);nodeList = parser.extractAllNodesThatMatch(allFilter);return nodeList;} catch (ParserException e) {e.printStackTrace();return null;}}好了有了這些節(jié)點(diǎn)接下來(lái)就是解析了。
這個(gè)例子代碼只寫(xiě)了一部分元素的獲取,剩下的活也是體力活慢慢分析節(jié)點(diǎn)關(guān)系,用過(guò)濾器或者dom樹(shù)找目標(biāo)節(jié)點(diǎn)。
下面的代碼是將解析到的節(jié)點(diǎn)數(shù)據(jù)封裝到bean
好了。解析都做完了,在寫(xiě)個(gè)主方法分析一個(gè)帖子試試;
@Testpublic void test() throws Exception {Html2Domain parse = new Html2DomainImpl();parse.parse2Thread("http://www.v2ex.com/t/262409#reply6","UTF-8");}看看運(yùn)行結(jié)果:
這個(gè)內(nèi)容過(guò)長(zhǎng),截圖只能看到帖子名稱(chēng),和帖子內(nèi)容了,有興趣的自己去測(cè)試把。請(qǐng)一定要注意地址,貌似這個(gè)網(wǎng)站帖子連接會(huì)有失效時(shí)間,假如測(cè)試獲取失敗請(qǐng)換個(gè)帖子地址試試。
附上項(xiàng)目代碼:測(cè)試使用的是jdk1.6+eclipse kepler
http://pan.baidu.com/s/1mh9OuDi
總結(jié)
以上是生活随笔為你收集整理的java解析html之HTMLparser初次尝试的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: cin.tie() 输入加速器
- 下一篇: 如何下载、使用英文期刊的LaTeX模板(