當(dāng)前位置：首頁(yè) > 前端技术 > HTML >内容正文

HTML

java解析html之HTMLparser初次尝试

發(fā)布時(shí)間：2024/3/12 HTML 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 java解析html之HTMLparser初次尝试小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

為了爬取一個(gè)網(wǎng)頁(yè)的數(shù)據(jù)，嘗試了一下Htmlparser來(lái)做小爬蟲(chóng)。

下面是一個(gè)小案例，用來(lái)爬取論壇的帖子內(nèi)容。

1. HtmlParser 簡(jiǎn)介

htmlparser是一個(gè)純的java寫(xiě)的html解析的庫(kù)，主要用于改造或提取html。用來(lái)分析抓取到的網(wǎng)頁(yè)信息是個(gè)不錯(cuò)的選擇，遺憾的是參考文檔太少。
項(xiàng)目主頁(yè)： http://htmlparser.sourceforge.net/
API文檔： http://htmlparser.sourceforge.net/javadoc/index.html

2. 建立Maven工程

添加相關(guān)依賴

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.fancy</groupId><artifactId>htmlParser</artifactId><version>0.0.1-SNAPSHOT</version><dependencies><dependency><groupId>org.htmlparser</groupId><artifactId>htmlparser</artifactId><version>2.1</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version><scope>test</scope></dependency></dependencies> </project>

2.1 創(chuàng)建一個(gè)解析器

用parser來(lái)抓取并分析一個(gè)網(wǎng)頁(yè)。

parser并不會(huì)處理網(wǎng)頁(yè)中的異步請(qǐng)求，在抓取頁(yè)面后會(huì)把真?zhèn)€頁(yè)面解析成DOM樹(shù)，并以各種形式的節(jié)點(diǎn)/TAG存儲(chǔ)，然后我們就可以用各種過(guò)濾器來(lái)帥選自己想要的節(jié)點(diǎn)。

htmlparser的已包含節(jié)點(diǎn)如下

org.htmlparser
Interface Node

All Superinterfaces:

Cloneable

All Known Subinterfaces:

Remark,Tag,Text

All Known Implementing Classes:

AbstractNode,AppletTag,BaseHrefTag, BodyTag, Bullet, BulletList, CompositeTag, DefinitionList, DefinitionListBullet, Div, DoctypeTag, FormTag, FrameSetTag, FrameTag, HeadingTag, HeadTag, Html, ImageTag, InputTag, JspTag, LabelTag, LinkTag, MetaTag, ObjectTag, OptionTag, ParagraphTag, ProcessingInstructionTag, RemarkNode, ScriptTag, SelectTag, Span, StyleTag, TableColumn, TableHeader, TableRow, TableTag, TagNode, TextareaTag, TextNode, TitleTag

網(wǎng)頁(yè)被解析后獲得的都是這些節(jié)點(diǎn)以及他們之間的父子包含關(guān)系。

每一個(gè)節(jié)點(diǎn)都包含如下方法（很多節(jié)點(diǎn)還會(huì)自己實(shí)現(xiàn)更多的方法，例如linktag有些方法用于獲取link標(biāo)簽的url，檢查這個(gè)url的協(xié)議類(lèi)型...)

Method Summary

?void	accept(NodeVisitor?visitor) ??????????Apply the visitor to this node.
?Object	clone() ??????????Allow cloning of nodes.
?void	collectInto(NodeList?list,NodeFilter?filter) ??????????Collect this node and its child nodes into a list, provided the node satisfies the filtering criteria.
?void	doSemanticAction() ??????????Perform the meaning of this tag.
?NodeList	getChildren() ??????????Get the children of this node.
?int	getEndPosition() ??????????Gets the ending position of the node.
?Node	getFirstChild() ??????????Get the first child of this node.
?Node	getLastChild() ??????????Get the last child of this node.
?Node	getNextSibling() ??????????Get the next sibling to this node.
?Page	getPage() ??????????Get the page this node came from.
?Node	getParent() ??????????Get the parent of this node.
?Node	getPreviousSibling() ??????????Get the previous sibling to this node.
?int	getStartPosition() ??????????Gets the starting position of the node.
?String	getText() ??????????Returns the text of the node.
?void	setChildren(NodeList?children) ??????????Set the children of this node.
?void	setEndPosition(int?position) ??????????Sets the ending position of the node.
?void	setPage(Page?page) ??????????Set the page this node came from.
?void	setParent(Node?node) ??????????Sets the parent of this node.
?void	setStartPosition(int?position) ??????????Sets the starting position of the node.
?void	setText(String?text) ??????????Sets the string contents of the node.
?String	toHtml() ??????????Return the HTML for this node.
?String	toHtml(boolean?verbatim) ??????????Return the HTML for this node.
?String	toPlainTextString() ??????????A string representation of the node.
?String	toString() ??????????Return the string representation of the node.

節(jié)點(diǎn)過(guò)濾器，這些過(guò)濾器可以按照即誒但類(lèi)型。節(jié)點(diǎn)之間父子關(guān)系，也可以自定義過(guò)濾器。多個(gè)過(guò)濾器之間可以組合成符合過(guò)濾器用于多條件過(guò)濾，

比如AndFilter，NotFilter，OrFilter，XorFilter

Class Summary

AndFilter	Accepts nodes matching all of its predicate filters (AND operation).
CssSelectorNodeFilter	A NodeFilter that accepts nodes based on whether they match a CSS2 selector.
HasAttributeFilter	This class accepts all tags that have a certain attribute, and optionally, with a certain value.
HasChildFilter	This class accepts all tags that have a child acceptable to the filter.
HasParentFilter	This class accepts all tags that have a parent acceptable to another filter.
HasSiblingFilter	This class accepts all tags that have a sibling acceptable to another filter.
IsEqualFilter	This class accepts only one specific node.
LinkRegexFilter	This class accepts tags of class LinkTag that contain a link matching a given regex pattern.
LinkStringFilter	This class accepts tags of class LinkTag that contain a link matching a given pattern string.
NodeClassFilter	This class accepts all tags of a given class.
NotFilter	Accepts all nodes not acceptable to it's predicate filter.
OrFilter	Accepts nodes matching any of its predicates filters (OR operation).
RegexFilter	This filter accepts all string nodes matching a regular expression.
StringFilter	This class accepts all string nodes containing the given string.
TagNameFilter	This class accepts all tags matching the tag name.

抓取http://www.v2ex.com網(wǎng)站中的一篇帖子

首先要?jiǎng)?chuàng)建獲取網(wǎng)頁(yè)內(nèi)容，分析網(wǎng)頁(yè)元素結(jié)構(gòu)制作過(guò)濾器；

可以看到回復(fù)div的id都是r_加六位數(shù)字，推薦使用正則表達(dá)式匹配，主題的樣式是corder-bottom:0px（一定要缺人過(guò)濾器的結(jié)果，免得引入多余節(jié)點(diǎn)）。

創(chuàng)建一個(gè)方法，獲得主題和回復(fù)節(jié)點(diǎn)集合

/*** * 獲取html中的主題和所有回復(fù)節(jié)點(diǎn)* * @param url* @param ENCODE* @return*/protected NodeList getNodelist(String url, String ENCODE) {try {NodeList nodeList = null;Parser parser = new Parser(url);parser.setEncoding(ENCODE);//定義一個(gè)Filter，過(guò)濾主題divNodeFilter filter = new NodeFilter() {@Overridepublic boolean accept(Node node) {if(node.getText().contains("style=\"border-bottom: 0px;\"")) {return true;} else {return false;}}};//定義一個(gè)Filter，過(guò)濾所有回復(fù)divNodeFilter replyfilter = new NodeFilter() {@Overridepublic boolean accept(Node node) {String containsString = "id=\"r_";if(node.getText().contains(containsString)) {return true;} else {return false;}}};//組合filterOrFilter allFilter = new OrFilter(filter, replyfilter);nodeList = parser.extractAllNodesThatMatch(allFilter);return nodeList;} catch (ParserException e) {e.printStackTrace();return null;}}
好了有了這些節(jié)點(diǎn)接下來(lái)就是解析了。

這個(gè)例子代碼只寫(xiě)了一部分元素的獲取，剩下的活也是體力活慢慢分析節(jié)點(diǎn)關(guān)系，用過(guò)濾器或者dom樹(shù)找目標(biāo)節(jié)點(diǎn)。

下面的代碼是將解析到的節(jié)點(diǎn)數(shù)據(jù)封裝到bean

public Forum parse2Thread(String url,String ENCODE) {List<Reply> replylist = new ArrayList<Reply>(); //回復(fù)列表Topic topic = new Topic(); //主題NodeFilter divFilter = new NodeClassFilter(Div.class);//div過(guò)濾器NodeFilter headingFilter = new NodeClassFilter(HeadingTag.class);//heading過(guò)濾器NodeFilter tagFilter = new NodeClassFilter(TagNode.class);//heading過(guò)濾器NodeList nodeList = this.getNodelist(url, ENCODE);//解析node到帖子實(shí)體for (int i = 0; i < nodeList.size(); i++) {Node node = nodeList.elementAt(i);if(node.getText().contains("style=\"border-bottom: 0px;\"")) {//如果node是主題NodeList list = node.getChildren();//node的子節(jié)點(diǎn)//header divNode headerNode = list.extractAllNodesThatMatch(new NodeClassFilter(Div.class)).elementAt(0);//帖子主題Node h1Node = headerNode.getChildren().extractAllNodesThatMatch(headingFilter).elementAt(0);topic.setTopicName(h1Node.toPlainTextString());//發(fā)帖人信息NodeList headerChrildrens = headerNode.getChildren();topic.setAnn_name(headerChrildrens.elementAt(15).toPlainTextString());topic.setTopicDescribe(headerChrildrens.elementAt(16).toPlainTextString());//發(fā)帖人頭像鏈接Node frNode = headerChrildrens.extractAllNodesThatMatch(divFilter).elementAt(0);ImageTag imgNode = (ImageTag) frNode.getFirstChild().getFirstChild();topic.setAnn_img(imgNode.getImageURL());//cell divNode cellNode = list.extractAllNodesThatMatch(divFilter).elementAt(1);Node topic_content = cellNode.getChildren().extractAllNodesThatMatch(divFilter).elementAt(0);Node markdown_body = topic_content.getChildren().extractAllNodesThatMatch(divFilter).elementAt(0);topic.setTopicBody(markdown_body.toPlainTextString());//暫時(shí)不包含連接和圖片純文本} else if(node.getText().contains("id=\"r_")){//節(jié)點(diǎn)是回復(fù)Reply reply = new Reply();Node tableNode = node.getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);Node trNode = tableNode.getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);//回復(fù)的tagNodeListNodeList tagList = trNode.getChildren().extractAllNodesThatMatch(tagFilter);ImageTag reply_img = (ImageTag) tagList.elementAt(0).getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);reply.setReply_img(reply_img.getImageURL());//nodeList bodyNode = tagList;replylist.add(reply);}}System.out.println("-----------實(shí)體----------------");Forum forum = new Forum(topic, replylist);System.out.println(forum.toString());return null;}

好了。解析都做完了，在寫(xiě)個(gè)主方法分析一個(gè)帖子試試；

@Testpublic void test() throws Exception {Html2Domain parse = new Html2DomainImpl();parse.parse2Thread("http://www.v2ex.com/t/262409#reply6","UTF-8");}
看看運(yùn)行結(jié)果：

這個(gè)內(nèi)容過(guò)長(zhǎng)，截圖只能看到帖子名稱(chēng)，和帖子內(nèi)容了，有興趣的自己去測(cè)試把。請(qǐng)一定要注意地址，貌似這個(gè)網(wǎng)站帖子連接會(huì)有失效時(shí)間，假如測(cè)試獲取失敗請(qǐng)換個(gè)帖子地址試試。

附上項(xiàng)目代碼：測(cè)試使用的是jdk1.6+eclipse kepler

http://pan.baidu.com/s/1mh9OuDi

總結(jié)

以上是生活随笔為你收集整理的java解析html之HTMLparser初次尝试的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： cin.tie() 输入加速器
下一篇：如何下载、使用英文期刊的LaTeX模板（