當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用jsoup解析html

發布時間：2025/3/20 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了使用jsoup解析html 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

我們抓取到頁面后，還需要對頁面進行解析。可以使用字符串處理工具解析頁面，也可以使用正則表達式，但是這些技術都會帶來很大的開發成本，所以我們需要一款專門解析html頁面的技術。

jsoup是一款java的html解析器，可以直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API，可通過DOM，CSS以及類似Jquery的操作方法來取出和操作數據。

jsoup的主要功能如下：

從一個URL，文件或者字符串中解析HTML
使用DOM或CSS選擇器來查找、取出數據
可操作HTML元素、屬性、文本

（1）加入pom依賴

? ? ? ?<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.10.2</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version><scope>test</scope></dependency><dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.6</version></dependency><dependency><groupId>org.apache.commons</groupId><artifactId>commons-lang3</artifactId><version>3.7</version></dependency>

（2）解析url

? ?@Testpublic void testUrl() throws Exception {//解析URL地址Document doc= Jsoup.parse(new URL("https://movie.douban.com"),1000);//獲取第一個title標簽內的內容String title=doc.getElementsByTag("title").first().text();System.out.println(title);}

雖然使用jsoup可以替代HttpClient直接發起請求解析數據，但是往往不會這樣用，因為實際的開發過程中，需要使用到多線程，連接池，代理等方式，而jsoup對這些的支持不是很好，所以我們一般只把jsoup當做html解析工具使用

（3）解析字符串

這里我們將豆瓣電影250的網站html保存下來，然后利用commons-io包的工具類獲取該html的字符串

? ?@Testpublic void testString() throws Exception {//使用工具類讀取文件，獲取字符串String content = FileUtils.readFileToString(new File("C:\\Users\\Desktop\\學習\\豆瓣電影 Top 250.html"), "utf8");//解析字符串Document doc = Jsoup.parse(content);String title = doc.getElementsByTag("title").first().text();System.out.println(title); ?}

（4）解析文件

? ?@Testpublic void testFile() throws Exception{//解析文件Document doc = Jsoup.parse(new File("C:\\Users\\Desktop\\學習\\豆瓣電影 Top 250.html"), "utf8");String title = doc.getElementsByTag("title").first().text();System.out.println(title);}

（5）使用dom方式遍歷文檔

功能方法

根據id查詢元素	getElementById
根據標簽獲取元素	getElementsByTag
根據class獲取元素	getElementsByClass
根據屬性獲取元素	getElementsByAttribute

? ?@Testpublic void testDom() throws Exception{//解析文件Document doc = Jsoup.parse(new File("C:\\Users\\Desktop\\學習\\豆瓣電影 Top 250.html"), "utf8");//1.根據id獲取元素String element = doc.getElementById("content").text();System.out.println(element);//2.根據標簽獲取元素String title = doc.getElementsByTag("title").first().text();System.out.println(title);//3.根據class獲取元素String element2 = doc.getElementsByClass("inp-btn").first().text();System.out.println(element2);//4.根據屬性獲取元素String element3 = doc.getElementsByAttribute("href").first().text();System.out.println(element3);String element4 = doc.getElementsByAttributeValue("href","https://movie.douban.com/subject/1292052/").text();System.out.println(element4);}

（6）從元素中獲取數據

? ?@Testpublic void testData() throws Exception {//解析文件Document doc = Jsoup.parse(new File("C:\\Users\\Desktop\\學習\\豆瓣電影 Top 250.html"), "utf8");//根據id獲取數據Element element = doc.getElementById("top-nav-appintro");String str = "";//1.從元素中獲取Idstr = element.id();System.out.println(str);//2.從元素中獲取classNamestr = element.className();System.out.println(str);Set<String> classSet = element.classNames();for (String s : classSet) {System.out.println(s);}//3.從元素中獲取屬性的值String id = element.attr("class ");//4.從元素中獲取所有屬性Attributes attributes = element.attributes();System.out.println(attributes.toString());//5.從元素中獲取文本內容textstr = element.text();System.out.println(str);}

（7）Selector選擇器

功能語法

通過標簽查找元素	tagname
通過Id查找元素	#id
通過class名稱查找元素	.class
通過屬性查找元素	[attribute]
通過屬性值查找元素	[attr=value]

? ?@Testpublic void testSelector() throws Exception{//解析文件Document doc = Jsoup.parse(new File("C:\\Users\\楊峰宇\\Desktop\\學習\\豆瓣電影 Top 250.html"), "utf8");//通過標簽查找元素Elements elements = doc.select("span");for (Element element : elements) {System.out.println(element.text());}System.out.println("==================");//通過id查找元素，比如#idElements elements1 = doc.select("#top-nav-appintro");System.out.println(elements1.first().text());System.out.println("==================");//通過class元素查找元素，比如.classElements elements2 = doc.select(".inp-btn");for (Element element : elements2) {System.out.println(element.text());}System.out.println("==================");//通過屬性查找元素，比如[abc]Elements elements3 = doc.select("[href]");for (Element element : elements3) {System.out.println(element.text());}System.out.println("==================");//通過屬性值來查找元素，比如[abc=123]Elements elements4 = doc.select("[class=pic]");for (Element element : elements4) {System.out.println(element.text());}}

總結

以上是生活随笔為你收集整理的使用jsoup解析html的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

html
Jsoup

上一篇：使用java的HttpClient实现抓
下一篇： Linux之bash shell基本命令