當前位置：首頁 > 前端技术 > HTML >内容正文

HTML

Jsoup组件抓取HTML标签

發布時間：2023/12/14 HTML 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 Jsoup组件抓取HTML标签小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Jsoup如何獲取解析html元素內容？

微信朋友圈分享鏈接通常需要抓取html標簽獲取當前html頁面的內容和第一張圖片，如何抓取html元素呢，Java中通常用Jsoup組件去抓取元素，Jsoup 是一款Java的HTML解析包，主要用于對html進行解析，有時候我們需要從網頁源碼中提取有效的信息內容，比如網頁的title，網頁的body，使用jsoup對html網頁進行解析，可以非常輕松的實現此類需求。

Jsoup可以解析一個html字符串，可以解析一個url，也可以解析一個html文本：

1：解析一個字符串

public static void parseHtmlText(){String html = "<!DOCTYPE html><html><head><meta "+ "charset=\"utf-8\"><title>demo</title>"+ "</head><body><h1>你好，吃飯了嗎</h1></body></html>";Document document = Jsoup.parse(html);System.out.println("title:" + document.title());}

說明：

parse(String html, String baseUri)?這方法能夠將輸入的HTML解析為一個新的文檔 (Document），參數 baseUri 是用來將相對 URL 轉成絕對URL，并指定從哪個網站獲取文檔。如這個方法不適用，你可以使用?parse(String html)?方法來解析成HTML字符串如上面的示例。.

只要解析的不是空字符串，就能返回一個結構合理的文檔，其中包含(至少)?一個head和一個body元素。?

一旦擁有了一個Document，你就可以使用Document中適當的方法或它父類?Element和Node中的方法來取得相關數據。

2：解析一個body片段

public static void parseHtmlBody(){String html = "<h1>Lorem ipsum</h1>";Document doc = Jsoup.parseBodyFragment(html);Element body = doc.body();System.out.print("body:" + body.html());}

3：解析一個html文本

public static void parseHtmlFile(){try {//根據當前class絕對路徑得到當前的文件流String file = JsoupMain.class.getResource("").getPath() + "index.html";System.out.println("file:" + file);Document document = Jsoup.parse(new File(file), "UTF-8","");System.out.println("title:" + document.title());} catch (IOException e) {e.printStackTrace();}}

4：根據url解析html

public static void parseHtmlUrl(){String url = "http://www.baidu.com/";try {Document document = Jsoup.connect(url).get();String title = document.title();System.out.println(title);} catch (IOException e) {e.printStackTrace();}}

?? ?注意：根據url解析html.并且獲取標題(注意此寫法在Android中應該將這個方法放在子線程中),
?? ? connect(String url) 方法創建一個新的 Connection, 和 get()?
?? ? 取得和解析一個HTML文件。如果從該URL獲取HTML時發生錯誤，便會拋出 IOException，應適當處理。

5：獲取html文檔中的img標簽

public static void parseHtmlImgElement(){String url = "https://www.qq.com/";try {Document document = Jsoup.connect(url).get();//獲取html body元素Element element = document.body();Elements elements = element.getElementsByTag("img");for(Element e : elements){System.out.print("網頁中圖片地址:" + "http:" + e.attr("src"));System.out.print(";寬:" + e.attr("width"));System.out.println(";高:" + e.attr("height"));}} catch (IOException e) {e.printStackTrace();}}

注意：e.attr("src")返回的字符串不是一個全路徑,是//www.baidu.com/img/gs.gif

打印結果如下：

網頁中圖片地址:http://mat1.gtimg.com/pingjs/ext2020/qqindex2018/dist/img/qq_logo_2x.png;寬:100%;高:
網頁中圖片地址:http://mat1.gtimg.com/pingjs/ext2020/test2017/netwatch.png;寬:;高:
網頁中圖片地址:http://img1.gtimg.com/ninja/2/2018/10/ninja153907290259802.png;寬:;高:
網頁中圖片地址:http://img1.gtimg.com/ninja/2/2018/10/ninja153907291410277.png;寬:;高:
網頁中圖片地址:http://inews.gtimg.com/newsapp_ls/0/7969043087_640330/0;寬:;高:
網頁中圖片地址:http://img1.gtimg.com/ninja/2/2019/03/ninja155169290388291.jpg;寬:;高:
網頁中圖片地址:http://inews.gtimg.com/newsapp_ls/0/7967777098_150120/0;寬:;高:
網頁中圖片地址:http://img1.gtimg.com/bj/pics/hv1/64/147/2305/149920174.jpg;寬:;高:
網頁中圖片地址:http://mat1.gtimg.com/news/zt2018/qqpclmlogo/logo_ssy.png;寬:;高:
網頁中圖片地址:http://img1.gtimg.com/ninja/2/2019/03/ninja155165943065253.jpg;寬:;高:
網頁中圖片地址:http://inews.gtimg.com/newsapp_ls/0/7967571215_640330/0;寬:;高:

6：獲取html文檔中的<a>標簽，并且將其鏈接打印出來

public static void parseHtmlLinkElement(){String url = "https://www.baidu.com/";try {Document document = Jsoup.connect(url).get();//獲取html body元素Element element = document.body();Elements elements = element.getElementsByTag("a");for(Element e : elements){String linkHref = e.attr("href");String linkText = e.text();System.out.println("linkHref:" + linkHref);System.out.println("linkText:" + linkText);}} catch (IOException e) {e.printStackTrace();}}

7：選擇器語法查找元素以及屬性

public static void selectHtmlLinkElement(){String url = "https://www.baidu.com/";try {Document document = Jsoup.connect(url).get();//獲取標簽a帶有href屬性的元素Elements links = document.select("a[href]");//找到img標簽中以png結尾的標簽Elements pngs = document.select("img[src$=.png]");//獲取class等于s-top-wrap的div標簽,獲取他的第一個元素Element divElement = document.select("s-isindex-wrap").first();if(divElement != null){System.out.println(divElement.data());}for(Element e : links){System.out.println(e.attr("href"));}System.out.println("**************selectHtmlLinkElement***************");for(Element e : pngs){System.out.println("http:" + e.attr("src"));}} catch (IOException e) {e.printStackTrace();}}

Selector選擇器的使用：
tagname: 通過標簽查找元素，比如：a
ns|tag: 通過標簽在命名空間查找元素，比如：可以用 fb|name 語法來查找 <fb:name> 元素
#id: 通過ID查找元素，比如：#logo
.class: 通過class名稱查找元素，比如：.masthead
[attribute]: 利用屬性查找元素，比如：[href]
[^attr]: 利用屬性名前綴來查找元素，比如：可以用[^data-] 來查找帶有HTML5 Dataset屬性的元素
[attr=value]: 利用屬性值來查找元素，比如：[width=500]
[attr^=value], [attr$=value], [attr*=value]: 利用匹配屬性值開頭、結尾或包含屬性值來查找元素，比如：[href*=/path/]
[attr~=regex]: 利用屬性值匹配正則表達式來查找元素，比如： img[src~=(?i)\.(png|jpe?g)]

參考鏈接：https://www.open-open.com/jsoup/parse-document-from-string.htm

總結

以上是生活随笔為你收集整理的Jsoup组件抓取HTML标签的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：将sql数据导入mysql数据库_将sq
下一篇：卸载WPS后怎么WORD的图标还是WPS