Jsoup学习
1、Jsoup簡介
Jsoup 是一款Java 的HTML解析器,可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似于jQuery的操作方法來取出和操作數據。
我們在爬蟲采集網頁領域,主要作用是用HttpClient獲取到網頁后,具體的網頁提取需要的信息的時候 ,就用到Jsoup,Jsoup可以使用強大的類似Jquery,css選擇器,來獲取需要的數據可以非常輕松的實現。
雖然Jsoup也支持從某個地址直接去爬取網頁源碼,但是只支持HTTP,HTTPS協議,支持不夠豐富。
所以,主要還是用來對HTML進行解析。其中,要被解析的HTML可以是一個HTML的字符串,可以是一個URL,可以是一個文件。
org.jsoup.Jsoup把輸入的HTML轉換成一個org.jsoup.nodes.Document對象,然后從Document對象中取出想要的元素。
org.jsoup.nodes.Document繼承了org.jsoup.nodes.Element,Element又繼承了org.jsoup.nodes.Node類。里面提供了豐富的方法來獲取HTML的元素。
Jsoup官方地址:https://jsoup.org/
Jsoup最新下載:https://jsoup.org/download
Jsoup中文文檔:http://www.open-open.com/jsoup/
2、使用實例
1.添加Maven依賴:
<!-- 添加Httpclient支持 --><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.2</version></dependency><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.10.2</version></dependency>2.簡單例子:
public class JsoupTest {public static void main(String[] args) throws Exception{// 創建httpClient實例CloseableHttpClient httpClient = HttpClients.createDefault();// 創建httpGet實例HttpGet httpGet = new HttpGet("http://www.cnblogs.com");CloseableHttpResponse response = httpClient.execute(httpGet);String content = null;if(response != null){HttpEntity entity = response.getEntity(); content = EntityUtils.toString(entity, "UTF-8"); // 獲取網頁內容Document document = Jsoup.parse(content); // 解析網頁,得到文檔對象Elements elements = document.getElementsByTag("title"); // 獲取 tag為 title的DOM元素Element element = elements.get(0); // 獲取第一個DOM元素String title = element.text(); // 返回元素的文本System.out.println("博客園的標題:" + title);Element element2 = document.getElementById("site_nav_top");String navTop = element2.text();System.out.println("座右銘:" + navTop);}if(response != null){response.close();}if(httpClient != null){httpClient.close();}} }3.Jsoup查找DOM元素
簡單的實例:
public class JsoupTest2 {public static void main(String[] args) throws Exception{// 創建httpClient實例CloseableHttpClient httpClient = HttpClients.createDefault();// 創建httpGet實例HttpGet httpGet = new HttpGet("http://www.cnblogs.com");httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");CloseableHttpResponse response = httpClient.execute(httpGet);String content = null;if(response != null){HttpEntity entity = response.getEntity(); content = EntityUtils.toString(entity, "UTF-8"); // 獲取網頁內容Document document = Jsoup.parse(content); // 解析網頁,得到文檔對象/*** 1.根據tag獲取元素*/Elements elements = document.getElementsByTag("title"); // 獲取 tag為 title的DOM元素Element element = elements.get(0); // 獲取第一個DOM元素String title = element.text(); // 返回元素的文本System.out.println("博客園的標題:" + title);/*** 2.根據 id獲取元素*/Element element2 = document.getElementById("site_nav_top");String navTop = element2.text();System.out.println("座右銘:" + navTop);/*** 3.根據樣式獲取元素*/Elements elements3 = document.getElementsByClass("post_item");System.out.println("============根據樣式獲取元素=============");for(Element e : elements3){System.out.println(e.html());System.out.println("------------------------------");}/*** 4.根據屬性名稱來查詢DOM*/Elements elements4 = document.getElementsByAttribute("width");System.out.println("============根據屬性名稱來查詢DOM=============");for(Element e : elements4){System.out.println(e.toString());System.out.println("------------------------------");}/*** 5.根據屬性名和屬性值來查詢DOM*/Elements elements5 = document.getElementsByAttributeValue("target", "_blank");System.out.println("============ 根據屬性名和屬性值來查詢DOM=============");for(Element e : elements5){System.out.println(e.toString());System.out.println("------------------------------");}}if(response != null){response.close();}if(httpClient != null){httpClient.close();}} }4.通過類似于css或jQuery的選擇器來查找元素
使用的是Element類的下記方法:public Elements select(String cssQuery)
通過傳入一個類似于CSS或jQuery的選擇器字符串,來查找指定元素。
實例如下:
public class JsoupTest3 {public static void main(String[] args) throws Exception{// 創建httpClient實例CloseableHttpClient httpClient = HttpClients.createDefault();// 創建httpGet實例HttpGet httpGet = new HttpGet("http://www.cnblogs.com");httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");CloseableHttpResponse response = httpClient.execute(httpGet);String content = null;if(response != null){HttpEntity entity = response.getEntity(); content = EntityUtils.toString(entity, "UTF-8"); // 獲取網頁內容Document document = Jsoup.parse(content); // 解析網頁,得到文檔對象// 1.查找所有帖子DOMElements elements = document.select(".post_item .post_item_body h3 a");for(Element ele : elements){System.out.println("博客標題:" + ele.text());}System.out.println("------------------------分割線------------------------");// 2.查找帶有href屬性的a元素Elements hrefElements = document.select("a[href]");for(Element ele : hrefElements){System.out.println(ele.toString());}System.out.println("------------------------分割線------------------------");// 3.查找擴展名為.png的圖片DOM節點Elements imgElements = document.select("img[src$=.png]");for(Element ele : imgElements){System.out.println(ele.toString());}System.out.println("------------------------分割線------------------------");// 4.獲取tag為title的第一個DOM元素Element titleEle = document.getElementsByTag("title").first();System.out.println("標題為:" + titleEle.text());}if(response != null){response.close();}if(httpClient != null){httpClient.close();}} }5.Jsoup獲取DOM元素的屬性值
1.獲取博客園的博客標題以及博客地址,獲取友情鏈接
? ?
2.代碼實現:
public class JsoupTest4 {public static void main(String[] args) throws Exception{// 創建httpClient實例CloseableHttpClient httpClient = HttpClients.createDefault();// 創建httpGet實例HttpGet httpGet = new HttpGet("http://www.cnblogs.com");httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");CloseableHttpResponse response = httpClient.execute(httpGet);String content = null;if(response != null){HttpEntity entity = response.getEntity(); content = EntityUtils.toString(entity, "UTF-8"); // 獲取網頁內容Document document = Jsoup.parse(content); // 解析網頁,得到文檔對象// 1.通過選擇器查找所有博客標題以及鏈接Elements ele = document.select("#post_list .post_item .post_item_body h3 a");for(Element e : ele){System.out.println("博客標題:" + e.text() + "---博客地址:" + e.attr("href"));}// 2.獲取友情鏈接Element linkEle = document.select("#friend_link").first();System.out.println("友情鏈接純文本:" + linkEle.text());System.out.println("友情鏈接HTML:" + linkEle.html());}if(response != null){response.close();}if(httpClient != null){httpClient.close();}} }?
總結
- 上一篇: 我们需要什么样的导航网站?
- 下一篇: AES加密有什么用,AES加密算法安全性