當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

JAVA代码爬虫获取网站信息

發布時間：2023/12/14 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 JAVA代码爬虫获取网站信息小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

目前想獲得一些計算機及CPU型號對應的性能分數.在網站上有

https://www.cpubenchmark.net/cpu_list.php

哈哈哈...(你們明白要干嘛的...)

畢竟型號還是有限的,所以類似于字典表在數據庫中,方便自己后期查詢維護.

直接上代碼,一知半解的,還不怎么懂...

public void getComputScore(List<ZnComputScore> result){log.info("開始:爬取電腦性能得分數據");CloseableHttpClient httpClient = HttpClients.createDefault();CloseableHttpResponse response = null;//2.創建get請求，相當于在瀏覽器地址欄輸入網址HttpGet request = new HttpGet("https://www.cpubenchmark.net/cpu_list.php");//設置請求頭，將爬蟲偽裝成瀏覽器request.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36");try {log.info("開始執行爬取數據的http請求.");//3.執行get請求，相當于在輸入地址欄后敲回車鍵response = httpClient.execute(request);log.info("結束執行http的execute請求");//4.判斷響應狀態為200，進行處理if(response.getStatusLine().getStatusCode() == 200) {//5.獲取響應內容HttpEntity httpEntity = response.getEntity();String html = EntityUtils.toString(httpEntity, "utf-8");//6.Jsoup解析htmlDocument document = Jsoup.parse(html);//像js一樣，通過標簽獲取title//像js一樣，通過id 獲取文章列表元素對象//像js一樣，通過class 獲取列表下的所有博客Elements trs =document.getElementsByTag("tr");log.info("開始封裝[{}]條Element數據",trs.size());for (Element a_s:trs) {Elements tds = a_s.getElementsByTag("tr");ZnComputScore comput = new ZnComputScore();if (a_s.children().size() < 2 || a_s.child(0)== null || a_s.child(1)==null){continue;}comput.setCpuText(a_s.child(0).text());comput.setCpuScore(a_s.child(1).text());result.add(comput);}} else {//如果返回狀態不是200，比如404（頁面不存在）等，根據情況做處理，這里略System.out.println("返回狀態不是200");System.out.println(EntityUtils.toString(response.getEntity(), "utf-8"));}} catch (ClientProtocolException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} finally {//6.關閉HttpClientUtils.closeQuietly(response);HttpClientUtils.closeQuietly(httpClient);log.info("結束:爬取[{}]條電腦得分數據",result.size());}} <dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.4.1</version></dependency> <dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.12.1</version><scope>compile</scope></dependency>

httpclient用于請求數據

jsoup用于解析請求返回的html文件

至此就直接用吧....

其他頁面啥的,可以通過輸出html文件進行業務解析,拿到自己想要的數據.

總結

以上是生活随笔為你收集整理的JAVA代码爬虫获取网站信息的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。