當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

java爬虫下载付费html网页模板

發布時間：2025/3/20 编程问答 11 豆豆

生活随笔收集整理的這篇文章主要介紹了 java爬虫下载付费html网页模板小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前言

前一段時間我們有一個網頁的projiect小項目，要求學習bootstarp。然而自己寫的模板和別人寫好的東西，無論從美觀和手機運行的兼容性上差距都很巨大。中途我們放棄自己寫的東西，開始偷別人的模板。有些甚至不會偷的同學甚至還付費下載，都什么年代了，程序員還要花錢買模板。那次結束后，突發奇想能不能寫個程序，讓他自動下載模板。經過不斷努力和解決bug，最終取得了成功。

思路

大致思路為：輸入模板的一個頁面為url，通過這個鏈接遍歷所有與之有關的鏈接放到hashset中（采用隊列的寬度優先遍歷bfs）。這個相關用字符判斷鏈接前面的主要域名地址。（鏈出去的鏈接不處理，防止無限擴大）。同時，還要將各種url分類放到不同的set中。

html頁面分析：抓取html鏈接。還要按行讀取html文本分析其中可能隱藏的css文件（可能有背景圖片）。獲取js鏈接，獲取image地址，css地址，（注意一定要儲存絕對地址而不是相對地址）。還有的涉及到上層目錄。需要處理。

css頁面：按行分析。因為css中可能儲存背景圖片以及其他logo。
js：直接下載保存。
html：下載保存
image：下載保存

注意點：

所有下載鏈接或者其他活動都要在try catch進行，在catch中跳過這個步驟，執行相應步驟。

下載目錄在download自行更改（默認F：//download）

添加jsoup的jar包

有些圖片藏在js文件中和css文件中，所以需要去判斷js文件和css文件，我這個只分析了css沒分析css。

由于精力和時間問題，項目并沒有晚上，由于筆者此時正則能力不足，大部分采用字符串分割查找或者contains查找，難免有疏漏

目前代碼測試只針對17素材之家部分模板測試有效。其他站點未進行測試

只是小白，代碼亢長低水平，大佬勿噴。
附上代碼如下：

代碼

啟動主類getmoban

import java.io.IOException; import java.util.Iterator; import java.util.Scanner; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors;public class getmoban {public static void main(String[] args) throws IOException{ExecutorService ex=Executors.newFixedThreadPool(6);Scanner sc=new Scanner(System.in);System.out.println("請輸入網址（別太大否則下載不完）");String url=sc.nextLine();geturl g=new geturl(url);//csssearch cssimage=new csssearch();System.out.println(g.file);g.judel(); Iterator it=g.htmlurlset.iterator(); while(it.hasNext()){String name=it.next();try {download download=new download(name);ex.execute(download); }catch(Exception e){}//System.out.println("地址為" name);}Iterator it2=g.jsset.iterator();while(it2.hasNext()){String name=it2.next();try {download download=new download(name);ex.execute(download); }catch(Exception e){}//System.out.println("js地址為" name);}Iterator it3=g.cssset.iterator();while(it3.hasNext())//css需要過濾其中是否有背景圖片{String name=it3.next();try {download download=new download(name);ex.execute(download);cssimage.searchimage(name);}catch(Exception e){}//System.out.println("css地址為" name);}Iterator it4=g.imgset.iterator();while(it4.hasNext()){String name=it4.next();try {download download=new download(name);ex.execute(download); }catch(Exception e){}//System.out.println("image地址為" name);}ex.shutdown();//judel();} }

分析鏈接geturl

import java.io.IOException; import java.util.ArrayDeque; import java.util.HashSet; import java.util.Iterator; import java.util.Queue; import java.util.Set;import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;public class geturl {public static String url="http://www.17sucai.com/preview/1/2014-11-28/jQuery用戶注冊表單驗證代碼/index.html";static String head="http";public geturl(String url){this.url=url;}static String file=url;//文件路徑{if(url.contains("http")){head=file.split("//")[0];file=file.split("//")[1];}int last=file.lastIndexOf("/");file=file.substring(0, last);}static Set htmlurlset=new HashSet();//htmlstatic Set jsset=new HashSet();//jsstatic Set imgset=new HashSet();//imagestatic Set cssset=new HashSet();//css樣式static Queue queue=new ArrayDeque();// public geturl() throws IOException // {this.judel();}public static void judel() throws IOException {queue.add(url);htmlurlset.add(url);while(!queue.isEmpty()&&queue!=null)//要防止鏈接無限擴大{String teamurl=queue.poll();//彈出頭并且刪除節點System.out.println(teamurl);if(!teamurl.endsWith(".com"))//有的網站短小，可能識別有錯誤 {if(file.indexOf("/")>0){if(teamurl.contains(file.substring(0,file.indexOf("/"))))analyze(teamurl);}elseanalyze(teamurl);} // catch(Exception e) {System.out.println("cuo");} }}public static void analyze(String URL){try {Document doc;doc = Jsoup.connect(URL).timeout(20000).header("user-agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36").ignoreContentType(true).get();Elements all=doc.select("[class]");//檢查Elements js=doc.getElementsByTag("script");Elements html=doc.select("a[href]");Elements img=doc.select("img");Elements css=doc.select("link[href]");for(Element e:all){if(e.attr("style")!="")//找到藏在html的css的圖片背景{ String tex=e.attr("style");if(tex.contains("url")){String urladress=file;String imgurl=tex.split("url")[1];imgurl=imgurl.split("\\(")[1].split("\\)")[0];//轉義字符串if(imgurl.startsWith("'")||imgurl.startsWith("\""))//注意轉義字符串{imgurl=imgurl.substring(1,imgurl.length()-1);} while(imgurl.startsWith("..")){imgurl=imgurl.substring(imgurl.indexOf("/") 1); urladress=urladress.substring(0,urladress.lastIndexOf("/"));}urladress=head "//" urladress "/" imgurl;imgset.add(urladress);} }}for(Element htmlelement:html){ String a=htmlelement.absUrl("href").split("#")[0];if(!a.equals("")){if(!htmlurlset.contains(a)&&a.contains(file.substring(0,file.indexOf("/"))))//不存在繼續遍歷{ queue.add(a);htmlurlset.add(a); //System.out.println(a);} } }for(Element jselement:js)//判斷JS{String team=jselement.absUrl("src"); if(!team.equals(""))jsset.add(team);//添加}for(Element csselement:css){String team=csselement.absUrl("href");if(!team.equals(""))//絕對路徑cssset.add(team); // System.out.println(e.attr("href"));}for(Element imageelement:img){String team=imageelement.absUrl("src");if(!team.equals(""))//絕對路徑imgset.add(team);//System.out.println(e.attr("href"));}}catch(Exception e){if(!queue.isEmpty()) {URL=queue.poll();analyze(URL);}}} }

分析css（css可能隱藏圖片）csssearch

import java.io.BufferedReader; import java.io.File; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection; import java.util.HashSet; import java.util.Iterator; import java.util.Set;public class csssearch {public static void searchimage(String ur) throws IOException {if(ur.toLowerCase().contains("bootstarp")) {return;}//bootstarp.css過濾掉，肯定沒圖片Set imgset=new HashSet();//String ur="http://demo.cssmoban.com/cssthemes5/cpts_1019_bpi/css/style.css";String http="http";String fileurl=ur;if(fileurl.startsWith("http")){http=fileurl.split("//")[0];//防止https協議fileurl=fileurl.split("//")[1];}fileurl=fileurl.substring(0,fileurl.lastIndexOf("/"));//System.out.println(fileurl);//測試URL url=new URL(ur);URLConnection conn = url.openConnection();conn.setConnectTimeout(1000);conn.setReadTimeout(5000);conn.connect();InputStream in= conn.getInputStream();InputStreamReader inp=new InputStreamReader(in);BufferedReader buf=new BufferedReader(inp);File file=new File("F:\\download\\" ur.split("//")[1]);if(!file.exists()){file.getParentFile().mkdirs();file.createNewFile();}// BufferedOutputStream bufout=new BufferedOutputStream(new FileOutputStream(file));String tex="";while((tex=buf.readLine())!=null){ // System.out.println(tex);if(tex.contains("url")){String urladress=fileurl;String imgurl=tex.split("url")[1];imgurl=imgurl.split("\\(")[1].split("\\)")[0];//轉義字符串if(imgurl.startsWith("'")||imgurl.startsWith("\""))//注意轉義字符串{imgurl=imgurl.substring(1,imgurl.length()-1);}//System.out.println(imgurl);//測試while(imgurl.startsWith("..")){imgurl=imgurl.substring(imgurl.indexOf("/") 1); urladress=urladress.substring(0,urladress.lastIndexOf("/"));}urladress=http "//" urladress "/" imgurl;//System.out.println(urladress);//down.download(urladress);imgset.add(urladress);}}// bufout.close();buf.close();inp.close();in.close();Iterator it=imgset.iterator();while(it.hasNext()){ String team=it.next();try {download down=new download(team);Thread t1=new Thread(down);t1.start();System.out.println(team "下載成功");}catch(Exception e) {System.out.println("下載失敗：" team);}}} }

download(線程池下載)

import java.io.BufferedInputStream; import java.io.BufferedOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.net.URL; import java.net.URLConnection;public class download implements Runnable{public String ur;public download() {}public download(String ur){this.ur=ur;}public static void download(String ur) throws IOException { //String ur="http://www.17sucai.com/preview/1266961/2018-06-22/wrj/index.html";String fileplace=ur;if(fileplace.contains("http")){fileplace=fileplace.split("//")[1];}URL url = new URL(ur);URLConnection conn = url.openConnection();conn.setConnectTimeout(4000);conn.setReadTimeout(5000);conn.connect();InputStream in= conn.getInputStream();BufferedInputStream buf=new BufferedInputStream(in);File file=new File("F:\\download\\" fileplace);if(!file.exists()){file.getParentFile().mkdirs();file.createNewFile();}//System.out.print(file.getAbsolutePath()); BufferedOutputStream bufout=new BufferedOutputStream(new FileOutputStream(file)); // int b=0; // while((b=buf.read())!=-1) // { // bufout.write(b); // //System.out.println(b ""); // }byte b[]=new byte[1024];int n=0;while((n=buf.read(b))!=-1){bufout.write(b, 0, n);}in.close();buf.close(); bufout.close();//fullFileName.close();}@Overridepublic void run() {try {download(ur);System.out.println(Thread.currentThread().getName() " 下載" ur "成功");} catch (IOException e) {// TODO 自動生成的 catch 塊e.printStackTrace();}}}

目標網站，以及運行效果

還有注意的要進入初始界面，把上面的X 點進去。

只能下載17素材大部分模板。如果以后時間充足，可能會維護下，下載更多模板！

項目github地址

如果對后端、爬蟲、數據結構算法等感性趣歡迎關注我的個人公眾號交流：bigsai

《新程序員》：云原生和全面數字化實踐50位技術專家共同創作，文字、視頻、音頻交互閱讀

總結

以上是生活随笔為你收集整理的java爬虫下载付费html网页模板的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： PAT条条大路通罗马
下一篇： Codeforces 494Div3（A