當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

java爬虫抓取起点小说,手把手带你爬虫 | 爬取起点小说网

發(fā)布時(shí)間：2023/12/14 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 java爬虫抓取起点小说,手把手带你爬虫 | 爬取起点小说网小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

原標(biāo)題：手把手帶你爬蟲(chóng) | 爬取起點(diǎn)小說(shuō)網(wǎng)

很多同學(xué)都喜歡看小說(shuō)，尤其是程序員群體，對(duì)武俠小說(shuō)，科幻小說(shuō)都很著迷，最近的修仙的小說(shuō)也很多，比如凡人修仙傳，武動(dòng)乾坤，斗破蒼穹等等，今天分享一個(gè)用Python來(lái)爬取小說(shuō)的小腳本！

目標(biāo)

爬取一本仙俠類(lèi)的小說(shuō)下載并保存為txt文件到本地。本例為“大周仙吏”。

項(xiàng)目準(zhǔn)備

軟件：Pycharm

第三方庫(kù)：requests,fake_useragent,lxml

網(wǎng)站地址：https://book.qidian.com

網(wǎng)站分析

打開(kāi)網(wǎng)址：

判斷是否為靜態(tài)加載網(wǎng)頁(yè)，Ctrl+U打開(kāi)源代碼，Ctrl+F打開(kāi)搜索框，輸入：第一章。

在這里是可以找到的，判定為靜態(tài)加載。

反爬分析

同一個(gè)ip地址去多次訪問(wèn)會(huì)面臨被封掉的風(fēng)險(xiǎn)，這里采用fake_useragent，產(chǎn)生隨機(jī)的User-Agent請(qǐng)求頭進(jìn)行訪問(wèn)。

代碼實(shí)現(xiàn)1.導(dǎo)入相對(duì)應(yīng)的第三方庫(kù)，定義一個(gè)class類(lèi)繼承object，定義init方法繼承self，主函數(shù)main繼承self。importrequests

fromfake_useragent importUserAgent

fromlxml importetree

classphoto_spider(object):

def__init__(self):

self.url = 'https://book.qidian.com/info/1020580616#Catalog'

ua = UserAgent(verify_ssl= False)

#隨機(jī)產(chǎn)生user-agent

fori inrange( 1, 100):

self.headers = {

'User-Agent': ua.random

}

defmian(self):

pass

if__name__ == '__main__':

spider = qidian

spider.main

2.發(fā)送請(qǐng)求,獲取網(wǎng)頁(yè)。defget_html(self,url):

response=requests.get(url,headers=self.headers)

html=response.content.decode( 'utf-8')

returnhtml

3.獲取圖片的鏈接地址。importrequests

fromlxml importetree

fromfake_useragent importUserAgent

classqidian(object):

def__init__(self):

self.url = 'https://book.qidian.com/info/1020580616#Catalog'

ua = UserAgent(verify_ssl= False)

fori inrange( 1, 100):

self.headers = {

'User-Agent': ua.random

}

defget_html(self,url):

response=requests.get(url,headers=self.headers)

html=response.content.decode( 'utf-8')

returnhtml

defparse_html(self,html):

target=etree.HTML(html)

links=target.xpath( '//ul[@class="cf"]/li/a/@href') #獲取鏈接

names=target.xpath( '//ul[@class="cf"]/li/a/text') #獲取每一章的名字

forlink,name inzip(links,names):

print(name+ 't'+ 'https:'+link)

defmain(self):

url=self.url

html=self.get_html(url)

self.parse_html(html)

if__name__ == '__main__':

spider=qidian

spider.main

打印結(jié)果：

4.解析鏈接，獲取每一章內(nèi)容。defparse_html(self,html):

target=etree.HTML(html)

links=target.xpath( '//ul[@class="cf"]/li/a/@href')

forlink inlinks:

host= 'https:'+link

#解析鏈接地址

res=requests.get(host,headers=self.headers)

c=res.content.decode( 'utf-8')

target=etree.HTML(c)

names=target.xpath( '//span[@class="content-wrap"]/text')

results=target.xpath( '//div[@class="read-content j_readContent"]/p/text')

forname innames:

print(name)

forresult inresults:

print(result)

打印結(jié)果：(下面內(nèi)容過(guò)多，只貼出一部分。)

5.保存為txt文件到本地。withopen( 'F:/pycharm文件/document/'+ name + '.txt', 'a') asf:

forresult inresults:

#print(result)

f.write(result+ 'n')

效果顯示：

打開(kāi)文件目錄：

完整代碼importrequests

fromlxml importetree

fromfake_useragent importUserAgent

classqidian(object):

def__init__(self):

self.url = 'https://book.qidian.com/info/1020580616#Catalog'

ua = UserAgent(verify_ssl= False)

fori inrange( 1, 100):

self.headers = {

'User-Agent': ua.random

}

defget_html(self,url):

response=requests.get(url,headers=self.headers)

html=response.content.decode( 'utf-8')

returnhtml

defparse_html(self,html):

target=etree.HTML(html)

links=target.xpath( '//ul[@class="cf"]/li/a/@href')

forlink inlinks:

host= 'https:'+link

#解析鏈接地址

res=requests.get(host,headers=self.headers)

c=res.content.decode( 'utf-8')

target=etree.HTML(c)

names=target.xpath( '//span[@class="content-wrap"]/text')

results=target.xpath( '//div[@class="read-content j_readContent"]/p/text')

forname innames:

print(name)

withopen( 'F:/pycharm文件/document/'+ name + '.txt', 'a') asf:

forresult inresults:

#print(result)

f.write(result+ 'n')

defmain(self):

url=self.url

html=self.get_html(url)

self.parse_html(html)

if__name__ == '__main__':

spider=qidian

spider.main 返回搜狐，查看更多

責(zé)任編輯：

總結(jié)

以上是生活随笔為你收集整理的java爬虫抓取起点小说,手把手带你爬虫 | 爬取起点小说网的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：浅析android手游lua脚本的加密与
下一篇：利用Java程序分析福彩3D