生活随笔
收集整理的這篇文章主要介紹了
python爬虫之批量下载小说
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
前面練習了爬取單頁的小說內(nèi)容,之后又練習了提取整部小說的內(nèi)容:
可參考:
一部小說爬取
繼上次之后想試試批量爬取小說,想就接著干,找到目標網(wǎng)頁的地址:
頁面顯示如下:
然后打開開發(fā)者工具,發(fā)現(xiàn)內(nèi)容也都在相應(yīng)體中,那提取數(shù)據(jù)就十分簡單了,
頁面的跳轉(zhuǎn)的地址也很容易提取:
一段簡單的代碼實現(xiàn)跳轉(zhuǎn)頁面地址的提取,提取出來的地址少了協(xié)議,列表推導(dǎo)式完成地址的拼接:
跳轉(zhuǎn)之后竟然沒有直接到詳情頁,跳轉(zhuǎn)到了點擊閱讀的頁面:
沒辦法,只好再次提取中間跳轉(zhuǎn)的地址:
同樣也是很容易提取,地址也是不完整,列表推導(dǎo)式完成拼接:
請求之后終于到了列表詳情頁:
檢查之后提取內(nèi)容也是很簡單,xpath直接提取,同時提取小說的名字,然后在請求小說內(nèi)容頁面的地址:
文本頁面的內(nèi)容同樣使用xpath提取,并提取章節(jié),但有兩部小說的鏈接為空,所以就使用if判段直接跳過,否則接下來請求就會出錯,同時在這一步直接實現(xiàn)保存,代碼如下:
至此代碼完成,完整代碼如下:
'''
爬取17k小說網(wǎng)女生頻道完本50-100萬字的小說
'''
import os
import requests
from lxml
import etree
from fake_useragent
import UserAgent
ua
= UserAgent
()
class Nover_Women():def __init__(self
):self
.start_url
= "https://www.17k.com/all/book/3_5_0_3_3_0_1_0_1.html"self
.headers
= {'User-Agent': ua
.random
}def get_html(self
, url
):html
= requests
.get
(url
, headers
=self
.headers
).content
.decode
()return html
def mid_page(self
, html
):e
= etree
.HTML
(html
)mid_link
= e
.xpath
('//td[@class="td3"]/span/a/@href')mid_link
= ["https:" + i
for i
in mid_link
]return mid_link
def mid_detail_page(self
, mid_link
):mid_list
= []for link
in mid_link
:mid_data
= {}mid_html
= requests
.get
(url
=link
, headers
=self
.headers
).content
.decode
()e
= etree
.HTML
(mid_html
)mid_detail_link
= e
.xpath
('//dt[@class="read"]/a/@href')mid_detail_link
= ["https://www.17k.com/" + i
for i
in mid_detail_link
]mid_data
["mid_detail_link"] = mid_detail_linkmid_list
.append
(mid_data
)return mid_list
def fin_detail(self
, mid_list
):for url
in mid_list
:content
= {}r
= requests
.get
(url
=url
["mid_detail_link"][0], headers
=self
.headers
).content
.decode
()e
= etree
.HTML
(r
)content_page
= e
.xpath
('//dl[@class="Volume"]/dd/a/@href')if len(content_page
) == 0:continuebook_name
= e
.xpath
('//div[@class="Main List"]/h1/text()')content
["name"] = book_namecontent_page
= ["https://www.17k.com" + i
for i
in content_page
]content
["link"] = content_page
if not os
.path
.exists
(content
["name"][0]):os
.mkdir
(content
["name"][0])for content_url
in content
["link"]:content_html
= requests
.get
(url
=content_url
, headers
=self
.headers
).content
.decode
()e_content
= etree
.HTML
(content_html
)chapter
= e_content
.xpath
('//div[@class="readAreaBox content"]/h1/text()')chapter
= chapter
[0].replace
('\t', '')text
= e_content
.xpath
('//div[@class="p"]/p/text()')with open(content
["name"][0] + '/' + chapter
+ '.txt', 'w', encoding
="utf-8") as f
:print("正在寫入:" + content
["name"][0] + ':' + chapter
)f
.write
(chapter
)f
.write
('\r\n')for i
in text
:f
.write
(i
)def run(self
):url
= self
.start_urlhtml
= self
.get_html
(url
)mid_link
= self
.mid_page
(html
)mid_list
= self
.mid_detail_page
(mid_link
)self
.fin_detail
(mid_list
)
if __name__
== '__main__':Novel_Spider
= Nover_Women
()Novel_Spider
.run
()
爬取結(jié)果如下:
爬取了很多,就不一一展示了。代碼僅作參考。
總結(jié)
以上是生活随笔為你收集整理的python爬虫之批量下载小说的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。