當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python学习之路-爬虫（四大名著）

發(fā)布時(shí)間：2023/12/16 python 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python学习之路-爬虫（四大名著）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

今天繼續(xù)學(xué)習(xí)，爬取四大名著，內(nèi)容來自靜態(tài)網(wǎng)站http://www.purepen.com/index.html

目標(biāo)：

每部名著單獨(dú)一個(gè)文件夾

每回單獨(dú)一個(gè)文件，格式：數(shù)字.章節(jié)名.txt，eg：1.甄士隱夢(mèng)幻識(shí)通靈賈雨村風(fēng)塵懷閨秀.txt

去掉html中的換行和其他元素，只保留段落的換行

其他

因?yàn)榫帉懘a邊調(diào)試，每次完全重新執(zhí)行比較慢（一部書100-120回），增加了文件名判斷邏輯，已經(jīng)爬下來的，就不重復(fù)爬了

注意：需要自己手工創(chuàng)建4個(gè)同名目錄（后續(xù)再回來更新）

遇到的問題：

html解析時(shí)，遇到特殊字符，text()無法完整獲取全部?jī)?nèi)容，如圖

最終通過調(diào)整解析字符集解決?

html = etree.parse('html.txt', etree.HTMLParser(encoding='gb18030'))

完整代碼

import time import requests from lxml import etreehead = {} url = "http://www.purepen.com/hlm/index.htm" head["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36" head["Accept"]= "gzip, deflate" head["Accept-Language"]= "zh-CN,zh;q=0.9,en;q=0.8" head["Connection"] = "keep-alive"def main(bookName):print(bookName+"原本",":")# 爬標(biāo)題res = requests.get(url.replace('hlm',bookName), headers=head)with open("html.txt", "wb") as f:f.write(res.content)html = etree.parse('html.txt', etree.HTMLParser(encoding='gb18030'))title_list = html.xpath('//table//a/text()')# 爬章節(jié)連接# 獲取屬性值：@link_list = html.xpath('//table//a/@href')#判斷是否已經(jīng)存在文件i = 1for _title in title_list:_file = './Text/'+bookName+'/' + str(i) + '.' + str(_title) + '.txt'try:#有就繼續(xù)f = open(_file)print('File:'+ _file +' is exist')f.close()except FileNotFoundError:#沒有就爬內(nèi)容，并寫文件print('File:'+ _file +' is not found')#獲取內(nèi)容_url = "http://www.purepen.com/"+bookName+"/"+link_list[i-1]_content = getContent(_url)fileSave(_content,_file,'./Text/'+bookName+'/'+str(i)+'.'+_title+'.txt')#寫文件i += 1def getContent(_url):res = requests.get(_url, headers=head)with open("html.txt", "wb") as f:f.write(res.content)html = etree.parse('html.txt', etree.HTMLParser(encoding='gb18030'))content = html.xpath('//center//font/text()')return contentdef fileSave(_content, _title, _path):f = open(_path,'w')f.write(_content[0].replace('\n','').replace('\u3000\u3000','\n').lstrip())f.close()return Trueif __name__ == '__main__':main('shz')

其中bookName：

紅樓夢(mèng)hlm
水滸傳shz
三國(guó)演義sgyy
西游記xyj

結(jié)果

總結(jié)

以上是生活随笔為你收集整理的Python学习之路-爬虫（四大名著）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。