Python学习之路-爬虫(四大名著)
生活随笔
收集整理的這篇文章主要介紹了
Python学习之路-爬虫(四大名著)
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
今天繼續(xù)學(xué)習(xí),爬取四大名著,內(nèi)容來自靜態(tài)網(wǎng)站http://www.purepen.com/index.html
目標(biāo):
其他
因?yàn)榫帉懘a邊調(diào)試,每次完全重新執(zhí)行比較慢(一部書100-120回),增加了文件名判斷邏輯,已經(jīng)爬下來的,就不重復(fù)爬了
注意:需要自己手工創(chuàng)建4個(gè)同名目錄(后續(xù)再回來更新)
遇到的問題:
html解析時(shí),遇到特殊字符,text()無法完整獲取全部?jī)?nèi)容,如圖
最終通過調(diào)整解析字符集解決?
html = etree.parse('html.txt', etree.HTMLParser(encoding='gb18030'))完整代碼
import time import requests from lxml import etreehead = {} url = "http://www.purepen.com/hlm/index.htm" head["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36" head["Accept"]= "gzip, deflate" head["Accept-Language"]= "zh-CN,zh;q=0.9,en;q=0.8" head["Connection"] = "keep-alive"def main(bookName):print(bookName+"原本",":")# 爬標(biāo)題res = requests.get(url.replace('hlm',bookName), headers=head)with open("html.txt", "wb") as f:f.write(res.content)html = etree.parse('html.txt', etree.HTMLParser(encoding='gb18030'))title_list = html.xpath('//table//a/text()')# 爬章節(jié)連接# 獲取屬性值:@link_list = html.xpath('//table//a/@href')#判斷是否已經(jīng)存在文件i = 1for _title in title_list:_file = './Text/'+bookName+'/' + str(i) + '.' + str(_title) + '.txt'try:#有就繼續(xù)f = open(_file)print('File:'+ _file +' is exist')f.close()except FileNotFoundError:#沒有就爬內(nèi)容,并寫文件print('File:'+ _file +' is not found')#獲取內(nèi)容_url = "http://www.purepen.com/"+bookName+"/"+link_list[i-1]_content = getContent(_url)fileSave(_content,_file,'./Text/'+bookName+'/'+str(i)+'.'+_title+'.txt')#寫文件i += 1def getContent(_url):res = requests.get(_url, headers=head)with open("html.txt", "wb") as f:f.write(res.content)html = etree.parse('html.txt', etree.HTMLParser(encoding='gb18030'))content = html.xpath('//center//font/text()')return contentdef fileSave(_content, _title, _path):f = open(_path,'w')f.write(_content[0].replace('\n','').replace('\u3000\u3000','\n').lstrip())f.close()return Trueif __name__ == '__main__':main('shz')其中bookName:
- 紅樓夢(mèng)hlm
- 水滸傳shz
- 三國(guó)演義sgyy
- 西游記xyj
結(jié)果
總結(jié)
以上是生活随笔為你收集整理的Python学习之路-爬虫(四大名著)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: drawio 二次开发
- 下一篇: 云等保安全合规解决方案