當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫之起点中文网完本小说

發布時間：2023/12/14 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫之起点中文网完本小说小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

任務

爬取起點中文網前5頁（頁數可改）完本小說

將爬取到的小說名字、作者、鏈接以及相關簡介保存到一個excel表格中

分析

通過檢查網頁的源碼可知，瀏覽器發起的是get請求，返回的數據類型是text/html。因此可以調用requests模塊中的get() 函數得到頁面源碼數據

headers = {'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 86.0.4240.198Safari / 537.36'} param = {'page': page_num} url = 'https://www.qidian.com/finish?action=hidden&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=2&' response = requests.get(url=url, params=param, headers=headers) page_data = response.text

(由于剛剛接觸爬蟲，這里遇到了一個問題:
當得到page_data數據后，對其進行打印顯示，發現頁面源碼缺少本次任務的關鍵內容，即小說的所有數據都不在頁面源碼中。但是當將其保存為一個html文件，再打開，并創建一個BeautifulSoup對象后，再對其進行打印，小說的有關數據又會顯示出來，這一點不是太明白，希望看到此文的朋友可以講解一下原理。)

file = open('qidian.html', 'r', encoding='utf-8') soup = BeautifulSoup(file, 'lxml')

得到BeautifulSoup對象soup后，就可以用正則表達式對要爬取的內容進行提取。每得到一個數據，就將其保存在一個列表中，最后對excel表格進行操作，就可將數據進行永久化存儲。

完整代碼

# -*- coding = utf-8 -*- # @Time : 2020/12/27 11:17 # @author: 農夫三犭 # @File : main.py # @Software:PyCharmimport requests import re from bs4 import BeautifulSoup import xlwt# 小說鏈接 findLink = re.compile(r'<a data-bid=.*? data-eid=.*? href="(.*?)"', re.S) # 選取的內容要用括號括起來 # 小說名字 findName = re.compile(r'<a data-bid=.*? data-eid=.*? href=.*? target=.*?>(.*)</a></h4>') # 小說作者 findAuthor = re.compile(r'a class="name" data-eid=.*? href=.*? target="_blank">(.*?)</a><em>|</em>') # 小說簡介 findIntroduce = re.compile(r'<p class="intro">(.*?)</p>', re.S)if __name__ == '__main__':savepath = "起點小說完結.xls"datalist = []headers = {'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 86.0.4240.198Safari / 537.36'}url = 'https://www.qidian.com/finish?action=hidden&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=2&'for page_num in range(1, 6): # 前5頁內容page_num = str(page_num)param = {'page': page_num}response = requests.get(url=url, params=param, headers=headers)page_data = response.textwith open('./qidian.html', 'w', encoding='utf-8') as fp:fp.write(page_data)file = open('qidian.html', 'r', encoding='utf-8')soup = BeautifulSoup(file, 'lxml')for item in soup.find_all('div', class_='book-mid-info'):data = []item = str(item) # 必須先轉換為字符串# 小說名字name = re.findall(findName, item)[0]data.append(name)# 小說作者author = re.findall(findAuthor, item)[0]data.append(author)# 小說鏈接link = re.findall(findLink, item)[0]link = 'https' + linkdata.append(link)# 小說簡介introduce = re.findall(findIntroduce, item)introduce = [x.strip() for x in introduce if x.strip() != ''][0] # 去除列表中的空格和換行data.append(introduce)datalist.append(data)book = xlwt.Workbook(encoding="utf-8") # 創建workbook對象sheet = book.add_sheet('起點小說完本', cell_overwrite_ok=True) # 創建工作表column = ("小說名字", "小說作者", "小說鏈接", "小說簡介") # 元組for i in range(0, 4):sheet.write(0, i, column[i]) # 列名for i in range(0, len(datalist)):print(f"第{i + 1}條寫入成功")datas = datalist[i]for j in range(0, 4):sheet.write(i + 1, j, datas[j]) # 數據book.save(savepath)

(剛開始學爬蟲，不足之處還請大家多多指教）

總結

以上是生活随笔為你收集整理的Python爬虫之起点中文网完本小说的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：语言缩写c-a,常见的国家语言缩写以及语
下一篇：关于C#英文注释改成中文注释