爬虫:起点中文网
1. 目標:
練習爬取起點中文網24小時熱銷榜(https://www.qidian.com/rank/hotsales)小說名稱、作者、類型、狀態、劇情介紹、最新更新章節和最新更新時間,并存儲到csv中。
2. 代碼實現
import requests from lxml import etree import time import pandas as pd headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36' } result = {'title':[],#小說名稱'author':[],#小說作者'type':[],#小說類型'progress':[],#狀態:完結or連載'intro':[],#劇情介紹'now':[],#最新更新章節'time':[]#最新更新時間 } for i in range(1,6):url = 'https://www.qidian.com/rank/hotsales?style=1&page={}'.format(i)page_text = requests.get(url = url, headers = headers).texttree = etree.HTML(page_text)li_list = tree.xpath('//div[@class = "book-img-text"]/ul/li')for li in li_list:title = li.xpath('./div[2]/h4/a/text()')[0].replace(' ','').replace('\n','')author = li.xpath('./div[2]/p[1]/a[1]/text()')[0].replace(' ','').replace('\n','')Type = li.xpath('./div[2]/p[1]/a[2]/text()')[0].replace(' ','').replace('\n','')progress = li.xpath('./div[2]/p[1]/span/text()')[0].replace(' ','').replace('\n','')intro = li.xpath('./div[2]/p[2]/text()')[0].replace(' ','').replace('\n','').replace('\t','').replace('\r','')now = li.xpath('./div[2]/p[3]/a/text()')[0].replace(' ','').replace('\n','').replace('最新更新','')Time = li.xpath('./div[2]/p[3]/span/text()')[0].replace('\n','')#將時間標準化ctime = time.strftime('%Y/%m/%d %H:%M')time.sleep(0.5)print(title)result['title'].append(title)result['author'].append(author)result['type'].append(Type)result['progress'].append(progress)result['intro'].append(intro)result['now'].append(now)result['time'].append(ctime)df = pd.DataFrame(result)df.to_csv('qidian.csv',encoding = 'utf-8')3. 結果展示
?
總結
- 上一篇: 【解决方案】严防夏天溺水,开启EasyD
- 下一篇: 陈老师的一些单片机外围电路设计心得