當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

浅谈天涯社区“工薪一族”爬虫

發布時間：2024/3/13 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了浅谈天涯社区“工薪一族”爬虫小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

淺談天涯社區“工薪一族”爬蟲

1. 確定數據結構

首先，明確一個問題：要存什么。

以下是我最終代碼的數據結構

{"time": "2022-08-04 10:25:07", // 開始爬取的時間"pages": 3, // 爬取頁面數"posts": [ //大列表,記錄各個帖子{"page": 1, //記錄以下是哪個頁面"posts": [ //列表記錄該頁帖子{"title": "歷史學習記錄", //標題"post_time": "2022-08-04 03:37:49", //發送時間"author_id": "潘妮sun", //作者"url": "http://bbs.tianya.cn/post-170-917565-1.shtml", //帖子鏈接"author_url": "http://www.tianya.cn/112795571", //作者鏈接"read_num": "8", //閱讀數"reply_num": "4", //回復數"content": "黃帝和炎帝其實并不是皇帝，而是古書記載中黃河流域遠古..."//帖子內容(文本過長,這里只展示一部分)},......]}] }

由此可見,我們要存的東西如下:

爬取時間,頁數
帖子標題&鏈接
帖子發送時間
帖子作者&鏈接
閱讀數&回復數
帖子內容

2. 頁面分析

2.1 目錄頁面分析

打開目標頁面：http://bbs.tianya.cn/list.jsp?item=170

按下f12，打開開發者工具，分析頁面結構。

主體頁面由9個tbody構成，其中第一個為表格標題，其余八個內部各有10個帖子，共80個

每個tbody內由10個tr構成，記錄了帖名和鏈接、作者和鏈接、點擊量、回復量、最后回復時間

每頁最后會有一個鏈接指向下一頁，如同鏈表的指針

這里注意，第一頁的下一頁按鈕是第二個，其余頁是第三個

2.2 帖子頁面分析

隨便打開一條帖子, 如http://bbs.tianya.cn/post-170-878768-1.shtml

按下f12，打開開發者工具，分析頁面結構。

html的head標簽內有文章題目(后面會提到為啥要說這個)

發帖時間有兩種

一種為div內單獨span標簽內,以純文本形式存儲

另一種為和點擊和回復一起整體保存

帖子內容保存在"bbs-content"的div里,以<br>分段

3. 確定工具

爬取html這里選用request庫

解析提取html這里選用xpath庫

文本格式化存儲要用到json庫

記錄時間要用到time庫

提取文本數據可能要用到正則表達式,導入re庫(可選)

*注: 這里可以先記錄下瀏覽器的User-Agent, 構造headers

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49' }

4. 開始提取

4.1 提取頁面

import requests from lxml import etreeurl = ‘http://bbs.tianya.cn/list.jsp?item=170’ headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49' } posts = [] # 保存帖子用 next = ‘’ # 保存下一頁鏈接用raw = requests.get(url, headers=headers) # 爬取頁面 html = etree.HTML(raw.text) # 轉換為xml給xpath解析 # 取下一頁鏈接 next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/@href')[0] # 判斷第二個是否是下一頁按鈕，若不是則為第三個按鈕 # 第一頁以外是a[3]不是a[2]（兩個條件不能換順序，否則第一頁會報錯） if html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/text()')[0] != '下一頁':next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[3]/@href')[0] tbodys = html.xpath('//*[@id="main"]/div[@class="mt5"]/table/tbody') # 提取9個tbody tbodys.remove(tbodys[0]) # 移除頁首表頭（標題作者點擊回復回復時間） for tbody in tbodys:items = tbody.xpath("./tr")for item in items:title = item.xpath("./td[1]/a/text()")[0].replace('\r', '').replace('\n', '').replace('\t', '') # 帖子題目會有換行符等符號，需要去除post_url = "http://bbs.tianya.cn" + item.xpath("./td[1]/a/@href")[0] # 帖子鏈接author_id = item.xpath("./td[2]/a/text()")[0] # 作者idauthor_url = item.xpath("./td[2]/a/@href")[0] # 作者鏈接read_num = item.xpath("./td[3]/text()")[0] # 閱讀數reply_num = item.xpath("./td[4]/text()")[0] # 回復數post = {'title': title,'author_id': author_id,'url': post_url,'author_url': author_url,'read_num': read_num,'reply_num': reply_num,}posts.append(post)print(post) # 展示輸出結果調試用

4.2 提取單個帖子

post_time = '' # 保存發帖時間 post_content = '' # 保存發帖內容 post_url = ‘http://bbs.tianya.cn/post-170-917511-1.shtml’ postraw = requests.get(posturl, headers=headers) posthtml = etree.HTML(postraw.text) # 天涯社區的時間有兩種保存格式，這里分別適配 try:posttimeraw = posthtml.xpath('//*[@id="post_head"]/div[2]/div[2]/span[2]/text()')[0] # 發帖時間 except:posttimeraw = posthtml.xpath('//*[@id="container"]/div[2]/div[3]/span[2]/text()[2]')[0] # 發帖時間 # 利用正則進行時間文本格式化 YYYY-MM-DD HH:mm:ss post_time = re.findall(r'\d+-\d+-\d+ \d+:\d+:\d+', posttimeraw)[0] if len(title) == 0: # 處理部分因格式特殊取不到標題的帖子title = posthtml.xpath('/html/head/title/text()')[0].replace('_工薪一族_論壇_天涯社區', '') contents = posthtml.xpath('//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()') # 帖子內容(列表形式，一段一項) post_content = '' for string in contents: # 提取正文每一段string = string.replace('\r', '').replace('\n', '').replace('\t', '').replace('\u3000', '') + '\n' # 去除換行符等符號，并加上段間換行符post_content += string # 將每段內容拼接起來

4.3 構造函數

這里的目的是為了拼接單帖和頁面代碼，實現單頁內全部數據的提取（包括題目，內容和數據）

下文為我的實現函數，入參為頁面網址url和headers，出參為構造的單頁面所有數據構成的列表posts和下一頁的鏈接next

def get_posts(url, headers):raw = requests.get(url, headers=headers)code = raw.status_codeposts = []next = '' # 加載失敗直接返回空，避免報錯if code == 200:html = etree.HTML(raw.text)# 取下一頁鏈接next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/@href')[0]# 第一頁以外是a[3]不是a[2]（兩個條件不能換順序，否則第一頁會報錯）if html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/text()')[0] != '下一頁': # 判斷第二個按鈕是否是下一頁按鈕next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[3]/@href')[0] tbodys = html.xpath('//*[@id="main"]/div[@class="mt5"]/table/tbody')tbodys.remove(tbodys[0]) # 移除頁首表頭（標題作者點擊回復回復時間）for tbody in tbodys:items = tbody.xpath("./tr")for item in items:title = item.xpath("./td[1]/a/text()")[0].replace('\r', '').replace('\n', '').replace('\t', '') # 帖子題目會有換行符等符號，需要去除url = "http://bbs.tianya.cn" + item.xpath("./td[1]/a/@href")[0] # 帖子鏈接author_id = item.xpath("./td[2]/a/text()")[0] # 作者idauthor_url = item.xpath("./td[2]/a/@href")[0] # 作者鏈接read_num = item.xpath("./td[3]/text()")[0] # 閱讀數reply_num = item.xpath("./td[4]/text()")[0] # 回復數# 獲取帖子內容postraw = requests.get(url, headers=headers) postcode = postraw.status_codeif postcode == 200:posthtml = etree.HTML(postraw.text)try:posttimeraw = posthtml.xpath('//*[@id="post_head"]/div[2]/div[2]/span[2]/text()')[0] # 發帖時間except:posttimeraw = posthtml.xpath('//*[@id="container"]/div[2]/div[3]/span[2]/text()[2]')[0] # 發帖時間post_time = re.findall(r'\d+-\d+-\d+ \d+:\d+:\d+', posttimeraw)[0]if len(title) == 0: # 處理部分因格式特殊取不到標題的帖子title = posthtml.xpath('/html/head/title/text()')[0].replace('_工薪一族_論壇_天涯社區', '')contents = posthtml.xpath('//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()') # 帖子內容(列表形式，一段一項)post_content = ''for string in contents:string = string.replace('\r', '').replace('\n', '').replace('\t', '').replace('\u3000', '') + '\n' # 去除換行符等符號，并加上段間換行符post_content += string # 將每段內容拼接起來post = {'title': title,'post_time': post_time,'author_id': author_id,'url': url,'author_url': author_url,'read_num': read_num,'reply_num': reply_num,'content': post_content}posts.append(post)print(title) # 輸出帖子題目調試用return posts, next

4.4 保存數據

本項目標：構造主函數，實現json格式化保存

def main():url = 'http://bbs.tianya.cn/list.jsp?item=170'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49'}postss = {'time': time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),'pages': 0,'posts': []}for i in range(3): # 只爬取前三頁print("page: " + str(i + 1)) # 輸出頁碼調試用posts, next = get_posts(url, headers)pages = {'page': i + 1,'posts': posts}postss['posts'].append(pages)url = nextpostss['pages'] += 1# 每獲取一頁保存一次，容災with open('tianya.json', 'w', encoding='utf-8') as f:json.dump(postss, f, ensure_ascii=False, indent=4)with open('tianya.json', 'w', encoding='utf-8') as f:json.dump(postss, f, ensure_ascii=False, indent=4) # indent=4 是為了格式化json

5. 注意事項

直接從頁面提取文本標題會有一些干擾符號，需要去除

頁面中部分標題有特殊樣式，無法提取，需要進入該帖后利用head中的題目提取存入

6. 成品代碼

import requests from lxml import etree import json import re import timedef get_posts(url, headers):raw = requests.get(url, headers=headers)code = raw.status_codeposts = []next = '' # 加載失敗直接返回空，避免報錯if code == 200:html = etree.HTML(raw.text)# 取下一頁鏈接next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/@href')[0]# 第一頁以外是a[3]不是a[2]（兩個條件不能換順序，否則第一頁會報錯）if html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/text()')[0] != '下一頁': # 判斷第二個按鈕是否是下一頁按鈕next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[3]/@href')[0] tbodys = html.xpath('//*[@id="main"]/div[@class="mt5"]/table/tbody')tbodys.remove(tbodys[0]) # 移除頁首表頭（標題作者點擊回復回復時間）for tbody in tbodys:items = tbody.xpath("./tr")for item in items:title = item.xpath("./td[1]/a/text()")[0].replace('\r', '').replace('\n', '').replace('\t', '') # 帖子題目會有換行符等符號，需要去除url = "http://bbs.tianya.cn" + item.xpath("./td[1]/a/@href")[0] # 帖子鏈接author_id = item.xpath("./td[2]/a/text()")[0] # 作者idauthor_url = item.xpath("./td[2]/a/@href")[0] # 作者鏈接read_num = item.xpath("./td[3]/text()")[0] # 閱讀數reply_num = item.xpath("./td[4]/text()")[0] # 回復數# 獲取帖子內容postraw = requests.get(url, headers=headers) postcode = postraw.status_codeif postcode == 200:posthtml = etree.HTML(postraw.text)try:posttimeraw = posthtml.xpath('//*[@id="post_head"]/div[2]/div[2]/span[2]/text()')[0] # 發帖時間except:posttimeraw = posthtml.xpath('//*[@id="container"]/div[2]/div[3]/span[2]/text()[2]')[0] # 發帖時間post_time = re.findall(r'\d+-\d+-\d+ \d+:\d+:\d+', posttimeraw)[0]if len(title) == 0: # 處理部分因格式特殊取不到標題的帖子title = posthtml.xpath('/html/head/title/text()')[0].replace('_工薪一族_論壇_天涯社區', '')contents = posthtml.xpath('//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()') # 帖子內容(列表形式，一段一項)post_content = ''for string in contents:string = string.replace('\r', '').replace('\n', '').replace('\t', '').replace('\u3000', '') + '\n' # 去除換行符等符號，并加上段間換行符post_content += string # 將每段內容拼接起來post = {'title': title,'post_time': post_time,'author_id': author_id,'url': url,'author_url': author_url,'read_num': read_num,'reply_num': reply_num,'content': post_content}posts.append(post)print(title) # 輸出帖子題目調試用return posts, nextdef main():url = 'http://bbs.tianya.cn/list.jsp?item=170'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49'}postss = {'time': time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),'pages': 0,'posts': []}for i in range(3): # 只爬取前三頁print("page: " + str(i + 1)) # 輸出頁碼調試用posts, next = get_posts(url, headers)pages = {'page': i + 1,'posts': posts}postss['posts'].append(pages)url = nextpostss['pages'] += 1# 每獲取一頁保存一次，容災with open('tianya.json', 'w', encoding='utf-8') as f:json.dump(postss, f, ensure_ascii=False, indent=4)with open('tianya.json', 'w', encoding='utf-8') as f:json.dump(postss, f, ensure_ascii=False, indent=4) # indent=4 是為了格式化jsonif __name__ == '__main__':main()

總結

以上是生活随笔為你收集整理的浅谈天涯社区“工薪一族”爬虫的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Java 多个pdf合并成一个pdf
下一篇：用echarts做如图,x轴左右都是正数