當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬取百度贴吧回帖中的微信号（基于简单http请求）

發布時間：2024/1/8 python 43 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬取百度贴吧回帖中的微信号（基于简单http请求）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作者：草小誠（wellsmile@foxmail.com）
轉載請注原文地址：https://blog.csdn.net/cxcjoker7894/article/details/85685115

前些日子媳婦兒有個需求，想要一個任意貼吧近期主題帖的所有回帖中的微信號，用來做一些微商的操作，你懂的。因為有些貼吧專門就是微商互加，或者客戶留微信的，還有專門特定用戶群的貼吧，非常精準，我們一致認為比其他加人模式效率要高，所以如果能方便快捷的提取微信號，價值還是很高的（事后來看微信號到購買轉化率約1%，已經很滿意了）。

那需求很明確，說干就干，當天晚上就想把這個工具實現出來，我呢打開電腦開始調研 BeautifulSoup 和 Scrapy，媳婦兒開始整理目標貼吧，考慮關鍵詞的過濾和回帖正則的匹配方式。在她一步步點擊操作中我就發現，百度貼吧的 url 和 css 名極為規整，分頁與功能按鈕也都簡單的在 GET 請求的 url 中體現，而且沒有混淆或加密等等。既然如此，還用得著爬蟲框架么，直接循環 http 請求，想辦法從返回的 HTML 中提取微信號就行了，非常簡單。

下面都以“寶媽微信群”吧為例：

1.訪問貼吧，發現 url 如下:

https://tieba.baidu.com/f?kw=寶媽微信群ie=utf-8

對比其他貼吧，格式一致，地址/f，參數 kw(keyword) 和 ie，kw即為貼吧名稱，ie是編碼不用動。由于需求只要近期的帖子，所以這里不翻頁。

2.訪問主題帖，發現 url 如下：

https://tieba.baidu.com/p/5981196309

對比其他帖子，真的是太規整了，/p/ 就是帖子，后面跟帖子id即可，帖子id唯一且不會變化。
在主題帖列表的html中，包含 ‘j_th_tit’ 屬性并不含 ‘threadlist_title’ 屬性的元素即為主題帖鏈接。

3.分頁怎么辦？點一下第五頁：

https://tieba.baidu.com/p/5144424400?pn=5

bing~加一個 pn 參數就可以訪問第N頁，這里有個問題是，如果pn值超過當前主題帖最大頁數，就會仍然顯示本貼的最后一頁，也就是說如果只有5頁，即使傳100也還是顯示第5頁，所以循環頁數的時候需要根據屏顯的頁數判斷是否翻頁了，如果沒有真的翻頁就continue。

至此，所有的 HTML 都可以獲取到了，要想辦法提取其中的回帖信息，分為三部分：

按行讀，并切割拆分元素，依靠標簽名和class名，提取有用的html元素。

排除噪音干擾，去掉一些夾雜著的的 js 代碼段。

將有用的html元素分別處理，微信去提取，分頁去判斷，等等。

舉例，class="tP"的元素是分頁顯示，含有‘d_post_content j_d_post_content’ 的元素是回復，由于HTML內容龐大，就不在文中貼出了。

由于沒有想出好的微信號判斷邏輯，就簡單的將不包含英文和數字的回復過濾掉了。

再將 while true 循環翻頁的邏輯寫好，就大功告成了，把爬下來的有效回復全都寫到文件里就好，一個貼吧首頁所有主題帖中的所有有效回復，就提取出來了。

貼一下成果圖，這個吧大約有2000條：

下面貼代碼吧，看官可以拿去作為工具使用，前三個變量分別為：

tieba_name ：貼吧名稱，注意不含“吧”字。
tiezi_names ：這里可以限定帖子的名稱，如果設置了，會只爬取包含有設置的關鍵詞的帖子。
store_file_path ：結果存儲在哪里？文件路徑。

其他的無需改動，即可使用：

#coding=utf-8 ''' @author: caoxiaocheng ''' import requests import re import string import timetieba_name = '寶媽微信群' tiezi_names = [] store_file_path = 'C:\\Users\\user\\Desktop\\貼吧爬取數據%s.txt' % str(int(time.time()))def special_print(print_str, store_file=None):print('★★★★★★★★ ' + str(print_str), flush=True)if store_file:store_file.write('★★★★★★★★ ' + str(print_str)+'\n')def tiezi_check(tiezi_names, tieba_title_line):if not tiezi_names:return Truefor tiezi_name in tiezi_names:if tiezi_name in tieba_title_line:return Truereturn Falsedef word_check(word, check_words):for check_word in check_words:if check_word in word:return Falseif re.match('..\:..', word):return Falsenum_or_letter = Falsefor one_char in word:if one_char in [str(x) for x in string.digits] or one_char in [str(x) for x in string.ascii_letters]:num_or_letter = Truebreakif not num_or_letter:return Falsereturn Truedef get_resp_line(resp):resp_html = resp.content.decode('utf-8')resp_lines = resp_html.split('\n')return resp_linestieba_index_url = 'https://tieba.baidu.com/f' tieba_detail_url = 'https://tieba.baidu.com' tieba_index_params = {'kw': tieba_name,'ie': 'utf-8',} detail_link_mapping = {} check_words = ['該樓層疑似違規已被系統折疊','送TA禮物','收起回復','function']# 將目標貼吧首頁的所有主題帖鏈接存下來 special_print('正在獲取帖子列表') tieba_index_resp = requests.get(tieba_index_url, tieba_index_params) tieba_index_lines = get_resp_line(tieba_index_resp) for tieba_index_line in tieba_index_lines:if 'j_th_tit' in tieba_index_line and 'threadlist_title' not in tieba_index_line and tiezi_check(tiezi_names, tieba_index_line):title_line = tieba_index_line.strip()title_name = title_line.split(' ')[3][7:-1]title_link_tail = title_line.split(' ')[2][6:-1]title_link = tieba_detail_url + title_link_taildetail_link_mapping[title_link] = title_name for detail_link in detail_link_mapping:print('標題:'+detail_link_mapping[detail_link])print('鏈接:'+detail_link)# 遍歷所有主題帖，爬取主題帖所有回復 with open(store_file_path, 'a', encoding='utf-8') as store_file:for detail_link in detail_link_mapping:title_name = detail_link_mapping[detail_link]special_print('正在爬取帖子:'+title_name, store_file)last_page = 0current_page = 1try:while(True):special_print('正在爬取第'+str(last_page+1)+'頁', store_file)detail_link_resp = requests.get(detail_link+'?pn='+str(last_page+1))detail_link_lines = get_resp_line(detail_link_resp)more_page = Falsefor detail_link_line in detail_link_lines:detail_line = detail_link_line.strip()detail_line_fix = re.sub(u"\\<.*?\\>", "", detail_line)if 'class="tP"' in detail_link_line:more_page = Truecurrent_page = int(detail_line_fix.strip())if current_page == last_page:special_print('沒有更多頁了', store_file)raise IndexErrorif 'd_post_content j_d_post_content' in detail_link_line and 'topic_name' not in detail_link_line:if 'void' in detail_line_fix:detail_line_fix = detail_line_fix[:detail_line_fix.index('void')]detail_line_words = [x for x in detail_line_fix.split(' ') if x]for detail_line_word in detail_line_words:if word_check(detail_line_word, check_words):print(detail_line_word)store_file.write(detail_line_word+'\n')if not more_page:special_print('沒有更多頁了', store_file)raise IndexErrorlast_page = current_pageexcept IndexError:continue

總結

以上是生活随笔為你收集整理的Python爬取百度贴吧回帖中的微信号（基于简单http请求）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：我的、新的、纯粹的：触摸荣耀长大后的面庞
下一篇：《毒液·致命守护着》