微博数据爬虫——获取用户基本信息(三)
生活随笔
收集整理的這篇文章主要介紹了
微博数据爬虫——获取用户基本信息(三)
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
目標:獲取微博用戶的基本信息,如關注數、粉絲數、微博數量、注冊時間
首先獲取page_id
使用正則匹配獲取
add = urllib.request.Request(url="https://weibo.com/u/%s?is_all=1" % (o_id), headers=headers) r = urllib.request.urlopen(url=add, timeout=20).read().decode('utf-8') page_id = re.findall(r'\$CONFIG\[\'page_id\']=\'(\d+)\'', r)[0]而后通過匹配獲取基本信息
add = urllib.request.Request(url="https://weibo.com/p/%s/info" % (page_id), headers=headers) r = urllib.request.urlopen(url=add, timeout=20).read().decode('utf-8') nums = re.findall(r'<strong class=\\"W_f.*?\\">(\d*)<\\/strong>', r) regist_time = re.findall(r'注冊時間:.*?<span class=\\"pt_detail\\">(.*?)<\\/span>', r)[0] regist_time = regist_time.replace(" ","").replace("\\r\\n","") dic["follow_num"] = nums[0] dic["fun_num"] = nums[1] dic["post_num"] = nums[2] dic["regist_time"] = regist_time最終代碼如下:
import json import re from urllib import request import urllib import configdef get_user_action(o_id):dic = {}headers = config.get_headers()add = urllib.request.Request(url="https://weibo.com/u/%s?is_all=1" % (o_id), headers=headers)r = urllib.request.urlopen(url=add, timeout=20).read().decode('utf-8')page_id = re.findall(r'\$CONFIG\[\'page_id\']=\'(\d+)\'', r)[0]add = urllib.request.Request(url="https://weibo.com/p/%s/info" % (page_id), headers=headers)r = urllib.request.urlopen(url=add, timeout=20).read().decode('utf-8')print("https://weibo.com/p/%s/info" % (page_id))print(r)nums = re.findall(r'<strong class=\\"W_f.*?\\">(\d*)<\\/strong>', r)regist_time = re.findall(r'注冊時間:.*?<span class=\\"pt_detail\\">(.*?)<\\/span>', r)[0]regist_time = regist_time.replace(" ","").replace("\\r\\n","")dic["follow_num"] = nums[0]dic["fun_num"] = nums[1]dic["post_num"] = nums[2]dic["regist_time"] = regist_timereturn dicif __name__ == '__main__':dic = get_user_action("1906123125")json_f = open("data/data_userinfo.json", "w")json.dump(dic, json_f, indent=4)?
總結
以上是生活随笔為你收集整理的微博数据爬虫——获取用户基本信息(三)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 提高软件CPU占用率
- 下一篇: linux shell取得秒级时间戳