當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

爬虫获取微博首页热搜

發(fā)布時(shí)間：2024/1/8 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫获取微博首页热搜小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

爬蟲獲取微博首頁(yè)熱搜

步驟:

打開微博首頁(yè) https://s.weibo.com/top/summary?
右鍵點(diǎn)擊檢查，分析靜態(tài)網(wǎng)頁(yè)
將爬取到的內(nèi)容保存為csv文件格式

需要導(dǎo)入的庫(kù)

import requests from lxml import etree import pandas as pd

話不多說(shuō)，直接上源碼！

import requests from lxml import etree import pandas as pd url = 'https://s.weibo.com/top/summary?' headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36 Edg/91.0.864.70' }def get_url(url):try:response = requests.get(url, headers=headers)if response.status_code == 200:return response.textexcept requests.ConnectionError as e:print(e.args)def get_hot():hotlist = [] #熱搜內(nèi)容列表，用來(lái)保存內(nèi)容hot_url_list=[] #熱搜url列表index_list=[] #索引號(hào)列表items = get_url(url) #調(diào)用函數(shù)，獲取網(wǎng)頁(yè)response.texthtml = etree.HTML(items)# 初始化hot_list = html.xpath('/html/body/div/section/ul/li')#xpath定位，可在瀏覽器直接復(fù)制j=1#遍歷所有l(wèi)i列表for i in hot_list:#獲取熱搜內(nèi)容hot = i.xpath('./a/span/text()')[0] #一直搞不懂[0]是什么意思hotlist.append(hot)#獲取內(nèi)容的urlhot_url = i.xpath('./a/@href')[0]hot_url="https://s.weibo.com/"+str(hot_url)#需要組合正確的url，才能打開hot_url_list.append(hot_url)print(j,hot,hot_url)index_list.append(j)j=j+1#保存文件file=pd.DataFrame(data={'編號(hào)':index_list,'內(nèi)容':hotlist,'url':hot_url_list})file.to_csv('微博熱搜.csv',encoding='utf_8_sig')#調(diào)用函數(shù)，完成爬取！ get_hot()

運(yùn)行結(jié)果:

文件

到此，便完成了今天微博熱搜的獲取。

關(guān)于以上代碼，要留意的就是組合url，源碼是沒(méi)有"https://s.weibo.com/"這一前綴的，估計(jì)是對(duì)我的考驗(yàn)，哈哈！

還有就是hot = i.xpath(’./a/span/text()’)[0] 后面的[0]不加會(huì)報(bào)錯(cuò)，但我又不知道是什么意思，還望大神指點(diǎn)迷津。

xpath只是略懂皮毛，知識(shí)有限，還望走過(guò)路過(guò)多多指教！

總結(jié)

以上是生活随笔為你收集整理的爬虫获取微博首页热搜的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 2019蓝桥杯本科B组C-C++决赛题
下一篇：腾讯王卡运营坑之一：web容器优雅停机缓