當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python异步爬取知乎热榜

發(fā)布時間：2025/3/20 python 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python异步爬取知乎热榜小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

一、錯誤代碼：摘要和詳細(xì)的url獲取不到

import asyncio from bs4 import BeautifulSoup import aiohttpheaders={'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36','referer': 'https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C' } async def getPages(url):async with aiohttp.ClientSession(headers=headers) as session:async with session.get(url) as resp:print(resp.status) # 打印狀態(tài)碼html=await resp.text()soup=BeautifulSoup(html,'lxml')items=soup.select('.HotList-item')for item in items:title=item.select('.HotList-itemTitle')[0].texttry:abstract=item.select('.HotList-itemExcerpt')[0].textexcept:abstract='No Abstract'hot=item.select('.HotList-itemMetrics')[0].texttry:img=item.select('.HotList-itemImgContainer img')['src']except:img='No Img'print("{}\n{}\n{}".format(title,abstract,img))if __name__ == '__main__':url='https://www.zhihu.com/billboard'loop=asyncio.get_event_loop()loop.run_until_complete(getPages(url))loop.close()

二、查看JS代碼

發(fā)現(xiàn)詳細(xì)鏈接、圖片鏈接、問題摘要等都在JS里面（CSDN的開發(fā)者助手插件確實(shí)好用）

正則表達(dá)式獲取上述信息

接下來就是詳細(xì)的代碼啦

import asyncio import json import re import aiohttpheaders={'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36','referer': 'https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C' } async def getPages(url):async with aiohttp.ClientSession(headers=headers) as session:async with session.get(url) as resp:print(resp.status) # 打印狀態(tài)碼html=await resp.text()regex=re.compile('"hotList":(.*?),"guestFeeds":')text=regex.search(html).group(1)# print(json.loads(text)) # json換成字典格式for item in json.loads(text):title=item['target']['titleArea']['text']question=item['target']['excerptArea']['text']hot=item['target']['metricsArea']['text']link=item['target']['link']['url']img=item['target']['imageArea']['url']if not img:img='No Img'if not question:question='No Abstract'print("Title：{}\nPopular：{}\nQuestion：{}\nLink：{}\nImg：{}".format(title,hot,question,link,img))if __name__ == '__main__':url='https://www.zhihu.com/billboard'loop=asyncio.get_event_loop()loop.run_until_complete(getPages(url))loop.close()

總結(jié)

以上是生活随笔為你收集整理的Python异步爬取知乎热榜的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Python使用aiohttp异步爬取糗
下一篇： Windows10熄屏自动断开WiFi连