Python异步爬取知乎热榜
生活随笔
收集整理的這篇文章主要介紹了
Python异步爬取知乎热榜
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
一、錯誤代碼:摘要和詳細(xì)的url獲取不到
import asyncio from bs4 import BeautifulSoup import aiohttpheaders={'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36','referer': 'https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C' } async def getPages(url):async with aiohttp.ClientSession(headers=headers) as session:async with session.get(url) as resp:print(resp.status) # 打印狀態(tài)碼html=await resp.text()soup=BeautifulSoup(html,'lxml')items=soup.select('.HotList-item')for item in items:title=item.select('.HotList-itemTitle')[0].texttry:abstract=item.select('.HotList-itemExcerpt')[0].textexcept:abstract='No Abstract'hot=item.select('.HotList-itemMetrics')[0].texttry:img=item.select('.HotList-itemImgContainer img')['src']except:img='No Img'print("{}\n{}\n{}".format(title,abstract,img))if __name__ == '__main__':url='https://www.zhihu.com/billboard'loop=asyncio.get_event_loop()loop.run_until_complete(getPages(url))loop.close()二、查看JS代碼
發(fā)現(xiàn)詳細(xì)鏈接、圖片鏈接、問題摘要等都在JS里面(CSDN的開發(fā)者助手插件確實(shí)好用)
?
正則表達(dá)式獲取上述信息
?
接下來就是詳細(xì)的代碼啦
import asyncio import json import re import aiohttpheaders={'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36','referer': 'https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C' } async def getPages(url):async with aiohttp.ClientSession(headers=headers) as session:async with session.get(url) as resp:print(resp.status) # 打印狀態(tài)碼html=await resp.text()regex=re.compile('"hotList":(.*?),"guestFeeds":')text=regex.search(html).group(1)# print(json.loads(text)) # json換成字典格式for item in json.loads(text):title=item['target']['titleArea']['text']question=item['target']['excerptArea']['text']hot=item['target']['metricsArea']['text']link=item['target']['link']['url']img=item['target']['imageArea']['url']if not img:img='No Img'if not question:question='No Abstract'print("Title:{}\nPopular:{}\nQuestion:{}\nLink:{}\nImg:{}".format(title,hot,question,link,img))if __name__ == '__main__':url='https://www.zhihu.com/billboard'loop=asyncio.get_event_loop()loop.run_until_complete(getPages(url))loop.close()?
?
總結(jié)
以上是生活随笔為你收集整理的Python异步爬取知乎热榜的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Python使用aiohttp异步爬取糗
- 下一篇: Windows10熄屏自动断开WiFi连