Python异步高并发批量读取URL链接
生活随笔
收集整理的這篇文章主要介紹了
Python异步高并发批量读取URL链接
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
?? 思路:先組裝 需要訪問的url 鏈接,然后用asyncio協(xié)程批量去aiohttp請(qǐng)求,把返回的response數(shù)據(jù) 用BeautifulSoup處理得到我們想要的結(jié)果,然后把數(shù)據(jù)插入mongo數(shù)據(jù)庫(kù)。
import asyncio import time import aiohttp import async_timeout from bs4 import BeautifulSoup import pymongomyclient = pymongo.MongoClient("mongodb://localhost:27017/") mydb = myclient["python"] mycol = mydb["movie"]''''異步并發(fā)獲取url數(shù)據(jù),爬蟲某網(wǎng)'''temp_data = [] linklist = [] link_texts = [] link_nums = []msg = "https://www.xxx.com/mdb/film/list/year-1908/o0d0p{}.html" headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' }urls = [msg.format(i) for i in range(1, 7)] print(urls) ##打印url 數(shù)組 data = bytearray()async def fetch(session, url):with async_timeout.timeout(10):async with session.get(url) as response:html = await response.text(encoding=None, errors='ignore')return response.status,htmlasync def main(url):async with aiohttp.ClientSession() as session:status,html = await fetch(session, url)if status == 200:return htmlelse:return ''if __name__ == '__main__':start = time.time()loop = asyncio.get_event_loop()tasks = [main(url) for url in urls]# 返回一個(gè)列表,內(nèi)容為各個(gè)tasks的返回值status_list = loop.run_until_complete(asyncio.gather(*tasks))print(len([status_list]))for html in status_list:soup = BeautifulSoup(html)soup = soup.find('ul', attrs={'class': 'inqList'})for x in soup.find_all('li'):aTag = x.find('a')divTag = x.find('div')link = aTag.get('href')divaTag = divTag.find('a')divbTag = divTag.find('b')link_text = divaTag.stringif link_text:link_texts.append(link_text)if link:linklist.append(link)if divbTag:link_num = divbTag.stringelse:link_num = 0link_nums.append(link_num)mydict = {"name": link_text, "socre": link_num, "url": "https://www.1905.com"+link}inser = mycol.insert_one(mydict) #插入數(shù)據(jù)print(link_texts)print(linklist)print(link_nums)print(len(link_texts))print(len(linklist))print(len(link_nums))end = time.time()print("cost time:", end - start)讀取MongoDb里面我們插入的數(shù)據(jù)
???????class MongoDb:def __init__(self):myclient = pymongo.MongoClient("mongodb://localhost:27017/")mydb = myclient["python"]self.mycol = mydb["movie"]mdb = MongoDb() for x in mdb.mycol.find({},{ "_id": 0, "name": 1, "socre": 1,"url":1}).sort("socre",-1):print(x)總結(jié)
以上是生活随笔為你收集整理的Python异步高并发批量读取URL链接的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 音频分离Spleeter的安装
- 下一篇: 四阶龙格库塔matlab计算例题,四阶龙