當前位置：首頁 > 编程语言 > python >内容正文

python

python多线程爬取斗图啦数据

發布時間：2024/4/15 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 python多线程爬取斗图啦数据小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python多線程爬取斗圖啦網的表情數據

使用到的技術點

requests請求庫
re 正則表達式
pyquery解析庫,python實現的jquery
threading 線程
queue 隊列

''' 斗圖啦多線程方式'''import requests,time,re,os from pyquery import PyQuery as jq from requests.exceptions import RequestException from urllib import request # 導入線程類 import threading # 導入隊列類 from queue import Queue head = {"User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",} # 創建項目文件夾 pt=os.path.dirname(os.path.abspath(__file__)) path = os.path.join(pt, "斗圖啦") if not os.path.exists(path):os.mkdir(path)''' 生產者類繼承自多線程類threading.Thread 重寫init方法和run方法 ''' class Producer(threading.Thread):def __init__(self,img_queue,url_queue,*args,**kwargs):super(Producer, self).__init__(*args,*kwargs)self.img_queue=img_queueself.url_queue=url_queuedef run(self):while True:if self.url_queue.empty():# 如果沒有url了直接退出循環breakurl=self.url_queue.get()self.parse_page(url)## 解析數據方法def parse_page(self,url):res=requests.get(url,headers=head)doc=jq(res.text)# print(res.text)# 查詢到所有的a標簽items= doc.find(".page-content a").items()for a in items:title=a.find("p").text()src=a.find("img.img-responsive").attr("data-original")# 分割路徑拿到擴展名pathtype= os.path.splitext(src)[1]# 使用正則表達式去掉特殊字符patitle=re.sub(r'[\.。，\?？\*!！\/~]',"",title)filename = patitle + pathtypefilepath=os.path.join(path,filename)# 添加到消費者隊列循環下載圖片self.img_queue.put((filepath,src))''' 消費者和生產者一樣的道理 ''' class Customer(threading.Thread):def __init__(self,img_queue,url_queue,*args,**kwargs):super(Customer, self).__init__(*args,**kwargs)self.img_queue=img_queueself.url_queue=url_queuedef run(self):while True:if self.img_queue.empty() and self.url_queue.empty():#如果沒有url并且圖片下載完成直接退出break# 在隊列中拿到路徑和圖片鏈接filepath,src=self.img_queue.get()print('%s開始下載,鏈接%s' % (filepath, src))# 請求圖片img = requests.get(src)# 寫入本地 content表示二進制數據,text是文本數據with open(filepath, "wb")as f:f.write(img.content)# request.urlretrieve(src,os.path.join(path,filename))print('%s下載完成' % filepath)def main():# 構建url隊列和img隊列url_queue=Queue(100000)img_queue=Queue(100000)# 構建url 爬取1到100頁的數據for i in range(1,101):url="https://www.doutula.com/photo/list/?page="+str(i)url_queue.put(url)# 添加到生產者隊列中 # 開啟5個線程線程執行生產者for i in range(5):t=Producer(img_queue,url_queue)t.start()# 開啟3個線程線程執行消費者for i in range(3):t=Customer(img_queue,url_queue)t.start()if __name__ == '__main__':print("爬蟲調度啟動---------")main()print("爬蟲調度完成---------")

轉載于:https://www.cnblogs.com/HiLzd/p/11246116.html

總結

以上是生活随笔為你收集整理的python多线程爬取斗图啦数据的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：字典类型
下一篇： [转]25个增强iOS应用程序性能的提示