當前位置：首頁 > 编程语言 > python >内容正文

python

python抓取数据时失败_爬取数据缺失的补坑，Python数据爬取的坑坑洼洼如何铲平...

發(fā)布時間：2023/12/2 python 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 python抓取数据时失败_爬取数据缺失的补坑，Python数据爬取的坑坑洼洼如何铲平... 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

渣渣業(yè)余選手講解，關于爬取數(shù)據(jù)缺失的補坑，一點點關于Python數(shù)據(jù)爬取的坑坑洼洼如何鏟平，個人的一些心得體會，還有結合實例的數(shù)據(jù)缺失的補全，幾點參考，僅供觀賞，如有雷同，那肯定是我抄襲的！

在使用Python爬取數(shù)據(jù)的過程中，尤其是用你自身電腦進行數(shù)據(jù)抓取，往往會有網(wǎng)絡延遲，或者兼職網(wǎng)管拔插重啟網(wǎng)絡的情況發(fā)生，這是渣渣碰到的非常普遍的情況，當然推薦還是推薦使用服務器抓取數(shù)據(jù)。

當然這是比較常見和可控的網(wǎng)絡爬取的異常，處理還是有不少方法或者說是方案的，也是這里著重談談的爬取數(shù)據(jù)缺失的補坑。

補坑一：timeou=x 的設置

requests抓取網(wǎng)頁數(shù)據(jù)中，timeou屬性建議一定要設置，一般為timeou=5，建議設置5s以上，如果你的網(wǎng)絡差，或者抓取的網(wǎng)頁服務器延遲比較厲害，比如國內(nèi)訪問國外網(wǎng)站服務器，建議設置10s以上！

為什么要設置imeou=x呢？

避免網(wǎng)絡延遲，程序卡死，死機，連報錯都不會出現(xiàn)，一直停滯在網(wǎng)頁訪問的過程中，這在 pyinstaller 打包的exe程序使用中尤為常見！

超時（timeout）

為防止服務器不能及時響應，大部分發(fā)至外部服務器的請求都應該帶著 timeout 參數(shù)。

在默認情況下，除非顯式指定了 timeout 值，requests 是不會自動進行超時處理的。

如果沒有 timeout，你的代碼可能會掛起若干分鐘甚至更長時間。

連接超時指的是在你的客戶端實現(xiàn)到遠端機器端口的連接時（對應的是 connect() ），Request 會等待的秒數(shù)。

一個很好的實踐方法是把連接超時設為比 3 的倍數(shù)略大的一個數(shù)值，因為 TCP 數(shù)據(jù)包重傳窗口 (TCP packet retransmission window) 的默認大小是 3。

在爬蟲代理這一塊我們經(jīng)常會遇到請求超時的問題，代碼就卡在哪里，不報錯也沒有requests請求的響應。

通常的處理是在requests.get()語句中加入timeout限制請求時間req = requests.get(url, headers=headers, proxies=proxies, timeout=5)

如果發(fā)現(xiàn)設置timeout=5后長時間不響應問題依然存在，可以將timeout里的參數(shù)細化

作出如下修改后，問題就消失了req = requests.get(url, headers=headers, proxies=proxies, timeout=(3,7))

timeout是用作設置響應時間的，響應時間分為連接時間和讀取時間，timeout(3,7)表示的連接時間是3，響應時間是7，如果只寫一個的話，就是連接和讀取的timeout總和！

來源：CSDN博主「明天依舊可好」

補坑二：requests超時重試

requests訪問重試的設置，你非常熟悉的錯誤信息中顯示的是 read timeout（讀取超時）報錯。

超時重試的設置，雖然不能完全避免讀取超時報錯，但能夠大大提升你的數(shù)據(jù)獲取量，避免偶爾的網(wǎng)絡超時而無法獲取數(shù)據(jù)，避免你后期大量補坑數(shù)據(jù)。

一般超時我們不會立即返回，而會設置一個三次重連的機制。def gethtml(url):

i = 0

while i < 3:

try:

html = requests.get(url, timeout=5).text

return html

except requests.exceptions.RequestException:

i += 1

其實 requests 已經(jīng)幫我們封裝好了。（但是代碼好像變多了...）import time

import requests

from requests.adapters import HTTPAdapter

s = requests.Session()

s.mount('http://', HTTPAdapter(max_retries=3))

s.mount('https://', HTTPAdapter(max_retries=3))

print(time.strftime('%Y-%m-%d %H:%M:%S'))

try:

r = s.get('http://www.google.com.hk', timeout=5)

return r.text

except requests.exceptions.RequestException as e:

print(e)

print(time.strftime('%Y-%m-%d %H:%M:%S'))

max_retries 為最大重試次數(shù)，重試3次，加上最初的一次請求，一共是4次，所以上述代碼運行耗時是20秒而不是15秒2020-01-11 15:34:03

HTTPConnectionPool(host='www.google.com.hk', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(, 'Connection to www.google.com.hk timed out. (connect timeout=5)'))

2020-01-11 15:34:23

來源：大齡碼農(nóng)的Python之路

補坑三：urlretrieve（）函數(shù) 下載圖片

解決urlretrieve下載不完整問題且避免用時過長

下載文件出現(xiàn)urllib.ContentTooShortError且重新下載文件會存在用時過長的問題，而且往往會嘗試好幾次，甚至十幾次，偶爾會陷入死循環(huán)，這種情況是非常不理想的。為此，筆者利用socket模塊，使得每次重新下載的時間變短，且避免陷入死循環(huán)，從而提高運行效率。

以下為代碼：import socket

import urllib.request

#設置超時時間為30s

socket.setdefaulttimeout(30)

#解決下載不完全問題且避免陷入死循環(huán)

try:

urllib.request.urlretrieve(url,image_name)

except socket.timeout:

count = 1

while count <= 5:

try:

urllib.request.urlretrieve(url,image_name)

break

except socket.timeout:

err_info = 'Reloading for %d time'%count if count == 1 else 'Reloading for %d times'%count

print(err_info)

count += 1

if count > 5:

print("downloading picture fialed!")

來源：CSDN博主「山陰少年」

補坑四：time.sleep的使用

Python time sleep() 函數(shù)推遲調(diào)用線程的運行，可通過參數(shù)secs指秒數(shù)，表示進程掛起的時間。

某些網(wǎng)頁請求過快，如果沒有設置延遲1-2s，你是不會抓取到數(shù)據(jù)的！

當然這種情況還是比較少數(shù)！

想要順利采集數(shù)據(jù)，不管什么方法，目的只有一個：記錄下最后的狀態(tài)，也就是你的抓取日志文件系統(tǒng)一定要完善！

附：

一次完整的數(shù)據(jù)補坑實例：

異常處理記錄源碼：s = requests.session()

s.mount('http://', HTTPAdapter(max_retries=3))

s.mount('https://', HTTPAdapter(max_retries=3))

try:

print(f">>> 開始下載 {img_name}圖片 ...")

r=s.get(img_url,headers=ua(),timeout=15)

with open(f'{path}/{img_name}','wb') as f:

f.write(r.content)

print(f">>>下載 {img_name}圖片成功！")

time.sleep(2)

except requests.exceptions.RequestException as e:

print(f"{img_name}圖片-{img_url}下載失敗！")

with open(f'{path}/imgspider.txt','a+') as f:

f.write(f'{img_url},{img_name},{path}-下載失敗，錯誤代碼：{e}！\n')

下載圖片報錯：

異常文件記錄數(shù)據(jù)：https://www.red-dot.org/index.php?f=65894&token=2aa10bf1c4ad54ea3b55f0f35f57abb4ba22cc76&eID=tx_solr_image&size=large&usage=overview,1_1_KRELL Automotive.jpg,2019Communication Design/Film & Animation-下載失敗，錯誤代碼：HTTPSConnectionPool(host='www.red-dot.org', port=443): Max retries exceeded with url: /index.php?f=65894&token=2aa10bf1c4ad54ea3b55f0f35f57abb4ba22cc76&eID=tx_solr_image&size=large&usage=overview (Caused by ReadTimeoutError("HTTPSConnectionPool(host='www.red-dot.org', port=443): Read timed out. (read timeout=15)"))！

https://www.red-dot.org/index.php?f=65913&token=8cf9f213e28d0e923e1d7c3ea856210502f57df3&eID=tx_solr_image&size=large&usage=overview,1_2_OLX – Free Delivery.jpg,2019Communication Design/Film & Animation-下載失敗，錯誤代碼：HTTPSConnectionPool(host='www.red-dot.org', port=443): Read timed out.！

https://www.red-dot.org/index.php?f=65908&token=426484d233356d6a1d4b8044f4994e1d7f8c141a&eID=tx_solr_image&size=large&usage=overview,1_3_Dentsu Aegis Network’s Data Training – Data Foundation.jpg,2019Communication Design/Film & Animation-下載失敗，錯誤代碼：HTTPSConnectionPool(host='www.red-dot.org', port=443): Max retries exceeded with url: /index.php?f=65908&token=426484d233356d6a1d4b8044f4994e1d7f8c141a&eID=tx_solr_image&size=large&usage=overview (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 11004] getaddrinfo failed'))！

數(shù)據(jù)補坑思路：

第一步：搜索到異常記錄文件，獲取到文件路徑

第二步：打開文件，獲取到相關數(shù)據(jù)信息

第三步：重新下載圖片信息，補充圖片數(shù)據(jù)

幾個關鍵點：

1.搜索異常文件，我這里是 imgspider.txt#搜索文件

def search(path,key):

"""

文件目錄里搜索想要查找的文件輸出文件所在路徑

:param path: 想要搜索查詢的目錄

:param key: 搜索的文件關鍵字

:return: 返回目錄

"""

key_paths=[]

#查看當前目錄文件列表（包含文件夾）

allfilelist = os.listdir(path)

print(allfilelist)

for filelist in allfilelist:

if "." not in filelist:

filespath=os.path.join(path, filelist)

files= os.listdir(filespath)

print(files)

for file in files:

if "." not in file:

filepath=os.path.join(filespath, file)

file = os.listdir(filepath)

for file_name in file:

if key in file_name:

key_path=os.path.join(filepath,file_name)

print(f'找到文件，路徑為{key_path}')

key_paths.append(key_path)

else:

if key in filelist:

key_path=os.path.join(path, filelist)

print(f'找到文件，路徑為{key_path}')

key_paths.append(key_path)

return key_paths

這里只寫到二級目錄，其實可以改成遞歸函數(shù)調(diào)用，結合gui界面制作簡易文件搜索工具助手！

搜索文件效果：

2.圖片數(shù)據(jù)的處理

字符串分割函數(shù) split

需要提取到三個信息，也就是異常記錄里的信息內(nèi)容

1.img_url：圖片下載地址

2.img_name：圖片名稱

3.path：圖片存儲路徑for data in datas:

img_data=data.split('-下載失敗')[0]

img_url=img_data.split(',')[0]

img_name = img_data.split(',')[1]

path = img_data.split(',')[2]

print(img_name,img_url,path)

補坑效果：

附完整源碼：# -*- coding: utf-8 -*-

#python3.7

# 20200111 by 微信：huguo00289

import os,time,requests

from fake_useragent import UserAgent

from requests.adapters import HTTPAdapter #引入 HTTPAdapter 庫

#構成協(xié)議頭

def ua():

ua=UserAgent()

headers={"User-Agent":ua.random}

return headers

#搜索文件

def search(path,key):

"""

文件目錄里搜索想要查找的文件輸出文件所在路徑

:param path: 想要搜索查詢的目錄

:param key: 搜索的文件關鍵字

:return: 返回目錄

"""

key_paths=[]

#查看當前目錄文件列表（包含文件夾）

allfilelist = os.listdir(path)

print(allfilelist)

for filelist in allfilelist:

if "." not in filelist:

filespath=os.path.join(path, filelist)

files= os.listdir(filespath)

print(files)

for file in files:

if "." not in file:

filepath=os.path.join(filespath, file)

file = os.listdir(filepath)

for file_name in file:

if key in file_name:

key_path=os.path.join(filepath,file_name)

print(f'找到文件，路徑為{key_path}')

key_paths.append(key_path)

else:

if key in filelist:

key_path=os.path.join(path, filelist)

print(f'找到文件，路徑為{key_path}')

key_paths.append(key_path)

return key_paths

#獲取圖片下載失敗的文件記錄路徑

def get_pmimgspider():

img_paths=[]

key = "imgspider"

categorys = [

"Advertising", "Annual Reports", "Apps", "Brand Design & Identity", "Brands", "Corporate Design & Identity",

"Fair Stands", "Film & Animation", "Illustrations", "Interface & User Experience Design",

"Online", "Packaging Design", "Posters", "Publishing & Print Media", "Retail Design", "Sound Design",

"Spatial Communication", "Typography", "Red Dot_Junior Award",

]

for category in categorys:

path = f'2019Communication Design/{category}'

key_paths = search(path, key)

img_paths.extend(key_paths)

print(img_paths)

return img_paths

#下載圖片

def get_img(img_name,img_url,path):

s = requests.session()

s.mount('http://', HTTPAdapter(max_retries=3))

s.mount('https://', HTTPAdapter(max_retries=3))

try:

print(f">>> 開始下載 {img_name}圖片 ...")

r=s.get(img_url,headers=ua(),timeout=15)

with open(f'{path}/{img_name}','wb') as f:

f.write(r.content)

print(f">>>下載 {img_name}圖片成功！")

time.sleep(2)

except requests.exceptions.RequestException as e:

print(f"{img_name}圖片-{img_url}下載失敗！")

with open(f'{path}/imgspider.txt','a+') as f:

f.write(f'{img_url},{img_name},{path}-下載失敗，錯誤代碼：{e}！\n')

def main2():

img_paths = get_pmimgspider()

for img_path in img_paths:

print(img_path)

with open(img_path) as f:

datas = f.readlines()

print(datas)

for data in datas:

img_data=data.split('-下載失敗')[0]

img_url=img_data.split(',')[0]

img_name = img_data.split(',')[1]

path = img_data.split(',')[2]

print(img_name,img_url,path)

get_img(img_name, img_url, path)

if __name__=="__main__":

main2()

總結

以上是生活随笔為你收集整理的python抓取数据时失败_爬取数据缺失的补坑，Python数据爬取的坑坑洼洼如何铲平...的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： linux tmp（linux tm）
下一篇：备案资料（资料库备案）