當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫实战学习笔记_2 网络请求urllib模块+设置请求头+Cookie+模拟登陆

發布時間：2024/7/5 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫实战学习笔记_2 网络请求urllib模块+设置请求头+Cookie+模拟登陆小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1 urllib模塊

1.1 urllib模塊簡介

Python3中將urib與urllib2模塊的功能組合，并且命名為urllib。Python3中的urllib模塊中包含多個功能的子模塊，具體內容如下。

urllib.request：用于實現基本HTTP請求的模塊。
urlb.error：異常處理模塊，如果在發送網絡請求時出現了錯誤，可以捕獲的有效處理。
urllib.parse：用于解析URL的模塊。
urllib.robotparser：用于解析robots.txt文件，判斷網站是否可以爬取信息。

1.2 發送網絡請求urllib.request.urlopen()

1.2.1 urllib.request.urlopen()函數簡介

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url：需要訪問網站的URL完整地址
data：該參數默認為None,通過該參數確認請求方式，如果是None,表示請求方式為GET,否則請求方式為POST。在發送POST請求時，參數daa需要以字典形式的數據作為參數值，并且需要將字典類型的參數值轉換為字節類型的數據才可以實現POST請求。
timeout：設置網站訪問超時時間，以秒為單位。
cafile：指定包含CA證書的單個文件，
capah：指定證書文件的目錄。
cadefault：CA證書默認值
context：描述SSL選項的實例。

1.2.2 發送GET請求

import urllib.request response = urllib.request.urlopen("https://www.baidu.com/") print("response:",response) # 輸出： response: <http.client.HTTPResponse object at 0x000001AD2793C850>

1.2.3 獲取狀態碼、響應頭、獲取HTMl代碼?

import urllib.request url = "https://www.baidu.com/" response = urllib.request.urlopen(url=url) print("響應狀態碼：",response.status) # 輸出：響應狀態碼： 200 print("響應頭信息：",response.getheaders()) # 響應頭信息： [('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'), ('Content-Length', '227'), ('Content-Type', 'text/html'), ('Date', 'Wed, 09 Mar 2022 10:45:04 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Pragma', 'no-cache'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BD_NOT_HTTPS=1; path=/; Max-Age=300'), ('Set-Cookie', 'BIDUPSID=5C4759402F5A8C38E347A1E6FB8788EF; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1646822704; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BAIDUID=5C4759402F5A8C384F12C0C34D5D3B36:FG=1; max-age=31536000; expires=Thu, 09-Mar-23 10:45:04 GMT; domain=.baidu.com; path=/; version=1; comment=bd'), ('Strict-Transport-Security', 'max-age=0'), ('Traceid', '1646822704264784359414774964437731406767'), ('X-Frame-Options', 'sameorigin'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close')] print("響應頭指定信息：",response.getheader('Accept-Ranges')) # 響應頭指定信息： bytes print("目標頁面的Html代碼 \n ",response.read().decode('utf-8')) # 即為Html文件的內容

1.2.4 發送POST請求

urlopen()方法在默認的情況下發送的是GET請求，如果需要發送POST請求，可以為其設置data參數、該參數是byte類型，需要使用bytes()方法將參數值進行數據類型轉換

import urllib.request import urllib.parseurl = "https://www.baidu.com/" data = bytes(urllib.parse.urlencode({'hello':'python'}),encoding='utf-8') # 將表單轉化為bytes類型，并且設置編碼 response = urllib.request.urlopen(url=url,data=data,timeout=0.1) # 發送網絡請求設置超時時間0.1s print(response.read().decode('utf-8')) # 讀取Html代碼進行編碼

1.2.5 處理網絡超市異常

如果遇到了超時異常，爬蟲程序將在此處停止。所以在實際開發中開發者可以將超時異常捕獲，然后處理下面的爬蟲任務。以上述發送網絡請求為例，將超時參數imeout設置為0.1s，然后使用try...excpt 捕獲異常并判斷如果是超時異常就模擬自動執行下一個任務。

import urllib.request import urllib.error import socketurl = "https://www.baidu.com/"try:response = urllib.request.urlopen(url=url,timeout=0.1)print(response.read().decode('utf-8')) except urllib.error.URLError as error :if isinstance(error.reason,socket.timeout):print("當前任務已經超時，即將執行下一任務")

2 設置請求頭

2.1?urllib.request.Request()

urlopen()方法可以實現最基本的請求的發起，但如果要加入Headers等信息，就可以利用Request類來構造請求。

2.1.1 函數原型
使用方法為：

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

2.1.2 參數解析

url：要請求的URL地址
data?：必須是bytes(字節流）類型，如果是字典，可以用urllib.parse模塊里的urlencode()編碼
headers：是一個字典類型，是請求頭。①在構造請求時通過headers參數直接構造，也可以通過調用請求實例的add_header()方法添加。②通過請求頭偽裝瀏覽器，默認User-Agent是Python-urllib。要偽裝火狐瀏覽器，可以設置User-Agent為Mozilla/5.0 (x11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
origin_req_host：指定請求方的host名稱或者ip地址
unverifiable：設置網頁是否需要驗證，默認是False，這個參數一般也不用設置。
method?：字符串，用來指定請求使用的方法，比如GET，POST和PUT等。

2.1.3 設置請求頭的作用

請求頭參數是為了模擬瀏覽器向網頁后臺發送網絡請求，這樣可以避免服務器的反爬措施。使用urlopen()方法發送網絡請求時，其本身并沒有設置請求頭參數，所以向測試地址發送請求時，返回的信息中headers將顯示默認值。

所以在設置請求頭信息前，需要在瀏覽器中找到一個有效的請求頭信息。以谷歌瀏覽器為例2

2.1.4 手動尋找請求頭

F12打開開發工具，選擇 Network 選項，接著任意打開一個網頁，在請求列表中找到Headers選項中找到請求頭。

?2.2 設置請求頭

import urllib.request import urllib.parse url = "https://www.baidu.com/" # 設置請求地址 #設置請求頭信息 headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"} # data轉化為bytes類型，并設置編碼方式 data = bytes(urllib.parse.urlencode({'hello':'python'}),encoding='utf-8') # 創建Request類型對象 url_post = urllib.request.Request(url=url,data=data,headers=headers,method='POST') # 發送網絡請求 response = urllib.request.urlopen(url_post) # 讀取HTMl代碼并進行UTF-8編碼 print(response.read().decode('utf-8'))

3 Cookie

? ?Cookie是服務器向客戶端返回響應數據時所留下的標記，當客戶端再次訪問服務器時將攜帶這個標記。一般在實現登錄一個頁面時，登錄成功后，會在瀏覽器的Cookie中保留一些信息，當瀏覽器再次訪問時會攜帶Cook中的信息，經過服務器核對后便可以確認當前用戶已經登錄過，此時可以直接將登錄后的數據返回。
? ? 在使用網絡爬蟲獲取網頁登錄后的數據時，除了使用模擬登錄以外，還可以獲取登錄后的Cookie，然后利用這個Cookie再次發送請求時，就能以登錄用戶的身份獲取數據。

3.1 模擬登陸

3.1.1 登陸前準備

目標地址：site2.rjkflm.com:666

賬號：test01test

密碼：123456

3.1.2 查看登陸目標地址

得到以下信息

Request URL:http://site2.rjkflm.com:666/index/index/login.html

3.1.2 實現模擬登陸

import urllib.request import urllib.parseurl = "http://site2.rjkflm.com:666/index/index/chklogin.html" # 設置表單 data = bytes(urllib.parse.urlencode({'username':'test01test','password':'123456'}),encoding='utf-8') # 將bytes轉化，并且設置編碼 r = urllib.request.Request(url=url,data=data,method='POST') response = urllib.request.urlopen(r) # 發送請求 print(response.read().decode('utf-8')) # 返回：{"status":true,"msg":"登錄成功！"}

3.1.3 獲取Cookies

import urllib.request import urllib.parse import http.cookiejar import jsonurl = "http://site2.rjkflm.com:666/index/index/chklogin.html" # 設置表單 data = bytes(urllib.parse.urlencode({'username':'test01test','password':'123456'}),encoding='utf-8')cookie_file = 'cookie.txt' cookie = http.cookiejar.LWPCookieJar(cookie_file) # 創建LWPCookieJar對象 # 生成 Cookie處理器 cookie_processor = urllib.request.HTTPCookieProcessor(cookie) # 創建opener對象 opener = urllib.request.build_opener(cookie_processor) response = opener.open(url,data=data) # 發送網絡請求 response = json.loads(response.read().decode('utf-8'))['msg'] if response == '登陸成功':cookie.save(ignore_discard=True,ignore_expires=True) # 保存Cookie文件

3.1.4?載入Cookies

import urllib.request import http.cookiesimport urllib.request # 導入urllib.request模塊 import http.cookiejar # 導入http.cookiejar子模塊 # 登錄后頁面的請求地址 url = 'http://site2.rjkflm.com:666/index/index/index.html' cookie_file = 'cookie.txt' # cookie文件cookie = http.cookiejar.LWPCookieJar() # 創建LWPCookieJar對象 # 讀取cookie文件內容 cookie.load(cookie_file,ignore_expires=True,ignore_discard=True) # 生成cookie處理器 handler = urllib.request.HTTPCookieProcessor(cookie) # 創建opener對象 opener = urllib.request.build_opener(handler) response = opener.open(url) # 發送網絡請求 print(response.read().decode('utf-8')) # 打印登錄后頁面的html代碼

總結

以上是生活随笔為你收集整理的爬虫实战学习笔记_2 网络请求urllib模块+设置请求头+Cookie+模拟登陆的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： hidl 原理分析_一个 health
下一篇： python bp神经网络分类预测结果图