當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

简述python爬虫_python爬虫入门篇了解

發(fā)布時間：2025/3/12 python 20 豆豆

生活随笔收集整理的這篇文章主要介紹了简述python爬虫_python爬虫入门篇了解小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1. 爬蟲分類：

1.1 通用爬蟲：例如搜索引擎：無差別的收集數(shù)據(jù)；提取存儲關(guān)鍵字；構(gòu)建索引庫；給用戶提供搜索接口。

1.2 聚焦爬蟲：有針對性的編寫特定領(lǐng)域數(shù)據(jù)的爬取程序。

2. Robots協(xié)議：

指定一個robots.txt文件，告訴爬蟲引擎什么可以爬取，什么不可以爬取。君子協(xié)議，不受法律保障。

例如：http://www.taobao.com/robots.txt

3. Http請求與響應(yīng)：

爬取網(wǎng)頁就是通過http協(xié)議訪問網(wǎng)頁，通過瀏覽器是人的行為，將這種行為變成程序行為，就是爬蟲。

python里提供了urllib包來處理url，訪問網(wǎng)頁：

urllib.request? # 打開和讀寫url

urllib.error? # 由urllib.request引起的異常

urllib.parse? # 解析url

urllib.robotparser? # 分析robots.txt文件

(python2里提供了urllib和urllib2：urllib提供較為底層的接口，urllib2對urllib進(jìn)行了進(jìn)一步的封裝；python3中將urllib和urllib2合并，并只提供了標(biāo)準(zhǔn)庫urllib包。)

3.1 urllib.request模塊

身份驗證、重定向、cookies等應(yīng)用中打開url(主要是http)的函數(shù)和類。

urlopen方法：建立http連接，獲取數(shù)據(jù)。

urlopen(url, data=None)

# url是鏈接地址字符串或請求對象；data是提交的數(shù)據(jù)。如果data為None則發(fā)起GET請求，如果為非None，發(fā)起POST請求；返回http.client.HTTPResponse類的響應(yīng)對象，是一個類文件對象。

#GET方法：數(shù)據(jù)通過URL傳遞，放在HTTP報文的header部分；POST方法：數(shù)據(jù)放在HTTP報文的body中提交。數(shù)據(jù)都是鍵值對形式，多個參數(shù)之間通過'&'連接，url和數(shù)據(jù)之間通過'?'連接。

from urllib.request importurlopen#打開url返回一個響應(yīng)對象，類文件對象

response = urlopen('http://www.bing.com') #data為None，所以是get方法

print(response.closed) #False。類似于文件對象，不會自動關(guān)閉

with response:print(type(response)) #

print(response.status) #狀態(tài)碼200，表示成功。

print(response.reason) #OK

print(response._method) #請求方法：get。_method表示是內(nèi)部方法

print(response.geturl()) #http://cn.bing.com/?setmkt=zh-CN&setmkt=zh-CN。'http://www.bing.com'鏈接跳轉(zhuǎn)至的真正url

print(response.info()) #返回headers的信息

print(response.read()) #讀取返回的網(wǎng)頁內(nèi)容；字節(jié)類型

print(response.closed) #True。說明with方法同文件對象一樣會自動關(guān)閉。

簡述上述過程：通過urllib.request.urlopen方法，發(fā)起一個HTTP的GET請求，WEB服務(wù)器返回網(wǎng)頁內(nèi)容。響應(yīng)的數(shù)據(jù)被封裝到類文件對象(response)中。可以通過read、readline、deadlines等方法獲取數(shù)據(jù)，status和reason屬性表示返回的狀態(tài)，info方法返回頭信息等。

上述代碼可以獲得網(wǎng)站的響應(yīng)數(shù)據(jù)，但是urlopen方法只能傳遞url和data參數(shù)，不能構(gòu)造http請求。在python的源碼中，User-agent值為Python-urllib/3.6(3.6為對應(yīng)的python版本號)，因此就能被一些反爬蟲的網(wǎng)站識別出來。于是需要更改User-agent值來偽裝成瀏覽器：賦值瀏覽器的UA值。

Request方法：構(gòu)建request請求對象，可以修改請求頭信息。

Request(url, data=None, headers={})

from urllib.request importRequest, urlopenimportrandom

url= 'http://www.bing.com/' #要訪問的url；實際上是自動跳轉(zhuǎn) 301 302#不同瀏覽器的User-Agent值。切換瀏覽器的不同的用戶代理，查看網(wǎng)絡(luò)下標(biāo)頭里user-agent值

ua_list = ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36','Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15','Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299']#ua要添加到請求頭中

ua =random.choice(ua_list)#request = Request(url, headers={'User-agent': ua}) # 實現(xiàn)的功能跟下面兩行代碼是一樣的

request =Request(url)

request.add_header('User-Agent', ua)print(type(request)) #

response= urlopen(request, timeout=20) #urlopen傳入的參數(shù)可以是url對象，也可以是request對象

print(type(response)) #

with response:print(response.status) #200

print(response.getcode()) #200

print(response.reason) #OK

print(response.geturl()) #http://cn.bing.com/

print(response.info()) #header里的信息:包含Cache-Control,Content-Length,Set-Cookies,Connection等字段信息

print(response.read()) #網(wǎng)頁html內(nèi)容

print(request.get_header('User-agent')) #Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0);偽裝的瀏覽器ua值

print('user-agent'.capitalize()) #User-agent;User-agent首字母要大些

3.2 urllib.parse模塊

對url進(jìn)行編碼和解碼。

urlencode方法：url編碼：主要是對post或get方法請求傳遞的參數(shù)編碼，與原url結(jié)合，使其成為能夠被訪問的url格式。

urlencode(data)? # 傳入的data參數(shù)需要是字典或者二元組序列

from urllib.parse importurlencode

u1= urlencode({'name': 'danni', 'age': 23})print(u1) #name=danni&age=23;get方法下u的值是以?連接放在url后，post方法下u的值是放在body里

u2= urlencode({'test': 'http://www.taobao.com?name=danni&age=23'})print(u2) #test=http%3A%2F%2Fwww.taobao.com%3Fname%3Ddanni%26age%3D23

從的運行結(jié)果我們可以發(fā)現(xiàn)：

1.斜杠、&、冒號、等號、問號等都被全部編碼了(%之后實際上是單字節(jié)十六進(jìn)制表示的值)。因為提交的數(shù)據(jù)中，可能會像上述情況一樣有斜杠、等號、問號等特殊符號，但這些字符表示數(shù)據(jù)，不表示元字符。直接發(fā)給服務(wù)器端會導(dǎo)致接收方無法判斷誰是元字符誰是數(shù)據(jù)了。因此，將這部分特殊的字符數(shù)據(jù)也要進(jìn)行url編碼，這樣就不會有歧義。

2.中文同樣也會被編碼，一般先按照字符集的encoding要求將其轉(zhuǎn)換成字節(jié)序列，每一個字節(jié)對應(yīng)的十六進(jìn)制字符串前面加上百分號即可。

unquote方法：url解碼

在百度上搜索漢字“中”：網(wǎng)頁上呈現(xiàn)出來的url：https://www.baidu.com/s?wd=中；復(fù)制粘貼后呈現(xiàn)出的url：https://www.baidu.com/s?wd=%E4%D8%AD

from urllib.parse importurlencode, unquote

u= urlencode({'wd': '中'}) #urlencode對中文的編碼

print(u) #wd=%E4%B8%AD

url= 'https://www.baidu.com/s?{}'.format(u)print(url) #https://www.baidu.com/s?wd=%E4%B8%AD

print('中'.encode('utf-8')) #b'\xe4\xb8\xad' utf-8對中文的編碼

print(unquote(u)) #wd=中 unquote解碼

print(unquote(url)) #https://www.baidu.com/s?wd=中

參與到實際：假設(shè)一個場景需求：連接必應(yīng)搜索引擎，獲取一個搜索的URL：http://cn.bing.com/search?q=馬哥教育。通過程序代碼完成對關(guān)鍵字“馬哥教育”的bing搜索，并將返回的結(jié)果保存為網(wǎng)頁文件。

from urllib.parse importurlencode, unquote

base_url= 'http://cn.bing.com/search' #是http而不是https

d = {'q': '馬哥教育'} #注意必應(yīng)瀏覽器下是'q'

u= urlencode(d) #q=%E9%A9%AC%E5%93%A5%E6%95%99%E8%82%B2；編碼

url = '{}?{}'.format(base_url, u) #?連接url和關(guān)鍵字參數(shù)

print(url) #http://cn.bing.com/search?q=%E9%A9%AC%E5%93%A5%E6%95%99%E8%82%B2

print(unquote(url)) #http://cn.bing.com/search?q=馬哥教育；解碼可以比對看一下是否有錯誤

from urllib.request importurlopen, Request#偽裝成瀏覽器

ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'req= Request(url, headers={'User-agent': ua}) #構(gòu)建http請求對象

res = urlopen(req) #傳入請求對象，建立http連接，獲取數(shù)據(jù)

with res:

with open('test_python.html', 'wb') as f:

f.write(res.read())#將返回的網(wǎng)頁內(nèi)容寫入新文件中

print('success')

上述需求講的是測試get方法的應(yīng)用，我們可以用http://httpbin.org/這個測試網(wǎng)站，來測試自己post出去的數(shù)據(jù)是否能正確返回。

from urllib.request importRequest, urlopenfrom urllib.parse importurlencodeimportsimplejson

request= Request('http://httpbin.org/post') #post

request.add_header('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36')

data= urlencode({'name': '張三!@#$%^&*(,.)', 'age': '6'}) #post的dataurl編碼，放在body里。不做url編碼有風(fēng)險

print(data)print(type(data))

d= data.encode() #POST方法傳入的需為json格式的Form表單數(shù)據(jù)

res = urlopen(request, data=d)

with res:

text=res.read()print(text) #包含args,data,files,form,headers等字段的信息；form字段里是post出去的data數(shù)據(jù)

d = simplejson.loads(text) #bytes轉(zhuǎn)dict

print(d)

到這里，python爬蟲的‘偽裝成瀏覽器的請求對象->建立http連接->發(fā)起post或get請求->查看響應(yīng)內(nèi)容 ’理解和簡單應(yīng)用算是成功告一段落啦～

接下來在下一篇里我們就可以開始簡單的豆瓣網(wǎng)站爬蟲啦～

總結(jié)

以上是生活随笔為你收集整理的简述python爬虫_python爬虫入门篇了解的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Java常用设计模式——观察者模式
下一篇： kali2.0安装mysql,Ubunt