生活随笔
收集整理的這篇文章主要介紹了
网络爬虫 | 基础
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄
- 1、 爬蟲工作原理
- 2、 requests
- 2.1 requests.get()
- 2.2 Response對象的常用屬性
- 3、 爬蟲理論
- 4、BeautifulSoup
- 5、例子:
- 5.1 例一
- 5.2 例二
- 5.2.1 Network
- 5.2.2 XHR
- 5.2.3 JSON
- 5.3 例三
- 5.3.1 params
- 5.3.2 Request Headers
- 6、存儲數據
- 6.1 存儲方式
- 6.2 csv寫入與讀取
- 6.3 Excel寫入與讀取
- 7、爬取思想
- 8、cookies及session
- 8.1 cookies及其用法
- 8.2 session及其用法
- 8.3 存儲cookies
- 8.4 讀取cookies
- 9、selenium
- 10、 協程
- 10.1 gevent庫
- 10.2 queue模塊
- 10.3 例子
- 11、Scrapy
- 未完
1、 爬蟲工作原理
- 首先,爬蟲可以模擬瀏覽器去向服務器發出請求;
- 其次,等服務器響應后,爬蟲程序還可以代替瀏覽器幫我們解析數據;
- 接著,爬蟲可以根據我們設定的規則批量提取相關數據,而不需要我們去手動提取;
- 最后,爬蟲可以批量地把數據存儲到本地。
爬蟲的工作分為四步:
- 第1步:獲取數據。爬蟲程序會根據我們提供的網址,向服務器發起請求,然后返回數據。
- 第2步:解析數據。爬蟲程序會把服務器返回的數據解析成我們能讀懂的格式。
- 第3步:提取數據。爬蟲程序再從中提取出我們需要的數據。
- 第4步:儲存數據。爬蟲程序把這些有用的數據保存起來,便于你日后的使用和分析。
2、 requests
requests庫可以幫我們下載網頁源代碼、文本、圖片,甚至是音頻。其實,“下載”本質上是向服務器發送請求并得到響應。
2.1 requests.get()
import requests
res
= requests
.get
('URL')
- import requests:引入requests庫
- res = requests.get(‘URL’):requests.get()發送了請求,然后得到了服務器的響應。服務器返回的結果是個Response對象,現在存儲到了我們定義的變量res中。
2.2 Response對象的常用屬性
res是一個對象,屬于requests.models.Response類。
import requests
res
= requests
.get
('URL')
print(type(res
))
結果:
<class 'requests.models.Response'>
Response對象常用的四個屬性:
屬性作用
| response.status code | 檢查請求是否成功 |
| response.content | 把reponse對象轉換為二進制數據 |
| response.text | 把reponse對象轉換為字符串數據 |
| response.encoding | 定義response對象的編碼 |
response.content 例子:
import requests
res
= requests
.get
('xxx.png')
pic
=res
.content
photo
= open('ppt.jpg','wb')
photo
.write
(pic
)
photo
.close
()
response.text 例子:
import requests
res
= requests
.get
('xxx.md')
novel
=res
.text
print(novel
[:800])
res.encoding 例子:
import requests
res
= requests
.get
('xxx.md')
res
.encoding
='utf-8'
novel
=res
.text
print(novel
[:800])
3、 爬蟲理論
- Robots協議是互聯網爬蟲的一項公認的道德規范,它的全稱是==“網絡爬蟲排除標準”(Robots exclusion protocol)==,這個協議用來告訴爬蟲,哪些頁面是可以抓取的,哪些不可以。
- 如何查看網站的robots協議呢,很簡單,在網站的域名后加上==/robots.txt==就可以了。
User
-agent
: Baiduspider
Allow
: /article
Allow
: /oshtml
Allow
: /ershou
Allow
: /$
Disallow
: /product
/
Disallow
: /
?
User
-Agent
: Googlebot
Allow
: /article
Allow
: /oshtml
Allow
: /product
Allow
: /spu
Allow
: /dianpu
Allow
: /oversea
Allow
: /list
Allow
: /ershou
Allow
: /$
Disallow
: /
?
……
?
User
-Agent
: *
Disallow
: /
- Allow代表可以被訪問,Disallow代表禁止被訪問。
4、BeautifulSoup
使用BeautifulSoup解析和提取網頁中的數據。
- 【提取數據】是指把我們需要的數據從眾多數據中挑選出來。
4.1 解析數據
BeautifulSoup解析數據的用法:
bs對象
= BeautifulSoup
(要解析的文本,
' 解析器')
例子:
import requests
from bs4
import BeautifulSoup
res
= requests
.get
('xxx.html')
html
= res
.text
soup
= BeautifulSoup
(html
,'html.parser')
- soup的數據類型是<class ‘bs4.BeautifulSoup’>,說明soup是一個BeautifulSoup對象。
注意:
response.text和soup打印出的內容表面上看長得一模一樣,卻有著不同的內心,它們屬于不同的類:<class ‘str’> 與<class ‘bs4.BeautifulSoup’>。前者是字符串,后者是已經被解析過的BeautifulSoup對象。 之所以打印出來的是一樣的文本,是因為BeautifulSoup對象在直接打印它的時候會調用該對象內的__str__方法,所以直接打印 bs 對象顯示字符串是__str__的返回結果。
4.2 提取數據
- find()與find_all()
find()與find_all()是BeautifulSoup對象的兩個方法,它們可以匹配html的標簽和屬性,把BeautifulSoup對象里符合要求的數據都提取出來。
方法作用用法示例
| find() | 提取滿足要求的首個數據 | BeautifulSoup對象.find (標簽,屬性) | soup.find(‘div’ ,class_ =‘books’) |
| find_all() | 提取滿足要求的所有數據 | BeautifulSoup對象.find_all (標簽,屬性) | soup.find_all(‘div’ ,class_ =‘books’) |
例子:
import requests
from bs4
import BeautifulSoup
url
= 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html'
res
= requests
.get
(url
)
print(res
.status_code
)
soup
= BeautifulSoup
(res
.text
,'html.parser')
item
= soup
.find
('div') print(type(item
)) print(item
)
運行結果:首個<div>元素。還打印了它的數據類型:<class 'bs4.element.Tag'>,說明這是一個Tag類對象。
import requests
from bs4
import BeautifulSoup
url
= 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html'
res
= requests
.get
(url
)
print(res
.status_code
)
soup
= BeautifulSoup
(res
.text
,'html.parser')
items
= soup
.find_all
('div') print(type(items
)) print(items
)
運行結果:三個<div>元素,它們一起組成了一個列表結構。打印items的類型,顯示的是<class 'bs4.element.ResultSet'>,是一個ResultSet類的對象。其實是Tag對象以列表結構儲存了起來,可以把它當做列表來處理。
屬性/方法作用
| Tag.find()和Tag.fnd all() | 提取Tag中的Tag |
| Tag.text | 提取Tag中的文字 |
| Tag['屬性名] | 輸入參數:屬性名,可以提取Tag中這個屬性的值 |
例子:
- 爬取書苑不太冷網站中每本書的類型、鏈接、標題和簡介
import requests
from bs4
import BeautifulSoup
res
= requests
.get
('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
html
= res
.text
soup
= BeautifulSoup
( html
,'html.parser')
items
= soup
.find_all
(class_
='books')
for item
in items
:kind
= item
.find
('h2') title
= item
.find
(class_
='title') brief
= item
.find
(class_
='info') print(kind
,'\n',title
,'\n',brief
) print(type(kind
),type(title
),type(brief
))
- 用Tag.text提出Tag對象中的文字,用Tag[‘href’]提取出URL
5、例子:
5.1 例一
- 寫一個循環,提取頁面的所有菜名、URL、食材,并將它存入列表。
[[菜A
,URL_A
,食材A
],[菜B
,URL_B
,食材B
],[菜C
,URL_C
,食材C
]]
import requests
from bs4
import BeautifulSoup
res_foods
= requests
.get
('http://www.xiachufang.com/explore/')
bs_foods
= BeautifulSoup
(res_foods
.text
,'html.parser')
list_foods
= bs_foods
.find_all
('div',class_
='info pure-u')
list_all
= []
for food
in list_foods
:tag_a
= food
.find
('a')name
= tag_a
.text
[17:-13]URL
= 'http://www.xiachufang.com'+tag_a
['href']tag_p
= food
.find
('p',class_
='ing ellipsis')ingredients
= tag_p
.text
[1:-1]list_all
.append
([name
,URL
,ingredients
])print(list_all
)
list_foods
= bs_foods
.find_all
('div',class_
='info pure-u')
list_all
= []
for food
in list_foods
:tag_a
= food
.find
('a')name
= tag_a
.text
[17:-13]URL
= 'http://www.xiachufang.com'+tag_a
['href']tag_p
= food
.find
('p',class_
='ing ellipsis')ingredients
= tag_p
.text
[1:-1]list_all
.append
([name
,URL
,ingredients
])print(list_all
)
5.2 例二
import requests
from bs4
import BeautifulSoupres_music
= requests
.get
('https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC%A6')
bs_music
= BeautifulSoup
(res_music
.text
,'html.parser')
list_music
= bs_music
.find_all
('a',class_
='js_song')
for music
in list_music
:
print(music
['title'])
上面錯誤
5.2.1 Network
- Network的功能是:記錄在當前頁面上發生的所有請求。
- 第0行的左側,紅色的圓鈕是啟用Network監控(默認高亮打開),灰色圓圈是清空面板上的信息。
- 右側勾選框Preserve log,它的作用是“保留請求日志”。如果不點擊這個,當發生頁面跳轉的時候,記錄就會被清空。所以,我們在爬取一些會發生跳轉的網頁時,會點亮它。
- 第3行,是對請求進行分類查看。最常用的是:
- ALL(查看全部)
- XHR(一種不借助刷新頁面即可傳輸的對象)
- Doc(Document,第0個請求一般在這里),有時候也會看看:
- Img(僅查看圖片)/Media(僅查看媒體文件)
- Other(其他)。
剛剛寫的代碼,只是模擬了這52個請求中的一個(準確來說,就是第0個請求),而這個請求里并不包含歌曲清單。
5.2.2 XHR
- Ajax技術:應用這種技術,好處是顯而易見的——更新網頁內容,而不用重新加載整個網頁。
- Ajax技術在工作的時候創建一個XHR(或是Fetch)對象,然后利用XHR對象來實現,服務器和瀏覽器之間傳輸數據。在這里,XHR和Fetch并沒有本質區別。
歌單應該在XHR下的client_search中
- Headers:標頭(請求信息)
- Preview:預覽
- Response:原始信息
- Timing:時間。
General里的Requests URL(最外層是一個字典)就是應該去訪問的鏈接。
XHR是一個字典,鍵data對應的值也是一個字典;在該字典里,鍵song對應的值也是一個字典;在該字典里,鍵list對應的值是一個列表;在該列表里,一共有20個元素;每一個元素都是一個字典;在每個字典里,鍵name的值,對應的是歌曲名。
import requests
res
= requests
.get
('https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=60997426243444153&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=20&w=%E5%91%A8%E6%9D%B0%E4%BC%A6&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0')
print(res
.text
)
res.text取到的,是字符串。不是列表/字典。
5.2.3 JSON
- json是一種特殊的字符串,這種字符串特殊在它的寫法——它是用列表/字典的語法寫成的。
- json能夠有組織地存儲信息。
- json()方法,將response對象,轉為列表/字典
import requests
res_music
= requests
.get
('https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=60997426243444153&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=20&w=%E5%91%A8%E6%9D%B0%E4%BC%A6&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0')
json_music
= res_music
.json
()
list_music
= json_music
['data']['song']['list']
for music
in list_music
:
print(music
['name'])
import requests
res_music
= requests
.get
('https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=60997426243444153&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=20&w=%E5%91%A8%E6%9D%B0%E4%BC%A6&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0')
json_music
= res_music
.json
()
list_music
= json_music
['data']['song']['list']
for music
in list_music
:
print(music
['name'])print('所屬專輯:'+music
['album']['name'])print('播放時長:'+str(music
['interval'])+'秒')print('播放鏈接:https://y.qq.com/n/yqq/song/'+music
['mid']+'.html\n\n')
5.3 例三
Query String Parametres中文翻譯是:查詢字符串參數。
- 爬取多個歌曲
- 點擊第2頁、第3頁進行翻頁,此時Network會多加載出2個XHR,它們的Name都是client_search…。
- 只有一個參數變化,說明p代表的應該就是頁碼。
- 每次循環去更改p的值
import requests
for x
in range(5):res_music
= requests
.get
('https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=60997426243444153&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p='+str(x
+1)+'&n=20&w=%E5%91%A8%E6%9D%B0%E4%BC%A6&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0')json_music
= res_music
.json
()list_music
= json_music
['data']['song']['list']for music
in list_music
:print(music
['name'])print('所屬專輯:'+music
['album']['name'])print('播放時長:'+str(music
['interval'])+'秒')print('播放鏈接:https://y.qq.com/n/yqq/song/'+music
['mid']+'.html\n\n')
5.3.1 params
requests模塊里的requests.get()提供了一個參數叫params,可以讓我們用字典的形式,把參數傳進去。
- 把Query String Parametres里的內容,直接復制下來,封裝為一個字典,傳遞給params。
import requests
url
= 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp'
for x
in range(5):params
= {'ct':'24','qqmusic_ver': '1298','new_json':'1','remoteplace':'sizer.yqq.song_next','searchid':'64405487069162918','t':'0','aggr':'1','cr':'1','catZhida':'1','lossless':'0','flag_qc':'0','p':str(x
+1),'n':'20','w':'周杰倫','g_tk':'5381','loginUin':'0','hostUin':'0','format':'json','inCharset':'utf8','outCharset':'utf-8','notice':'0','platform':'yqq.json','needNewCode':'0' }res_music
= requests
.get
(url
,params
=params
)json_music
= res_music
.json
()list_music
= json_music
['data']['song']['list']for music
in list_music
:print(music
['name'])print('所屬專輯:'+music
['album']['name'])print('播放時長:'+str(music
['interval'])+'秒')print('播放鏈接:https://y.qq.com/n/yqq/song/'+music
['mid']+'.html\n\n')
-
爬取評論
- 找評論:
先把Network面板清空,再點擊一下評論翻頁,看看有沒有多出來的新XHR,多出來的那一個,就應該是和評論相關的啦。
- 反動頁數發現:
- XHR有兩個參數在不斷變化:一個是pagenum,一個lasthotcommentid。
- lasthotcommentid,它的含義是:上一條熱評的評論id。
import requests
url
= 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg'
commentid
= ''
for x
in range(5):params
= {'g_tk':'5381','loginUin':'0','hostUin':'0','format':'json','inCharset':'utf8','outCharset':'GB2312','notice':'0','platform':'yqq.json','needNewCode':'0','cid':'205360772','reqtype':'2','biztype':'1','topid':'102065756','cmd':'8','needcommentcrit':'0','pagenum':str(x
),'pagesize':'25','lasthotcommentid':commentid
,'domain':'qq.com','ct':'24','cv':'101010 '}res_comment
= requests
.get
(url
,params
=params
)json_comment
= res_comment
.json
()list_comment
= json_comment
['comment']['commentlist']for comment
in list_comment
:print(comment
['rootcommentcontent'])commentid
= list_comment
[24]['commentid']
5.3.2 Request Headers
服務器通過什么識別我們是真實瀏覽器,還是Python爬蟲的?
- Requests Headers,請求頭。
- user-agent會記錄你電腦的信息和瀏覽器版本。
- origin和referer則記錄了這個請求,最初的起源是來自哪個頁面。
- 區別是referer會比origin攜帶的信息更多些。
- user-agent不修改,那么這里的默認值就會是Python,會被瀏覽器認出來。
修改origin或referer,一并作為字典寫入headers
import requests
url
= 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg'
headers
= {'origin':'https://y.qq.com','referer':'https://y.qq.com/n/yqq/song/004Z8Ihr0JIu5s.html','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',}
params
= {
'g_tk':'5381',
'loginUin':'0',
'hostUin':'0',
'format':'json',
'inCharset':'utf8',
'outCharset':'GB2312',
'notice':'0',
'platform':'yqq.json',
'needNewCode':'0',
'cid':'205360772',
'reqtype':'2',
'biztype':'1',
'topid':'102065756',
'cmd':'8',
'needcommentcrit':'0',
'pagenum':0,
'pagesize':'25',
'lasthotcommentid':'',
'domain':'qq.com',
'ct':'24',
'cv':'101010 '}res_music
= requests
.get
(url
,headers
=headers
,params
=params
)
6、存儲數據
6.1 存儲方式
file=open('test.csv','a+')
file.write
('美國隊長,鋼鐵俠,蜘蛛俠')
file.close
()
6.2 csv寫入與讀取
直接運用別人寫好的模塊,比使用open()函數來讀寫方便。
import csv
csv_file
= open('demo.csv','w',newline
='',encoding
='utf-8')
- “w”就是writer,即文件寫入模式,它會以覆蓋原內容的形式寫入新添加的內容。
- newline=’ '參數的原因是,可以避免csv文件出現兩倍的行距
- encoding=‘utf-8’,可以避免編碼問題導致的報錯或亂碼。
csv文件寫入
- csv.writer()函數來建立一個writer對象,
- 調用writer對象的writerow()方法。
- writerow()函數里,需要放入列表參數
import csv
csv_file
= open('demo.csv','w',newline
='',encoding
='utf-8')
writer
= csv
.writer
(csv_file
)
writer
.writerow
(['電影','豆瓣評分'])
writer
.writerow
(['銀河護衛隊','8.0'])
writer
.writerow
(['復仇者聯盟','8.1'])
csv_file
.close
()
讀csv文件
import csv
csv_file
= open('demo.csv','r',newline
='',encoding
='utf-8')
reader
= csv
.reader
(csv_file
)
for row
in reader
:print(row
)
csv官方文檔
6.3 Excel寫入與讀取
寫入操作
引用openpyxl,然后通過openpyxl.Workbook()函數就可以創建新的工作薄獲取工作表操作單元格,往單元格里寫入內容工作表里寫入一行內容的話,就得用到append函數保存這個Excel文件
import openpyxl
wb
= openpyxl
.Workbook
()
sheet
= wb
.active
sheet
.title
= 'new title'
sheet
['A1'] = '漫威宇宙'
rows
= [['美國隊長','鋼鐵俠','蜘蛛俠'],['是','漫威','宇宙', '經典','人物']]
for i
in rows
:sheet
.append
(i
)
print(rows
)
wb
.save
('Marvel.xlsx')
wb
= openpyxl
.load_workbook
('Marvel.xlsx')
sheet
= wb
['new title']
sheetname
= wb
.sheetnames
print(sheetname
)
A1_cell
= sheet
['A1']
A1_value
= A1_cell
.value
print(A1_value
)
openpyxl官方文檔
6.3.1 例子
import requests
,openpyxl
wb
=openpyxl
.Workbook
()
sheet
=wb
.active
sheet
.title
='restaurants'
sheet
['A1'] ='歌曲名'
sheet
['B1'] ='所屬專輯'
sheet
['C1'] ='播放時長'
sheet
['D1'] ='播放鏈接' url
= 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp'
for x
in range(5):params
= {'ct': '24','qqmusic_ver': '1298','new_json': '1','remoteplace': 'sizer.yqq.song_next','searchid': '64405487069162918','t': '0','aggr': '1','cr': '1','catZhida': '1','lossless': '0','flag_qc': '0','p': str(x
+ 1),'n': '20','w': '周杰倫','g_tk': '5381','loginUin': '0','hostUin': '0','format': 'json','inCharset': 'utf8','outCharset': 'utf-8','notice': '0','platform': 'yqq.json','needNewCode': '0'}res_music
= requests
.get
(url
, params
=params
)json_music
= res_music
.json
()list_music
= json_music
['data']['song']['list']for music
in list_music
:name
= music
['name']album
= music
['album']['name']time
= music
['interval']link
= 'https://y.qq.com/n/yqq/song/' + str(music
['file']['media_mid']) + '.html\n\n'sheet
.append
([name
, album
, time
,url
])print('歌曲名:' + name
+ '\n' + '所屬專輯:' + album
+'\n' + '播放時長:' + str(time
) + '\n' + '播放鏈接:'+ url
)wb
.save
('Jay.xlsx')
7、爬取思想
8、cookies及session
8.1 cookies及其用法
- 當你登錄一個網站,你都會在登錄頁面看到一個可勾選的選項“記住我”,如果你勾選了,以后你再打開這個網站就會自動登錄,這就是cookie在起作用。
- 勾選“記住我”,服務器就會生成一個cookies和xxx這個賬號綁定。接著,它把這個cookies告訴你的瀏覽器,讓瀏覽器把cookies存儲到你的本地電腦。當下一次,瀏覽器帶著cookies訪問博客,服務器會知道你是xxx。
8.1.1 例子
發表博客評論
- post帶著參數地請求登錄
- 獲得登錄的cookies
- 帶cookies去請求發表評論
import requests
url
= ' https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'
headers
= {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
}
data
= {
'log': 'spiderman',
'pwd': 'crawler334566',
'wp-submit': '登錄',
'redirect_to': 'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/',
'testcookie': '1'
}
login_in
= requests
.post
(url
,headers
=headers
,data
=data
)
cookies
= login_in
.cookies
url_1
= 'https://wordpress-edu-3autumn.localprod.forc.work/wp-comments-post.php'
data_1
= {
'comment': input('請輸入你想要發表的評論:'),
'submit': '發表評論',
'comment_post_ID': '7',
'comment_parent': '0'
}
comment
= requests
.post
(url_1
,headers
=headers
,data
=data_1
,cookies
=cookies
)
print(comment
.status_code
)
8.2 session及其用法
- session是會話過程中,服務器用來記錄特定用戶會話的信息。
- cookies中存儲著session的編碼信息,session中又存儲了cookies的信息。
- 當瀏覽器第一次訪問網頁時,服務器會返回set cookies的字段給瀏覽器,而瀏覽器會把cookies保存到本地。
- 等瀏覽器第二次訪問這個網頁時,就會帶著cookies去請求,而因為cookies里帶有會話的編碼信息,服務器立馬就能辨認出這個用戶,同時返回和這個用戶相關的特定編碼的session。
8.2.1 例子
發表博客評論
import requests
session
= requests
.session
()
url
= 'https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'
headers
= {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
data
= {'log':input('請輸入賬號:'), 'pwd':input('請輸入密碼:'),'wp-submit':'登錄','redirect_to':'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/','testcookie':'1'
}
session
.post
(url
,headers
=headers
,data
=data
)
url_1
= 'https://wordpress-edu-3autumn.localprod.forc.work/wp-comments-post.php'
data_1
= {
'comment': input('請輸入你想要發表的評論:'),
'submit': '發表評論',
'comment_post_ID': '7',
'comment_parent': '0'
}
comment
= session
.post
(url_1
,headers
=headers
,data
=data_1
)
print(comment
)
8.3 存儲cookies
- cookies轉成字典,然后再通過json模塊轉成字符串。
import requests
,json
session
= requests
.session
()
url
= ' https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'
headers
= {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
data
= {
'log': input('請輸入你的賬號:'),
'pwd': input('請輸入你的密碼:'),
'wp-submit': '登錄',
'redirect_to': 'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/',
'testcookie': '1'
}
session
.post
(url
, headers
=headers
, data
=data
)cookies_dict
= requests
.utils
.dict_from_cookiejar
(session
.cookies
)
print(cookies_dict
)
cookies_str
= json
.dumps
(cookies_dict
)
print(cookies_str
)
f
= open('cookies.txt', 'w')
f
.write
(cookies_str
)
f
.close
()
8.4 讀取cookies
- 存儲cookies時,是把它先轉成字典,再轉成字符串。
- 讀取cookies時,要先把字符串轉成字典,再把字典轉成(json模塊)cookies本來的格式。
讀取代碼:
cookies_txt
= open('cookies.txt', 'r')
cookies_dict
= json
.loads
(cookies_txt
.read
())
cookies
= requests
.utils
.cookiejar_from_dict
(cookies_dict
)
cookies
= session
.cookies
8.4.1 例子
發表博客評論
- 程序能讀取到cookies,就自動登錄,發表評論;如果讀取不到,就重新輸入賬號密碼登錄,再評論。
import requests
,json
session
= requests
.session
()
headers
= {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
}
try:
cookies_txt
= open('cookies.txt', 'r')cookies_dict
= json
.loads
(cookies_txt
.read
())cookies
= requests
.utils
.cookiejar_from_dict
(cookies_dict
)cookies
= session
.cookies
except FileNotFoundError
:
url
= ' https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'data
= {'log': input('請輸入你的賬號:'),'pwd': input('請輸入你的密碼:'),'wp-submit': '登錄','redirect_to': 'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/','testcookie': '1'}session
.post
(url
, headers
=headers
, data
=data
)cookies_dict
= requests
.utils
.dict_from_cookiejar
(session
.cookies
)cookies_str
= json
.dumps
(cookies_dict
)f
= open('cookies.txt', 'w')f
.write
(cookies_str
)f
.close
()url_1
= 'https://wordpress-edu-3autumn.localprod.forc.work/wp-comments-post.php'
data_1
= {
'comment': input('請輸入你想評論的內容:'),
'submit': '發表評論',
'comment_post_ID': '7',
'comment_parent': '0'
}
comment
= session
.post
(url_1
,headers
=headers
,data
=data_1
)
print(comment
.status_code
)
- 解決cookies會過期的問題。代碼里加一個條件判斷,如果cookies過期,就重新獲取新的cookies。
import requests
, json
session
= requests
.session
()
headers
= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}def cookies_read():cookies_txt
= open('cookies.txt', 'r')cookies_dict
= json
.loads
(cookies_txt
.read
())cookies
= requests
.utils
.cookiejar_from_dict
(cookies_dict
)return (cookies
)def sign_in():url
= ' https://wordpress-edu-3autumn.localprod.forc.work/wp-login.php'data
= {'log': input('請輸入你的賬號'),'pwd': input('請輸入你的密碼'),'wp-submit': '登錄','redirect_to': 'https://wordpress-edu-3autumn.localprod.forc.work/wp-admin/','testcookie': '1'}session
.post
(url
, headers
=headers
, data
=data
)cookies_dict
= requests
.utils
.dict_from_cookiejar
(session
.cookies
)cookies_str
= json
.dumps
(cookies_dict
)f
= open('cookies.txt', 'w')f
.write
(cookies_str
)f
.close
()def write_message():url_2
= 'https://wordpress-edu-3autumn.localprod.forc.work/wp-comments-post.php'data_2
= {'comment': input('請輸入你要發表的評論:'),'submit': '發表評論','comment_post_ID': '7','comment_parent': '0'}return (session
.post
(url_2
, headers
=headers
, data
=data_2
))try:session
.cookies
= cookies_read
()
except FileNotFoundError
:sign_in
()session
.cookies
= cookies_read
()num
= write_message
()
if num
.status_code
== 200:print('成功啦!')
else:sign_in
()session
.cookies
= cookies_read
()num
= write_message
()
9、selenium
官方中文文檔
from selenium
import webdriver
driver
= webdriver
.Chrome
()
教學系統的瀏覽器設置方法
from selenium
.webdriver
.chrome
.webdriver
import RemoteWebDriver
from selenium
.webdriver
.chrome
.options
import Options chrome_options
= Options
()
chrome_options
.add_argument
('--headless')
driver
= RemoteWebDriver
("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options
.to_capabilities
())
from selenium
import webdriver
driver
= webdriver
.Chrome
() driver
.get
('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/')
driver
.close
()
- 解析與提取數據
- selenium所解析提取的,是Elements中的所有數據,而BeautifulSoup所解析的則只是Network中第0個請求的響應。
- selenium把網頁打開,所有信息就都加載到了Elements。
例子:解析與提取數據網頁中,元素的內容。
from selenium
.webdriver
.chrome
.webdriver
import RemoteWebDriver
from selenium
.webdriver
.chrome
.options
import Options
import time
chrome_options
= Options
()
chrome_options
.add_argument
('--headless')
driver
= RemoteWebDriver
("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options
.to_capabilities
()) driver
.get
('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/')
time
.sleep
(2)
label
= driver
.find_element_by_tag_name
('label')
print(label
.text
)
driver
.close
()
- selenium提取數據方法
- 提取出的數據屬于WebElement類對象
- BeautifulSoup屬于Tag對象
方法作用
| find_element_by_tag_name: | 通過元素的名稱選擇 |
| find_element_by_class_name: | 通過元素的class屬性選擇 |
| find_element_by_id: | 通過元素的id選擇 |
| find_element_by_name: | 通過元素的name屬性選擇 |
| find_element_by_link_text: | 通過鏈接文本獲取超鏈接 |
| find_element_by_link_text: | 通過鏈接文本獲取超鏈接 |
| find_element_by_partial_link_text: | 通過鏈接的部分文本獲取超鏈接 |
find_element_by_tag_name:通過元素的名稱選擇
find_element_by_class_name:通過元素的
class屬性選擇
find_element_by_id:通過元素的
id選擇
find_element_by_name:通過元素的name屬性選擇
find_element_by_link_text:通過鏈接文本獲取超鏈接
find_element_by_partial_link_text:通過鏈接的部分文本獲取超鏈接
例子:獲取【你好,蜘蛛俠!】的網頁源代碼
from selenium
.webdriver
.chrome
.webdriver
import RemoteWebDriver
from selenium
.webdriver
.chrome
.options
import Options
import timechrome_options
= Options
()
chrome_options
.add_argument
('--headless')
driver
= RemoteWebDriver
("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options
.to_capabilities
()) driver
.get
('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/')
time
.sleep
(2) pageSource
= driver
.page_source
print(type(pageSource
))
print(pageSource
)
driver
.close
()
.send_keys
()
.click
()
from selenium
import webdriver
import time
driver
= webdriver
.Chrome
() driver
.get
('https://localprod.pandateacher.com/python-manuscript/hello-spiderman/')
time
.sleep
(2) teacher
= driver
.find_element_by_id
('teacher')
teacher
.send_keys
('必須是吳楓呀')
assistant
= driver
.find_element_by_name
('assistant')
assistant
.send_keys
('都喜歡')
button
= driver
.find_element_by_class_name
('sub')
button
.click
()
driver
.close
()
9.1 例子
一、用selenium爬取QQ音樂的歌曲評論,甜甜的。
- 通過selenium打開瀏覽器的操作,數據就被加載到elements中了。
from selenium
.webdriver
.chrome
.webdriver
import RemoteWebDriver
from selenium
.webdriver
.chrome
.options
import Options
import timechrome_options
= Options
()
chrome_options
.add_argument
('--headless')
driver
= RemoteWebDriver
("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options
.to_capabilities
()) driver
.get
('https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html')
time
.sleep
(2)comments
= driver
.find_element_by_class_name
('js_hot_list').find_elements_by_class_name
('js_cmt_li')
print(len(comments
))
for comment
in comments
: sweet
= comment
.find_element_by_tag_name
('p') print ('評論:%s\n ---\n'%sweet
.text
)
driver
.close
()
from selenium
.webdriver
.chrome
.webdriver
import RemoteWebDriver
from selenium
.webdriver
.chrome
.options
import Options
from bs4
import BeautifulSoup
import time
chrome_options
= Options
()
chrome_options
.add_argument
('--headless')
driver
= RemoteWebDriver
("http://chromedriver.python-class-fos.svc:4444/wd/hub", chrome_options
.to_capabilities
()) driver
.get
('https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html')
time
.sleep
(2)button
= driver
.find_element_by_class_name
('js_get_more_hot')
button
.click
()
time
.sleep
(2)
pageSource
= driver
.page_source
二、定時與郵件
- 自動爬取每日的天氣,并定時把天氣數據和穿衣提示發送到你的郵箱
10、 協程
- 異步:在一個任務未完成時,就可以執行其他多個任務,彼此不受影響。
- 同步:一個任務結束才能啟動下一個。
- 多協程:一個任務在執行過程中,如果遇到等待,就先去執行其他的任務,當等待結束,再回來繼續之前的那個任務。
10.1 gevent庫
例子
import requests
,time
start
= time
.time
()
url_list
= ['https://www.baidu.com/',
'https://www.sina.com.cn/',
'http://www.sohu.com/',
'https://www.qq.com/',
'https://www.163.com/',
'http://www.iqiyi.com/',
'https://www.tmall.com/',
'http://www.ifeng.com/']
for url
in url_list
:
r
= requests
.get
(url
)
- 導入了time模塊,記錄了程序開始和結束的時間
- 導入gevent庫中monkey模塊
from gevent
import monkey
monkey
.patch_all
()
import gevent
,time
,requests
start
= time
.time
()
url_list
= ['https://www.baidu.com/',
'https://www.sina.com.cn/',
'http://www.sohu.com/',
'https://www.qq.com/',
'https://www.163.com/',
'http://www.iqiyi.com/',
'https://www.tmall.com/',
'http://www.ifeng.com/']
def crawler( ):
r
= requests
.get
(url
)print(url
,time
.time
()-start
,r
.status_code
)tasks_list
= [ ]
for url
in url_list
:
task
= gevent
.spawn
(crawler
,url
)tasks_list
.append
(task
)
gevent
.joinall
(tasks_list
)
end
= time
.time
()
print(end
-start
)
10.2 queue模塊
用多協程來爬蟲,需要創建大量任務時,我們可以借助queue模塊。
from gevent
import monkey
monkey
.patch_all
()
import gevent
,time
,requests
from gevent
.queue
import Queue
- 第2部分,是如何創建隊列,以及怎么把任務存儲進隊列里。
start
= time
.time
()
url_list
= ['https://www.baidu.com/',
'https://www.sina.com.cn/',
'http://www.sohu.com/',
'https://www.qq.com/',
'https://www.163.com/',
'http://www.iqiyi.com/',
'https://www.tmall.com/',
'http://www.ifeng.com/']work
= Queue
()
for url
in url_list
:
work
.put_nowait
(url
)
- 第3部分,是定義爬取函數,和如何從隊列里提取出剛剛存儲進去的網址。
def crawler():while not work
.empty
():url
= work
.get_nowait
()r
= requests
.get
(url
)print(url
,work
.qsize
(),r
.status_code
)
- 第四部分,讓爬蟲用多協程執行任務,爬取隊列里的8個網站的代碼
def crawler():while not work
.empty
():url
= work
.get_nowait
()r
= requests
.get
(url
)print(url
,work
.qsize
(),r
.status_code
)tasks_list
= [ ]
for x
in range(2):
task
= gevent
.spawn
(crawler
)tasks_list
.append
(task
)
gevent
.joinall
(tasks_list
)
end
= time
.time
()
print(end
-start
)
10.3 例子
用多協程爬到薄荷網的食物熱量數據。
- 用for循環構造出前3個常見食物類別的前3頁食物記錄的網址和第11個常見食物類別的前3頁食物記錄的網址,并把這些網址放進隊列,打印出來。
import gevent
,requests
, bs4
, csv
from gevent
.queue
import Queue
from gevent
import monkey
monkey
.patch_all
()work
= Queue
()
url_1
= 'http://www.boohee.com/food/group/{type}?page={page}'
for x
in range(1, 4):for y
in range(1, 4):real_url
= url_1
.format(type=x
, page
=y
)work
.put_nowait
(real_url
)
url_2
= 'http://www.boohee.com/food/view_menu?page={page}'
for x
in range(1,4):real_url
= url_2
.format(page
=x
)work
.put_nowait
(real_url
)
print(work
)
import gevent
,requests
, bs4
, csv
from gevent
.queue
import Queue
from gevent
import monkey
monkey
.patch_all
()work
= Queue
()
url_1
= 'http://www.boohee.com/food/group/{type}?page={page}'
for x
in range(1, 4):for y
in range(1, 4):real_url
= url_1
.format(type=x
, page
=y
)work
.put_nowait
(real_url
)
url_2
= 'http://www.boohee.com/food/view_menu?page={page}'
for x
in range(1,4):real_url
= url_2
.format(page
=x
)work
.put_nowait
(real_url
)
def crawler():
headers
= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}while not work
.empty
():url
= work
.get_nowait
()res
= requests
.get
(url
, headers
=headers
)bs_res
= bs4
.BeautifulSoup
(res
.text
, 'html.parser')foods
= bs_res
.find_all
('li', class_
='item clearfix')for food
in foods
:food_name
= food
.find_all
('a')[1]['title']food_url
= 'http://www.boohee.com' + food
.find_all
('a')[1]['href']food_calorie
= food
.find
('p').text
print(food_name
)tasks_list
= []
for x
in range(5):
task
= gevent
.spawn
(crawler
)tasks_list
.append
(task
)
gevent
.joinall
(tasks_list
)
import gevent
,requests
, bs4
, csv
from gevent
.queue
import Queue
from gevent
import monkey
monkey
.patch_all
()work
= Queue
()
url_1
= 'http://www.boohee.com/food/group/{type}?page={page}'
for x
in range(1, 4):for y
in range(1, 4):real_url
= url_1
.format(type=x
, page
=y
)work
.put_nowait
(real_url
)url_2
= 'http://www.boohee.com/food/view_menu?page={page}'
for x
in range(1,4):real_url
= url_2
.format(page
=x
)work
.put_nowait
(real_url
)def crawler():headers
= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}while not work
.empty
():url
= work
.get_nowait
()res
= requests
.get
(url
, headers
=headers
)bs_res
= bs4
.BeautifulSoup
(res
.text
, 'html.parser')foods
= bs_res
.find_all
('li', class_
='item clearfix')for food
in foods
:food_name
= food
.find_all
('a')[1]['title']food_url
= 'http://www.boohee.com' + food
.find_all
('a')[1]['href']food_calorie
= food
.find
('p').textwriter
.writerow
([food_name
, food_calorie
, food_url
])print(food_name
)csv_file
= open('boohee.csv', 'w', newline
='')
writer
= csv
.writer
(csv_file
)
writer
.writerow
(['食物', '熱量', '鏈接'])
tasks_list
= []
for x
in range(5):task
= gevent
.spawn
(crawler
)tasks_list
.append
(task
)
gevent
.joinall
(tasks_list
)
11、Scrapy
- Scrapy Engine(引擎)
- Scheduler(調度器)部門主要負責處理引擎發送過來的requests對象(即網頁請求的相關信息集合,包括params,data,cookies,request headers…等),會把請求的url以有序的方式排列成隊,并等待引擎來提取(功能上類似于gevent庫的queue模塊)。
- Downloader(下載器)部門則是負責處理引擎發送過來的requests,進行網頁爬取,并將返回的response(爬取到的內容)交給引擎。它對應的是爬蟲流程【獲取數據】這一步。
- Spiders(爬蟲)部門是公司的核心業務部門,主要任務是創建requests對象和接受引擎發送過來的response(Downloader部門爬取到的內容),從中解析并提取出有用的數據。它對應的是爬蟲流程【解析數據】和【提取數據】這兩步。
- Item Pipeline(數據管道)部門則是公司的數據部門,只負責存儲和處理Spiders部門提取到的有用數據。這個對應的是爬蟲流程【存儲數據】這一步。
未完
總結
以上是生活随笔為你收集整理的网络爬虫 | 基础的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。