生活随笔
收集整理的這篇文章主要介紹了
三、scrapy爬虫框架——scrapy模拟登陆
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
scrapy模擬登陸
學習目標:
應用 請求對象cookies參數的使用 了解 start_requests函數的作用 應用 構造并發送post請求
1. 回顧之前的模擬登陸的方法
1.1 requests模塊是如何實現模擬登陸的?
直接攜帶cookies請求頁面 找url地址,發送post請求存儲cookie
1.2 selenium是如何模擬登陸的?
找到對應的input標簽,輸入文本點擊登陸
1.3 scrapy的模擬登陸
直接攜帶cookies 找url地址,發送post請求存儲cookie
2. scrapy攜帶cookies直接獲取需要登陸后的頁面
應用場景
cookie過期時間很長,常見于一些不規范的網站 能在cookie過期之前把所有的數據拿到 配合其他程序使用,比如其使用selenium把登陸之后的cookie獲取到保存到本地,scrapy發送請求之前先讀取本地cookie
2.1 實現:重構scrapy的starte_rquests方法
scrapy中start_url是通過start_requests來進行處理的,其實現代碼如下
def start_requests ( self
) : cls
= self
. __class__
if method_is_overridden
( cls
, Spider
, 'make_requests_from_url' ) : warnings
. warn
( "Spider.make_requests_from_url method is deprecated; it " "won't be called in future Scrapy releases. Please " "override Spider.start_requests method instead (see %s.%s)." % ( cls
. __module__
, cls
. __name__
) , ) for url
in self
. start_urls
: yield self
. make_requests_from_url
( url
) else : for url
in self
. start_urls
: yield Request
( url
, dont_filter
= True )
所以對應的,如果start_url地址中的url是需要登錄后才能訪問的url地址,則需要重寫start_request方法并在其中手動添加上cookie
2.2 攜帶cookies登陸github
測試賬號 noobpythoner zhoudawei123
import scrapy
class Git1Spider ( scrapy
. Spider
) : name
= 'git1' allowed_domains
= [ 'github.com' ] start_urls
= [ 'https://github.com/zep03' ] def start_request ( self
) : url
= self
. start_urls
[ 0 ] temp
= '_octo=GH1.1.838083519.1594559947; _ga=GA1.2.1339438892.1594559990; _gat=1; tz=Asia%2FShanghai; _device_id=4d76e456d7a0c1e69849de2655198d40; has_recent_activity=1; user_session=e6aK8ODfFzCDBmDG72FxcGE17CQ3FiL23o; __Host-user_session_same_site=e6aK8ODfFzCDBmDTZMReW2g3PhRJEG72FxcGE17CQ3FiL23o; logged_in=yes; dotc' cookies
= { data
. split
( '=' ) [ 0 ] : data
. split
( '=' ) [ - 1 ] for data
in temp
. split
( ';' ) } print ( cookies
) yield scrapy
. Request
( url
= url
, callback
= self
. parse
, cookies
= cookies
) def parse ( self
, response
) : print ( response
. xpath
( '/html/head/title/text()' ) . extract_first
( ) )
import scrapy
import re
class Login1Spider ( scrapy
. Spider
) : name
= 'login1' allowed_domains
= [ 'github.com' ] start_urls
= [ 'https://github.com/NoobPythoner' ] def start_requests ( self
) : cookies_str
= '...' cookies_dict
= { i
. split
( '=' ) [ 0 ] : i
. split
( '=' ) [ 1 ] for i
in cookies_str
. split
( '; ' ) } yield scrapy
. Request
( self
. start_urls
[ 0 ] , callback
= self
. parse
, cookies
= cookies_dict
) def parse ( self
, response
) : result_list
= re
. findall
( r
'noobpythoner|NoobPythoner' , response
. body
. decode
( ) ) print ( result_list
) pass
注意:
scrapy中cookie不能夠放在headers中,在構造請求的時候有專門的cookies參數,能夠接受字典形式的coookie 在setting中設置ROBOTS協議、USER_AGENT
3. scrapy.Request發送post請求
我們知道可以通過scrapy.Request()指定method、body參數來發送post請求;但是通常使用scrapy.FormRequest()來發送post請求
3.1 發送post請求
注意:scrapy.FormRequest()能夠發送表單和ajax請求,參考閱讀 https://www.jb51.net/article/146769.htm
3.1.1 思路分析
找到post的url地址:點擊登錄按鈕進行抓包,然后定位url地址為https://github.com/session
找到請求體的規律:分析post請求的請求體,其中包含的參數均在前一次的響應中
否登錄成功:通過請求個人主頁,觀察是否包含用戶名
3.1.2 代碼實現如下:
import scrapy
class Git2Spider ( scrapy
. Spider
) : name
= 'git2' allowed_domains
= [ 'github.com' ] start_urls
= [ 'http://github.com/login' ] def parse ( self
, response
) : token
= response
. xpath
( '//input[@name="authenticity_token"]/@value' ) . extract_first
( ) timestamp_secret
= response
. xpath
( '//input[@name="timestamp_secret"]/@value' ) . extract_first
( ) timestamp
= response
. xpath
( '//input[@name="timestamp"]/@value' ) . extract_first
( ) required_field_name
= response
. xpath
( '//*[@id="login"]/form/div[4]/input[6]/@name' ) . extract_first
( ) post_data
= { "commit" : "Sign in" , "authenticity_token" : token
, "ga_id" : "1029919665.1594130837" , "login" : "賬號" , "password" : "密碼" , "webauthn-support" : "supported" , "webauthn-iuvpaa-support" : "unsupported" , "return_to" : "" , required_field_name
: "" , "timestamp" : timestamp
, "timestamp_secret" : timestamp_secret
} print ( post_data
) yield scrapy
. FormRequest
( url
= 'https://github.com/session' , callback
= self
. after_login
, formdata
= post_data
) def after_login ( self
, response
) : yield scrapy
. Request
( 'https://github.com/zep03' , callback
= self
. check_login
) def check_login ( self
, response
) : print ( response
. xpath
( '/html/head/title/text()' ) . extract_first
( ) )
import scrapy
import re
class Login2Spider ( scrapy
. Spider
) : name
= 'login2' allowed_domains
= [ 'github.com' ] start_urls
= [ 'https://github.com/login' ] def parse ( self
, response
) : authenticity_token
= response
. xpath
( "//input[@name='authenticity_token']/@value" ) . extract_first
( ) utf8
= response
. xpath
( "//input[@name='utf8']/@value" ) . extract_first
( ) commit
= response
. xpath
( "//input[@name='commit']/@value" ) . extract_first
( ) yield scrapy
. FormRequest
( "https://github.com/session" , formdata
= { "authenticity_token" : authenticity_token
, "utf8" : utf8
, "commit" : commit
, "login" : "noobpythoner" , "password" : "***" } , callback
= self
. parse_login
) def parse_login ( self
, response
) : ret
= re
. findall
( r
"noobpythoner|NoobPythoner" , response
. text
) print ( ret
)
小技巧
在settings.py中通過設置COOKIES_DEBUG=TRUE 能夠在終端看到cookie的傳遞傳遞過程
小結
start_urls中的url地址是交給start_request處理的,如有必要,可以重寫start_request函數 直接攜帶cookie登陸:cookie只能傳遞給cookies參數接收 scrapy.Request()發送post請求
總結
以上是生活随笔 為你收集整理的三、scrapy爬虫框架——scrapy模拟登陆 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。