當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

三、scrapy爬虫框架——scrapy模拟登陆

發布時間：2024/7/5 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了三、scrapy爬虫框架——scrapy模拟登陆小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

scrapy模擬登陸

學習目標：

應用請求對象cookies參數的使用

了解 start_requests函數的作用

應用構造并發送post請求

1. 回顧之前的模擬登陸的方法

1.1 requests模塊是如何實現模擬登陸的？

直接攜帶cookies請求頁面

找url地址，發送post請求存儲cookie

1.2 selenium是如何模擬登陸的？

找到對應的input標簽，輸入文本點擊登陸

1.3 scrapy的模擬登陸

直接攜帶cookies

找url地址，發送post請求存儲cookie

2. scrapy攜帶cookies直接獲取需要登陸后的頁面

應用場景

cookie過期時間很長，常見于一些不規范的網站

能在cookie過期之前把所有的數據拿到

配合其他程序使用，比如其使用selenium把登陸之后的cookie獲取到保存到本地，scrapy發送請求之前先讀取本地cookie

2.1 實現：重構scrapy的starte_rquests方法

scrapy中start_url是通過start_requests來進行處理的，其實現代碼如下

# 這是源代碼 def start_requests(self):cls = self.__class__if method_is_overridden(cls, Spider, 'make_requests_from_url'):warnings.warn("Spider.make_requests_from_url method is deprecated; it ""won't be called in future Scrapy releases. Please ""override Spider.start_requests method instead (see %s.%s)." % (cls.__module__, cls.__name__),)for url in self.start_urls:yield self.make_requests_from_url(url)else:for url in self.start_urls:yield Request(url, dont_filter=True)

所以對應的，如果start_url地址中的url是需要登錄后才能訪問的url地址，則需要重寫start_request方法并在其中手動添加上cookie

2.2 攜帶cookies登陸github

測試賬號 noobpythoner zhoudawei123

myCode

import scrapy # 失敗！！！！！！！！！！！！class Git1Spider(scrapy.Spider):name = 'git1'allowed_domains = ['github.com']start_urls = ['https://github.com/zep03']# 重寫start_request()方法def start_request(self):url = self.start_urls[0]temp = '_octo=GH1.1.838083519.1594559947; _ga=GA1.2.1339438892.1594559990; _gat=1; tz=Asia%2FShanghai; _device_id=4d76e456d7a0c1e69849de2655198d40; has_recent_activity=1; user_session=e6aK8ODfFzCDBmDG72FxcGE17CQ3FiL23o; __Host-user_session_same_site=e6aK8ODfFzCDBmDTZMReW2g3PhRJEG72FxcGE17CQ3FiL23o; logged_in=yes; dotc'# split()將字符串按照;號進行切割，裝進一個列表中# 通過字典生成式把cookie字符串轉換成一個字典cookies = {data.split('=')[0]:data.split('=')[-1] for data in temp.split(';')}print(cookies)# headers = {# 'Referer': 'https://github.com/login?return_to=%2Fzep03',# # 'Host': 'github.com'# 'If-None-Match':'W/"f3d499ffda61143f54d8e48cb050e43d"'# }yield scrapy.Request(url = url,callback=self.parse,cookies=cookies# headers=headers)def parse(self, response):print(response.xpath('/html/head/title/text()').extract_first()) import scrapy import reclass Login1Spider(scrapy.Spider):name = 'login1'allowed_domains = ['github.com']start_urls = ['https://github.com/NoobPythoner'] # 這是一個需要登陸以后才能訪問的頁面def start_requests(self): # 重構start_requests方法# 這個cookies_str是抓包獲取的cookies_str = '...' # 抓包獲取# 將cookies_str轉換為cookies_dictcookies_dict = {i.split('=')[0]:i.split('=')[1] for i in cookies_str.split('; ')}yield scrapy.Request(self.start_urls[0],callback=self.parse,cookies=cookies_dict)def parse(self, response): # 通過正則表達式匹配用戶名來驗證是否登陸成功# 正則匹配的是github的用戶名result_list = re.findall(r'noobpythoner|NoobPythoner', response.body.decode()) print(result_list)pass

注意：

scrapy中cookie不能夠放在headers中，在構造請求的時候有專門的cookies參數，能夠接受字典形式的coookie

在setting中設置ROBOTS協議、USER_AGENT

3. scrapy.Request發送post請求

我們知道可以通過scrapy.Request()指定method、body參數來發送post請求；但是通常使用scrapy.FormRequest()來發送post請求

3.1 發送post請求

注意：scrapy.FormRequest()能夠發送表單和ajax請求，參考閱讀 https://www.jb51.net/article/146769.htm

3.1.1 思路分析

找到post的url地址：點擊登錄按鈕進行抓包，然后定位url地址為https://github.com/session

找到請求體的規律：分析post請求的請求體，其中包含的參數均在前一次的響應中

否登錄成功：通過請求個人主頁，觀察是否包含用戶名

3.1.2 代碼實現如下：

myCode

import scrapyclass Git2Spider(scrapy.Spider):name = 'git2'allowed_domains = ['github.com']start_urls = ['http://github.com/login']def parse(self, response):# 從登錄頁面響應中解析出post數據# 正則提取token = response.xpath('//input[@name="authenticity_token"]/@value').extract_first()# print(token)timestamp_secret = response.xpath('//input[@name="timestamp_secret"]/@value').extract_first()# print(timestamp_secret)timestamp = response.xpath('//input[@name="timestamp"]/@value').extract_first()# print(timestamp)required_field_name = response.xpath('//*[@id="login"]/form/div[4]/input[6]/@name').extract_first()# print(required_field_name)post_data = {"commit": "Sign in","authenticity_token": token,"ga_id": "1029919665.1594130837","login": "賬號","password": "密碼","webauthn-support": "supported","webauthn-iuvpaa-support": "unsupported","return_to": "",required_field_name: "","timestamp": timestamp,"timestamp_secret": timestamp_secret}print(post_data)# 針對登陸url發送post請求yield scrapy.FormRequest(url='https://github.com/session',callback=self.after_login,formdata=post_data)def after_login(self,response):yield scrapy.Request('https://github.com/zep03',callback=self.check_login)def check_login(self,response):print(response.xpath('/html/head/title/text()').extract_first()) import scrapy import reclass Login2Spider(scrapy.Spider):name = 'login2'allowed_domains = ['github.com']start_urls = ['https://github.com/login']def parse(self, response):authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()commit = response.xpath("//input[@name='commit']/@value").extract_first()#構造POST請求，傳遞給引擎yield scrapy.FormRequest("https://github.com/session",formdata={"authenticity_token":authenticity_token,"utf8":utf8,"commit":commit,"login":"noobpythoner","password":"***"},callback=self.parse_login)def parse_login(self,response):ret = re.findall(r"noobpythoner|NoobPythoner",response.text)print(ret)

小技巧

在settings.py中通過設置COOKIES_DEBUG=TRUE 能夠在終端看到cookie的傳遞傳遞過程

小結

start_urls中的url地址是交給start_request處理的，如有必要，可以重寫start_request函數

直接攜帶cookie登陸：cookie只能傳遞給cookies參數接收

scrapy.Request()發送post請求

總結

以上是生活随笔為你收集整理的三、scrapy爬虫框架——scrapy模拟登陆的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：三、MyBatis 使用传统 Dao 开
下一篇：五、PHP框架Laravel学习笔记——