當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python全系列之爬虫scrapy_python爬虫scrapy之登录知乎

發(fā)布時間：2024/7/23 python 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 python全系列之爬虫scrapy_python爬虫scrapy之登录知乎小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

下面我們看看用scrapy模擬登錄的基本寫法：

注意：我們經(jīng)常調(diào)試代碼的時候基本都用chrome瀏覽器，但是我就因為用了谷歌瀏覽器(它總是登錄的時候不提示我用驗證碼，誤導(dǎo)我以為登錄時不需要驗證碼，其實登錄時候必須要驗證碼的)，這里你可以多試試幾個瀏覽器，一定要找個提示你輸入驗證碼的瀏覽器調(diào)試。

1、我們登錄的時候，提示我們輸入驗證碼，當(dāng)驗證碼彈出之前會有個請求，我們打開這個請求，很明顯，type是login，驗證碼無疑了,就算是看請求的因為名，你也應(yīng)該知道這個就是驗證碼的請求，或者打開這個驗證碼的請求url，這。

驗證碼的圖片，悲慘了，這怎么整。別著急。。

2、驗證碼提示我們要點擊倒著寫的字體，這。。。，爬蟲和反爬蟲就是無休止的互相折磨。這明顯就是上面那個圖片的信息。

3、機智的我，發(fā)現(xiàn)驗證碼的請求參數(shù)里面有三個參數(shù)，r是一個13位的數(shù)字，type是登錄用的，lang很可疑，改改它，把cn給他改成en。

代碼如下：

import json

import scrapy

import time

from PIL import Image

class ZhihuloginSpider(scrapy.Spider):

name = ‘zhihu_login‘

allowed_domains = [‘zhihu.com‘]

start_urls = [‘https://www.zhihu.com/‘]

header = {

‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,‘

‘ like Gecko) Chrome/62.0.3202.94 Safari/537.36‘,

}

def parse(self, response):

#主頁爬取的具體內(nèi)容

print(response.text)

def start_requests(self):

‘‘‘

1、首先構(gòu)造并抓取登錄需要提交的驗證碼

:return:

‘‘‘

t = str(int(time.time() * 1000))

captcha_url = ‘https://www.zhihu.com/captcha.gif?r={0}&type=login&lang=en‘.format(t)

return [scrapy.Request(url=captcha_url, headers=self.header, callback=self.parser_captcha)]

def parser_captcha(self, response):

‘‘‘

1、根據(jù)start_requests方法返回的驗證碼，將它存入本地

2、打開下載下來的驗證碼

3、這里是需要手動輸入的，這里可以接入打碼平臺

:param response:

:return:

‘‘‘

with open(‘captcha.jpg‘, ‘wb‘) as f:

f.write(response.body)

f.close()

try:

im = Image.open(‘captcha.jpg‘)

im.show()

im.close()

except:

pass

captcha = input("請輸入你的驗證>")

return scrapy.FormRequest(url=‘https://www.zhihu.com/#signin‘, headers=self.header, callback=self.login, meta={

‘captcha‘: captcha

})

def login(self, response):

xsrf = response.xpath("//input[@name=‘_xsrf‘]/@value").extract_first()

if xsrf is None:

return ‘‘

post_url = ‘https://www.zhihu.com/login/phone_num‘

post_data = {

"_xsrf": xsrf,

"phone_num": ‘你的賬戶名稱‘,

"password": ‘你的賬戶密碼‘,

"captcha": response.meta[‘captcha‘]

}

return [scrapy.FormRequest(url=post_url, formdata=post_data, headers=self.header, callback=self.check_login)]

# 驗證返回是否成功

def check_login(self, response):

js = json.loads(response.text)

print(js)

if ‘msg‘ in js and js[‘msg‘] == ‘登錄成功‘:

for url in self.start_urls:

print(url)

yield scrapy.Request(url=url, headers=self.header, dont_filter=True)

else:

print("登錄失敗，請檢查！！！")

總結(jié)

以上是生活随笔為你收集整理的python全系列之爬虫scrapy_python爬虫scrapy之登录知乎的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： git 小乌龟更新分支_git常用操作
下一篇： pythondistutils安装_py