當前位置：首頁 > 编程语言 > python >内容正文

python

[python爬虫] BeautifulSoup设置Cookie解决网站拦截并爬取蚂蚁短租

發(fā)布時間：2024/6/1 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 [python爬虫] BeautifulSoup设置Cookie解决网站拦截并爬取蚂蚁短租小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

我們在編寫Python爬蟲時，有時會遇到網(wǎng)站拒絕訪問等反爬手段，比如這么我們想爬取螞蟻短租數(shù)據(jù)，它則會提示“當前訪問疑似黑客攻擊，已被網(wǎng)站管理員設(shè)置為攔截”提示，如下圖所示。此時我們需要采用設(shè)置Cookie來進行爬取，下面我們進行詳細介紹。非常感謝我的學生承峰提供的思想，后浪推前浪啊！

一. 網(wǎng)站分析與爬蟲攔截

當我們打開螞蟻短租搜索貴陽市，反饋如下圖所示結(jié)果。
網(wǎng)址為：http://www.mayi.com/guiyang/1/?map=n

我們可以看到短租房信息呈現(xiàn)一定規(guī)律分布，如下圖所示，這也是我們要爬取的信息。

通過瀏覽器審查元素，我們可以看到需要爬取每條租房信息都位于<dd></dd>節(jié)點下。

在定位房屋名稱，如下圖所示，位于<div class="room-detail clearfloat"></div>節(jié)點下。

接下來我們寫個簡單的BeautifulSoup進行爬取。
# -*- coding: utf-8 -*- import urllib import re from bs4 import BeautifulSoup import codecsurl = 'http://www.mayi.com/guiyang/?map=no' response=urllib.urlopen(url) contents = response.read() soup = BeautifulSoup(contents, "html.parser") print soup.title print soup #短租房名稱 for tag in soup.find_all('dd'):for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):fname = name.find('p').get_text()print u'[短租房名稱]', fname.replace('\n','').strip()

但很遺憾，報錯了，說明螞蟻金服防范措施還是挺到位的。

二. 設(shè)置Cookie的BeautifulSoup爬蟲

添加消息頭的代碼如下所示，這里先給出代碼和結(jié)果，再教大家如何獲取Cookie。

# -*- coding: utf-8 -*- import urllib2 import re from bs4 import BeautifulSoup#爬蟲函數(shù) def gydzf(url):user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"headers={"User-Agent":user_agent}request=urllib2.Request(url,headers=headers)response=urllib2.urlopen(request)contents = response.read() soup = BeautifulSoup(contents, "html.parser")for tag in soup.find_all('dd'):#短租房名稱for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):fname = name.find('p').get_text()print u'[短租房名稱]', fname.replace('\n','').strip()#短租房價格for price in tag.find_all(attrs={"class":"moy-b"}):string = price.find('p').get_text()fprice = re.sub("[￥]+".decode("utf8"), "".decode("utf8"),string)fprice = fprice[0:5]print u'[短租房價格]', fprice.replace('\n','').strip()#評分及評論人數(shù)for score in name.find('ul'):fscore = name.find('ul').get_text()print u'[短租房評分/評論/居住人數(shù)]', fscore.replace('\n','').strip() #網(wǎng)頁鏈接url url_dzf = tag.find(attrs={"target":"_blank"})urls = url_dzf.attrs['href']print u'[網(wǎng)頁鏈接]', urls.replace('\n','').strip()urlss = 'http://www.mayi.com' + urls + ''print urlss#主函數(shù) if __name__ == '__main__':i = 1while i<10:print u'頁碼', iurl = 'http://www.mayi.com/guiyang/' + str(i) + '/?map=no'gydzf(url)i = i+1else:print u"結(jié)束"

輸出結(jié)果如下圖所示：

頁碼 1 [短租房名稱] 大唐東原財富廣場--城市簡約復式民宿 [短租房價格] 298 [短租房評分/評論/居住人數(shù)] 5.0分·5條評論·二居·可住3人 [網(wǎng)頁鏈接] /room/851634765 http://www.mayi.com/room/851634765 [短租房名稱] 大唐東原財富廣場--清新檸檬復式民宿 [短租房價格] 568 [短租房評分/評論/居住人數(shù)] 2條評論·三居·可住6人 [網(wǎng)頁鏈接] /room/851634467 http://www.mayi.com/room/851634467...頁碼 9 [短租房名稱] 【高鐵北站公園旁】美式風情+超大舒適安逸 [短租房價格] 366 [短租房評分/評論/居住人數(shù)] 3條評論·二居·可住5人 [網(wǎng)頁鏈接] /room/851018852 http://www.mayi.com/room/851018852 [短租房名稱] 大營坡（中大國際購物中心附近）北歐小清新三室 [短租房價格] 298 [短租房評分/評論/居住人數(shù)] 三居·可住6人 [網(wǎng)頁鏈接] /room/851647045 http://www.mayi.com/room/851647045

接下來我們想獲取詳細信息，比如：http://www.mayi.com/room/851634467
我們需要獲取租房地址及詳細信息等。

這里作者主要是提供分析Cookie的方法，使用瀏覽器打開網(wǎng)頁，右鍵“檢查”，然后再刷新網(wǎng)頁。在“NetWork”中找到網(wǎng)頁并點擊，在彈出來的Headers中就隱藏這這些信息。

最常見的兩個參數(shù)是Cookie和User-Agent，如下圖所示：

然后在Python代碼中設(shè)置這些參數(shù)，再調(diào)用Urllib2.Request()提交請求即可，核心代碼如下：

? user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) ... Chrome/61.0.3163.100 Safari/537.36"
? cookie="mediav=%7B%22eid%22%3A%22387123...b3574ef2-21b9-11e8-b39c-1bc4029c43b8"
? headers={"User-Agent":user_agent,"Cookie":cookie}
? request=urllib2.Request(url,headers=headers)
? response=urllib2.urlopen(request)
? contents = response.read()?
? soup = BeautifulSoup(contents, "html.parser")

? for tag1 in soup.find_all(attrs={"class":"main"}):? ??
? ? ? ? ....

注意，每小時Cookie會更新一次，我們需要手動修改Cookie值即可，就是上面代碼的cookie變量和user_agent變量。完整代碼如下所示：
# -*- coding: utf-8 -*- import urllib2 import re from bs4 import BeautifulSoup import codecs import csvc = open("ycf.csv","wb") #write 寫 c.write(codecs.BOM_UTF8) writer = csv.writer(c) writer.writerow(["短租房名稱","地址","價格","評分","可住人數(shù)","人均價格"])#爬取詳細信息 def getInfo(url,fname,fprice,fscore,users):#通過瀏覽器開發(fā)者模式查看訪問使用的user_agent及cookie設(shè)置訪問頭（headers）避免反爬蟲,且每隔一段時間運行要根據(jù)開發(fā)者中的cookie更改代碼中的cookieuser_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"cookie="mediav=%7B%22eid%22%3A%22387123%22eb7; mayi_uuid=1582009990674274976491; sid=42200298656434922.85.130.130"headers={"User-Agent":user_agent,"Cookie":cookie}request=urllib2.Request(url,headers=headers)response=urllib2.urlopen(request)contents = response.read() soup = BeautifulSoup(contents, "html.parser")#短租房地址for tag1 in soup.find_all(attrs={"class":"main"}): print u'短租房地址:'for tag2 in tag1.find_all(attrs={"class":"desWord"}):address = tag2.find('p').get_text()print address#可住人數(shù) print u'可住人數(shù):'for tag4 in tag1.find_all(attrs={"class":"w258"}):yy = tag4.find('span').get_text()print yyfname = fname.encode("utf-8")address = address.encode("utf-8")fprice = fprice.encode("utf-8")fscore = fscore.encode("utf-8")fpeople = yy[2:3].encode("utf-8")ones = int(float(fprice))/int(float(fpeople))#存儲至本地writer.writerow([fname,address,fprice,fscore,fpeople,ones])#爬蟲函數(shù) def gydzf(url):user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"headers={"User-Agent":user_agent}request=urllib2.Request(url,headers=headers)response=urllib2.urlopen(request)contents = response.read() soup = BeautifulSoup(contents, "html.parser")for tag in soup.find_all('dd'):#短租房名稱for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):fname = name.find('p').get_text()print u'[短租房名稱]', fname.replace('\n','').strip()#短租房價格for price in tag.find_all(attrs={"class":"moy-b"}):string = price.find('p').get_text()fprice = re.sub("[￥]+".decode("utf8"), "".decode("utf8"),string)fprice = fprice[0:5]print u'[短租房價格]', fprice.replace('\n','').strip()#評分及評論人數(shù)for score in name.find('ul'):fscore = name.find('ul').get_text()print u'[短租房評分/評論/居住人數(shù)]', fscore.replace('\n','').strip() #網(wǎng)頁鏈接url url_dzf = tag.find(attrs={"target":"_blank"})urls = url_dzf.attrs['href']print u'[網(wǎng)頁鏈接]', urls.replace('\n','').strip()urlss = 'http://www.mayi.com' + urls + ''print urlssgetInfo(urlss,fname,fprice,fscore,user_agent)#主函數(shù) if __name__ == '__main__': i = 0while i<33:print u'頁碼', (i+1)if(i==0):url = 'http://www.mayi.com/guiyang/?map=no'if(i>0):num = i+2 #除了第一頁是空的，第二頁開始按2順序遞增url = 'http://www.mayi.com/guiyang/' + str(num) + '/?map=no'gydzf(url)i=i+1c.close() 輸出結(jié)果如下，存儲本地CSV文件：

同時，大家可以嘗試Selenium爬取螞蟻短租，應(yīng)該也是可行的方法。最后希望文章對您有所幫助，如果存在不足之處，請海涵~
(By:Eastmount 2017-03-07 晚上7點??http://blog.csdn.net/eastmount/?)

與50位技術(shù)專家面對面20年技術(shù)見證，附贈技術(shù)全景圖

總結(jié)

以上是生活随笔為你收集整理的[python爬虫] BeautifulSoup设置Cookie解决网站拦截并爬取蚂蚁短租的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： [CentOS Python系列] 四.
下一篇： [CentOS Python系列] 六.