python urllib2爬虫下的一些坑和感悟
案例一:?
打開(kāi)http://www.diyiziti.com/Builder在線生成書(shū)法字。
手動(dòng)在輸入框輸入字進(jìn)行轉(zhuǎn)換后,在Chrome瀏覽器的More Tools > Developer Tools下,點(diǎn)擊Network > Doc ,查看最低端的輸入數(shù)據(jù)。
可以看到以下數(shù)據(jù)是輸入到表單上提交的數(shù)據(jù)。然而我們?nèi)藶椴僮鬏斎氲臄?shù)據(jù)很可能只有兩個(gè):FontInFold 和 Content。
FontInFold 是字體下拉框選擇的字體類(lèi)型,值為數(shù)字。
Content是我們?cè)谳斎肟蛑休斎胍M(jìn)行轉(zhuǎn)換的字。
于是用以下代碼爬取某在線生成書(shū)法字網(wǎng)站。
half_url = 'http://www.diyiziti.com/Builder' user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' headers = {'User-Agent' : user_agent} sort = ('99','104','100','103','82','105','113','392','374','384') font = {'99':'柳公權(quán)柳體書(shū)法字體','104':'顏真卿顏體書(shū)法字體','100':'柳公權(quán)楷書(shū)繁體','103':'趙孟頫楷書(shū)字體','82':'歐陽(yáng)詢體書(shū)法字體','105':'褚遂良楷書(shū)書(shū)法字體','113':'毛筆字','392':'北魏楷書(shū)字體','374':'漢儀全唐詩(shī)字體','384':'黃自元楷書(shū)'}def getImg(wd):for Sort in sort:print '++++++++++', Sort, font[Sort]#try:url = '%s/%s' % (half_url, Sort)data = urllib.urlencode({'FontInfoId': Sort,'FontSize': '75'})request = urllib2.Request(url, data, headers)response = urllib2.urlopen(request)html = response.read()print(html)得到一系列亂碼:
++++++++++ 99 柳公權(quán)柳體書(shū)法字體 �PNGIHDR : �GX� sRGB ��� gAMA ���a cHRM z& �� � �� u0 �` :� p��Q< �IDATx^�Ё à�S�Pa��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`�����-t k�f� IEND�B`�【坑一】:爬蟲(chóng)給表單傳參時(shí)要把完整的參數(shù)傳進(jìn)去。因?yàn)橹挥袨g覽器端會(huì)幫我們默認(rèn)設(shè)定參數(shù)值。
案例二:?
爬取http://www.zhenhaotv.com/在線生成書(shū)法字網(wǎng)站。
吸取案例一的教訓(xùn),把完整的參數(shù)傳入,如下:
url = 'http://www.zhenhaotv.com' user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' headers = {'User-Agent' : user_agent}#, 'Referer':'https://www.zhenhaotv.com/'} sort = ('2','4','8','9','26','30','34') font = {'2':'方正行楷繁體)','4':'漢儀雪君體繁','8':'博洋柳體字體','9':'博洋歐體字體','26':'騰祥鐵山楷繁','30':'蘇新詩(shī)柳楷簡(jiǎn)','34':'新蒂趙孟頫楷'}def getImg(wd):for Sort in sort:print '++++++++++', Sort, font[Sort]#try:data = urllib.urlencode({'text':wd,'font':Sort,'size':'68','color':'#000000','bg':'#ffffff','list':'open'})request = urllib2.Request(url, data, headers)response = urllib2.urlopen(request)html = response.read()print(html)結(jié)果:
網(wǎng)頁(yè)似乎沒(méi)動(dòng),就跟剛在瀏覽器中打入網(wǎng)址時(shí)一模一樣。參數(shù)跟沒(méi)傳進(jìn)去似的。
想破腦袋,偶然間把網(wǎng)址從地址欄里復(fù)制出來(lái)再粘貼到爬蟲(chóng)中,新世界的大門(mén)就此被我開(kāi)啟了。。。
【坑二】:爬蟲(chóng)要注意爬取的網(wǎng)址是最后瀏覽器端粘貼回來(lái)的網(wǎng)址,以保證協(xié)議等內(nèi)容不會(huì)出錯(cuò)。
我這錯(cuò)就錯(cuò)在網(wǎng)址不是:http://www.zhenhaotv.com?而是https://www.zhenhaotv.com/。協(xié)議不一樣。
?
這些還是基礎(chǔ)的爬蟲(chóng)。
爬蟲(chóng)坑多不易,且行且珍惜。。。
總結(jié)
以上是生活随笔為你收集整理的python urllib2爬虫下的一些坑和感悟的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 什么是UML,UML类图
- 下一篇: ¥3EG踩坑记录¥Vitis HLS x