生活随笔
收集整理的這篇文章主要介紹了
[Python爬虫案例]西刺免费代理IP
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
[Python爬蟲案例]西刺免費代理IP
溫馨提示
如有安裝annaconda,請打開命令行安裝一個必要包
pip install progressbar
import requests
import chardet
import random
import time
from bs4
import BeautifulSoup
from telnetlib
import Telnet
import progressbaruser_agent
= ["Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)","Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)","Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)","Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)","Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)","Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0","Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
]def getHtmlWithHeader(url
):try:response
= requests
.get
(url
,headers
={"User-Agent": random
.choice
(user_agent
)})code
= chardet
.detect
(response
.content
)["encoding"]response
.encoding
= code
return response
.text
except:time
.sleep
(1)global _times_count_times_count
+= 1if _times_count
> 5:print("ip獲取失敗,請稍后重試")returnprint("第", _times_count
, "次嘗試抓取")return getHtmlWithHeader
(url
)def getIP(num
):datalist
= []for num1
in range(num
):url
= 'http://www.xicidaili.com/nn/' + str(num1
+ 1)html
= getHtmlWithHeader
(url
)soup
= BeautifulSoup
(html
, 'html.parser')parent
= soup
.find
(id="ip_list")lis
= parent
.find_all
('tr')lis
.pop
(0)print("爬取ip地址及相關信息")for i
in lis
:ip
= i
.find_all
('td')[1].get_text
()dk
= i
.find_all
('td')[2].get_text
()nm
= i
.find_all
('td')[4].get_text
()ty
= i
.find_all
('td')[5].get_text
()tm
= i
.find_all
('td')[8].get_text
()datalist
.append
((ip
, dk
, nm
, ty
, tm
))print("共爬取到", len(datalist
), "條數據\n")return datalist
def filtrateIP(datalist
):datalist1
= []print('過濾存活時間短的\n')for i
in datalist
:if "分鐘" not in i
[4]:datalist1
.append
(i
)print("共過濾掉", len(datalist
) - len(datalist1
), "條生存時間短的數據")print("還剩", len(datalist1
), "條數據\n")print('測試不可用的ip并將其過濾')datalist
.clear
()v
= 1p
= progressbar
.ProgressBar
()for i
in p
(datalist1
):v
+= 1try:Telnet
(i
[0], i
[1], timeout
=1)except:passelse:datalist
.append
(i
)print('過濾不可用的ip')print("共過濾掉", len(datalist1
) - len(datalist
), "條不可用數據")print("還剩", len(datalist
), "條數據")return datalist
def saveIP(datalist
):httplist
= []httpslist
= []for i
in datalist
:if i
[3] == 'HTTP':httplist
.append
('http://' + i
[0] + ':' + i
[1])else:httpslist
.append
('https://' + i
[0] + ':' + i
[1])print("HTTP共" + str(len(httplist
)) + "條數據")print(httplist
)print("")print("HTTPS共" + str(len(httpslist
)) + "條數據")print(httpslist
)print("")print("寫入文件")f
= open('ip地址.txt', 'w', encoding
="utf-8")f
.write
("HTTP\n")f
.write
(str(httplist
) + "\n\n")f
.write
("HTTPS\n")f
.write
(str(httpslist
))f
.close
()
def main(num
):datalist
= getIP
(num
)IPlist
= filtrateIP
(datalist
)saveIP
(IPlist
)if __name__
== '__main__':main
(10)
總結
以上是生活随笔為你收集整理的[Python爬虫案例]西刺免费代理IP的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。