當前位置：首頁 > 编程语言 > python >内容正文

python

Python 爬取可用代理 IP

發布時間：2025/4/5 python 17 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 爬取可用代理 IP 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

2019獨角獸企業重金招聘Python工程師標準>>>

通常情況下爬蟲超過一定頻率或次數，對應的公網 IP 會被封掉，為了能穩定爬取大量數據，我們一般從淘寶購買大量代理ip，一般 10元 10w?IP/天，然而這些?IP?大量都是無效?IP，需要自己不斷重試或驗證，其實這些?IP?也是賣家從一些代理網站爬下來的。

既然如此，為什么我們不自己動手爬呢？基本思路其實挺簡單：

（1）找一個專門的 proxy?ip 網站，解析出其中?IP

（2）驗證?IP?有效性

（3）存儲有效?IP?或者做成服務

一個?demo?如下：

import requests from bs4 import BeautifulSoup import re import socket import logginglogging.basicConfig(level=logging.DEBUG)def proxy_spider(page_num):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}for i in range(page_num):url = 'http://www.xicidaili.com/wt/' + str(i + 1)r = requests.get(url=url, headers=headers)html = r.text# print r.status_codesoup = BeautifulSoup(html, "html.parser")datas = soup.find_all(name='tr', attrs={'class': re.compile('|[^odd]')})for data in datas:soup_proxy = BeautifulSoup(str(data), "html.parser")proxy_contents = soup_proxy.find_all(name='td')ip_org = str(proxy_contents[1].string)ip = ip_orgport = str(proxy_contents[2].string)protocol = str(proxy_contents[5].string)wan_proxy_check(ip, port, protocol)# print(ip, port, protocol)def local_proxy_check(ip, port, protocol):proxy = {}proxy[protocol.lower()] = '%s:%s' % (ip, port)# print proxyheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)try:s.settimeout(1)s.connect((ip, int(port)))s.shutdown(2)logging.debug("{} {}".format(ip, port))return Trueexcept:logging.debug("-------- {} {}".format(ip, port))return False""" 幾種在Linux下查詢外網IP的辦法 https://my.oschina.net/epstar/blog/513186 各大巨頭電商提供的IP庫API接口-新浪、搜狐、阿里 http://zhaoshijie.iteye.com/blog/2205033 """def wan_proxy_check(ip, port, protocol):proxy = {}proxy[protocol.lower()] = '%s:%s' % (ip, port)# proxy = {protocol:protocol+ "://" +ip + ":" + port}# print(proxy)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}try:result = requests.get("http://pv.sohu.com/cityjson", headers=headers, proxies=proxy, timeout=1).text.strip("\n")wan_ip = re.findall(r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b", result)[0]if wan_ip == ip:logging.info("{} {} {}".format(protocol, wan_ip, port))logging.debug("========================")else:logging.debug(" Porxy bad: {} {}".format(wan_ip, port))except Exception as e:logging.debug("#### Exception: {}".format(str(e)))if __name__ == '__main__':proxy_spider(1)

Refer：

[1]?Python爬蟲代理IP池(proxy pool)

https://github.com/jhao104/proxy_pool

[2]?Python爬蟲代理IP池

http://www.spiderpy.cn/blog/detail/13

[3]?python ip proxy tool scrapy crawl. 抓取大量免費代理 ip，提取有效 ip 使用

https://github.com/awolfly9/IPProxyTool

轉載于:https://my.oschina.net/leejun2005/blog/67349

總結

以上是生活随笔為你收集整理的Python 爬取可用代理 IP的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

Python
IP

上一篇：使用VS2008怎么连接自带的SQL S
下一篇： tslib