第四章 爬取西刺免费代理ip 并应用到scrapy
生活随笔
收集整理的這篇文章主要介紹了
第四章 爬取西刺免费代理ip 并应用到scrapy
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
1.獲取免費代理ip
#!/usr/bin/env python # -*- coding: utf-8 -*- """ @author: liuyc @file: crawl_xici_ip.py @time: 2017/8/21 23:22 @describe: """ import requests from scrapy.selector import Selector from fake_useragent import UserAgent import MySQLdbconn = MySQLdb.connect(host="127.0.0.1", user="root", password="pass", db="scrapy", charset="utf8") cursor = conn.cursor() ua = UserAgent()def crawl_xici_ips():headers = {"User-Agent": "{0}".format(ua.random)}for i in range(1, 3):req = requests.get("http://www.xicidaili.com/nn/{0}".format(i), headers=headers)selector = Selector(text=req.text)trs = selector.css("#ip_list tr")lst_ip = []for tr in trs[1:]:all_text = tr.css("td::text").extract()lst_ip.append({"ip": all_text[0], 'port': all_text[1], 'ip_type': all_text[5]})for ip_info in lst_ip:sql_insert = "INSERT INTO ip_proxy(ip, port, ip_type) VALUES ('{0}','{1}','{2}')".format(ip_info["ip"], ip_info["port"], ip_info["ip_type"])cursor.execute(sql_insert)conn.commit()# crawl_xici_ips()class GetIP(object):def get_random_ip(self):random_sql = "SELECT ip, port FROM ip_proxy ORDER BY RAND() LIMIT 1"cursor.execute(random_sql)for ip_info in cursor.fetchall():ip = ip_info[0]port = ip_info[1]judge_re = self.judge_ip(ip, port)if judge_re:return "http://{0}:{1}".format(ip, port)else:return self.get_random_ip()def del_ip(self, ip):# 從數(shù)據(jù)庫中刪除無效ipdel_sql = "DELETE FROM ip_proxy WHERE ip='{0}'".format(ip)try:cursor.execute(del_sql)conn.commit()except Exception as e:print(e)def judge_ip(self, ip, port):baidu_rul = "http://www.baidu.com"proxy_url = "http://{0}:{1}".format(ip, port)proxy_dict = {"http": proxy_url,}try:response = requests.get(baidu_rul, proxies=proxy_dict)except Exception as e:print(e)self.del_ip(ip)print("invalid ip or port")return Falseelse:code = response.status_codeif code >= 200 and code < 300:print("effective ip:{0}".format(ip))return Trueelse:print("invalid ip or port")self.del_ip(ip)return False # # if __name__ == "__main__": # get_ip_obj = GetIP() # get_ip_obj.get_random_ip()2.在抓取網(wǎng)頁時通過下面的中間件 隨機(jī)更換免費代理IP class RandomProxyMiddleware(object):"""隨機(jī)免費代理ip中間件"""def process_request(self, request, spider):print("RandomUserAgentMiddleware".center(20, "*"))get_ip_obj = GetIP()request.meta["proxy"] = get_ip_obj.get_random_ip()
總結(jié)
以上是生活随笔為你收集整理的第四章 爬取西刺免费代理ip 并应用到scrapy的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: caffe生成voc格式lmdb
- 下一篇: Nginx服务基础