使用pyspider爬取巨量淘宝MM图片
生活随笔
收集整理的這篇文章主要介紹了
使用pyspider爬取巨量淘宝MM图片
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
版權聲明:本文可能為博主原創文章,若標明出處可隨便轉載。 https://blog.csdn.net/Jailman/article/details/79354568
具體搭建步驟不再贅述,這里主要使用到了fakeagent,phantomjs和proxy
pyspider的爬取相當智能,在不能獲取圖片的時候會適當的暫停一段時間再試探性的爬取,配合fakeagent,proxypool和phantomjs,爬取成功率在90%以上。
代碼是扒的別人的然后修改提高速度和成功率的,數據總量在百G左右,磁盤大的可以扒一扒。
代碼如下:
#!/usr/bin/env python # -*- encoding: utf-8 -*- # Created on 2016-03-25 00:59:45 # Project: taobaommfrom pyspider.libs.base_handler import * from fake_useragent import UserAgent import base64 import requests import random import sys reload(sys) sys.setdefaultencoding('UTF-8') PAGE_START = 1 PAGE_END = 4301 DIR_PATH = '/root/images/tbmm'class Handler(BaseHandler):r = requests.get(u'http://127.0.0.1:5010/get_all/')proxy = random.choice(eval(r.text))ua = UserAgent()crawl_config = {"proxy": proxy,"headers":{"User-Agent": ua.random},}def __init__(self):self.base_url = 'https://mm.taobao.com/json/request_top_list.htm?page='self.page_num = PAGE_STARTself.total_num = PAGE_ENDself.deal = Deal()def on_start(self):while self.page_num <= self.total_num:url = self.base_url + str(self.page_num)self.crawl(url, callback=self.index_page)self.page_num += 1def index_page(self, response):for each in response.doc('.lady-name').items():self.crawl(each.attr.href, callback=self.detail_page, fetch_type='js')def detail_page(self, response):domain = response.doc('.mm-p-domain-info li > span').text()if domain:page_url = 'https:' + domainself.crawl(page_url, callback=self.domain_page)def domain_page(self, response):name = base64.b64encode(response.doc('.mm-p-model-info-left-top dd > a').text())dir_path = self.deal.mkDir(name)brief = response.doc('.mm-aixiu-content').text()if dir_path:imgs = response.doc('.mm-aixiu-content img').items()count = 1self.deal.saveBrief(brief, dir_path, name)for img in imgs:url = img.attr.srcif url:extension = self.deal.getExtension(url)file_name = name + str(count) + '.' + extensioncount += 1self.crawl(img.attr.src, callback=self.save_img,save={'dir_path': dir_path, 'file_name': file_name})def save_img(self, response):content = response.contentdir_path = response.save['dir_path']file_name = response.save['file_name']file_path = dir_path + '/' + file_nameself.deal.saveImg(content, file_path)import osclass Deal:def __init__(self):self.path = DIR_PATHif not self.path.endswith('/'):self.path = self.path + '/'if not os.path.exists(self.path):os.makedirs(self.path)def mkDir(self, path):path = path.strip()dir_path = self.path + pathexists = os.path.exists(dir_path)if not exists:os.makedirs(dir_path)return dir_pathelse:return dir_pathdef saveImg(self, content, path):f = open(path, 'wb')f.write(content)f.close()def saveBrief(self, content, dir_path, name):file_name = dir_path + "/" + name + ".txt"f = open(file_name, "w+")f.write(content.encode('utf-8'))def getExtension(self, url):extension = url.split('.')[-1]return extension總結
以上是生活随笔為你收集整理的使用pyspider爬取巨量淘宝MM图片的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: NoHttpResponseExcept
- 下一篇: 改变DIV的背景颜色透明度,但其中的文字