19-爬虫之scrapy框架大文件下载06
生活随笔
收集整理的這篇文章主要介紹了
19-爬虫之scrapy框架大文件下载06
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
大文件下載
創(chuàng)建一個爬蟲工程:scrapy startproject proName
進入工程目錄創(chuàng)建爬蟲源文件:scrapy genspider spiderName www.xxx.com
執(zhí)行工程:scrapy crawl spiderName
大文件數(shù)據(jù)是在管道中請求到的
下載管道類是scrapy封裝好的直接調用即可:
from scrapy.pipelines.images import ImagesPipeline # 該管道提供數(shù)據(jù)下載功能(圖片視頻音頻皆可使用該類)
重寫管道類的三個方法:
def get_media_requests
- 對圖片地址發(fā)起請求
def file_path
- 返回圖片名稱即可
def item_completed
- 返回item,將其返回給下一個即將被執(zhí)行的管道類
在配置文件中添加 IMAGES_STORE
- IMAGES_STORE=‘dirname’
img.py 爬蟲源文件
import scrapy from imgPro.items import ImgproItemclass ImgSpider(scrapy.Spider):name = 'img'#allowed_domains = ['www.xxx.com']start_urls = ['http://www.521609.com/daxuemeinv/']def parse(self, response):li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')for li in li_list:img_src = 'http://www.521609.com'+li.xpath('./a[1]/img/@src').extract_first() #圖片urlimg_name = li.xpath('./a[1]/img/@alt').extract_first()+'.jpg'#圖片名字item = ImgproItem()item['name'] = img_nameitem['src'] = img_srcyield itemitems.py
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ImgproItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()src = scrapy.Field()settings.py
# Scrapy settings for imgPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'imgPro'SPIDER_MODULES = ['imgPro.spiders'] NEWSPIDER_MODULE = 'imgPro.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"# Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = "ERROR"IMAGES_STORE = 'imgLibs' #定義存儲文件夾(沒有的話會自動給創(chuàng)建) # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'imgPro.middlewares.ImgproSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'imgPro.middlewares.ImgproDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'imgPro.pipelines.ImgsproPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'pipelines.py
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface # from itemadapter import ItemAdapter# 該默認管道無法幫我們進行數(shù)據(jù)請求,因此該管道我們就不使用 # class ImgproPipeline: # def process_item(self, item, spider): # return item# 管道需要接受item中的圖片地址和名稱,然后在管道中請求到圖片的數(shù)據(jù)對其進行持久化存儲 from scrapy.pipelines.images import ImagesPipeline # 該管道提供數(shù)據(jù)下載功能(圖片視頻音頻皆可使用該類) import scrapy class ImgsproPipeline(ImagesPipeline):# 根據(jù)圖片地址發(fā)起請求def get_media_requests(self, item, info):print(item)yield scrapy.Request(url=item['src'],meta={'item':item})# 返回圖片名稱即可def file_path(self, request, response=None, info=None):# 通過request獲取metaitem = request.meta['item']filePath = item['name']return filePath #返回圖片名稱# 我們將item傳遞給下一個即將被執(zhí)行的管道類def item_completed(self, results, item, info):return item結果展示
scrapy的settings.py配置文件介紹
# -*- coding: utf-8 -*-# Scrapy settings for demo1 project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlBOT_NAME = '' #Scrapy項目的名字,這將用來構造默認 User-Agent,同時也用來log,當您使用 startproject 命令創(chuàng)建項目時其也被自動賦值。SPIDER_MODULES = [''] #Scrapy搜索spider的模塊列表 默認: [xxx.spiders] NEWSPIDER_MODULE = '' #使用 genspider 命令創(chuàng)建新spider的模塊。默認: 'xxx.spiders'#爬取的默認User-Agent,除非被覆蓋 #USER_AGENT =‘’#如果啟用,Scrapy將會采用 robots.txt策略 ROBOTSTXT_OBEY = True#Scrapy downloader 并發(fā)請求(concurrent requests)的最大值,默認: 16 #CONCURRENT_REQUESTS = 32#為同一網(wǎng)站的請求配置延遲(默認值:0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 下載器在下載同一個網(wǎng)站下一個頁面前需要等待的時間,該選項可以用來限制爬取速度,減輕服務器壓力。同時也支持小數(shù):0.25 以秒為單位#下載延遲設置只有一個有效 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 對單個網(wǎng)站進行并發(fā)請求的最大值。 #CONCURRENT_REQUESTS_PER_IP = 16 對單個IP進行并發(fā)請求的最大值。如果非0,則忽略 CONCURRENT_REQUESTS_PER_DOMAIN 設定,使用該設定。 也就是說,并發(fā)限制將針對IP,而不是網(wǎng)站。該設定也影響 DOWNLOAD_DELAY: 如果 CONCURRENT_REQUESTS_PER_IP 非0,下載延遲應用在IP而不是網(wǎng)站上。#禁用Cookie(默認情況下啟用) #COOKIES_ENABLED = False#禁用Telnet控制臺(默認啟用) #TELNETCONSOLE_ENABLED = False #覆蓋默認請求標頭: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}#啟用或禁用蜘蛛中間件 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'demo1.middlewares.Demo1SpiderMiddleware': 543, #}#啟用或禁用下載器中間件 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'demo1.middlewares.MyCustomDownloaderMiddleware': 543, #}#啟用或禁用擴展程序 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}#配置項目管道 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'demo1.pipelines.Demo1Pipeline': 300, #}#啟用和配置AutoThrottle擴展(默認情況下禁用) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True#初始下載延遲 #AUTOTHROTTLE_START_DELAY = 5#在高延遲的情況下設置的最大下載延遲 #AUTOTHROTTLE_MAX_DELAY = 60#Scrapy請求的平均數(shù)量應該并行發(fā)送每個遠程服務器 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0#啟用顯示所收到的每個響應的調節(jié)統(tǒng)計信息: #AUTOTHROTTLE_DEBUG = False#啟用和配置HTTP緩存(默認情況下禁用) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'總結
以上是生活随笔為你收集整理的19-爬虫之scrapy框架大文件下载06的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 18-爬虫之scrapy框架请求传参实现
- 下一篇: 20-爬虫之scrapy框架CrawlS