當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

19-爬虫之scrapy框架大文件下载06

發(fā)布時間：2024/9/15 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 19-爬虫之scrapy框架大文件下载06 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

大文件下載
創(chuàng)建一個爬蟲工程：scrapy startproject proName
進入工程目錄創(chuàng)建爬蟲源文件：scrapy genspider spiderName www.xxx.com
執(zhí)行工程：scrapy crawl spiderName

大文件數(shù)據(jù)是在管道中請求到的

下載管道類是scrapy封裝好的直接調用即可：

from scrapy.pipelines.images import ImagesPipeline # 該管道提供數(shù)據(jù)下載功能（圖片視頻音頻皆可使用該類）

重寫管道類的三個方法：

def get_media_requests

對圖片地址發(fā)起請求

def file_path

返回圖片名稱即可

def item_completed

返回item，將其返回給下一個即將被執(zhí)行的管道類

在配置文件中添加 IMAGES_STORE

IMAGES_STORE=‘dirname’

img.py 爬蟲源文件

import scrapy from imgPro.items import ImgproItemclass ImgSpider(scrapy.Spider):name = 'img'#allowed_domains = ['www.xxx.com']start_urls = ['http://www.521609.com/daxuemeinv/']def parse(self, response):li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')for li in li_list:img_src = 'http://www.521609.com'+li.xpath('./a[1]/img/@src').extract_first() #圖片urlimg_name = li.xpath('./a[1]/img/@alt').extract_first()+'.jpg'#圖片名字item = ImgproItem()item['name'] = img_nameitem['src'] = img_srcyield item

items.py

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ImgproItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()src = scrapy.Field()

settings.py

# Scrapy settings for imgPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'imgPro'SPIDER_MODULES = ['imgPro.spiders'] NEWSPIDER_MODULE = 'imgPro.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"# Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = "ERROR"IMAGES_STORE = 'imgLibs' #定義存儲文件夾（沒有的話會自動給創(chuàng)建） # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'imgPro.middlewares.ImgproSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'imgPro.middlewares.ImgproDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'imgPro.pipelines.ImgsproPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines.py

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface # from itemadapter import ItemAdapter# 該默認管道無法幫我們進行數(shù)據(jù)請求，因此該管道我們就不使用 # class ImgproPipeline: # def process_item(self, item, spider): # return item# 管道需要接受item中的圖片地址和名稱，然后在管道中請求到圖片的數(shù)據(jù)對其進行持久化存儲 from scrapy.pipelines.images import ImagesPipeline # 該管道提供數(shù)據(jù)下載功能（圖片視頻音頻皆可使用該類） import scrapy class ImgsproPipeline(ImagesPipeline):# 根據(jù)圖片地址發(fā)起請求def get_media_requests(self, item, info):print(item)yield scrapy.Request(url=item['src'],meta={'item':item})# 返回圖片名稱即可def file_path(self, request, response=None, info=None):# 通過request獲取metaitem = request.meta['item']filePath = item['name']return filePath #返回圖片名稱# 我們將item傳遞給下一個即將被執(zhí)行的管道類def item_completed(self, results, item, info):return item

結果展示

scrapy的settings.py配置文件介紹

# -*- coding: utf-8 -*-# Scrapy settings for demo1 project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlBOT_NAME = '' #Scrapy項目的名字,這將用來構造默認 User-Agent,同時也用來log,當您使用 startproject 命令創(chuàng)建項目時其也被自動賦值。SPIDER_MODULES = [''] #Scrapy搜索spider的模塊列表默認: [xxx.spiders] NEWSPIDER_MODULE = '' #使用 genspider 命令創(chuàng)建新spider的模塊。默認: 'xxx.spiders'#爬取的默認User-Agent，除非被覆蓋 #USER_AGENT =‘’#如果啟用,Scrapy將會采用 robots.txt策略 ROBOTSTXT_OBEY = True#Scrapy downloader 并發(fā)請求(concurrent requests)的最大值,默認: 16 #CONCURRENT_REQUESTS = 32#為同一網(wǎng)站的請求配置延遲（默認值：0） # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 下載器在下載同一個網(wǎng)站下一個頁面前需要等待的時間,該選項可以用來限制爬取速度,減輕服務器壓力。同時也支持小數(shù):0.25 以秒為單位#下載延遲設置只有一個有效 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 對單個網(wǎng)站進行并發(fā)請求的最大值。 #CONCURRENT_REQUESTS_PER_IP = 16 對單個IP進行并發(fā)請求的最大值。如果非0,則忽略 CONCURRENT_REQUESTS_PER_DOMAIN 設定,使用該設定。也就是說,并發(fā)限制將針對IP,而不是網(wǎng)站。該設定也影響 DOWNLOAD_DELAY: 如果 CONCURRENT_REQUESTS_PER_IP 非0,下載延遲應用在IP而不是網(wǎng)站上。#禁用Cookie（默認情況下啟用） #COOKIES_ENABLED = False#禁用Telnet控制臺（默認啟用） #TELNETCONSOLE_ENABLED = False #覆蓋默認請求標頭： #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}#啟用或禁用蜘蛛中間件 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'demo1.middlewares.Demo1SpiderMiddleware': 543, #}#啟用或禁用下載器中間件 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'demo1.middlewares.MyCustomDownloaderMiddleware': 543, #}#啟用或禁用擴展程序 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}#配置項目管道 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'demo1.pipelines.Demo1Pipeline': 300, #}#啟用和配置AutoThrottle擴展（默認情況下禁用） # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True#初始下載延遲 #AUTOTHROTTLE_START_DELAY = 5#在高延遲的情況下設置的最大下載延遲 #AUTOTHROTTLE_MAX_DELAY = 60#Scrapy請求的平均數(shù)量應該并行發(fā)送每個遠程服務器 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0#啟用顯示所收到的每個響應的調節(jié)統(tǒng)計信息： #AUTOTHROTTLE_DEBUG = False#啟用和配置HTTP緩存（默認情況下禁用） # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

總結

以上是生活随笔為你收集整理的19-爬虫之scrapy框架大文件下载06的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 18-爬虫之scrapy框架请求传参实现
下一篇： 20-爬虫之scrapy框架CrawlS