當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫、框架scrapy

發(fā)布時間：2024/4/17 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫、框架scrapy 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

閱讀目錄

一介紹
二安裝
三命令行工具
四項目結(jié)構(gòu)以及爬蟲應(yīng)用簡介?
五 Spiders
六 Selectors
七 Items
八 Item Pipeline
九 Dowloader Middeware
十 Spider Middleware
十一 settings.py
十二爬取亞馬遜商品信息

一介紹

? ? Scrapy一個開源和協(xié)作的框架，其最初是為了頁面抓取 (更確切來說, 網(wǎng)絡(luò)抓取 )所設(shè)計的，使用它可以以快速、簡單、可擴(kuò)展的方式從網(wǎng)站中提取所需的數(shù)據(jù)。但目前Scrapy的用途十分廣泛，可用于如數(shù)據(jù)挖掘、監(jiān)測和自動化測試等領(lǐng)域，也可以應(yīng)用在獲取API所返回的數(shù)據(jù)(例如 Amazon Associates Web Services ) 或者通用的網(wǎng)絡(luò)爬蟲。

? ? Scrapy 是基于twisted框架開發(fā)而來，twisted是一個流行的事件驅(qū)動的python網(wǎng)絡(luò)框架。因此Scrapy使用了一種非阻塞（又名異步）的代碼來實現(xiàn)并發(fā)。整體架構(gòu)大致如下

The data flow in Scrapy is controlled by the execution engine, and goes like this:

The?Engine?gets the initial Requests to crawl from the?Spider.

The?Engine?schedules the Requests in the?Scheduler?and asks for the next Requests to crawl.

The?Scheduler?returns the next Requests to the?Engine.

The?Engine?sends the Requests to the?Downloader, passing through the?Downloader Middlewares?(see?process_request()).

Once the page finishes downloading the?Downloader?generates a Response (with that page) and sends it to the Engine, passing through the?Downloader Middlewares?(see?process_response()).

The?Engine?receives the Response from the?Downloader?and sends it to the?Spider?for processing, passing through the?Spider Middleware?(see?process_spider_input()).

The?Spider?processes the Response and returns scraped items and new Requests (to follow) to the?Engine, passing through the?Spider Middleware?(see?process_spider_output()).

The?Engine?sends processed items to?Item Pipelines, then send processed Requests to the?Scheduler?and asks for possible next Requests to crawl.

The process repeats (from step 1) until there are no more requests from the?Scheduler.

Components：

引擎(EGINE)

引擎負(fù)責(zé)控制系統(tǒng)所有組件之間的數(shù)據(jù)流，并在某些動作發(fā)生時觸發(fā)事件。有關(guān)詳細(xì)信息，請參見上面的數(shù)據(jù)流部分。

調(diào)度器(SCHEDULER)
用來接受引擎發(fā)過來的請求, 壓入隊列中, 并在引擎再次請求的時候返回. 可以想像成一個URL的優(yōu)先級隊列, 由它來決定下一個要抓取的網(wǎng)址是什么, 同時去除重復(fù)的網(wǎng)址

下載器(DOWLOADER)
用于下載網(wǎng)頁內(nèi)容, 并將網(wǎng)頁內(nèi)容返回給EGINE，下載器是建立在twisted這個高效的異步模型上的

爬蟲(SPIDERS)
SPIDERS是開發(fā)人員自定義的類，用來解析responses，并且提取items，或者發(fā)送新的請求

項目管道(ITEM PIPLINES)
在items被提取后負(fù)責(zé)處理它們，主要包括清理、驗證、持久化（比如存到數(shù)據(jù)庫）等操作

下載器中間件(Downloader Middlewares)
位于Scrapy引擎和下載器之間，主要用來處理從EGINE傳到DOWLOADER的請求request，已經(jīng)從DOWNLOADER傳到EGINE的響應(yīng)response，你可用該中間件做以下幾件事

process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);

change received response before passing it to a spider;

send a new Request instead of passing received response to a spider;

pass response to a spider without fetching a web page;

silently drop some requests.

爬蟲中間件(Spider Middlewares)
位于EGINE和SPIDERS之間，主要工作是處理SPIDERS的輸入（即responses）和輸出（即requests）

官網(wǎng)鏈接：https://docs.scrapy.org/en/latest/topics/architecture.html

二安裝

#Windows平臺1、pip3 install wheel #安裝后，便支持通過wheel文件安裝軟件，wheel文件官網(wǎng)：https://www.lfd.uci.edu/~gohlke/pythonlibs3、pip3 install lxml4、pip3 install pyopenssl5、下載并安裝pywin32：https://sourceforge.net/projects/pywin32/files/pywin32/6、下載twisted的wheel文件：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted7、執(zhí)行pip3 install 下載目錄\Twisted-17.9.0-cp36-cp36m-win_amd64.whl8、pip3 install scrapy#Linux平臺1、pip3 install scrapy

三命令行工具

#1 查看幫助scrapy -hscrapy <command> -h#2 有兩種命令：其中Project-only必須切到項目文件夾下才能執(zhí)行，而Global的命令則不需要 Global commands:startproject #創(chuàng)建項目 scrapy startprojact 項目名genspider #創(chuàng)建爬蟲程序 scrapy genspider 爬蟲名 urlsettings #如果是在項目目錄下，則得到的是該項目的配置runspider #運行一個獨立的python文件，不必創(chuàng)建項目shell #scrapy shell url地址在交互式調(diào)試，如選擇器規(guī)則正確與否fetch #獨立于程單純地爬取一個頁面，可以拿到請求頭view #下載完畢后直接彈出瀏覽器，以此可以分辨出哪些數(shù)據(jù)是ajax請求version #scrapy version 查看scrapy的版本，scrapy version -v查看scrapy依賴庫的版本Project-only commands:crawl #運行爬蟲，必須創(chuàng)建項目才行，確保配置文件中ROBOTSTXT_OBEY = Falsecheck #檢測項目中有無語法錯誤list #列出項目中所包含的爬蟲名edit #編輯器，一般不用parse #scrapy parse url地址 --callback 回調(diào)函數(shù) #以此可以驗證我們的回調(diào)函數(shù)是否正確bench #scrapy bentch壓力測試#3 官網(wǎng)鏈接https://docs.scrapy.org/en/latest/topics/commands.html #1、執(zhí)行全局命令：請確保不在某個項目的目錄下，排除受該項目配置的影響 scrapy startproject MyProjectcd MyProject scrapy genspider baidu www.baidu.comscrapy settings --get XXX #如果切換到項目目錄下，看到的則是該項目的配置 scrapy runspider baidu.pyscrapy shell https://www.baidu.comresponseresponse.statusresponse.bodyview(response)scrapy view https://www.taobao.com #如果頁面顯示內(nèi)容不全，不全的內(nèi)容則是ajax請求實現(xiàn)的，以此快速定位問題 scrapy fetch --nolog --headers https://www.taobao.comscrapy version #scrapy的版本 scrapy version -v #依賴庫的版本#2、執(zhí)行項目命令：切到項目目錄下 scrapy crawl baidu scrapy check scrapy list scrapy parse http://quotes.toscrape.com/ --callback parse scrapy bench
示范用法

四項目結(jié)構(gòu)以及爬蟲應(yīng)用簡介?

project_name/scrapy.cfgproject_name/__init__.pyitems.pypipelines.pysettings.pyspiders/__init__.py爬蟲1.py爬蟲2.py爬蟲3.py

文件說明：

scrapy.cfg ?項目的主配置信息，用來部署scrapy時使用，爬蟲相關(guān)的配置信息在settings.py文件中。
items.py ? ?設(shè)置數(shù)據(jù)存儲模板，用于結(jié)構(gòu)化數(shù)據(jù)，如：Django的Model
pipelines ? ?數(shù)據(jù)處理行為，如：一般結(jié)構(gòu)化的數(shù)據(jù)持久化
settings.py 配置文件，如：遞歸的層數(shù)、并發(fā)數(shù)，延遲下載等。強調(diào):配置文件的選項必須大寫否則視為無效，正確寫法USER_AGENT='xxxx'
spiders ? ? ?爬蟲目錄，如：創(chuàng)建文件，編寫爬蟲規(guī)則

注意：一般創(chuàng)建爬蟲文件時，以網(wǎng)站域名命名

#在項目目錄下新建：entrypoint.py from scrapy.cmdline import execute execute(['scrapy', 'crawl', 'xiaohua']) 默認(rèn)只能在cmd中執(zhí)行爬蟲，如果想在pycharm中執(zhí)行需要做 import sys,os sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') 關(guān)于windows編碼

五 Spiders

1、介紹

#1、Spiders是由一系列類（定義了一個網(wǎng)址或一組網(wǎng)址將被爬取）組成，具體包括如何執(zhí)行爬取任務(wù)并且如何從頁面中提取結(jié)構(gòu)化的數(shù)據(jù)。#2、換句話說，Spiders是你為了一個特定的網(wǎng)址或一組網(wǎng)址自定義爬取和解析頁面行為的地方

2、Spiders會循環(huán)做如下事情

#1、生成初始的Requests來爬取第一個URLS，并且標(biāo)識一個回調(diào)函數(shù) 第一個請求定義在start_requests()方法內(nèi)默認(rèn)從start_urls列表中獲得url地址來生成Request請求，默認(rèn)的回調(diào)函數(shù)是parse方法。回調(diào)函數(shù)在下載完成返回response時自動觸發(fā)#2、在回調(diào)函數(shù)中，解析response并且返回值返回值可以4種：包含解析數(shù)據(jù)的字典Item對象新的Request對象（新的Requests也需要指定一個回調(diào)函數(shù)）或者是可迭代對象（包含Items或Request）#3、在回調(diào)函數(shù)中解析頁面內(nèi)容通常使用Scrapy自帶的Selectors，但很明顯你也可以使用Beutifulsoup，lxml或其他你愛用啥用啥。#4、最后，針對返回的Items對象將會被持久化到數(shù)據(jù)庫通過Item Pipeline組件存到數(shù)據(jù)庫：https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline）或者導(dǎo)出到不同的文件（通過Feed exports：https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports）

3、Spiders總共提供了五種類：

#1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider #2、scrapy.spiders.CrawlSpider #3、scrapy.spiders.XMLFeedSpider #4、scrapy.spiders.CSVFeedSpider #5、scrapy.spiders.SitemapSpider

4、導(dǎo)入使用

# -*- coding: utf-8 -*- import scrapy from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpiderclass AmazonSpider(scrapy.Spider): #自定義類，繼承Spiders提供的基類name = 'amazon'allowed_domains = ['www.amazon.cn']start_urls = ['http://www.amazon.cn/']def parse(self, response):pass

5、class scrapy.spiders.Spider

這是最簡單的spider類，任何其他的spider類都需要繼承它（包含你自己定義的）。

該類不提供任何特殊的功能，它僅提供了一個默認(rèn)的start_requests方法默認(rèn)從start_urls中讀取url地址發(fā)送requests請求，并且默認(rèn)parse作為回調(diào)函數(shù)

class AmazonSpider(scrapy.Spider):name = 'amazon' allowed_domains = ['www.amazon.cn'] start_urls = ['http://www.amazon.cn/']custom_settings = {'BOT_NAME' : 'Egon_Spider_Amazon','REQUEST_HEADERS' : {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',}}def parse(self, response):pass #1、name = 'amazon' 定義爬蟲名，scrapy會根據(jù)該值定位爬蟲程序所以它必須要有且必須唯一（In Python 2 this must be ASCII only.）#2、allowed_domains = ['www.amazon.cn'] 定義允許爬取的域名，如果OffsiteMiddleware啟動（默認(rèn)就啟動），那么不屬于該列表的域名及其子域名都不允許爬取如果爬取的網(wǎng)址為：https://www.example.com/1.html，那就添加'example.com'到列表.#3、start_urls = ['http://www.amazon.cn/'] 如果沒有指定url，就從該列表中讀取url來生成第一個請求#4、custom_settings 值為一個字典，定義一些配置信息，在運行爬蟲程序時，這些配置會覆蓋項目級別的配置所以custom_settings必須被定義成一個類屬性，由于settings會在類實例化前被加載#5、settings 通過self.settings['配置項的名字']可以訪問settings.py中的配置，如果自己定義了custom_settings還是以自己的為準(zhǔn)#6、logger 日志名默認(rèn)為spider的名字 self.logger.debug('=============>%s' %self.settings['BOT_NAME'])#5、crawler：了解該屬性必須被定義到類方法from_crawler中#6、from_crawler(crawler, *args, **kwargs)：了解 You probably won’t need to override this directly because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.#7、start_requests() 該方法用來發(fā)起第一個Requests請求，且必須返回一個可迭代的對象。它在爬蟲程序打開時就被Scrapy調(diào)用，Scrapy只調(diào)用它一次。默認(rèn)從start_urls里取出每個url來生成Request(url, dont_filter=True)#針對參數(shù)dont_filter,請看自定義去重規(guī)則如果你想要改變起始爬取的Requests，你就需要覆蓋這個方法，例如你想要起始發(fā)送一個POST請求，如下 class MySpider(scrapy.Spider):name = 'myspider'def start_requests(self):return [scrapy.FormRequest("http://www.example.com/login",formdata={'user': 'john', 'pass': 'secret'},callback=self.logged_in)]def logged_in(self, response):# here you would extract links to follow and return Requests for# each of them, with another callbackpass#8、parse(response) 這是默認(rèn)的回調(diào)函數(shù)，所有的回調(diào)函數(shù)必須返回an iterable of Request and/or dicts or Item objects.#9、log(message[, level, component])：了解 Wrapper that sends a log message through the Spider’s logger, kept for backwards compatibility. For more information see Logging from Spiders.#10、closed(reason) 爬蟲程序結(jié)束時自動觸發(fā) 定制scrapy.spider屬性與方法詳解去重規(guī)則應(yīng)該多個爬蟲共享的，但凡一個爬蟲爬取了，其他都不要爬了，實現(xiàn)方式如下#方法一： 1、新增類屬性 visited=set() #類屬性2、回調(diào)函數(shù)parse方法內(nèi)： def parse(self, response):if response.url in self.visited:return None.......self.visited.add(response.url) #方法一改進(jìn)：針對url可能過長，所以我們存放url的hash值 def parse(self, response):url=md5(response.request.url)if url in self.visited:return None.......self.visited.add(url) #方法二：Scrapy自帶去重功能配置文件： DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默認(rèn)的去重規(guī)則幫我們?nèi)ブ?#xff0c;去重規(guī)則在內(nèi)存中 DUPEFILTER_DEBUG = False JOBDIR = "保存范文記錄的日志路徑，如：/root/" # 最終路徑為 /root/requests.seen，去重規(guī)則放文件中 scrapy自帶去重規(guī)則默認(rèn)為RFPDupeFilter，只需要我們指定 Request(...,dont_filter=False) ，如果dont_filter=True則告訴Scrapy這個URL不參與去重。#方法三：我們也可以仿照RFPDupeFilter自定義去重規(guī)則，from scrapy.dupefilter import RFPDupeFilter，看源碼，仿照BaseDupeFilter#步驟一：在項目目錄下自定義去重文件dup.py class UrlFilter(object):def __init__(self):self.visited = set() #或者放到數(shù)據(jù)庫 @classmethoddef from_settings(cls, settings):return cls()def request_seen(self, request):if request.url in self.visited:return Trueself.visited.add(request.url)def open(self): # can return deferredpassdef close(self, reason): # can return a deferredpassdef log(self, request, spider): # log that a request has been filteredpass#步驟二：配置文件settings.py： DUPEFILTER_CLASS = '項目名.dup.UrlFilter'# 源碼分析： from scrapy.core.scheduler import Scheduler 見Scheduler下的enqueue_request方法：self.df.request_seen(request) 去重規(guī)則：去除重復(fù)的url #例一： import scrapyclass MySpider(scrapy.Spider):name = 'example.com'allowed_domains = ['example.com']start_urls = ['http://www.example.com/1.html','http://www.example.com/2.html','http://www.example.com/3.html',]def parse(self, response):self.logger.info('A response from %s just arrived!', response.url)#例二：一個回調(diào)函數(shù)返回多個Requests和Items import scrapyclass MySpider(scrapy.Spider):name = 'example.com'allowed_domains = ['example.com']start_urls = ['http://www.example.com/1.html','http://www.example.com/2.html','http://www.example.com/3.html',]def parse(self, response):for h3 in response.xpath('//h3').extract():yield {"title": h3}for url in response.xpath('//a/@href').extract():yield scrapy.Request(url, callback=self.parse)#例三：在start_requests()內(nèi)直接指定起始爬取的urls，start_urls就沒有用了，import scrapy from myproject.items import MyItemclass MySpider(scrapy.Spider):name = 'example.com'allowed_domains = ['example.com']def start_requests(self):yield scrapy.Request('http://www.example.com/1.html', self.parse)yield scrapy.Request('http://www.example.com/2.html', self.parse)yield scrapy.Request('http://www.example.com/3.html', self.parse)def parse(self, response):for h3 in response.xpath('//h3').extract():yield MyItem(title=h3)for url in response.xpath('//a/@href').extract():yield scrapy.Request(url, callback=self.parse) 例子我們可能需要在命令行為爬蟲程序傳遞參數(shù)，比如傳遞初始的url，像這樣 #命令行執(zhí)行 scrapy crawl myspider -a category=electronics#在__init__方法中可以接收外部傳進(jìn)來的參數(shù) import scrapyclass MySpider(scrapy.Spider):name = 'myspider'def __init__(self, category=None, *args, **kwargs):super(MySpider, self).__init__(*args, **kwargs)self.start_urls = ['http://www.example.com/categories/%s' % category]#...#注意接收的參數(shù)全都是字符串，如果想要結(jié)構(gòu)化的數(shù)據(jù)，你需要用類似json.loads的方法參數(shù)傳遞

6、其他通用Spiders：https://docs.scrapy.org/en/latest/topics/spiders.html#generic-spiders

六 Selectors

#1 //與/ #2 text #3、extract與extract_first:從selector對象中解出內(nèi)容 #4、屬性：xpath的屬性加前綴@ #4、嵌套查找 #5、設(shè)置默認(rèn)值 #4、按照屬性查找 #5、按照屬性模糊查找 #6、正則表達(dá)式 #7、xpath相對路徑 #8、帶變量的xpath response.selector.css() response.selector.xpath() 可簡寫為 response.css() response.xpath()#1 //與/ response.xpath('//body/a/')# response.css('div a::text')>>> response.xpath('//body/a') #開頭的//代表從整篇文檔中尋找,body之后的/代表body的兒子 [] >>> response.xpath('//body//a') #開頭的//代表從整篇文檔中尋找,body之后的//代表body的子子孫孫 [<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href=" image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>]#2 text >>> response.xpath('//body//a/text()') >>> response.css('body a::text')#3、extract與extract_first:從selector對象中解出內(nèi)容 >>> response.xpath('//div/a/text()').extract() ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 '] >>> response.css('div a::text').extract() ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']>>> response.xpath('//div/a/text()').extract_first() 'Name: My image 1 ' >>> response.css('div a::text').extract_first() 'Name: My image 1 '#4、屬性：xpath的屬性加前綴@ >>> response.xpath('//div/a/@href').extract_first() 'image1.html' >>> response.css('div a::attr(href)').extract_first() 'image1.html'#4、嵌套查找 >>> response.xpath('//div').css('a').xpath('@href').extract_first() 'image1.html'#5、設(shè)置默認(rèn)值 >>> response.xpath('//div[@id="xxx"]').extract_first(default="not found") 'not found'#4、按照屬性查找 response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract() response.css('#images a[@href="image3.html"]/text()').extract()#5、按照屬性模糊查找 response.xpath('//a[contains(@href,"image")]/@href').extract() response.css('a[href*="image"]::attr(href)').extract()response.xpath('//a[contains(@href,"image")]/img/@src').extract() response.css('a[href*="imag"] img::attr(src)').extract()response.xpath('//*[@href="image1.html"]') response.css('*[href="image1.html"]')#6、正則表達(dá)式 response.xpath('//a/text()').re(r'Name: (.*)') response.xpath('//a/text()').re_first(r'Name: (.*)')#7、xpath相對路徑 >>> res=response.xpath('//a[contains(@href,"3")]')[0] >>> res.xpath('img') [<Selector xpath='img' data='<img src="image3_thumb.jpg">'>] >>> res.xpath('./img') [<Selector xpath='./img' data='<img src="image3_thumb.jpg">'>] >>> res.xpath('.//img') [<Selector xpath='.//img' data='<img src="image3_thumb.jpg">'>] >>> res.xpath('//img') #這就是從頭開始掃描 [<Selector xpath='//img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb.jpg">'>, <Selector xpa th='//img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb.jpg">'>]#8、帶變量的xpath >>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first() 'Name: My image 1 ' >>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5個a標(biāo)簽的div的id 'images' View Code

https://docs.scrapy.org/en/latest/topics/selectors.html

七 Items

https://docs.scrapy.org/en/latest/topics/items.html

八 Item Pipeline

#一：可以寫多個Pipeline類 #1、如果優(yōu)先級高的Pipeline的process_item返回一個值或者None，會自動傳給下一個pipline的process_item, #2、如果只想讓第一個Pipeline執(zhí)行，那得讓第一個pipline的process_item拋出異常raise DropItem()#3、可以用spider.name == '爬蟲名' 來控制哪些爬蟲用哪些pipeline 二：示范 from scrapy.exceptions import DropItemclass CustomPipeline(object):def __init__(self,v):self.value = v@classmethoddef from_crawler(cls, crawler):"""Scrapy會先通過getattr判斷我們是否自定義了from_crawler,有則調(diào)它來完成實例化"""val = crawler.settings.getint('MMMM')return cls(val)def open_spider(self,spider):"""爬蟲剛啟動時執(zhí)行一次"""print('000000')def close_spider(self,spider):"""爬蟲關(guān)閉時執(zhí)行一次"""print('111111')def process_item(self, item, spider):# 操作并進(jìn)行持久化# return表示會被后續(xù)的pipeline繼續(xù)處理return item# 表示將item丟棄，不會被后續(xù)pipeline處理# raise DropItem() 自定義pipeline #1、settings.py HOST="127.0.0.1" PORT=27017 USER="root" PWD="123" DB="amazon" TABLE="goods"ITEM_PIPELINES = {'Amazon.pipelines.CustomPipeline': 200, }#2、pipelines.py class CustomPipeline(object):def __init__(self,host,port,user,pwd,db,table):self.host=hostself.port=portself.user=userself.pwd=pwdself.db=dbself.table=table@classmethoddef from_crawler(cls, crawler):"""Scrapy會先通過getattr判斷我們是否自定義了from_crawler,有則調(diào)它來完成實例化"""HOST = crawler.settings.get('HOST')PORT = crawler.settings.get('PORT')USER = crawler.settings.get('USER')PWD = crawler.settings.get('PWD')DB = crawler.settings.get('DB')TABLE = crawler.settings.get('TABLE')return cls(HOST,PORT,USER,PWD,DB,TABLE)def open_spider(self,spider):"""爬蟲剛啟動時執(zhí)行一次"""self.client = MongoClient('mongodb://%s:%s@%s:%s' %(self.user,self.pwd,self.host,self.port))def close_spider(self,spider):"""爬蟲關(guān)閉時執(zhí)行一次"""self.client.close()def process_item(self, item, spider):# 操作并進(jìn)行持久化 self.client[self.db][self.table].save(dict(item)) 示范

https://docs.scrapy.org/en/latest/topics/item-pipeline.html

九 Dowloader Middeware

class DownMiddleware1(object):def process_request(self, request, spider):"""請求需要被下載時，經(jīng)過所有下載器中間件的process_request調(diào)用:param request: :param spider: :return: None,繼續(xù)后續(xù)中間件去下載；Response對象，停止process_request的執(zhí)行，開始執(zhí)行process_responseRequest對象，停止中間件的執(zhí)行，將Request重新調(diào)度器raise IgnoreRequest異常，停止process_request的執(zhí)行，開始執(zhí)行process_exception"""passdef process_response(self, request, response, spider):"""spider處理完成，返回時調(diào)用:param response::param result::param spider::return: Response 對象：轉(zhuǎn)交給其他中間件process_responseRequest 對象：停止中間件，request會被重新調(diào)度下載raise IgnoreRequest 異常：調(diào)用Request.errback"""print('response1')return responsedef process_exception(self, request, exception, spider):"""當(dāng)下載處理器(download handler)或 process_request() (下載中間件)拋出異常:param response::param exception::param spider::return: None：繼續(xù)交給后續(xù)中間件處理異常；Response對象：停止后續(xù)process_exception方法Request對象：停止中間件，request將會被重新調(diào)用下載"""return None 下載器中間件

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

class DownMiddleware1(object):@staticmethoddef get_proxy():return requests.get("http://127.0.0.1:5010/get/").text@staticmethoddef delete_proxy(proxy):requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))def process_request(self, request, spider):"""請求需要被下載時，經(jīng)過所有下載器中間件的process_request調(diào)用:param request::param spider::return:None,繼續(xù)后續(xù)中間件去下載；Response對象，停止process_request的執(zhí)行，開始執(zhí)行process_responseRequest對象，停止中間件的執(zhí)行，將Request重新調(diào)度器raise IgnoreRequest異常，停止process_request的執(zhí)行，開始執(zhí)行process_exception"""if not hasattr(DownMiddleware1,'proxy_addr'):DownMiddleware1.proxy_addr = self.get_proxy()request.meta['download_timeout'] = 5request.meta["proxy"] = "http://" + self.proxy_addrprint('元數(shù)據(jù)',request.meta)if request.meta.get('depth') == 10 or request.meta.get('retry_times') == 2:request.meta['depth'] = 0request.meta['retry_times']=0self.delete_proxy(self.proxy_addr)DownMiddleware1.proxy_addr=self.get_proxy()request.meta["proxy"] = "http://" + self.proxy_addrprint('============>',request.meta)return requestreturn None View Code

十 Spider Middleware

class SpiderMiddleware(object):def process_spider_input(self,response, spider):"""下載完成，執(zhí)行，然后交給parse處理:param response: :param spider: :return: """passdef process_spider_output(self,response, result, spider):"""spider處理完成，返回時調(diào)用:param response::param result::param spider::return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable)"""return resultdef process_spider_exception(self,response, exception, spider):"""異常調(diào)用:param response::param exception::param spider::return: None,繼續(xù)交給后續(xù)中間件處理異常；含 Response 或 Item 的可迭代對象(iterable)，交給調(diào)度器或pipeline"""return Nonedef process_start_requests(self,start_requests, spider):"""爬蟲啟動時調(diào)用:param start_requests::param spider::return: 包含 Request 對象的可迭代對象"""return start_requests 爬蟲中間件

https://docs.scrapy.org/en/latest/topics/spider-middleware.html

十一 settings.py

# -*- coding: utf-8 -*-# Scrapy settings for step8_king project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html# 1. 爬蟲名稱 BOT_NAME = 'step8_king'# 2. 爬蟲應(yīng)用路徑 SPIDER_MODULES = ['step8_king.spiders'] NEWSPIDER_MODULE = 'step8_king.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent # 3. 客戶端 user-agent請求頭 # USER_AGENT = 'step8_king (+http://www.yourdomain.com)'# Obey robots.txt rules # 4. 禁止爬蟲配置 # ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16) # 5. 并發(fā)請求數(shù) # CONCURRENT_REQUESTS = 4# Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # 6. 延遲下載秒數(shù) # DOWNLOAD_DELAY = 2# The download delay setting will honor only one of: # 7. 單域名訪問并發(fā)數(shù)，并且延遲下次秒數(shù)也應(yīng)用在每個域名 # CONCURRENT_REQUESTS_PER_DOMAIN = 2 # 單IP訪問并發(fā)數(shù)，如果有值則忽略：CONCURRENT_REQUESTS_PER_DOMAIN，并且延遲下次秒數(shù)也應(yīng)用在每個IP # CONCURRENT_REQUESTS_PER_IP = 3# Disable cookies (enabled by default) # 8. 是否支持cookie，cookiejar進(jìn)行操作cookie # COOKIES_ENABLED = True # COOKIES_DEBUG = True# Disable Telnet Console (enabled by default) # 9. Telnet用于查看當(dāng)前爬蟲的信息，操作爬蟲等... # 使用telnet ip port ，然后通過命令操作 # TELNETCONSOLE_ENABLED = True # TELNETCONSOLE_HOST = '127.0.0.1' # TELNETCONSOLE_PORT = [6023,]# 10. 默認(rèn)請求頭 # Override the default request headers: # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # }# Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html # 11. 定義pipeline處理請求 # ITEM_PIPELINES = { # 'step8_king.pipelines.JsonPipeline': 700, # 'step8_king.pipelines.FilePipeline': 500, # }# 12. 自定義擴(kuò)展，基于信號進(jìn)行調(diào)用 # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html # EXTENSIONS = { # # 'step8_king.extensions.MyExtension': 500, # }# 13. 爬蟲允許的最大深度，可以通過meta查看當(dāng)前深度；0表示無深度 # DEPTH_LIMIT = 3# 14. 爬取時，0表示深度優(yōu)先Lifo(默認(rèn))；1表示廣度優(yōu)先FiFo# 后進(jìn)先出，深度優(yōu)先 # DEPTH_PRIORITY = 0 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue' # 先進(jìn)先出，廣度優(yōu)先# DEPTH_PRIORITY = 1 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'# 15. 調(diào)度器隊列 # SCHEDULER = 'scrapy.core.scheduler.Scheduler' # from scrapy.core.scheduler import Scheduler# 16. 訪問URL去重 # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'# Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html""" 17. 自動限速算法from scrapy.contrib.throttle import AutoThrottle自動限速設(shè)置1. 獲取最小延遲 DOWNLOAD_DELAY2. 獲取最大延遲 AUTOTHROTTLE_MAX_DELAY3. 設(shè)置初始下載延遲 AUTOTHROTTLE_START_DELAY4. 當(dāng)請求下載完成后，獲取其"連接"時間 latency，即：請求連接到接受到響應(yīng)頭之間的時間5. 用于計算的... AUTOTHROTTLE_TARGET_CONCURRENCYtarget_delay = latency / self.target_concurrencynew_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延遲時間new_delay = max(target_delay, new_delay)new_delay = min(max(self.mindelay, new_delay), self.maxdelay)slot.delay = new_delay """# 開始自動限速 # AUTOTHROTTLE_ENABLED = True # The initial download delay # 初始下載延遲 # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # 最大下載延遲 # AUTOTHROTTLE_MAX_DELAY = 10 # The average number of requests Scrapy should be sending in parallel to each remote server # 平均每秒并發(fā)數(shù) # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received: # 是否顯示 # AUTOTHROTTLE_DEBUG = True# Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings""" 18. 啟用緩存目的用于將已經(jīng)發(fā)送的請求或相應(yīng)緩存下來，以便以后使用from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddlewarefrom scrapy.extensions.httpcache import DummyPolicyfrom scrapy.extensions.httpcache import FilesystemCacheStorage """ # 是否啟用緩存策略 # HTTPCACHE_ENABLED = True# 緩存策略：所有請求均緩存，下次在請求直接訪問原來的緩存即可 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy" # 緩存策略：根據(jù)Http響應(yīng)頭：Cache-Control、Last-Modified 等進(jìn)行緩存的策略 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"# 緩存超時時間 # HTTPCACHE_EXPIRATION_SECS = 0# 緩存保存路徑 # HTTPCACHE_DIR = 'httpcache'# 緩存忽略的Http狀態(tài)碼 # HTTPCACHE_IGNORE_HTTP_CODES = []# 緩存存儲的插件 # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'""" 19. 代理，需要在環(huán)境變量中設(shè)置from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware方式一：使用默認(rèn)os.environ{http_proxy:http://root:woshiniba@192.168.11.11:9999/https_proxy:http://192.168.11.11:9999/}方式二：使用自定義下載中間件def to_bytes(text, encoding=None, errors='strict'):if isinstance(text, bytes):return textif not isinstance(text, six.string_types):raise TypeError('to_bytes must receive a unicode, str or bytes ''object, got %s' % type(text).__name__)if encoding is None:encoding = 'utf-8'return text.encode(encoding, errors)class ProxyMiddleware(object):def process_request(self, request, spider):PROXIES = [{'ip_port': '111.11.228.75:80', 'user_pass': ''},{'ip_port': '120.198.243.22:80', 'user_pass': ''},{'ip_port': '111.8.60.9:8123', 'user_pass': ''},{'ip_port': '101.71.27.120:80', 'user_pass': ''},{'ip_port': '122.96.59.104:80', 'user_pass': ''},{'ip_port': '122.224.249.122:8088', 'user_pass': ''},]proxy = random.choice(PROXIES)if proxy['user_pass'] is not None:request.meta['proxy'] = to_bytes（"http://%s" % proxy['ip_port']）encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)print "**************ProxyMiddleware have pass************" + proxy['ip_port']else:print "**************ProxyMiddleware no pass************" + proxy['ip_port']request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])DOWNLOADER_MIDDLEWARES = {'step8_king.middlewares.ProxyMiddleware': 500,}"""""" 20. Https訪問Https訪問時有兩種情況：1. 要爬取網(wǎng)站使用的可信任證書(默認(rèn)支持)DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"2. 要爬取網(wǎng)站使用的自定義證書DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"# https.pyfrom scrapy.core.downloader.contextfactory import ScrapyClientContextFactoryfrom twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)class MySSLFactory(ScrapyClientContextFactory):def getCertificateOptions(self):from OpenSSL import cryptov1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())return CertificateOptions(privateKey=v1, # pKey對象certificate=v2, # X509對象verify=False,method=getattr(self, 'method', getattr(self, '_ssl_method', None)))其他：相關(guān)類scrapy.core.downloader.handlers.http.HttpDownloadHandlerscrapy.core.downloader.webclient.ScrapyHTTPClientFactoryscrapy.core.downloader.contextfactory.ScrapyClientContextFactory相關(guān)配置DOWNLOADER_HTTPCLIENTFACTORYDOWNLOADER_CLIENTCONTEXTFACTORY"""""" 21. 爬蟲中間件class SpiderMiddleware(object):def process_spider_input(self,response, spider):'''下載完成，執(zhí)行，然后交給parse處理:param response: :param spider: :return: '''passdef process_spider_output(self,response, result, spider):'''spider處理完成，返回時調(diào)用:param response::param result::param spider::return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable)'''return resultdef process_spider_exception(self,response, exception, spider):'''異常調(diào)用:param response::param exception::param spider::return: None,繼續(xù)交給后續(xù)中間件處理異常；含 Response 或 Item 的可迭代對象(iterable)，交給調(diào)度器或pipeline'''return Nonedef process_start_requests(self,start_requests, spider):'''爬蟲啟動時調(diào)用:param start_requests::param spider::return: 包含 Request 對象的可迭代對象'''return start_requests內(nèi)置爬蟲中間件：'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,""" # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = {# 'step8_king.middlewares.SpiderMiddleware': 543, }""" 22. 下載中間件class DownMiddleware1(object):def process_request(self, request, spider):'''請求需要被下載時，經(jīng)過所有下載器中間件的process_request調(diào)用:param request::param spider::return:None,繼續(xù)后續(xù)中間件去下載；Response對象，停止process_request的執(zhí)行，開始執(zhí)行process_responseRequest對象，停止中間件的執(zhí)行，將Request重新調(diào)度器raise IgnoreRequest異常，停止process_request的執(zhí)行，開始執(zhí)行process_exception'''passdef process_response(self, request, response, spider):'''spider處理完成，返回時調(diào)用:param response::param result::param spider::return:Response 對象：轉(zhuǎn)交給其他中間件process_responseRequest 對象：停止中間件，request會被重新調(diào)度下載raise IgnoreRequest 異常：調(diào)用Request.errback'''print('response1')return responsedef process_exception(self, request, exception, spider):'''當(dāng)下載處理器(download handler)或 process_request() (下載中間件)拋出異常:param response::param exception::param spider::return:None：繼續(xù)交給后續(xù)中間件處理異常；Response對象：停止后續(xù)process_exception方法Request對象：停止中間件，request將會被重新調(diào)用下載'''return None默認(rèn)下載中間件{'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,}""" # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # 'step8_king.middlewares.DownMiddleware1': 100, # 'step8_king.middlewares.DownMiddleware2': 500, # } settings.py

十二爬取亞馬遜商品信息

1、 scrapy startproject Amazon cd Amazon scrapy genspider spider_goods www.amazon.cn2、settings.py ROBOTSTXT_OBEY = False #請求頭 DEFAULT_REQUEST_HEADERS = {'Referer':'https://www.amazon.cn/','User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36' } #打開注釋 HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 0 HTTPCACHE_DIR = 'httpcache' HTTPCACHE_IGNORE_HTTP_CODES = [] HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'3、items.py class GoodsItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()#商品名字goods_name = scrapy.Field()#價錢goods_price = scrapy.Field()#配送方式delivery_method=scrapy.Field()4、spider_goods.py # -*- coding: utf-8 -*- import scrapyfrom Amazon.items import GoodsItem from scrapy.http import Request from urllib.parse import urlencodeclass SpiderGoodsSpider(scrapy.Spider):name = 'spider_goods'allowed_domains = ['www.amazon.cn']# start_urls = ['http://www.amazon.cn/']def __int__(self,keyword=None,*args,**kwargs):super(SpiderGoodsSpider).__init__(*args,**kwargs)self.keyword=keyworddef start_requests(self):url='https://www.amazon.cn/s/ref=nb_sb_noss_1?'paramas={'__mk_zh_CN': '亞馬遜網(wǎng)站','url': 'search - alias = aps','field-keywords': self.keyword}url=url+urlencode(paramas,encoding='utf-8')yield Request(url,callback=self.parse_index)def parse_index(self, response):print('解析索引頁:%s' %response.url)urls=response.xpath('//*[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()for url in urls:yield Request(url,callback=self.parse_detail)next_url=response.urljoin(response.xpath('//*[@id="pagnNextLink"]/@href').extract_first())print('下一頁的url',next_url)yield Request(next_url,callback=self.parse_index)def parse_detail(self,response):print('解析詳情頁:%s' %(response.url))item=GoodsItem()# 商品名字item['goods_name'] = response.xpath('//*[@id="productTitle"]/text()').extract_first().strip()# 價錢item['goods_price'] = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first().strip()# 配送方式item['delivery_method'] = ''.join(response.xpath('//*[@id="ddmMerchantMessage"]//text()').extract())return item5、自定義pipelines #sql.py import pymysql import settingsMYSQL_HOST=settings.MYSQL_HOST MYSQL_PORT=settings.MYSQL_PORT MYSQL_USER=settings.MYSQL_USER MYSQL_PWD=settings.MYSQL_PWD MYSQL_DB=settings.MYSQL_DBconn=pymysql.connect(host=MYSQL_HOST,port=int(MYSQL_PORT),user=MYSQL_USER,password=MYSQL_PWD,db=MYSQL_DB,charset='utf8' ) cursor=conn.cursor()class Mysql(object):@staticmethoddef insert_tables_goods(goods_name,goods_price,deliver_mode):sql='insert into goods(goods_name,goods_price,delivery_method) values(%s,%s,%s)'cursor.execute(sql,args=(goods_name,goods_price,deliver_mode))conn.commit()@staticmethoddef is_repeat(goods_name):sql='select count(1) from goods where goods_name=%s'cursor.execute(sql,args=(goods_name,))if cursor.fetchone()[0] >= 1:return Trueif __name__ == '__main__':cursor.execute('select * from goods;')print(cursor.fetchall())#pipelines.py from Amazon.mysqlpipelines.sql import Mysqlclass AmazonPipeline(object):def process_item(self, item, spider):goods_name=item['goods_name']goods_price=item['goods_price']delivery_mode=item['delivery_method']if not Mysql.is_repeat(goods_name):Mysql.insert_table_goods(goods_name,goods_price,delivery_mode)6、創(chuàng)建數(shù)據(jù)庫表 create database amazon charset utf8; create table goods(id int primary key auto_increment,goods_name char(30),goods_price char(20),delivery_method varchar(50) );7、settings.py MYSQL_HOST='localhost' MYSQL_PORT='3306' MYSQL_USER='root' MYSQL_PWD='123' MYSQL_DB='amazon'#數(shù)字代表優(yōu)先級程度（1-1000隨意設(shè)置，數(shù)值越低，組件的優(yōu)先級越高） ITEM_PIPELINES = {'Amazon.mysqlpipelines.pipelines.mazonPipeline': 1, }#8、在項目目錄下新建：entrypoint.py from scrapy.cmdline import execute execute(['scrapy', 'crawl', 'spider_goods','-a','keyword=iphone8']) View Code

?https://pan.baidu.com/s/1boCEBT1

轉(zhuǎn)載于:https://www.cnblogs.com/yifugui/p/8336096.html

總結(jié)

以上是生活随笔為你收集整理的爬虫、框架scrapy的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Swift 3 0 FMDB 初试
下一篇：阿里云前端周刊 - 第 26 期