當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

四、scrapy爬虫框架——scrapy管道的使用

發(fā)布時間：2024/7/5 编程问答 45 豆豆

生活随笔收集整理的這篇文章主要介紹了四、scrapy爬虫框架——scrapy管道的使用小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

scrapy管道的使用

學(xué)習(xí)目標(biāo)：

掌握 scrapy管道(pipelines.py)的使用

之前我們在scrapy入門使用一節(jié)中學(xué)習(xí)了管道的基本使用，接下來我們深入的學(xué)習(xí)scrapy管道的使用

1. pipeline中常用的方法：

process_item(self,item,spider):

管道類中必須有的函數(shù)
實(shí)現(xiàn)對item數(shù)據(jù)的處理
必須return item

open_spider(self, spider): 在爬蟲開啟的時候僅執(zhí)行一次

close_spider(self, spider): 在爬蟲關(guān)閉的時候僅執(zhí)行一次

2. 管道文件的修改

繼續(xù)完善wangyi爬蟲，在pipelines.py代碼中完善

myCode:

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapter import json from pymongo import MongoClientclass WangyiPipeline:# def __init__(self):# self.file = open('wangyi.json','w')def open_spider(self,spider):if spider.name == 'job':self.file = open('wangyi.json','w',encoding='utf-8')def process_item(self, item, spider):if spider.name == 'job':# 將item對象轉(zhuǎn)換成字典類型item = dict(item)# 將字典類型數(shù)據(jù)轉(zhuǎn)換成字符串str_data = json.dumps(item,ensure_ascii=False) +',\n'self.file.write(str_data)return itemdef close_spider(self,spider):if spider.name == 'job':self.file.close()class WangyiSimplePipeline:# def __init__(self):# self.file = open('wangyi.json','w')def open_spider(self,spider):if spider.name == 'job_simple':self.file = open('wangyiSimple.json','w',encoding='utf-8')def process_item(self, item, spider):if spider.name == 'job_simple':# 將item對象轉(zhuǎn)換成字典類型item = dict(item)# 將字典類型數(shù)據(jù)轉(zhuǎn)換成字符串str_data = json.dumps(item,ensure_ascii=False) +',\n'self.file.write(str_data)return itemdef close_spider(self,spider):if spider.name == 'job_simple':self.file.close()class MongoPipeline(object):def open_spider(self,spider):self.client = MongoClient('127.0.0.1',27017)self.db = self.client['itcast']self.col = self.db['wangyi']def process_item(self,item,spider):# 將item對象轉(zhuǎn)換成字符串data =dict(item)# 將data寫入數(shù)據(jù)庫self.col.insert(data)return itemdef close_spider(self,spider):self.client.close() import json from pymongo import MongoClientclass WangyiFilePipeline(object):def open_spider(self, spider): # 在爬蟲開啟的時候僅執(zhí)行一次if spider.name == 'itcast':self.f = open('json.txt', 'a', encoding='utf-8')def close_spider(self, spider): # 在爬蟲關(guān)閉的時候僅執(zhí)行一次if spider.name == 'itcast':self.f.close()def process_item(self, item, spider):if spider.name == 'itcast':self.f.write(json.dumps(dict(item), ensure_ascii=False, indent=2) + ',\n')# 不return的情況下，另一個權(quán)重較低的pipeline將不會獲得itemreturn item class WangyiMongoPipeline(object):def open_spider(self, spider): # 在爬蟲開啟的時候僅執(zhí)行一次if spider.name == 'itcast':# 也可以使用isinstanc函數(shù)來區(qū)分爬蟲類:con = MongoClient(host='127.0.0.1', port=27017) # 實(shí)例化mongoclientself.collection = con.itcast.teachers # 創(chuàng)建數(shù)據(jù)庫名為itcast,集合名為teachers的集合操作對象def process_item(self, item, spider):if spider.name == 'itcast':self.collection.insert(item) # 此時item對象必須是一個字典,再插入# 如果此時item是BaseItem則需要先轉(zhuǎn)換為字典：dict(BaseItem)# 不return的情況下，另一個權(quán)重較低的pipeline將不會獲得itemreturn item

3. 開啟管道

在settings.py設(shè)置開啟pipeline

...... ITEM_PIPELINES = {'myspider.pipelines.ItcastFilePipeline': 400, # 400表示權(quán)重'myspider.pipelines.ItcastMongoPipeline': 500, # 權(quán)重值越小，越優(yōu)先執(zhí)行！ } ......

別忘了開啟mongodb數(shù)據(jù)庫 sudo service mongodb start
并在mongodb數(shù)據(jù)庫中查看 mongo

思考：在settings中能夠開啟多個管道，為什么需要開啟多個？

不同的pipeline可以處理不同爬蟲的數(shù)據(jù)，通過spider.name屬性來區(qū)分

不同的pipeline能夠?qū)σ粋€或多個爬蟲進(jìn)行不同的數(shù)據(jù)處理的操作，比如一個進(jìn)行數(shù)據(jù)清洗，一個進(jìn)行數(shù)據(jù)的保存

同一個管道類也可以處理不同爬蟲的數(shù)據(jù)，通過spider.name屬性來區(qū)分

4. pipeline使用注意點(diǎn)

使用之前需要在settings中開啟

pipeline在setting中鍵表示位置(即pipeline在項(xiàng)目中的位置可以自定義)，值表示距離引擎的遠(yuǎn)近，越近數(shù)據(jù)會越先經(jīng)過：權(quán)重值小的優(yōu)先執(zhí)行

有多個pipeline的時候，process_item的方法必須return item,否則后一個pipeline取到的數(shù)據(jù)為None值

pipeline中process_item的方法必須有，否則item沒有辦法接受和處理

process_item方法接受item和spider，其中spider表示當(dāng)前傳遞item過來的spider

open_spider(spider) :能夠在爬蟲開啟的時候執(zhí)行一次

close_spider(spider) :能夠在爬蟲關(guān)閉的時候執(zhí)行一次

上述倆個方法經(jīng)常用于爬蟲和數(shù)據(jù)庫的交互，在爬蟲開啟的時候建立和數(shù)據(jù)庫的連接，在爬蟲關(guān)閉的時候斷開和數(shù)據(jù)庫的連接

小結(jié)

管道能夠?qū)崿F(xiàn)數(shù)據(jù)的清洗和保存，能夠定義多個管道實(shí)現(xiàn)不同的功能，其中有個三個方法
- process_item(self,item,spider):實(shí)現(xiàn)對item數(shù)據(jù)的處理
- open_spider(self, spider): 在爬蟲開啟的時候僅執(zhí)行一次
- close_spider(self, spider): 在爬蟲關(guān)閉的時候僅執(zhí)行一次

總結(jié)

以上是生活随笔為你收集整理的四、scrapy爬虫框架——scrapy管道的使用的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Web框架——Flask系列之sessi
下一篇： socket.io跨域踩坑