當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

Scrapy 教程(十)-管道与数据库

發布時間：2023/12/10 数据库 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy 教程(十)-管道与数据库小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Scrapy 框架將爬取的數據通過管道進行處理，即 pipelines.py 文件。

管道處理流程

一、定義 item

item 表示的是數據結構，定義了數據包括哪些字段

class TianqiItem(scrapy.Item):# define the fields for your item here like: city = scrapy.Field() # 城市date = scrapy.Field() # 日期hour = scrapy.Field() # 小時day = scrapy.Field() # 白天

寫法比較固定，不可隨意更改；注意沒有return

二、在爬蟲中生成 item

爬蟲組件必須將數據按 item 的結構進行組織

item['city'] = response.xpath('//a[@id="lastBread"]//text()').extract_first()[:-4] item['date'] = '%s-%s-%s'%(year, month, day) item['hour'] = hour

注意最終必須? return item；

而且可以返回多個 item，return item, item2，在某管道中，如果用了 item的key，就自動選擇有這個key的item，否則，所有item都會經過該處理。

三、在管道中處理

1. 爬蟲生成的 item 自動進入管道；

2. 管道會判斷流入的數據的類型是否是 item；【即 item.py 中定義的類型】

3. 如果是 item 類型，進行后續處理，否則，忽略；

4. 返回 item，【必須返回，切記】【返回的 item 流入下一個管道，或者告訴引擎，處理完畢，否則引擎會阻塞】

5. 爬取下一個

class TianqiPipeline(object):def __init__(self):self.f = open('save.txt', 'ab')def process_item(self, item, spider):print(item)self.f.write(str(dict(item)))return itemdef close_spider(self, spider):self.f.close()class RedisPipeline(object):def open_spider(self, spider):host = spider.settings.get('REDIS_HOST', 'localhost')port = spider.settings.get('REDIS_PORT', 6379)db = spider.settings.get('REDIS_DB_INDEX', 0)self.redis_con = redis.StrictRedis(host=host, port=port, db=db)def process_item(self, item, spider):self.redis_con.hmset('%s:%s:%s'%(item['city'], item['date'], item['hour']), dict(item))return itemdef close_spider(self, spider):self.redis_con.connection_pool.disconnect()

代碼解釋

必選方法：process_item，負責處理數據

可選方法：初始化，只在爬蟲啟動時進行初始化

可選方法：open_spider，在開始爬取數據之前被調用

可選方法：close_spider，爬取完數據后被調用

可選方法：from_crawler，獲取配置

mongodb 示例，包含了上述方法

首先執行 from_crawler 獲取配置，在 open_spider 中創建數據庫連接

四、啟動管道

在settings中配置即可

ITEM_PIPELINES = {'tianqi.pipelines.TianqiPipeline': 300,'tianqi.pipelines.RedisPipeline': 301, }

存在問題

上面這種方式會作用于所有爬蟲；

我們可以在管道中判斷是哪個爬蟲，根據 spider 參數，或者根據 item 中的 key，但這種方法很冗余；

更好的做法是在 spider類中配置 custom_settings 對象

# 類屬性 custom_settings = {'ITEM_PIPELINES': {'getProxy.pipelines.GetproxyPipeline': 300, }}

數據庫存儲

管道可以實現任何形式的存儲，包括文件、數據庫等；

而且可以存入各種數據庫，如 sqlite、mysql、mongoDB、redis；

上面的例子實現了 redis、mongodb 的存儲，其他大同小異，以后有機會再更新。

轉載于:https://www.cnblogs.com/yanshw/p/10919052.html

總結

以上是生活随笔為你收集整理的Scrapy 教程(十)-管道与数据库的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【qq语音获取好友ip】wireshar
下一篇： java additem 错,Java错