當(dāng)前位置：首頁(yè) > 运维知识 > 数据库 >内容正文

数据库

15-爬虫之scrapy框架基于管道实现数据库备份02

發(fā)布時(shí)間：2024/9/15 数据库 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 15-爬虫之scrapy框架基于管道实现数据库备份02 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

基于管道實(shí)現(xiàn)數(shù)據(jù)備份

將爬取到的數(shù)據(jù)分別存儲(chǔ)到不同的載體
將數(shù)據(jù)一份存儲(chǔ)到本地一份存儲(chǔ)到mysql和redis
一個(gè)管道類(lèi)對(duì)應(yīng)一種形式的持久化存儲(chǔ)操作，如果將數(shù)據(jù)存儲(chǔ)到不同得載體中就需要使用多個(gè)管道類(lèi)
創(chuàng)建一個(gè)爬蟲(chóng)工程：scrapy startproject proName
進(jìn)入工程目錄創(chuàng)建爬蟲(chóng)源文件：scrapy genspider spiderName www.xxx.com
執(zhí)行工程：scrapy crawl spiderName

創(chuàng)建mysql創(chuàng)建數(shù)據(jù)庫(kù)文件夾

創(chuàng)建數(shù)據(jù)庫(kù)表

修改pipelines.py 配置文件

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapter import pymysql # 導(dǎo)入mysql類(lèi) from redis import Redisclass GemoumouPipeline:fp = None# 重寫(xiě)父類(lèi)的兩個(gè)方法def open_spider(self,spider):print("我是open_spider(),我只會(huì)在爬蟲(chóng)開(kāi)始的時(shí)候執(zhí)行一次")self.fp = open("基于管道存儲(chǔ).txt","w",encoding="utf-8")def close_spider(self,spider):print("我是closer_spider(),我只會(huì)在爬蟲(chóng)結(jié)束的時(shí)候執(zhí)行一次")self.fp.close()#關(guān)閉文件# 該方法是用來(lái)接受item對(duì)象的，一次只能接收一個(gè)item，說(shuō)明該方法會(huì)被調(diào)用多次# 參數(shù)item：就是接收到的itemdef process_item(self, item, spider):#print(item) # item就是一個(gè)字典# 將item存儲(chǔ)到文本文件中self.fp.write(item["title"] +":"+item["content"]+ '\n')return item # 將item函數(shù)傳遞給下一個(gè)即將被執(zhí)行得管道類(lèi)# 將數(shù)據(jù)存儲(chǔ)到mysql中 class MysqlPipeline(object):conn = None #創(chuàng)建連接對(duì)象cursor = None # 首先創(chuàng)建一個(gè)游標(biāo)def open_spider(self,spider):#創(chuàng)建鏈接對(duì)象self.conn = pymysql.Connect(host="127.0.0.1",port=3306,user="root",password="root",db="gemoumou",charset="utf8")print(self.conn) # 打印鏈接對(duì)象def process_item(self,item,spider):self.cursor = self.conn.cursor() # 創(chuàng)建一個(gè)游標(biāo)sql = 'insert into duanziwang VALUES ("%s","%s")'%(item['title'],item['content']) # 向數(shù)據(jù)庫(kù)中插入數(shù)據(jù)# 事務(wù)處理try:self.cursor.execute(sql) # 執(zhí)行sql語(yǔ)句self.conn.commit() # 沒(méi)問(wèn)題就提交except Exception as e:print(e) # 如果有問(wèn)題就打印異常self.conn.rollback()return itemdef close_spider(self,spider): # 關(guān)閉連接self.cursor.close()self.conn.close()# # 將數(shù)據(jù)寫(xiě)入redis數(shù)據(jù)庫(kù) # class RedisPipeline(object): # conn = None # def open_spider(self,spider): # self.conn = Redis(host="127.0.0.1",port="6379") # print(self.conn) # def process_item(self,item,spider): # self.conn.lpush("duanziwang",item)

修改isettings.py配置文件

定義優(yōu)先級(jí)

# Scrapy settings for gemoumou project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'gemoumou'SPIDER_MODULES = ['gemoumou.spiders'] NEWSPIDER_MODULE = 'gemoumou.spiders' LOG_LEVEL = 'ERROR' #指定類(lèi)型日志的輸出(只輸出錯(cuò)誤信息)# Crawl responsibly by identifying yourself (and your website) on the user-agent #設(shè)置UA偽裝 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'# Obey robots.txt rules #改成False不遵從robots協(xié)議 ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'gemoumou.middlewares.GemoumouSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'gemoumou.middlewares.GemoumouDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {# 300 表示管道類(lèi)的優(yōu)先級(jí)，數(shù)值越小優(yōu)先級(jí)越高# 優(yōu)先級(jí)高的先被執(zhí)行'gemoumou.pipelines.GemoumouPipeline': 300,'gemoumou.pipelines.MysqlPipeline': 301,# 'gemoumou.pipelines.RedisPipeline': 302, } # 開(kāi)啟管道# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

爬蟲(chóng)源文件

import scrapy from gemoumou.items import GemoumouItemclass GpcSpider(scrapy.Spider):# 爬蟲(chóng)文件的名稱(chēng)，當(dāng)前源文件的唯一標(biāo)識(shí)name = 'gpc'# allowed_domains表示允許的域名，用來(lái)限定start_urls那些url可以發(fā)請(qǐng)求那些不能 # allowed_domains = ['www.xxx.com'] #我們一般給注釋掉# start_urls起始url列表只可以存儲(chǔ)url#作用：列表中存儲(chǔ)的url都會(huì)被進(jìn)行g(shù)et請(qǐng)求的發(fā)送start_urls = ['https://duanziwang.com/category/經(jīng)典段子/']# 數(shù)據(jù)解析#parse方法調(diào)用的次數(shù)取決于start_urls請(qǐng)求的次數(shù)#參數(shù)response：表示的就是服務(wù)器返回的響應(yīng)對(duì)象# 基于管道的持久化存儲(chǔ)def parse(self, response):# 數(shù)據(jù)解析名稱(chēng)和內(nèi)容article_list = response.xpath('//*[@id="35087"]')for article in article_list:# 我們可以看見(jiàn)解析出來(lái)的內(nèi)容不是字符串?dāng)?shù)據(jù)，說(shuō)明和etree中xpath使用方式不同# xpath返回的列表中存儲(chǔ)是Selector對(duì)象，說(shuō)明我們想要的字符串?dāng)?shù)據(jù)被存儲(chǔ)在了該對(duì)象的data屬性中# extract()就是將data屬性值取出# 調(diào)用extract_first() 將列表中第一個(gè)列表元素表示的Selector對(duì)象中的data值取出title = article.xpath("//div[@class='post-head']/h1[@class='post-title']/a/text()").extract_first()content = article.xpath("//div[@class='post-content']/p/text()").extract_first()# 實(shí)例化一個(gè)item類(lèi)型的對(duì)象，將解析到的數(shù)據(jù)存儲(chǔ)到該對(duì)象中item = GemoumouItem()# 不能使用item. 來(lái)調(diào)用數(shù)據(jù)item['title'] = titleitem['content']=content# 將item對(duì)象提交給管道yield item

items.py

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass GemoumouItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field() #在解析中解析出來(lái)幾個(gè)數(shù)據(jù)就定義幾個(gè)屬性title = scrapy.Field()content = scrapy.Field()

執(zhí)行

本地txt

數(shù)據(jù)庫(kù)

總結(jié)

以上是生活随笔為你收集整理的15-爬虫之scrapy框架基于管道实现数据库备份02的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 14-爬虫之scrapy框架的基本使用0
下一篇： 16-爬虫之scrapy框架手动请求发送