當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

16-爬虫之scrapy框架手动请求发送实现全站数据爬取03

發(fā)布時(shí)間：2024/9/15 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 16-爬虫之scrapy框架手动请求发送实现全站数据爬取03 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

scrapy的手動(dòng)請(qǐng)求發(fā)送實(shí)現(xiàn)全站數(shù)據(jù)爬取

yield scrapy.Reques(url,callback) 發(fā)起的get請(qǐng)求
- callback指定解析函數(shù)用于解析數(shù)據(jù)
yield scrapy.FormRequest（url,callback,formdata）發(fā)起的post請(qǐng)求
- formdata：字典，請(qǐng)求參數(shù)
為什么start_urls列表中的url會(huì)被自動(dòng)進(jìn)行g(shù)et請(qǐng)求的發(fā)送？
- 因?yàn)榱斜碇械膗rl其實(shí)是被start_requests這個(gè)父類(lèi)方法實(shí)現(xiàn)的get請(qǐng)求

# 父類(lèi)方法：這個(gè)是該方法的原始實(shí)現(xiàn) def start_requests(self):for u in self.start_urls:yield scrapy.Request(url=url,callback=self.parse)

如何將start_urls中的url默認(rèn)進(jìn)行post請(qǐng)求發(fā)送？

# 重寫(xiě)父類(lèi)方法默認(rèn)進(jìn)行post請(qǐng)求發(fā)送 def start_requests(self):for u in self.start_urls:yield scrapy.FormRequest(url=url,callback=self.parse)

開(kāi)始

創(chuàng)建一個(gè)爬蟲(chóng)工程：scrapy startproject proName
進(jìn)入工程目錄創(chuàng)建爬蟲(chóng)源文件：scrapy genspider spiderName www.xxx.com
執(zhí)行工程：scrapy crawl spiderName

配置pipelines.py文件

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapterclass GpcPipeline:def process_item(self, item, spider):print(item)return item

配置items.py文件

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass GpcItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()content = scrapy.Field()

配置settings.py配置文件

# Scrapy settings for gpc project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'gpc'SPIDER_MODULES = ['gpc.spiders'] NEWSPIDER_MODULE = 'gpc.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent #設(shè)置UA偽裝 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' LOG_LEVEL = 'ERROR' #指定類(lèi)型日志的輸出(只輸出錯(cuò)誤信息)# Obey robots.txt rules ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'gpc.middlewares.GpcSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'gpc.middlewares.GpcDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'gpc.pipelines.GpcPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

爬蟲(chóng)源文件 una.py

import scrapy from gpc.items import GpcItemclass UnaSpider(scrapy.Spider):name = 'una'#allowed_domains = ['www.baidu.com']start_urls = ['https://duanziwang.com/category/經(jīng)典段子/1/']#定義通用url模板url = 'https://duanziwang.com/category/經(jīng)典段子/%d/'page_num = 2# 將段子網(wǎng)中所有頁(yè)碼對(duì)應(yīng)的數(shù)據(jù)進(jìn)行爬取def parse(self, response):# 數(shù)據(jù)解析名稱(chēng)和內(nèi)容article_list = response.xpath('/html/body/section/div/div/main/article')for article in article_list:# 我們可以看見(jiàn)解析出來(lái)的內(nèi)容不是字符串?dāng)?shù)據(jù)，說(shuō)明和etree中xpath使用方式不同# xpath返回的列表中存儲(chǔ)是Selector對(duì)象，說(shuō)明我們想要的字符串?dāng)?shù)據(jù)被存儲(chǔ)在了該對(duì)象的data屬性中# extract()就是將data屬性值取出# 調(diào)用extract_first() 將列表中第一個(gè)列表元素表示的Selector對(duì)象中的data值取出title = article.xpath("./div[1]/h1/a/text()").extract_first()content = article.xpath("./div[2]/p/text()").extract_first()# 實(shí)例化一個(gè)item類(lèi)型的對(duì)象，將解析到的數(shù)據(jù)存儲(chǔ)到該對(duì)象中item = GpcItem()# 不能使用item. 來(lái)調(diào)用數(shù)據(jù)item['title'] = titleitem['content'] = contentyield itemif self.page_num < 5:#結(jié)束遞歸的條件new_url = format(self.url%self.page_num)#其他頁(yè)碼對(duì)應(yīng)的完整urlself.page_num += 1#對(duì)新的頁(yè)碼對(duì)應(yīng)的url發(fā)起請(qǐng)求（手動(dòng)發(fā)起get請(qǐng)求）yield scrapy.Request(url=new_url,callback=self.parse)#遞歸請(qǐng)求回調(diào)提交給parse進(jìn)行解析

結(jié)果展示

總結(jié)

以上是生活随笔為你收集整理的16-爬虫之scrapy框架手动请求发送实现全站数据爬取03的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 15-爬虫之scrapy框架基于管道实现
下一篇： 17-爬虫之scrapy框架五大核心组件