當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy框架爬取数据

發布時間：2024/1/1 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy框架爬取数据小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

因為對爬蟲有著很大的好奇心，所以又找了一些資料繼續開始了學習之旅。

文章目錄

一、Scrapy框架簡介
二、爬取網絡數據
- 2.1爬取單個網頁數據
- 2.2爬取多個網頁數據
三、一些小方法
四、小結

一、Scrapy框架簡介

如果你有跑車，你還會步行嗎？這是李剛老師書里的一句話。在這里Scrapy就是跑車，而像Python內置的urllib和re模塊則是步行，跑車和步行的確都可以到達目的地，但是我們大多數通常還是會選擇跑車，畢竟速度快而且又方便。簡單的來講，Scrapy是一個專業的、高效的爬蟲框架，當然像這樣的框架也不在少數，它使用了專業的Twisted包來處理網絡通信，使用lxml、cssselect包來提取HTML頁面的有效信息，與此同時，它還提供了有效的線程管理。也就是說，我們可以很方便的使用它來創建一個爬蟲。

二、爬取網絡數據

2.1爬取單個網頁數據

1、創建一個scrapy項目，在該項目文件夾下回出現下面的內容。

scrapy startproject SpiderData

2、在真正的進行爬取數據之前，我們可以先用shell調試工具來看一下我們要爬取的網址是否存在不允許我們進行爬取數據的機制。

scrapy shell https://www.zhipin.com/c100010000-p120106/

響應的狀態碼為403，說明這個網頁開啟了反爬蟲的一些機制，為了讓我們可以對其進行數據的爬取，那么我們就需要進行一下偽裝，也就是要設置一下我們請求數據中User-Agent里面的內容。因為我使用的是火狐瀏覽器，我就以火狐瀏覽器為例：

scrapy shell -s USER_AGENT='Mozilla/5.0' https://www.zhipin.com/c100010000-p120106/

現在狀態碼為200，說明沒有問題了，可以進行數據的爬取。
3、使用XPath表達式或CSS選擇器來提取我們想要爬取的信息。

這里補充一下，XPath語言是一種專門為了抽取HTML文檔中的元素、屬性和文本等信息而設計的語言，它要比正則表達式更適合處理HTML文檔。更多的內容可以通過網絡或一些書籍進行系統的學習。

不過仍然遇到了一個問題，使用view(response)命令來查看頁面時，卻出現了下面的情況：

response獲取到的頁面也一直是它，所以在scrapy shell中使用XPath一直返回的是空。因為不太清楚的緣故，所以我就只能在換個網址試了一下，這一次就沒有什么問題了。

scrapy shell -s USER_AGENT='Mozilla/5.0' https://movie.douban.com/top250?start=

4、調試好之后，那么我們就可以在Scrapy項目中編寫爬蟲程序了。
items.py：

import scrapyclass SpiderdataItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()#電影名稱movieName=scrapy.Field()#影片鏈接movieLink=scrapy.Field()#豆瓣評分score=scrapy.Field()

在spiders文件夾中創建一個doubanMovie.py文件。

scrapy genspider doubanMovie "movie.douban.com"

doubanMovie.py：

# -*- coding: utf-8 -*- import scrapy from SpiderData.items import SpiderdataItemclass DoubanmovieSpider(scrapy.Spider):name = 'doubanMovie' #爬蟲的名字allowed_domains = ['movie.douban.com'] #爬蟲要爬取網頁的域名start_urls = ['https://movie.douban.com/top250?start='] #爬取的首頁列表#該方法主要負責解析數據def parse(self, response):for movieInfo in response.xpath('//div[@class="item"]'):item=SpiderdataItem()item['movieName']=movieInfo.xpath('./div[@class="info"]/div/a/span[@class="title"]/text()').extract_first() #提取電影名字item['movieLink']=movieInfo.xpath('./div[@class="pic"]/a/@href').extract_first()item['score']=movieInfo.xpath('./div/div[@class="bd"]/div/span[@class="rating_num"]/text()').extract()[0]yield item #使用yield關鍵字創建一個生成器

pipelines.py：

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlclass SpiderdataPipeline(object):def process_item(self, item, spider):print("電影名：",item['movieName'],end=' ') #簡單輸出提取到的代碼print("電影鏈接：",item['movieLink'],end=' ')print("豆瓣評分：",item['score'])

如下所示，相應的去修改settings.py文件中的信息。
settings.py：

# -*- coding: utf-8 -*-# Scrapy settings for SpiderData project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'SpiderData'SPIDER_MODULES = ['SpiderData.spiders'] NEWSPIDER_MODULE = 'SpiderData.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'SpiderData (+http://www.yourdomain.com)'# Obey robots.txt rules ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: DEFAULT_REQUEST_HEADERS = {#在頭部信息中添加用戶代理，也就是為了偽裝成瀏覽器"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en', }# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'SpiderData.middlewares.SpiderdataSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'SpiderData.middlewares.SpiderdataDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'SpiderData.pipelines.SpiderdataPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5、執行程序。

scrapy crawl doubanMovie

2.2爬取多個網頁數據

爬取多個網頁和爬取單個網頁一樣，其實就是利用我們在爬取單個網頁過程中所獲得的url鏈接重新在經歷一次2.1中的過程，如下所示：
doubanMovie.py：

# -*- coding: utf-8 -*- import scrapy from SpiderData.items import SpiderdataItemclass DoubanmovieSpider(scrapy.Spider):name = 'doubanMovie' #爬蟲的名字allowed_domains = ['movie.douban.com'] #爬蟲要爬取網頁的域名start_urls = ['https://movie.douban.com/top250?start='] #爬取的首頁列表count=0; #記錄網頁的頁數 #該方法主要負責解析數據def parse(self, response):for movieInfo in response.xpath('//div[@class="item"]'):item=SpiderdataItem()item['movieName']=movieInfo.xpath('./div[@class="info"]/div/a/span[@class="title"]/text()').extract_first()item['movieLink']=movieInfo.xpath('./div[@class="pic"]/a/@href').extract_first()item['score']=movieInfo.xpath('./div/div[@class="bd"]/div/span[@class="rating_num"]/text()').extract()[0]self.length+=1yield itemif self.count<10:self.count+=1yield scrapy.Request("https://movie.douban.com/top250?start="+str(self.count*25),callback=self.parse) #創建一個生成器，并指定回調函數，循環采集數據

三、一些小方法

1、因為對于我這樣的初學者而言，使用scrapy框架還是有一定難度的，所以有時候就需要一些調試功能來找錯，所以就需要這個方法了。在使用Scrapy所創建的項目下，創建一個main.py，并將下面的代碼復制進去就可以進行調試了。

#-*- coding=utf-8 -*- #@Time : 2020/7/15 17:57 #@Author : wangjy #@File : main.py #@Software : PyCharm from scrapy.cmdline import execute import os import sysif __name__ == '__main__':sys.path.append(os.path.dirname(os.path.abspath(__file__)))execute(['scrapy', 'crawl', 'doubanMovie','-s','LOG_FILE=all.log']) #因為我個人比較不喜歡亂糟糟的東西，所以就將debug過程中生成的日志全都輸出到all.log文件中去了

2、因為會經常使用的XPath表達式，所以可以使用瀏覽器中的控制臺來先測試一下對錯，再移植到代碼中可能會好很多。

四、小結

經過了這幾天的學習，也算是對爬蟲框架Scrapy有了初步的了解，其實上述過程中真正編代碼的地方就只有三處：item.py，用于定義自己的數據結構；doubanMovie.py，創建自己的爬蟲；pipelines.py，用于接收和處理數據對象item。不過現在還處在知其然而不知其所以然的狀態，如生成器，所以還需要后面更為深入的了解。

參考資料：《瘋狂Python講義》《精通Python爬蟲框架Scrapy》

總結

以上是生活随笔為你收集整理的Scrapy框架爬取数据的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：单片机PLC
下一篇：扒一扒磁条导航和Slam导航的AGV交管