當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

CrawlSpider 详解

發布時間：2024/7/23 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 CrawlSpider 详解小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

From：https://blog.csdn.net/weixin_37947156/article/details/75604163

CrawlSpider是爬取那些具有一定規則網站的常用的爬蟲，它基于Spider并有一些獨特屬性

rules: 是Rule對象的集合，用于匹配目標網站并排除干擾
parse_start_url: 用于爬取起始響應，必須要返回Item，Request中的一個。

因為?rules?是?Rule?對象的集合，所以這里也要介紹一下?Rule。它有幾個參數：link_extractor、callback=None、cb_kwargs=None、follow=None、process_links=None、process_request=None
其中的 link_extractor 既可以自己定義，也可以使用已有 LinkExtractor 類，主要參數為：

allow：滿足括號中“正則表達式”的值會被提取，如果為空，則全部匹配。
deny：與這個正則表達式(或正則表達式列表)不匹配的URL一定不提取。
allow_domains：會被提取的鏈接的domains。
deny_domains：一定不會被提取鏈接的domains。
restrict_xpaths：使用xpath表達式，和allow共同作用過濾鏈接。還有一個類似的restrict_css

下面是官方提供的例子，我將從源代碼的角度開始解讀一些常見問題：

import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractorclass MySpider(CrawlSpider):name = 'example.com'allowed_domains = ['example.com']start_urls = ['http://www.example.com']rules = (# Extract links matching 'category.php' (but not matching 'subsection.php')# and follow links from them (since no callback means follow=True by default).Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),# Extract links matching 'item.php' and parse them with the spider's method parse_itemRule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),)def parse_item(self, response):self.logger.info('Hi, this is an item page! %s', response.url)item = scrapy.Item()item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()return item

問題：CrawlSpider如何工作的？

因為CrawlSpider繼承了Spider，所以具有Spider的所有函數。
首先由start_requests對start_urls中的每一個url發起請求（make_requests_from_url)，這個請求會被parse接收。在Spider里面的parse需要我們定義，但CrawlSpider定義parse去解析響應（self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)）
_parse_response根據有無?callback，follow?和?self.follow_links執行不同的操作

def _parse_response(self, response, callback, cb_kwargs, follow=True):##如果傳入了callback，使用這個callback解析頁面并獲取解析得到的reques或itemif callback:cb_res = callback(response, **cb_kwargs) or ()cb_res = self.process_results(response, cb_res)for requests_or_item in iterate_spider_output(cb_res):yield requests_or_item## 其次判斷有無follow，用_requests_to_follow解析響應是否有符合要求的link。if follow and self._follow_links:for request_or_item in self._requests_to_follow(response):yield request_or_item

其中_requests_to_follow又會獲取link_extractor（這個是我們傳入的LinkExtractor）解析頁面得到的link（link_extractor.extract_links(response)）,對url進行加工（process_links，需要自定義），對符合的link發起Request。使用.process_request(需要自定義）處理響應。

問題：CrawlSpider如何獲取rules？

CrawlSpider類會在__init__方法中調用_compile_rules方法，然后在其中淺拷貝rules中的各個Rule獲取要用于回調(callback)，要進行處理的鏈接（process_links）和要進行的處理請求（process_request)

def _compile_rules(self):def get_method(method):if callable(method):return methodelif isinstance(method, six.string_types):return getattr(self, method, None)self._rules = [copy.copy(r) for r in self.rules]for rule in self._rules:rule.callback = get_method(rule.callback)rule.process_links = get_method(rule.process_links)rule.process_request = get_method(rule.process_request)

那么Rule是怎么樣定義的呢？

class Rule(object):def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):self.link_extractor = link_extractorself.callback = callbackself.cb_kwargs = cb_kwargs or {}self.process_links = process_linksself.process_request = process_requestif follow is None:self.follow = False if callback else Trueelse:self.follow = follow

因此LinkExtractor會傳給link_extractor。

有callback的是由指定的函數處理，沒有callback的是由哪個函數處理的？

由上面的講解可以發現_parse_response會處理有callback的（響應）respons。
cb_res = callback(response, **cb_kwargs) or ()
而_requests_to_follow會將self._response_downloaded傳給callback用于對頁面中匹配的url發起請求（request）。
r = Request(url=link.url, callback=self._response_downloaded)

如何在CrawlSpider進行模擬登陸

因為CrawlSpider和Spider一樣，都要使用start_requests發起請求，用從Andrew_liu大神借鑒的代碼說明如何模擬登陸：

##替換原來的start_requests，callback為 def start_requests(self):return [Request("http://www.zhihu.com/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)] def post_login(self, response):print 'Preparing login'#下面這句話用于抓取請求網頁后返回網頁中的_xsrf字段的文字, 用于成功提交表單xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]print xsrf#FormRequeset.from_response是Scrapy提供的一個函數, 用于post表單#登陸成功后, 會調用after_login回調函數return [FormRequest.from_response(response, #"http://www.zhihu.com/login",meta = {'cookiejar' : response.meta['cookiejar']},headers = self.headers,formdata = {'_xsrf': xsrf,'email': '1527927373@qq.com','password': '321324jia'},callback = self.after_login,dont_filter = True)] #make_requests_from_url會調用parse，就可以與CrawlSpider的parse進行銜接了 def after_login(self, response) :for url in self.start_urls :yield self.make_requests_from_url(url)

理論說明如上。

Scrapy.spiders.CrawlSpider的源代碼

""" This modules implements the CrawlSpider which is the recommended spider to use for scraping typical web sites that requires crawling pages. See documentation in docs/topics/spiders.rst """import copy import sixfrom scrapy.http import Request, HtmlResponse from scrapy.utils.spider import iterate_spider_output from scrapy.spiders import Spiderdef identity(x):return xclass Rule(object):def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):self.link_extractor = link_extractorself.callback = callbackself.cb_kwargs = cb_kwargs or {}self.process_links = process_linksself.process_request = process_requestif follow is None:self.follow = False if callback else Trueelse:self.follow = followclass CrawlSpider(Spider):rules = ()def __init__(self, *a, **kw):super(CrawlSpider, self).__init__(*a, **kw)self._compile_rules()def parse(self, response):return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)def parse_start_url(self, response):return []def process_results(self, response, results):return resultsdef _requests_to_follow(self, response):if not isinstance(response, HtmlResponse):returnseen = set()for n, rule in enumerate(self._rules):links = [lnk for lnk in rule.link_extractor.extract_links(response)if lnk not in seen]if links and rule.process_links:links = rule.process_links(links)for link in links:seen.add(link)r = Request(url=link.url, callback=self._response_downloaded)r.meta.update(rule=n, link_text=link.text)yield rule.process_request(r)def _response_downloaded(self, response):rule = self._rules[response.meta['rule']]return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)def _parse_response(self, response, callback, cb_kwargs, follow=True):if callback:cb_res = callback(response, **cb_kwargs) or ()cb_res = self.process_results(response, cb_res)for requests_or_item in iterate_spider_output(cb_res):yield requests_or_itemif follow and self._follow_links:for request_or_item in self._requests_to_follow(response):yield request_or_itemdef _compile_rules(self):def get_method(method):if callable(method):return methodelif isinstance(method, six.string_types):return getattr(self, method, None)self._rules = [copy.copy(r) for r in self.rules]for rule in self._rules:rule.callback = get_method(rule.callback)rule.process_links = get_method(rule.process_links)rule.process_request = get_method(rule.process_request)@classmethoddef from_crawler(cls, crawler, *args, **kwargs):spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)spider._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)return spiderdef set_crawler(self, crawler):super(CrawlSpider, self).set_crawler(crawler)self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

總結

以上是生活随笔為你收集整理的CrawlSpider 详解的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： scrapy 伪装代理和 fake_us
下一篇： WEB三大攻击之—XSS攻击与防护