python网络爬虫(14)使用Scrapy搭建爬虫框架
目的意義
爬蟲(chóng)框架也許能簡(jiǎn)化工作量,提高效率等。scrapy是一款方便好用,拓展方便的框架。
本文將使用scrapy框架,示例爬取自己博客中的文章內(nèi)容。
說(shuō)明
學(xué)習(xí)和模仿來(lái)源:https://book.douban.com/subject/27061630/。
創(chuàng)建scrapy工程
首先當(dāng)然要確定好,有沒(méi)有完成安裝scrapy。在windows下,使用pip install scrapy,慢慢等所有依賴和scrapy安裝完畢即可。然后輸入scrapy到cmd中測(cè)試。
建立工程使用scrapy startproject myTestProject,會(huì)在工程下生成文件。
一些介紹說(shuō)明
在生成的文件中,
創(chuàng)建爬蟲(chóng)模塊-下載
在路徑./myTestProject/spiders下,放置用戶自定義爬蟲(chóng)模塊,并定義好name,start_urls,parse()。
如在spiders目錄下建立文件CnblogSpider.py,并填入以下:
import scrapy class CnblogsSpider(scrapy.Spider):name="cnblogs"start_urls=["https://www.cnblogs.com/bai2018/default.html?page=1"]def parse(self,response):pass?在cmd中,切換到./myTestProject/myTestProject下,再執(zhí)行scrapy crawl cnblogs(name)測(cè)試,觀察是否報(bào)錯(cuò),響應(yīng)代碼是否為200。其中的parse中參數(shù)response用于解析數(shù)據(jù),讀取數(shù)據(jù)等。
強(qiáng)化爬蟲(chóng)模塊-解析
在CnblogsSpider類中的parse方法下,添加解析功能。通過(guò)xpath、css、extract、re等方法,完成解析。
調(diào)取元素審查分析以后添加,成為以下代碼:
import scrapy class CnblogsSpider(scrapy.Spider):name="cnblogs"start_urls=["https://www.cnblogs.com/bai2018/"]def parse(self,response):papers=response.xpath(".//*[@class='day']")for paper in papers:url=paper.xpath(".//*[@class='postTitle']/a/@href").extract()title=paper.xpath(".//*[@class='postTitle']/a/text()").extract()time=paper.xpath(".//*[@class='dayTitle']/a/text()").extract()content=paper.xpath(".//*[@class='postCon']/div/text()").extract()print(url,title,time,content)pass?找到頁(yè)面中,class為day的部分,然后再找到其中各個(gè)部分,提取出來(lái),最后通過(guò)print方案輸出用于測(cè)試。
在正確的目錄下,使用cmd運(yùn)行scrapy crawl cnblogs,完成測(cè)試,并觀察顯示信息中的print內(nèi)容是否符合要求。
強(qiáng)化爬蟲(chóng)模塊-包裝數(shù)據(jù)
包裝數(shù)據(jù)的目的是存儲(chǔ)數(shù)據(jù)。scrapy使用Item類來(lái)滿足這樣的需求。
框架中的items.py用于定義存儲(chǔ)數(shù)據(jù)的Item類。
在items.py中修改MytestprojectItem類,成為以下代碼:
import scrapy class MytestprojectItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()url=scrapy.Field()time=scrapy.Field()title=scrapy.Field()content=scrapy.Field()pass?然后修改CnblogsSpider.py,成為以下內(nèi)容:
import scrapy from myTestProject.items import MytestprojectItem class CnblogsSpider(scrapy.Spider):name="cnblogs"start_urls=["https://www.cnblogs.com/bai2018/"]def parse(self,response):papers=response.xpath(".//*[@class='day']")for paper in papers:url=paper.xpath(".//*[@class='postTitle']/a/@href").extract()title=paper.xpath(".//*[@class='postTitle']/a/text()").extract()time=paper.xpath(".//*[@class='dayTitle']/a/text()").extract()content=paper.xpath(".//*[@class='postCon']/div/text()").extract()item=MytestprojectItem(url=url,title=title,time=time,content=content)yield itempass?將提取出的內(nèi)容封裝成Item對(duì)象,使用關(guān)鍵字yield提交。
強(qiáng)化爬蟲(chóng)模塊-翻頁(yè)
有時(shí)候就是需要翻頁(yè),以獲取更多數(shù)據(jù),然后解析。
修改CnblogsSpider.py,成為以下內(nèi)容:
import scrapy from scrapy import Selector from myTestProject.items import MytestprojectItem class CnblogsSpider(scrapy.Spider):name="cnblogs"allowd_domains=["cnblogs.com"]start_urls=["https://www.cnblogs.com/bai2018/"]def parse(self,response):papers=response.xpath(".//*[@class='day']")for paper in papers:url=paper.xpath(".//*[@class='postTitle']/a/@href").extract()title=paper.xpath(".//*[@class='postTitle']/a/text()").extract()time=paper.xpath(".//*[@class='dayTitle']/a/text()").extract()content=paper.xpath(".//*[@class='postCon']/div/text()").extract()item=MytestprojectItem(url=url,title=title,time=time,content=content)yield itemnext_page=Selector(response).re(u'<a href="(\S*)">下一頁(yè)</a>')if next_page:yield scrapy.Request(url=next_page[0],callback=self.parse)pass在scrapy的選擇器方面,使用xpath和css,可以直接將CnblogsSpider下的parse方法中的response參數(shù)使用,如response.xpath或response.css。
而更通用的方式是:使用Selector(response).xxx。針對(duì)re則為Selector(response).re。
關(guān)于yield的說(shuō)明:https://blog.csdn.net/mieleizhi0522/article/details/82142856
強(qiáng)化爬蟲(chóng)模塊-存儲(chǔ)
當(dāng)Item在Spider中被收集時(shí)候,會(huì)傳遞到Item Pipeline。
修改pipelines.py成為以下內(nèi)容:
import json from scrapy.exceptions import DropItem class MytestprojectPipeline(object):def __init__(self):self.file=open('papers.json','wb')def process_item(self, item, spider):if item['title']:line=json.dumps(dict(item))+"\n"self.file.write(line.encode())return itemelse:raise DropItem("Missing title in %s"%item)?重新實(shí)現(xiàn)process_item方法,收集item和該item對(duì)應(yīng)的spider。然后創(chuàng)建papers.json,轉(zhuǎn)化item為字典,存儲(chǔ)到j(luò)son表中。
另外,根據(jù)提示打開(kāi)pipelines.py的開(kāi)關(guān)。在settings.py中,使能ITEM_PIPELINES的開(kāi)關(guān)如下:
然后在cmd中執(zhí)行scrapy crawl cnblogs即可
?另外,還可以使用scrapy crawl cnblogs -o papers.csv進(jìn)行存儲(chǔ)為csv文件。
需要更改編碼,將csv文件以記事本方式重新打開(kāi),更正編碼后重新保存,查看即可。
強(qiáng)化爬蟲(chóng)模塊-圖像下載保存
設(shè)定setting.py
ITEM_PIPELINES = {'myTestProject.pipelines.MytestprojectPipeline':300,'scrapy.pipelines.images.ImagesPipeline':1 } IAMGES_STORE='.//cnblogs' IMAGES_URLS_FIELD = 'cimage_urls' IMAGES_RESULT_FIELD = 'cimages' IMAGES_EXPIRES = 30 IMAGES_THUMBS = {'small': (50, 50),'big': (270, 270) }?修改items.py為:
import scrapy class MytestprojectItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()url=scrapy.Field()time=scrapy.Field()title=scrapy.Field()content=scrapy.Field()cimage_urls=scrapy.Field()cimages=scrapy.Field()pass?修改CnblogsSpider.py為:
import scrapy from scrapy import Selector from myTestProject.items import MytestprojectItem class CnblogsSpider(scrapy.Spider):name="cnblogs"allowd_domains=["cnblogs.com"]start_urls=["https://www.cnblogs.com/bai2018/"]def parse(self,response):papers=response.xpath(".//*[@class='day']")for paper in papers:url=paper.xpath(".//*[@class='postTitle']/a/@href").extract()[0]title=paper.xpath(".//*[@class='postTitle']/a/text()").extract()time=paper.xpath(".//*[@class='dayTitle']/a/text()").extract()content=paper.xpath(".//*[@class='postCon']/div/text()").extract()item=MytestprojectItem(url=url,title=title,time=time,content=content)request=scrapy.Request(url=url, callback=self.parse_body)request.meta['item']=itemyield requestnext_page=Selector(response).re(u'<a href="(\S*)">下一頁(yè)</a>')if next_page:yield scrapy.Request(url=next_page[0],callback=self.parse)passdef parse_body(self, response):item = response.meta['item']body = response.xpath(".//*[@class='postBody']")item['cimage_urls'] = body.xpath('.//img//@src').extract()yield item總之,修改以上三個(gè)位置。在有時(shí)候配置正確的時(shí)候卻出現(xiàn)圖像等下載失敗,則可能是由于setting.py的原因,需要重新修改。
啟動(dòng)爬蟲(chóng)
建立main函數(shù),傳遞初始化信息,導(dǎo)入指定類。如:
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settingsfrom myTestProject.spiders.CnblogSpider import CnblogsSpiderif __name__=='__main__':process = CrawlerProcess(get_project_settings())process.crawl('cnblogs')process.start()修正
import scrapy from scrapy import Selector from cnblogSpider.items import CnblogspiderItem class CnblogsSpider(scrapy.Spider):name="cnblogs"allowd_domains=["cnblogs.com"]start_urls=["https://www.cnblogs.com/bai2018/"]def parse(self,response):papers=response.xpath(".//*[@class='day']")for paper in papers:urls=paper.xpath(".//*[@class='postTitle']/a/@href").extract()titles=paper.xpath(".//*[@class='postTitle']/a/text()").extract()times=paper.xpath(".//*[@class='dayTitle']/a/text()").extract()contents=paper.xpath(".//*[@class='postCon']/div/text()").extract()for i in range(len(urls)):url=urls[i]title=titles[i]time=times[0]content=contents[i]item=CnblogspiderItem(url=url,title=title,time=time,content=content)request = scrapy.Request(url=url, callback=self.parse_body)request.meta['item'] = itemyield requestnext_page=Selector(response).re(u'<a href="(\S*)">下一頁(yè)</a>')if next_page:yield scrapy.Request(url=next_page[0],callback=self.parse)passdef parse_body(self, response):item = response.meta['item']body = response.xpath(".//*[@class='postBody']")item['cimage_urls'] = body.xpath('.//img//@src').extract()yield item我的博客即將同步至騰訊云+社區(qū),邀請(qǐng)大家一同入駐:https://cloud.tencent.com/developer/support-plan?invite_code=813cva9t28s6
轉(zhuǎn)載于:https://www.cnblogs.com/bai2018/p/11255185.html
總結(jié)
以上是生活随笔為你收集整理的python网络爬虫(14)使用Scrapy搭建爬虫框架的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: weblogic中设置数据源的注意点
- 下一篇: numa对MySQL多实例性能影响