當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

scrapy基础之静态网页实例

發(fā)布時間：2024/1/1 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 scrapy基础之静态网页实例小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1，scrapy爬蟲基本流程：

? ? 1. 創(chuàng)建一個新的Scrapy Project

? ? > scrapy startproject 項目名稱? ? #系統(tǒng)會在當(dāng)前目錄下創(chuàng)建一個項目名稱命名的文件夾，其下包含以下文件：

? ? ? ??scrapy.cfg: 項目配置文件

? ? ? ? items.py: 需要提取的數(shù)據(jù)結(jié)構(gòu)定義文件，在items.py里面定義我們要抓取的數(shù)據(jù)：

? ? ? ? pipelines.py:管道定義，PipeLine用來對Spider返回的Item列表進(jìn)行保存操作，可以寫入到文件、或者數(shù)據(jù)庫等。

? ? ? ? settings.py: 爬蟲配置文件，包括發(fā)送請求次數(shù)，間隔時間，等等

? ? ? ? spiders: 放置spider的目錄，也就是爬蟲的主程序，至少包含以下三個要素：name：spider的標(biāo)識，parse()方法:當(dāng)start_urls里面的網(wǎng)頁抓取下來之后需要調(diào)用這個方法解析網(wǎng)頁內(nèi)容，同時需要返回下一個需要抓取的網(wǎng)頁，或者返回items列表。start_url :一個url列表，spider從這些網(wǎng)頁開始抓取.

? ? 2. 定義你需要從網(wǎng)頁中提取的元素Item

? ? 3.實現(xiàn)一個Spider類，通過接口完成爬取URL和提取Item的功能

? ? 4. 實現(xiàn)一個Item PipeLine類，完成Item的存儲功能

注意：

? ? > scrapy startproject huxiu2? ? #在當(dāng)前文件夾創(chuàng)建一個名為huxiu2的文件夾，文件夾有settings，items，pipeline等文件構(gòu)成一個爬蟲框架

? ? 自己寫的爬蟲程序要寫在projectname/projectname/spiders目錄下

2，配置items文件

目標(biāo)：虎嗅網(wǎng)新聞列表的名稱、鏈接地址和摘要爬取下來。

import scrapy class HuxiuItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title=scrapy.Field()link=scrapy.Field()desc=scrapy.Field()posttime=scrapy.Field()

注意：title，auther等內(nèi)容將會是每行中存放信息的條目，也就是需要爬取的信息。

? ? ? ? ? ?輸出和爬取程序不在一個文件里的好處是，未來可以使用Scrapy中其他有用的組件和幫助類。雖然目前感受不深，先按照框架來吧

3，寫第一個spider

必不可少的要素：如果采用了items.py一定要先import一下，繼承scrapy.Spider類并且有name，start_urls，parse。

from huxiu2.items import Huxiu2Item #整個爬蟲project的名字為huxiu2, import scrapyclass Huxiu2Spider(scrapy.Spider):name = "huxi2"allowed_domains = ['huxiu.com'] # 只有這個域名下的子域名可以被允許爬取start_urls = ['http://www.huxiu.com/index.php']def parse(self, response):for sel in response.xpath('//div[@class ="mod-info-flow"]/div/div[@class="mob-ctt index-article-list-yh"]'):item = HuxiuItem()item['title'] = sel.xapth('h2/a/text()').extract_first()item['link'] = sel.xpath('h2/a/@href').extract_first()url = response.urljoin(item['link'])item['desc'] = sel.path('div[@class ="mob-sub"]/text()').extract_first()print item['title'],item['link'], item['desc']
#結(jié)果是無法爬取，估計需要設(shè)置headers。下文是修正版，啦啦啦，好from huxiu2.items import Huxiu2Item #整個爬蟲project的名字為huxiimpofrom huxiu2.items import Huxiu2Item import scrapyclass Huxiu2Spider(scrapy.Spider):name = "huxiu"# allowed_domains = ['huxiu.com'] # 只有這個域名下的子域名可以被允許爬取# start_urls = ['http://www.huxiu.com'] #有了重定義的start_requests就不用start_urls啦def start_requests(self):
# start_urls作用是開始請求第一個網(wǎng)址，默認(rèn)就是使用start_requests方法，不帶任何參數(shù)地請求，現(xiàn)在我們?yōu)榱伺廊⌒畔⒈仨氁獛蠀?shù)，
# 所以我們要自定義一下這個方法。# 下方的request默認(rèn)參數(shù)是：url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=None, encoding='utf-8',
# priority=0，dont_filter=False, errback=None, flags=None。所以cookie和headers是分開的，至于其余的參數(shù)，以后詳解url='http://www.huxiu.com/index.php'head={"User-Agent":"info"} cookie={'key':'value'} #自己去網(wǎng)頁f12看哦，復(fù)制以后改成字典格式yield scrapy.Request(url=url,cookies=cookie,callback=self.parse,headers=head)passdef parse(self, response):print ">>>>>呵呵"for i in response.xpath('//div[@class="mod-b mod-art clearfix "]'):item=Huxiu2Item()item['title'] = i.xpath('./div[@class="mob-ctt index-article-list-yh"]/h2/a/text()').extract_first()item['link']=i.xpath('./div[@class="mob-ctt index-article-list-yh"]/h2/a/@href').extract_first()item['desc']=i.xpath('.//div[@class="mob-sub"]/text()').extract_first()print item['title'],item['link'],item['desc']

?4，處理鏈接進(jìn)入到每篇文章中獲取信息

? ? 在items.py文件中，我們定義了posttime=scrapy.Field()，但是發(fā)送時間只有在具體的文章被打開以后才能夠得到，所以，我們要同時處理第二層網(wǎng)址。

from huxiu2.items import Huxiu2Item #整個爬蟲project的名字為huxiu2, import scrapyclass Huxiu2Spider(scrapy.Spider):name = "huxi3"allowed_domains = ['huxiu.com'] # 只有這個域名下的子域名可以被允許爬取start_urls = ['http://www.huxiu.com'] #相當(dāng)于第一次發(fā)出請求def parse(self, response):for sel in response.xpath('//div[@class ="mod-info-flow"]/div/div[@class="mob-ctt index-article-list-yh"]'):item = HuxiuItem() #實例化一個item用于儲存數(shù)據(jù)item['title'] = sel.xapth('h2/a/text()').extract_first()item['link'] = sel.xpath('h2/a/@href').extract_first()url = response.urljoin(item['link']) #生成可訪問的打開每一篇文章的新網(wǎng)址item['desc'] = sel.xpath('div[@class ="mob-sub"]/text()').extract_first()print item['title'],item['link'], item['desc']yield scrapy.Request(url,callback=self.parse_artical) #開始第二層的請求，網(wǎng)址是url，callbac參數(shù)定義的是第二層爬取將調(diào)動哪個函數(shù)def parse_artical(self,response):item=Huxiu2Item() #同樣需要實例一下用來儲存數(shù)據(jù)item['posttime']=response.xpath('//span[@class="article-time pull-left"]/text()').extract_first()item['link']=response.url #進(jìn)去第二層以后順帶更新一下litem['link'],因為之前的link不完整，不用之前的url變量是因為在此處更新可以確保新網(wǎng)址可打開。print item['posttime'],item['link']yield item

?5，導(dǎo)出抓取數(shù)據(jù)

第一種方法：直接打印：>scrapy crawl huxi3

第二種方法：導(dǎo)出結(jié)果到j(luò)ason文件：>scrapy crawl huxiu -o items.json

第三種方法：保存到數(shù)據(jù)庫

? ? 保存到數(shù)據(jù)庫，可以直接在spiders里面定義，也可以使用piplines

? ? 直接寫在spiders里面的方法：

from huxiu2.items import Huxiu2Item #整個爬蟲project的名字為huxiu2, import scrapy,pymysqlclass Huxiu2Spider(scrapy.Spider):name = "huxiu"def start_requests(self):url='http://www.huxiu.com/index.php'head={"User-Agent":"xxx"}cookie={"xxx": "yyy"}yield scrapy.Request(url=url,cookies=cookie,callback=self.parse,headers=head)passdef parse(self, response):db = pymysql.connect("localhost", "root", "xxxx", "xxxx") #先要連接上數(shù)據(jù)庫cursor = db.cursor() sql="""create table huxiu(`title` varchar(50),`link` varchar(50),`desc` varchar(50))DEFAULT CHARSET=utf8""" #設(shè)置數(shù)據(jù)表的編碼方式try:cursor.execute(sql) #創(chuàng)建數(shù)據(jù)表新表db.commit()except:db.rollback()print "error first try"
for i in response.xpath('//div[@class="mod-b mod-art clearfix "]'):item=Huxiu2Item()item['title'] = i.xpath('./div[@class="mob-ctt index-article-list-yh"]/h2/a/text()').extract_first().encode("utf-8")tmplink=i.xpath('./div[@class="mob-ctt index-article-list-yh"]/h2/a/@href').extract_first().encode("utf-8")item['link']=response.urljoin(tmplink).encode("utf-8")item['desc']=i.xpath('.//div[@class="mob-sub"]/text()').extract_first().encode("utf-8") #加encode的原因是原數(shù)據(jù)是unicode格式的#print '第一層',item['title'],item['link'],item['desc']a=str(item['title'])b=str(item['link'])c=str(item['desc'])print item['title'],"結(jié)果"sql2="""INSERT INTO huxiu VALUES ('%s', '%s','%s')""" % (a,b,c) #傳參到sql語句中，%s表示傳的是字符串try:cursor.execute(sql2)db.commit()except:db.rollback()print "error of second try"db.close()

? ? 使用pipeline的方法：

? ? 注意：要想使用pipeline，

? ? ? ? 　　首先得在spiders中有yield item作為結(jié)尾，只要這樣才能生成item，否則item是空的；

? ? ? ? ? ? ? ?其次得在settings中啟用pipelines，格式如下：ITEM_PIPELINES = {'huxiu2.pipelines.Huxiu2PipelinesJson': 300,}，字典的鍵是自定義的pipeline函數(shù)名

? ? ? ? ? ? ? ?最后要注意pipeline內(nèi)的class下的函數(shù)基本有三個，分別在開始中間和結(jié)束時候執(zhí)行

spider的代碼：

from huxiu2.items import Huxiu2Item #整個爬蟲project的名字為huxiu2, import scrapy,pymysqlclass Huxiu2Spider(scrapy.Spider):name = "huxiu"def start_requests(self): #自定義的request方法url='http://www.huxiu.com/index.php'head={"User-Agent":"yyy"}cookie={"xxx": "yyy"}yield scrapy.Request(url=url,cookies=cookie,callback=self.parse,headers=head)passdef parse(self, response): #自定義的信息獲取函數(shù)for i in response.xpath('//div[@class="mod-b mod-art clearfix "]'):item=Huxiu2Item()item['title'] = i.xpath('./div[@class="mob-ctt index-article-list-yh"]/h2/a/text()').extract_first().encode("utf-8")tmplink=i.xpath('./div[@class="mob-ctt index-article-list-yh"]/h2/a/@href').extract_first().encode("utf-8")item['link']=response.urljoin(tmplink).encode("utf-8")item['desc']=i.xpath('.//div[@class="mob-sub"]/text()').extract_first().encode("utf-8")print '第一層',item['title'],item['link'],item['desc'] #為了看到是否可以成功爬取信息，為了搞明白pipelines的原理yield item #yield item非常必要

pipelines的代碼：

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport jsonclass Huxiu2PipelinesJson(object):def __init__(self):print '開始時候執(zhí)行' #在得到有用數(shù)據(jù)以前打印出來self.file=open('Huxiu.json','w') #會在當(dāng)前目錄生成Huxiu.json文件def process_item(self, item, spider):print 'process_item running' #每次print一組信息以后就會打印出這句話，說明item是一條一條寫進(jìn)json文件中的，而不是生成item以后再總的寫進(jìn)去content=json.dumps(dict(item),ensure_ascii=False)+'\n' #python信息轉(zhuǎn)化為json信息的函數(shù)self.file.write(content)return itemdef close_spider(self,spider):print '結(jié)束時候' #有用的信息執(zhí)行完畢后打印出來這句話self.file.close()

轉(zhuǎn)載于:https://www.cnblogs.com/0-lingdu/p/9532350.html

總結(jié)

以上是生活随笔為你收集整理的scrapy基础之静态网页实例的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：在制作joomla模板过程中遇到的问题
下一篇： C实现扫雷