當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy爬虫（8）scrapy-splash的入门

發布時間：2023/12/15 编程问答 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy爬虫（8）scrapy-splash的入门小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

2019獨角獸企業重金招聘Python工程師標準>>>

Scrapy爬蟲（8）scrapy-splash的入門

2018年03月17日 16:16:36?劍與星辰?閱讀數：885?標簽：?scrapysplash爬蟲?更多

個人分類：?scrapy

scrapy-splash的介紹

??在前面的博客中，我們已經見識到了Scrapy的強大之處。但是，Scrapy也有其不足之處，即Scrapy沒有JS engine, 因此它無法爬取JavaScript生成的動態網頁，只能爬取靜態網頁，而在現代的網絡世界中，大部分網頁都會采用JavaScript來豐富網頁的功能。所以，這無疑Scrapy的遺憾之處。?
??那么，我們還能愉快地使用Scrapy來爬取動態網頁嗎？有沒有什么補充的辦法呢？答案依然是yes!答案就是，使用scrapy-splash模塊！?
??scrapy-splash模塊主要使用了Splash. 所謂的Splash, 就是一個Javascript渲染服務。它是一個實現了HTTP API的輕量級瀏覽器，Splash是用Python實現的，同時使用Twisted和QT。Twisted（QT）用來讓服務具有異步處理能力，以發揮webkit的并發能力。Splash的特點如下：

并行處理多個網頁
得到HTML結果以及（或者）渲染成圖片
關掉加載圖片或使用 Adblock Plus規則使得渲染速度更快
使用JavaScript處理網頁內容
使用Lua腳本
能在Splash-Jupyter Notebooks中開發Splash Lua scripts
能夠獲得具體的HAR格式的渲染信息

scrapy-splash的安裝

??由于Splash的上述特點，使得Splash和Scrapy兩者的兼容性較好，抓取效率較高。?
??聽了上面的介紹，有沒有對scrapy-splash很心動呢？下面就介紹如何安裝scrapy-splash，步驟如下：?
??1. 安裝scrapy-splash模塊

pip3 install scrapy-splash

??2. scrapy-splash使用的是Splash HTTP API，所以需要一個splash instance，一般采用docker運行splash，所以需要安裝docker。不同系統的安裝命令會不同，如筆者的CentOS7系統的安裝方式為：

sudo yum install docker

安裝完docker后，可以輸入命令‘docker -v’來驗證docker是否安裝成功。

??3. 開啟docker服務，拉取splash鏡像（pull the image）：

sudo service docker start sudo dock pull scrapinghub/splash

運行結果如下：

??4. 開啟容器（start the container）：

sudo docker run -p 8050:8050 scrapinghub/splash

此時Splash以運行在本地服務器的端口8050(http).在瀏覽器中輸入’localhost:8050’，頁面如下：

在這個網頁中我們能夠運行Lua scripts，這對我們在scrapy-splash中使用Lua scripts是非常有幫助的。以上就是我們安裝scrapy-splash的全部。

scrapy-splash的實例

??在安裝完scrapy-splash之后，不趁機介紹一個實例，實在是說不過去的，我們將在此介紹一個簡單的實例，那就是利用百度查詢手機號碼信息。比如，我們在百度輸入框中輸入手機號碼‘159********’，然后查詢，得到如下信息：

我們將利用scrapy-splash模擬以上操作并獲取手機號碼信息。

??1. 創建scrapy項目phone?
??2. 配置settings.py文件，配置的內容如下：

ROBOTSTXT_OBEY = FalseSPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }DOWNLOADER_MIDDLEWARES = {'scrapy_splash.SplashCookiesMiddleware': 723,'scrapy_splash.SplashMiddleware': 725,'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810 }SPLASH_URL = 'http://localhost:8050'DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

具體的配置說明可以參考：??https://pypi.python.org/pypi/scrapy-splash?.

??3. 創建爬蟲文件phoneSpider.py, 代碼如下：

# -*- coding: utf-8 -*- from scrapy import Spider, Request from scrapy_splash import SplashRequest# splash lua script script = """function main(splash, args)assert(splash:go(args.url))assert(splash:wait(args.wait))js = string.format("document.querySelector('#kw').value=%s;document.querySelector('#su').click()", args.phone)splash:evaljs(js)assert(splash:wait(args.wait))return splash:html()end"""class phoneSpider(Spider):name = 'phone'allowed_domains = ['www.baidu.com']url = 'https://www.baidu.com'# start requestdef start_requests(self):yield SplashRequest(self.url, callback=self.parse, endpoint='execute', args={'lua_source': script, 'phone':'159*******', 'wait': 5})# parse the html content def parse(self, response):info = response.css('div.op_mobilephone_r.c-gap-bottom-small').xpath('span/text()').extract()print('='*40)print(''.join(info))print('='*40)

??4. 運行爬蟲，scrapy crawl phone, 結果如下：

??實例展示到此結束，歡迎大家訪問這個項目的Github地址： ?https://github.com/percent4/phoneSpider?.當然，有什么問題，也可以載下面留言評論哦~~

轉載于:https://my.oschina.net/u/3367404/blog/2088091

總結

以上是生活随笔為你收集整理的Scrapy爬虫（8）scrapy-splash的入门的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：三星电子平均年薪超1.3亿韩元！就这还降
下一篇： Redis学习（一）——