當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python | Xpath实战训练

發(fā)布時(shí)間：2025/3/15 python 15 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python | Xpath实战训练小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

一、前言

今天給大家分享的是，如何在cmd和pycharm中啟動自己的spider以及Xpath的基本介紹，并利用Xpath抓取伯樂在線單篇文章基本信息。

二、Xpath介紹

1. 維基百科看 Xpath

XPath即為XML路徑語言（XML Path Language），它是一種用來確定XML文檔中某部分位置的語言。 XPath基于XML的樹狀結(jié)構(gòu)，提供在數(shù)據(jù)結(jié)構(gòu)樹中找尋節(jié)點(diǎn)的能力。起初XPath的提出的初衷是將其作為一個(gè)通用的、介于XPointer與XSL間的語法模型。但是XPath很快的被開發(fā)者采用來當(dāng)作小型查詢語言。

2. 我來扯扯Xpath

1. Xpath使用路徑表達(dá)式在xml和html中進(jìn)行導(dǎo)航（據(jù)說訪問速度、效率比bs4快） 2. Xpath包含標(biāo)準(zhǔn)函數(shù)庫 3. Xpah是一個(gè)W3c的標(biāo)準(zhǔn)

3. Xpath基本使用語法

三、看代碼，邊學(xué)邊敲邊記

1.在cmd下啟動我們的Scrapy項(xiàng)目子項(xiàng)---jobbole

(1)快速進(jìn)入虛擬環(huán)境(設(shè)置方法見上一篇)

C:\Users\82055\Desktop>workon spiderenv

(2)進(jìn)入到項(xiàng)目目錄

(spiderenv) C:\Users\82055\Desktop>H: (spiderenv) H:\env\spiderenv>cd H:\spider_project\spider_bole_blog\spider_bole_blog

(3)輸入spider命令(格式：scrapy crawl 子項(xiàng)的name)

(spiderenv) H:\spider_project\spider_bole_blog\spider_bole_blog>scrapy crawl jobbole

(4)如果是win系統(tǒng)，可能會出現(xiàn)下面錯(cuò)誤

<module> from twisted.internet import _win32stdio File "h:\env\spiderenv\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module> import win32api ModuleNotFoundError: No module named 'win32api'

(5)解決方法：安裝 pypiwin32模塊（采用豆瓣源安裝）

# 虛擬環(huán)境中 pip install -i https://pypi.douban.com/simple pypiwin32

(6)再次執(zhí)行spider命令

(spiderenv) H:\spider_project\spider_bole_blog\spider_bole_blog>scrapy crawl jobbole 2018-08-23 23:42:01 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: spider_bole_blog) ··· 2018-08-23 23:42:04 [scrapy.core.engine] INFO: Closing spider (finished) 2018-08-23 23:42:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 440, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 21919, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 8, 23, 15, 42, 4, 695188), 'log_count/DEBUG': 3, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2018, 8, 23, 15, 42, 2, 770906)} 2018-08-23 23:42:04 [scrapy.core.engine] INFO: Spider closed (finished)

2. 在`Pycharm`下啟動我們的Scrapy項(xiàng)目子項(xiàng)---jobbole

(1)打開項(xiàng)目，在項(xiàng)目根目錄下新建一個(gè)main.py,用于調(diào)試代碼。

(2)在main.py中輸入下面內(nèi)容

''' author : 極簡XksA data : 2018.8.22 goal : 調(diào)試模塊 ''' import sys import os # 導(dǎo)入執(zhí)行spider命令行函數(shù) from scrapy.cmdline import execute # 獲取當(dāng)前項(xiàng)目目錄，添加到系統(tǒng)中 # 方法一:直接輸入，不便于代碼移植 #(比如小明和小紅的項(xiàng)目路徑可能不一樣，那么小明的代碼想在小紅的電腦上運(yùn)行, # 路徑就要手動改了，python怎么能這么麻煩呢，請看方法二) # sys.path.append("H:\spider_project\spider_bole_blog\spider_bole_blog") # 方法二：代碼獲取，靈活，代碼移植也不影響 # print(os.path.dirname(os.path.abspath(__file__))) # result : H:\spider_project\spider_bole_blog\spider_bole_blog # 獲取當(dāng)前項(xiàng)目路徑，添加到系統(tǒng)中 sys.path.append(os.path.dirname(os.path.abspath(__file__))) # 執(zhí)行spider命令 execute(['scrapy','crawl','jobbole'])

(3)修改setting.py文件設(shè)置，將ROBOTSTXT_OBEY值改為False,默認(rèn)為True或者被注釋掉了,文件中注釋解釋內(nèi)容：Obey robots.txt rules，表示我們的spider的網(wǎng)址必須要遵循robots協(xié)議，不然會直接被過濾掉，所以這個(gè)變量的屬性值必須設(shè)置為Fal0se哦！

# 大概是第21-22行,ROBOTSTXT_OBEY默認(rèn)值為True # 修改為False,如下： # Obey robots.txt rules ROBOTSTXT_OBEY = False

(4)直接運(yùn)行測試文件main.py,運(yùn)行結(jié)果和上面在cmd是一樣的。

(5)在jobbole.py中的的parse函數(shù)中加一個(gè)斷點(diǎn)，然后Debug模式運(yùn)行測試文件main.py

斷點(diǎn)設(shè)置：

斷點(diǎn)設(shè)置

debug結(jié)果分析：

debug結(jié)果分析

3. 編寫`jobbole.py`中的的`parse`函數(shù)，利用Xpath獲取網(wǎng)頁內(nèi)容

(1)為了簡單起見，我隨便選取了一篇文章《Linux 內(nèi)核 Git 歷史記錄中，最大最奇怪的提交信息是這樣的》。

(2)把start_urls的屬性值改為http://blog.jobbole.com/114256/,使spider從當(dāng)前文章開始爬起來。

start_urls = ['http://blog.jobbole.com/114256/']

(3)網(wǎng)頁中分析并獲取文章標(biāo)題Xpath路徑

頁面分析

在FireFox瀏覽器下按F12進(jìn)入開發(fā)者模式，選擇查看器左邊的選取圖標(biāo)功能，然后將鼠標(biāo)移動到標(biāo)題處，在查看器中會自動為我們找到源碼中標(biāo)題的位置

如上圖分析，標(biāo)題應(yīng)該在html下的body中的第一個(gè)div中的第三個(gè)div中的第一個(gè)div中的第一個(gè)div中的h1標(biāo)簽中，那么Xpath路徑即為：

/html/body/div[1]/div[3]/div[1]/div[1]/h1

是不是感覺到很復(fù)雜，哈哈哈，不用灰心，其實(shí)分析起來挺簡單的，另外我們還有更簡單的方法獲取Xpath,當(dāng)我們在查看器重找到我們要的內(nèi)容后，直接右鍵，即可復(fù)制我們想要的內(nèi)容的Xpath路徑了。

頁面復(fù)制Xpath

(4)修改jobbole.py中的的parse函數(shù),運(yùn)行打印出文章標(biāo)題

# scrapy 的 response里面包含了xpath方法，可以直接用調(diào)用，返回值為Selector類型 # Selector庫中有個(gè)方法extract(),可以獲取到data數(shù)據(jù) def parse(self, response): # firefox 瀏覽器返回的Xpath re01_selector = response.xpath('/html/body/div[1]/div[3]/div[1]/div[1]/h1/text()') # chrome 瀏覽器返回的Xpath re02_selector = response.xpath('//*[@id="post-114256"]/div[1]/h1/text()') re01_title = re01_selector.extract() re02_title = re02_selector.extract() print('xpath返回內(nèi)容：'+str(re01_selector)) print('firefox返回文章標(biāo)題為：' + re01_title) print('chrome返回文章標(biāo)題為：' + re02_title)

運(yùn)行結(jié)果：

# 觀察結(jié)果發(fā)現(xiàn)Xpath返回的Selector對象值包括 xpath路徑和data數(shù)據(jù) xpath返回內(nèi)容：[<Selector xpath='/html/body/div[1]/div[3]/div[1]/div[1]/h1/text()' data='Linux 內(nèi)核 Git 歷史記錄中，最大最奇怪的提交信息是這樣的'>] firefox返回文章標(biāo)題為：Linux 內(nèi)核 Git 歷史記錄中，最大最奇怪的提交信息是這樣的 chrome返回文章標(biāo)題為：Linux 內(nèi)核 Git 歷史記錄中，最大最奇怪的提交信息是這樣的

從上面可以看出，FireFox和Chorme獲取到的Xpath是不一樣的，but實(shí)際返回的東西是一樣的，只是用了不同的語法,我這里說明的意思是想告訴大家：Xpath的表達(dá)方式不止一種，可能某個(gè)內(nèi)容的Xpath有兩種或者更多，大家覺得怎么好理解就使用哪一個(gè)。

(5)我們繼續(xù)獲取其他數(shù)據(jù)（復(fù)習(xí)鞏固一下Xpath的用法）
為了快速、有效率的調(diào)式數(shù)據(jù)，給大家推薦一種方法：

# cmd 虛擬環(huán)境中輸入: scrapy shell 你要調(diào)試的網(wǎng)址 scrapy shell http://blog.jobbole.com/114256/

這樣在cmd中就能保存我們的訪問內(nèi)容，可以直接在cmd下進(jìn)行調(diào)試，不用在pycharm中每調(diào)試一個(gè)數(shù)據(jù)，就運(yùn)行一次，訪問一次頁面，這樣效率是非常低的。

獲取文章發(fā)布時(shí)間

>>> data_r = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()') >>> data_r.extract() ['\r\n\r\n 2018/08/08 · ', '\r\n \r\n \r\n\r\n \r\n · ', ', ', '\r\n \r\n'] >>> data_r.extract()[0].strip() '2018/08/08 ·' >>> data_str = data_r.extract()[0].strip() >>> data_str.strip() '2018/08/08 ·' >>> data_str.replace('·','').strip() '2018/08/08' # data_selector = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()') # data_str = data_selector.extract()[0].strip() # data_time = data_str.replace('·','').strip()

獲取文章點(diǎn)贊數(shù)、收藏?cái)?shù)和評論數(shù)

# 點(diǎn)贊數(shù) >>> praise_number = response.xpath('//h10[@id="114256votetotal"]/text()') >>> praise_number [<Selector xpath='//h10[@id="114256votetotal"]/text()' data='1'>] >>> praise_number = int(praise_number.extract()[0]) >>> praise_number 1 # praise_number = int(response.xpath('//h10[@id="114256votetotal"]/text()').extract()[0])
# 收藏?cái)?shù) >>> collection_number = response.xpath('//span[@data-book-type="1"]/text()') >>> collection_number [<Selector xpath='//span[@data-book-type="1"]/text()' data=' 1 收藏'>] >>> collection_number.extract() [' 1 收藏'] >>> collection_word = collection_number.extract()[0] >>> import re >>> reg_str = '.*(\d+).*' >>> re.findall(reg_str,collection_word) ['1'] >>> collection_number = int(re.findall(reg_str,collection_word)[0]) >>> collection_number 1 # collection_str = response.xpath('//span[@data-book-type="1"]/text()').extract()[0] # reg_str = '.*(\d+).*' # collection_number = int(re.findall(reg_str,collection_word)[0])
# 評論數(shù) >>> comment_number = response.xpath('//span[@class="btn-bluet-bigger href-style hide-on-480"]/text()') >>> comment_number [<Selector xpath='//span[@class="btn-bluet-bigger href-style hide-on-480"]/text()' data=' 評論'>] >>> comment_number.extract()[0] ' 評論' # 由于我選的這篇文章比較新，還沒有評論，哈哈哈。。。 # 如果有的話，后面和上面獲取收藏?cái)?shù)是一樣的方法(正則匹配)。

上是在cmd中的測試過程，可以看出來，我基本上都是用的都是//span[@data-book-type="1"]這種格式的Xpath,而非像FireFox瀏覽器上復(fù)制的Xpath,原因有兩點(diǎn)：

1.從外形來看，顯然我使用的這種Xpath要更好，至少長度上少很多(特別對于比較深的數(shù)據(jù)，如果像 `FireFox`這種，可能長度大于100也不奇怪) 2.從性能上來看，我是用的這種形式匹配更加準(zhǔn)確，如果莫個(gè)頁面包含js加載的數(shù)據(jù)（也包含div,p,a 等標(biāo)簽），如果我們直接分析`FireFox`這種Xpath，可能會出錯(cuò)。

建議：
(1)決心想學(xué)好的，把本文二中的Xpath語法好好記一下，練習(xí)一下；
(2)爬取網(wǎng)頁抓取數(shù)據(jù)盡量用谷歌瀏覽器。

3.現(xiàn)在`jobbole.py`中的代碼及運(yùn)行結(jié)果

代碼：

# -*- coding: utf-8 -*- import scrapy import re
class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/114256/']
def parse(self, response): # 標(biāo)題 # chrome 瀏覽器返回的Xpath re01_seletor = response.xpath('//*[@id="post-114256"]/div[1]/h1/text()') re01_title = re01_seletor.extract() # 發(fā)布日期 data_selector = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()') data_str = data_selector.extract()[0].strip() data_time = data_str.replace('·','').strip() # 點(diǎn)贊數(shù) praise_number = int(response.xpath('//h10[@id="114256votetotal"]/text()').extract()[0]) # 收藏?cái)?shù) collection_str = response.xpath('//span[@data-book-type="1"]/text()').extract()[0] reg_str = '.*(\d+).*' collection_number = int(re.findall(reg_str,collection_str)[0]) print("文章標(biāo)題："+re01_title[0]) print("發(fā)布日期："+data_time) print("點(diǎn)贊數(shù)："+str(praise_number)) print("收藏?cái)?shù)："+str(collection_number))

運(yùn)行結(jié)果：

文章標(biāo)題：Linux 內(nèi)核 Git 歷史記錄中，最大最奇怪的提交信息是這樣的發(fā)布日期：2018/08/08 點(diǎn)贊數(shù)：1 收藏?cái)?shù)：2

四、后言

學(xué)完這一期，大家應(yīng)該能感受到爬蟲的誘惑了哈，雖然現(xiàn)在我們還只是爬取的一個(gè)頁面的文章標(biāo)題等基本數(shù)據(jù)，最重要的是學(xué)會如何在cmd和pycharm中啟動我們的爬蟲項(xiàng)目和Xpath的學(xué)習(xí)，下一期，我將帶大家使用CSS選擇器，看看那個(gè)更好用，哈哈哈！

原文發(fā)布時(shí)間為：2018-09-7

本文作者：XksA

本文來自云棲社區(qū)合作伙伴“Python專欄”，了解相關(guān)信息可以關(guān)注“Python專欄”。

總結(jié)

以上是生活随笔為你收集整理的Python | Xpath实战训练的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。