當(dāng)前位置：首頁 > 前端技术 > HTML >内容正文

HTML

Python 中 xpath 语法与 lxml 库解析 HTML/XML 和 CSS Selector

發(fā)布時間：2024/7/23 HTML 40 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 中 xpath 语法与 lxml 库解析 HTML/XML 和 CSS Selector 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

The lxml.etree Tutorial ：https://lxml.de/tutorial.html

python3 解析 xml：https://www.cnblogs.com/deadwood-2016/p/8116863.html

微軟文檔：??XPath 語法和 XPath 函數(shù)
W3school Xpath 教程：http://www.w3school.com.cn/xpath/
Xpath 菜鳥教程：http://www.runoob.com/xpath/xpath-tutorial.html
簡書：Xpath高級用法：https://www.jianshu.com/p/1575db75670f
30個示例手把手教你學(xué)會Xpath高級用法：https://www.sohu.com/a/211716225_236714
了解XPath常用術(shù)語和表達(dá)式解析十分鐘輕松入門：http://www.bazhuayu.com/blog/2014091

前言

XPath 即為 XML 路徑語言，它是一種用來確定 XML（標(biāo)準(zhǔn)通用標(biāo)記語言的子集）文檔中某部分位置的語言。
XPath 基于 XML 的樹狀結(jié)構(gòu)，提供在數(shù)據(jù)結(jié)構(gòu)樹中找尋節(jié)點(diǎn)的能力。 XPath 同樣也支持HTML。
XPath 是一門小型的查詢語言。
python 中 lxml庫使用的是 Xpath 語法，是效率比較高的解析方法。

lxml 用法源自 lxml python 官方文檔：http://lxml.de/index.html
XPath 語法參考 w3school：http://www.w3school.com.cn/xpath/index.asp

安裝

Python 安裝 lxml 庫

pip3 install lxml

Python 中如何安裝使用 XPath

step1：?安裝 lxml 庫
step2：?from lxml import etree? ? ? ? ? ? ? ? ? ?#?etree全稱：ElementTree 元素樹
step3：?selector = etree.HTML(網(wǎng)頁源代碼)
step4：?selector.xpath(一段神奇的符號)

lxml 使用 Xpath 使用示例：

#!/usr/bin/python3 # -*- coding: utf-8 -*- # @Author : # @File : test.py # @Software : PyCharm # @description : XXXfrom lxml import etreehtml = ''' <div id="content"> <ul id="useful"><li>有效信息1</li><li>有效信息2</li><li>有效信息3</li></ul><ul id="useless"><li>無效信息1</li><li>無效信息2</li><li>無效信息3</li></ul> </div> <div id="url"><a href="http://cighao.com">陳浩的博客</a><a href="http://cighao.com.photo" title="陳浩的相冊">點(diǎn)我打開</a> </div> '''def test():selector = etree.HTML(html)# 提取 li 中的有效信息123content = selector.xpath('//ul[@id="useful"]/li/text()')for each in content:print(each)# 提取 a 中的屬性link = selector.xpath('//a/@href')for each in link:print(each)title = selector.xpath('//a/@title')for each in title:print(each)if __name__ == '__main__':test()

示例 2：

#!/usr/bin/env python3 # Author: veelionimport re import requests import lxml.html from pprint import pprintdef parse(li):item = dict()a_text_1 = li.xpath('.//text()')[0]a_text_2 = li.xpath('.//a')[0].texta_href = li.xpath('.//a')[0].get('href')item['a_text_1'] = a_text_1item['a_text_2'] = a_text_2item['a_href'] = a_hrefreturn itemdef main():url = 'https://www.sina.com.cn/'headers = {'User-Agent': 'Firefox'}resp = requests.get(url, headers=headers)html = resp.content.decode('utf8')doc = lxml.html.fromstring(html)xp = '//div[@class="main-nav"]//li'lis = doc.xpath(xp)print('lis:', len(lis))channel_list = [parse(li) for li in lis]print('channel_list:', len(channel_list))# pprint(channel_list[0])list(map(print, channel_list))if __name__ == '__main__':main()

lxml 使用 CSS 選擇器使用示例 1：

#!/usr/bin/python3 # -*- coding: utf-8 -*- # @Author : # @File : test_2.py # @Software : PyCharm # @description : XXXimport lxml.html from urllib.request import urlopendef scrape(html):tree = lxml.html.fromstring(html)td = tree.cssselect('tr#places_neighbours__row > td.w2p_fw')[0]area = td.text_content()return areaif __name__ == '__main__':r_html = urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()print(scrape(r_html))

lxml 使用 CSS 選擇器使用示例2 ：

# -*- coding: utf-8 -*-import csv import re import urlparse import lxml.html from link_crawler import link_crawlerFIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours')def scrape_callback(url, html):if re.search('/view/', url):tree = lxml.html.fromstring(html)row = [tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content() for field in FIELDS]print url, rowif __name__ == '__main__':link_crawler('http://example.webscraping.com/', '/(index|view)', scrape_callback=scrape_callback)

link_crawler.py

import re import urlparse import urllib2 import time from datetime import datetime import robotparser import Queuedef link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp',proxy=None, num_retries=1, scrape_callback=None):"""Crawl from the given seed URL following links matched by link_regex"""# the queue of URL's that still need to be crawledcrawl_queue = [seed_url]# the URL's that have been seen and at what depthseen = {seed_url: 0}# track how many URL's have been downloadednum_urls = 0rp = get_robots(seed_url)throttle = Throttle(delay)headers = headers or {}if user_agent:headers['User-agent'] = user_agentwhile crawl_queue:url = crawl_queue.pop()depth = seen[url]# check url passes robots.txt restrictionsif rp.can_fetch(user_agent, url):throttle.wait(url)html = download(url, headers, proxy=proxy, num_retries=num_retries)links = []if scrape_callback:links.extend(scrape_callback(url, html) or [])if depth != max_depth:# can still crawl furtherif link_regex:# filter for links matching our regular expressionlinks.extend(link for link in get_links(html) if re.match(link_regex, link))for link in links:link = normalize(seed_url, link)# check whether already crawled this linkif link not in seen:seen[link] = depth + 1# check link is within same domainif same_domain(seed_url, link):# success! add this new link to queuecrawl_queue.append(link)# check whether have reached downloaded maximumnum_urls += 1if num_urls == max_urls:breakelse:print'Blocked by robots.txt:', urlclass Throttle:"""Throttle downloading by sleeping between requests to same domain"""def __init__(self, delay):# amount of delay between downloads for each domainself.delay = delay# timestamp of when a domain was last accessedself.domains = {}def wait(self, url):"""Delay if have accessed this domain recently"""domain = urlparse.urlsplit(url).netloclast_accessed = self.domains.get(domain)if self.delay > 0 and last_accessed is not None:sleep_secs = self.delay - (datetime.now() - last_accessed).secondsif sleep_secs > 0:time.sleep(sleep_secs)self.domains[domain] = datetime.now()def download(url, headers, proxy, num_retries, data=None):print'Downloading:', urlrequest = urllib2.Request(url, data, headers)opener = urllib2.build_opener()if proxy:proxy_params = {urlparse.urlparse(url).scheme: proxy}opener.add_handler(urllib2.ProxyHandler(proxy_params))try:response = opener.open(request)html = response.read()code = response.codeexcept urllib2.URLError as e:print'Download error:', e.reasonhtml = ''if hasattr(e, 'code'):code = e.codeif num_retries > 0 and 500 <= code < 600:# retry 5XX HTTP errorshtml = download(url, headers, proxy, num_retries - 1, data)else:code = Nonereturn htmldef normalize(seed_url, link):"""Normalize this URL by removing hash and adding domain"""link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicatesreturn urlparse.urljoin(seed_url, link)def same_domain(url1, url2):"""Return True if both URL's belong to same domain"""return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netlocdef get_robots(url):"""Initialize robots parser for this domain"""rp = robotparser.RobotFileParser()rp.set_url(urlparse.urljoin(url, '/robots.txt'))rp.read()return rpdef get_links(html):"""Return a list of links from html """# a regular expression to extract all links from the webpagewebpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)# list of all links from the webpagereturn webpage_regex.findall(html)if __name__ == '__main__':link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, user_agent='BadCrawler')link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, max_depth=1, user_agent='GoodCrawler')

Python3 解析 XML

來源：https://www.cnblogs.com/deadwood-2016/p/8116863.html

Python 使用 XPath 解析 XML文檔：http://www.jingfengshuo.com/archives/1414.html

在 XML 解析方面，Python 貫徹了自己“開箱即用”（batteries included）的原則。在自帶的標(biāo)準(zhǔn)庫中，Python提供了大量可以用于處理XML語言的包和工具，數(shù)量之多，甚至讓Python編程新手無從選擇。

本文將介紹深入解讀利用Python語言解析XML文件的幾種方式，并以筆者推薦使用的ElementTree模塊為例，演示具體使用方法和場景。

一、什么是 XML?

XML是可擴(kuò)展標(biāo)記語言（Extensible Markup Language）的縮寫，其中的標(biāo)記（markup）是關(guān)鍵部分。您可以創(chuàng)建內(nèi)容，然后使用限定標(biāo)記標(biāo)記它，從而使每個單詞、短語或塊成為可識別、可分類的信息。

標(biāo)記語言從早期的私有公司和政府制定形式逐漸演變成標(biāo)準(zhǔn)通用標(biāo)記語言（Standard Generalized Markup Language，SGML）、超文本標(biāo)記語言（Hypertext Markup Language，HTML），并且最終演變成 XML。XML有以下幾個特點(diǎn)。

XML的設(shè)計(jì)宗旨是傳輸數(shù)據(jù)，而非顯示數(shù)據(jù)。
XML標(biāo)簽沒有被預(yù)定義。您需要自行定義標(biāo)簽。
XML被設(shè)計(jì)為具有自我描述性。
XML是W3C的推薦標(biāo)準(zhǔn)。

目前，XML在Web中起到的作用不會亞于一直作為Web基石的HTML。 XML無所不在。XML是各種應(yīng)用程序之間進(jìn)行數(shù)據(jù)傳輸?shù)淖畛Ｓ玫墓ぞ?#xff0c;并且在信息存儲和描述領(lǐng)域變得越來越流行。因此，學(xué)會如何解析XML文件，對于Web開發(fā)來說是十分重要的。

二、有哪些可以解析 XML 的 Python 包？

Python的標(biāo)準(zhǔn)庫中，提供了6種可以用于處理XML的包。

xml.dom

xml.dom實(shí)現(xiàn)的是W3C制定的DOM API。如果你習(xí)慣于使用DOM API或者有人要求這這樣做，可以使用這個包。不過要注意，在這個包中，還提供了幾個不同的模塊，各自的性能有所區(qū)別。

DOM解析器在任何處理開始之前，必須把基于XML文件生成的樹狀數(shù)據(jù)放在內(nèi)存，所以DOM解析器的內(nèi)存使用量完全根據(jù)輸入資料的大小。

xml.dom.minidom

xml.dom.minidom是DOM API的極簡化實(shí)現(xiàn)，比完整版的DOM要簡單的多，而且這個包也小的多。那些不熟悉DOM的朋友，應(yīng)該考慮使用xml.etree.ElementTree模塊。據(jù)lxml的作者評價，這個模塊使用起來并不方便，效率也不高，而且還容易出現(xiàn)問題。

xml.dom.pulldom

與其他模塊不同，xml.dom.pulldom模塊提供的是一個“pull解析器”，其背后的基本概念指的是從XML流中pull事件，然后進(jìn)行處理。雖然與SAX一樣采用事件驅(qū)動模型（event-driven processing model），但是不同的是，使用pull解析器時，使用者需要明確地從XML流中pull事件，并對這些事件遍歷處理，直到處理完成或者出現(xiàn)錯誤。

pull解析（pull parsing）是近來興起的一種XML處理趨勢。此前諸如SAX和DOM這些流行的XML解析框架，都是push-based，也就是說對解析工作的控制權(quán)，掌握在解析器的手中。

xml.sax

xml.sax模塊實(shí)現(xiàn)的是SAX API，這個模塊犧牲了便捷性來換取速度和內(nèi)存占用。SAX是Simple API for XML的縮寫，它并不是由W3C官方所提出的標(biāo)準(zhǔn)。它是事件驅(qū)動的，并不需要一次性讀入整個文檔，而文檔的讀入過程也就是SAX的解析過程。所謂事件驅(qū)動，是指一種基于回調(diào)（callback）機(jī)制的程序運(yùn)行方法。

xml.parser.expat

xml.parser.expat提供了對C語言編寫的expat解析器的一個直接的、底層API接口。expat接口與SAX類似，也是基于事件回調(diào)機(jī)制，但是這個接口并不是標(biāo)準(zhǔn)化的，只適用于expat庫。

expat是一個面向流的解析器。您注冊的解析器回調(diào)（或handler）功能，然后開始搜索它的文檔。當(dāng)解析器識別該文件的指定的位置，它會調(diào)用該部分相應(yīng)的處理程序（如果您已經(jīng)注冊的一個）。該文件被輸送到解析器，會被分割成多個片斷，并分段裝到內(nèi)存中。因此expat可以解析那些巨大的文件。

xml.etree.ElementTree（以下簡稱ET）

xml.etree.ElementTree模塊提供了一個輕量級、Pythonic的API，同時還有一個高效的C語言實(shí)現(xiàn)，即xml.etree.cElementTree。與DOM相比，ET的速度更快，API使用更直接、方便。與SAX相比，ET.iterparse函數(shù)同樣提供了按需解析的功能，不會一次性在內(nèi)存中讀入整個文檔。ET的性能與SAX模塊大致相仿，但是它的API更加高層次，用戶使用起來更加便捷。

筆者建議，在使用Python進(jìn)行XML解析時，首選使用ET模塊，除非你有其他特別的需求，可能需要另外的模塊來滿足。

解析XML的這幾種API并不是Python獨(dú)創(chuàng)的，Python也是通過借鑒其他語言或者直接從其他語言引入進(jìn)來的。例如expat就是一個用C語言開發(fā)的、用來解析XML文檔的開發(fā)庫。而SAX最初是由DavidMegginson采用java語言開發(fā)的，DOM可以以一種獨(dú)立于平臺和語言的方式訪問和修改一個文檔的內(nèi)容和結(jié)構(gòu)，可以應(yīng)用于任何編程語言。

下面，我們以ElementTree模塊為例，介紹在Python中如何解析lxml。

三、利用 ElementTree 解析 XML

Python標(biāo)準(zhǔn)庫中，提供了ET的兩種實(shí)現(xiàn)。一個是純Python實(shí)現(xiàn)的xml.etree.ElementTree，另一個是速度更快的C語言實(shí)現(xiàn)xml.etree.cElementTree。請記住始終使用C語言實(shí)現(xiàn)，因?yàn)樗乃俣纫旌芏?#xff0c;而且內(nèi)存消耗也要少很多。如果你所使用的Python版本中沒有cElementTree所需的加速模塊，你可以這樣導(dǎo)入模塊：

try:import?xml.etree.cElementTree as ET except?ImportError:import?xml.etree.ElementTree as ET

如果某個API存在不同的實(shí)現(xiàn)，上面是常見的導(dǎo)入方式。當(dāng)然，很可能你直接導(dǎo)入第一個模塊時，并不會出現(xiàn)問題。請注意，自Python 3.3之后，就不用采用上面的導(dǎo)入方法，因?yàn)镋lemenTree模塊會自動優(yōu)先使用C加速器，如果不存在C實(shí)現(xiàn)，則會使用Python實(shí)現(xiàn)。因此，使用Python 3.3+的朋友，只需要import xml.etree.ElementTree即可。

1、將XML文檔解析為樹（tree）

我們先從基礎(chǔ)講起。XML是一種結(jié)構(gòu)化、層級化的數(shù)據(jù)格式，最適合體現(xiàn)XML的數(shù)據(jù)結(jié)構(gòu)就是樹。ET提供了兩個對象：ElementTree將整個XML文檔轉(zhuǎn)化為樹，Element則代表著樹上的單個節(jié)點(diǎn)。對整個XML文檔的交互（讀取，寫入，查找需要的元素），一般是在ElementTree層面進(jìn)行的。對單個XML元素及其子元素，則是在Element層面進(jìn)行的。下面我們舉例介紹主要使用方法。

我們使用下面的XML文檔，作為演示數(shù)據(jù)：

<?xml version="1.0"?> <doc><branch name="codingpy.com" hash="1cdf045c">text,source</branch><branch name="release01" hash="f200013e"><sub-branch name="subrelease01">xml,sgml</sub-branch></branch><branch name="invalid"></branch> </doc>

接下來，我們加載這個文檔，并進(jìn)行解析：

>>> import xml.etree.ElementTree as ET >>> tree = ET.ElementTree(file='doc1.xml')

然后，我們獲取根元素（root element）：

>>> tree.getroot() <Element 'doc' at 0x11eb780>

正如之前所講的，根元素（root）是一個Element對象。我們看看根元素都有哪些屬性：

>>> root = tree.getroot() >>> root.tag, root.attrib ('doc', {})

沒錯，根元素并沒有屬性。與其他Element對象一樣，根元素也具備遍歷其直接子元素的接口：

>>> for child_of_root in root: ... print child_of_root.tag, child_of_root.attrib ... branch {'hash': '1cdf045c', 'name': 'codingpy.com'} branch {'hash': 'f200013e', 'name': 'release01'} branch {'name': 'invalid'}

我們還可以通過索引值來訪問特定的子元素：

>>> root[0].tag, root[0].text ('branch', '\n text,source\n ')

2、查找需要的元素

從上面的示例中，可以明顯發(fā)現(xiàn)我們能夠通過簡單的遞歸方法（對每一個元素，遞歸式訪問其所有子元素）獲取樹中的所有元素。但是，由于這是十分常見的工作，ET提供了一些簡便的實(shí)現(xiàn)方法。

Element對象有一個iter方法，可以對某個元素對象之下所有的子元素進(jìn)行深度優(yōu)先遍歷（DFS）。ElementTree對象同樣也有這個方法。下面是查找XML文檔中所有元素的最簡單方法：

>>> for elem in tree.iter(): ... print elem.tag, elem.attrib ... doc {} branch {'hash': '1cdf045c', 'name': 'codingpy.com'} branch {'hash': 'f200013e', 'name': 'release01'} sub-branch {'name': 'subrelease01'} branch {'name': 'invalid'}

在此基礎(chǔ)上，我們可以對樹進(jìn)行任意遍歷——遍歷所有元素，查找出自己感興趣的屬性。但是ET可以讓這個工作更加簡便、快捷。iter方法可以接受tag名稱，然后遍歷所有具備所提供tag的元素：

>>> for elem in tree.iter(tag='branch'): ... print elem.tag, elem.attrib ... branch {'hash': '1cdf045c', 'name': 'codingpy.com'} branch {'hash': 'f200013e', 'name': 'release01'} branch {'name': 'invalid'}

3、支持通過 XPath 查找元素

使用XPath查找感興趣的元素，更加方便。Element對象中有一些find方法可以接受Xpath路徑作為參數(shù)，find方法會返回第一個匹配的子元素，findall以列表的形式返回所有匹配的子元素, iterfind則返回一個所有匹配元素的迭代器（iterator）。ElementTree對象也具備這些方法，相應(yīng)地它的查找是從根節(jié)點(diǎn)開始的。

下面是一個使用XPath查找元素的示例：

>>> for elem in tree.iterfind('branch/sub-branch'): ... print elem.tag, elem.attrib ... sub-branch {'name': 'subrelease01'}

上面的代碼返回了branch元素之下所有tag為sub-branch的元素。接下來查找所有具備某個name屬性的branch元素：

>>> for elem in tree.iterfind('branch[@name="release01"]'): ... print elem.tag, elem.attrib ... branch {'hash': 'f200013e', 'name': 'release01'}

4、構(gòu)建 XML 文檔

利用ET，很容易就可以完成XML文檔構(gòu)建，并寫入保存為文件。ElementTree對象的write方法就可以實(shí)現(xiàn)這個需求。

一般來說，有兩種主要使用場景。一是你先讀取一個XML文檔，進(jìn)行修改，然后再將修改寫入文檔，二是從頭創(chuàng)建一個新XML文檔。

修改文檔的話，可以通過調(diào)整Element對象來實(shí)現(xiàn)。請看下面的例子：

>>> root = tree.getroot() >>> del root[2] >>> root[0].set('foo', 'bar') >>> for subelem in root: ... print subelem.tag, subelem.attrib ... branch {'foo': 'bar', 'hash': '1cdf045c', 'name': 'codingpy.com'} branch {'hash': 'f200013e', 'name': 'release01'}

在上面的代碼中，我們刪除了root元素的第三個子元素，為第一個子元素增加了新屬性。這個樹可以重新寫入至文件中。最終的XML文檔應(yīng)該是下面這樣的：

>>> import sys >>> tree.write(sys.stdout) <doc><branch foo="bar" hash="1cdf045c" name="codingpy.com">text,source</branch><branch hash="f200013e" name="release01"><sub-branch name="subrelease01">xml,sgml</sub-branch></branch></doc>

請注意，文檔中元素的屬性順序與原文檔不同。這是因?yàn)镋T是以字典的形式保存屬性的，而字典是一個無序的數(shù)據(jù)結(jié)構(gòu)。當(dāng)然，XML也不關(guān)注屬性的順序。

從頭構(gòu)建一個完整的文檔也很容易。ET模塊提供了一個SubElement工廠函數(shù)，讓創(chuàng)建元素的過程變得很簡單：

>>> a = ET.Element('elem') >>> c = ET.SubElement(a, 'child1') >>> c.text = "some text" >>> d = ET.SubElement(a, 'child2') >>> b = ET.Element('elem_b') >>> root = ET.Element('root') >>> root.extend((a, b)) >>> tree = ET.ElementTree(root) >>> tree.write(sys.stdout) <root><elem><child1>some text</child1><child2 /></elem><elem_b /></root>

5、利用 iterparse 解析XML流

XML文檔通常都會比較大，如何直接將文檔讀入內(nèi)存的話，那么進(jìn)行解析時就會出現(xiàn)問題。這也就是為什么不建議使用DOM，而是SAX API的理由之一。

我們上面談到，ET可以將XML文檔加載為保存在內(nèi)存里的樹（in-memory tree），然后再進(jìn)行處理。但是在解析大文件時，這應(yīng)該也會出現(xiàn)和DOM一樣的內(nèi)存消耗大的問題吧？沒錯，的確有這個問題。為了解決這個問題，ET提供了一個類似SAX的特殊工具——iterparse，可以循序地解析XML。

接下來，筆者為大家展示如何使用iterparse，并與標(biāo)準(zhǔn)的樹解析方式進(jìn)行對比。我們使用一個自動生成的XML文檔，下面是該文檔的開頭部分：

<?xml version="1.0" standalone="yes"?> <site><regions><africa><item id="item0"><location>United States</location> <quantity>1</quantity><name>duteous nine eighteen </name><payment>Creditcard</payment><description><parlist> [...]

我們來統(tǒng)計(jì)一下文檔中出現(xiàn)了多少個文本值為Zimbabwe的location元素。下面是使用ET.parse的標(biāo)準(zhǔn)方法：

tree = ET.parse(sys.argv[2])count = 0 for elem in tree.iter(tag='location'):if elem.text == 'Zimbabwe':count += 1print count

上面的代碼會將全部元素載入內(nèi)存，逐一解析。當(dāng)解析一個約100MB的XML文檔時，運(yùn)行上面腳本的Python進(jìn)程的內(nèi)存使用峰值為約560MB，總運(yùn)行時間問2.9秒。

請注意，我們其實(shí)不需要講整個樹加載到內(nèi)存里。只要檢測出文本為相應(yīng)值得location元素即可。其他數(shù)據(jù)都可以廢棄。這時，我們就可以用上 iterparse 方法了：

count = 0 for event, elem in ET.iterparse(sys.argv[2]):if event == 'end':if elem.tag == 'location' and elem.text == 'Zimbabwe':count += 1elem.clear() # 將元素廢棄print count

上面的 for 循環(huán)會遍歷 iterparse 事件，首先檢查事件是否為 end，然后判斷元素的 tag 是否為 location，以及其文本值是否符合目標(biāo)值。另外，調(diào)用 elem.clear() 非常關(guān)鍵：因?yàn)?iterparse 仍然會生成一個樹，只是循序生成的而已。廢棄掉不需要的元素，就相當(dāng)于廢棄了整個樹，釋放出系統(tǒng)分配的內(nèi)存。

當(dāng)利用上面這個腳本解析同一個文件時，內(nèi)存使用峰值只有 7MB，運(yùn)行時間為 2.5 秒。速度提升的原因，是我們這里只在樹被構(gòu)建時，遍歷一次。而使用 parse 的標(biāo)準(zhǔn)方法是先完成整個樹的構(gòu)建后，才再次遍歷查找所需要的元素。

iterparse 的性能與 SAX 相當(dāng)，但是其 API 卻更加有用：iterparse 會循序地構(gòu)建樹；而利用 SAX 時，你還得自己完成樹的構(gòu)建工作。

四、使用示例（ lxml 中 etree?）：

#!/usr/bin/python3 # -*- coding: utf-8 -*- # @Author : # @File : test_1.py # @Software : PyCharm # @description : XXX""" Element是 XML處理的核心類， Element對象可以直觀的理解為 XML的節(jié)點(diǎn)，大部分 XML節(jié)點(diǎn)的處理都是圍繞該類進(jìn)行的。這部分包括三個內(nèi)容：節(jié)點(diǎn)的操作、節(jié)點(diǎn)屬性的操作、節(jié)點(diǎn)內(nèi)文本的操作。 """from lxml import etree import lxml.html as HTML# 1.創(chuàng)建element root = etree.Element('root') print(root, root.tag)# 2.添加子節(jié)點(diǎn) child1 = etree.SubElement(root, 'child1') child2 = etree.SubElement(root, 'child2')# 3.刪除子節(jié)點(diǎn) # root.remove(child2)# 4.刪除所有子節(jié)點(diǎn) # root.clear()# 5.以列表的方式操作子節(jié)點(diǎn) print(len(root)) print(root.index(child1)) # 索引號 root.insert(0, etree.Element('child3')) # 按位置插入 root.append(etree.Element('child4')) # 尾部添加# 6.獲取父節(jié)點(diǎn) print(child1.getparent().tag) # print root[0].getparent().tag #用列表獲取子節(jié)點(diǎn),再獲取父節(jié)點(diǎn) '''以上都是節(jié)點(diǎn)操作'''# 7.創(chuàng)建屬性 # root.set('hello', 'dahu') #set(屬性名,屬性值) # root.set('hi', 'qing')# 8.獲取屬性 # print(root.get('hello')) #get方法 # print root.keys(),root.values(),root.items() #參考字典的操作 # print root.attrib #直接拿到屬性存放的字典,節(jié)點(diǎn)的attrib,就是該節(jié)點(diǎn)的屬性 '''以上是屬性的操作'''# 9.text和tail屬性 # root.text = 'Hello, World!' # print root.text# 10.test,tail 和 text 的結(jié)合 html = etree.Element('html') html.text = 'html.text' body = etree.SubElement(html, 'body') body.text = 'wo ai ni' child = etree.SubElement(body, 'child') child.text = 'child.text' # 一般情況下,如果一個節(jié)點(diǎn)的text沒有內(nèi)容,就只有</>符號,如果有內(nèi)容,才會<>,</>都有 child.tail = 'tails' # tail是在標(biāo)簽后面追加文本 print(etree.tostring(html)) # print(etree.tostring(html, method='text')) # 只輸出text和tail這種文本文檔,輸出的內(nèi)容連在一起,不實(shí)用# 11.Xpath方式 # print(html.xpath('string()')) #這個和上面的方法一樣,只返回文本的text和tail print(html.xpath('//text()')) # 這個比較好,按各個文本值存放在列表里面 tt = html.xpath('//text()') print(tt[0].getparent().tag) # 這個可以,首先我可以找到存放每個節(jié)點(diǎn)的text的列表,然后我再根據(jù)text找相應(yīng)的節(jié)點(diǎn) # for i in tt: # print i,i.getparent().tag,'\t',# 12.判斷文本類型 print(tt[0].is_text, tt[-1].is_tail) # 判斷是普通text文本,還是tail文本 '''以上都是文本的操作'''# 13.字符串解析,fromstring方式 xml_data = '<html>html.text<body>wo ai ni<child>child.text</child>tails</body></html>' root1 = etree.fromstring(xml_data) # fromstring,字面意思,直接來源字符串 # print root1.tag # print etree.tostring(root1)# 14.xml方式 root2 = etree.XML(xml_data) # 和fromstring基本一樣, print(etree.tostring(root2))# 15.文件類型解析 ''' - a file name/path - a file object - a file-like object - a URL using the HTTP or FTP protocol ''' tree = etree.parse('text.html') # 文件解析成元素樹 root3 = tree.getroot() # 獲取元素樹的根節(jié)點(diǎn) print(etree.tostring(root3, pretty_print=True))parser = etree.XMLParser(remove_blank_text=True) # 去除xml文件里的空行 root = etree.XML("<root> <a/> <b> </b> </root>", parser) print(etree.tostring(root))# 16.html方式 xml_data1 = '<root>data</root>' root4 = etree.HTML(xml_data1) print(etree.tostring(root4)) # HTML方法，如果沒有<html>和<body>標(biāo)簽，會自動補(bǔ)上 # 注意,如果是需要補(bǔ)全的html格式:這樣處理哦 with open("quotes-1.html", 'r') as f:a = HTML.document_fromstring(f.read().decode("utf-8"))for i in a.xpath('//div[@class="quote"]/span[@class="text"]/text()'):print(i)# 17.輸出內(nèi)容,輸出xml格式 print(etree.tostring(root)) print(etree.tostring(root, xml_declaration=True, pretty_print=True, encoding='utf-8')) # 指定xml聲明和編碼 '''以上是文件IO操作'''# 18.findall方法 root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>") print(root.findall('a')[0].text) # findall操作返回列表 print(root.find('.//a').text) # find操作就相當(dāng)與找到了這個元素節(jié)點(diǎn),返回匹配到的第一個元素 print(root.find('a').text) print([b.text for b in root.findall('.//a')]) # 配合列表解析,相當(dāng)帥氣! print(root.findall('.//a[@x]')[0].tag) # 根據(jù)屬性查詢 '''以上是搜索和定位操作''' print(etree.iselement(root)) print(root[0] is root[1].getprevious()) # 子節(jié)點(diǎn)之間的順序 print(root[1] is root[0].getnext()) '''其他技能''' # 遍歷元素數(shù) root = etree.Element("root") etree.SubElement(root, "child").text = "Child 1" etree.SubElement(root, "child").text = "Child 2" etree.SubElement(root, "another").text = "Child 3" etree.SubElement(root[0], "childson").text = "son 1" # for i in root.iter(): #深度遍歷 # for i in root.iter('child'): #只迭代目標(biāo)值 # print i.tag,i.text # print etree.tostring(root,pretty_print=True)

簡單的創(chuàng)建和遍歷

from lxml import etree# 創(chuàng)建 root = etree.Element('root') # 添加子元素，并為子節(jié)點(diǎn)添加屬性 root.append(etree.Element('child',interesting='sss')) # 另一種添加子元素的方法 body = etree.SubElement(root,'body') body.text = 'TEXT' # 設(shè)置值 body.set('class','dd') # 設(shè)置屬性 // # 輸出整個節(jié)點(diǎn) print(etree.tostring(root, encoding='UTF-8', pretty_print=True)) // // # 創(chuàng)建，添加子節(jié)點(diǎn)、文本、注釋 root = etree.Element('root') etree.SubElement(root, 'child').text = 'Child 1' etree.SubElement(root, 'child').text = 'Child 2' etree.SubElement(root, 'child').text = 'Child 3' root.append(etree.Entity('#234')) root.append(etree.Comment('some comment')) # 添加注釋 # 為第三個節(jié)點(diǎn)添加一個br br = etree.SubElement(root.getchildren()[2],'br') br.tail = 'TAIL' for element in root.iter(): # 也可以指定只遍歷是Element的，root.iter(tag=etree.Element)if isinstance(element.tag, str):print('%s - %s' % (element.tag, element.text))else:print('SPECIAL: %s - %s' % (element, element.text))

對 HTML / XML 的解析

# 先導(dǎo)入相關(guān)模塊 from lxml import etree from io import StringIO, BytesIO # 對html具有修復(fù)標(biāo)簽補(bǔ)全的功能 broken_html = '<html><head><title>test<body><h1 class="hh">page title</h3>' parser = etree.HTMLParser() tree = etree.parse(StringIO(broken_html), parser) # 或者直接使用 html = etree.HTML(broken_html) print(etree.tostring(tree, pretty_print=True, method="html")) # # #用xpath獲取h1 h1 = tree.xpath('//h1') # 返回的是一個數(shù)組 # 獲取第一個的tag print(h1[0].tag) # 獲取第一個的class屬性 print(h1[0].get('class')) # 獲取第一個的文本內(nèi)容 print(h1[0].text) # 獲取所有的屬性的key，value的列表 print(h1[0].keys(),h1[0].values())

雜項(xiàng)

python3.5 lxml用法
問題1：有一個XML文件，如何解析?
問題2：解析后，如果查找、定位某個標(biāo)簽?
問題3：定位后如何操作標(biāo)簽，比如訪問屬性、文本內(nèi)容等?
開始之前，首先是導(dǎo)入模塊，該庫常用的XML處理功能都在lxml.etree中
導(dǎo)入模塊：from lxml import etree?

Element類

Element是XML處理的核心類，Element對象可以直觀的理解為XML的節(jié)點(diǎn)，大部分XML節(jié)點(diǎn)的處理都是圍繞該類進(jìn)行的。
這部分包括三個內(nèi)容：節(jié)點(diǎn)的操作、節(jié)點(diǎn)屬性的操作、節(jié)點(diǎn)內(nèi)文本的操作。

節(jié)點(diǎn)操作

1、創(chuàng)建Element對象
? ? ? ? 直接使用Element方法，參數(shù)即節(jié)點(diǎn)名稱。
? ? ? ? root = etree.Element(‘root’)?
? ? ? ? print(root)?

2、獲取節(jié)點(diǎn)名稱
? ? ? ? 使用tag屬性，獲取節(jié)點(diǎn)的名稱。
? ? ? ? print(root.tag)?
? ? ? ? root?

3、輸出XML內(nèi)容
? ? ? ? 使用tostring方法輸出XML內(nèi)容（后文還會有補(bǔ)充介紹），參數(shù)為Element對象。
? ? ? ? print(etree.tostring(root))?
? ? ? ? b’’?
4、添加子節(jié)點(diǎn)
? ? ? ? 使用SubElement方法創(chuàng)建子節(jié)點(diǎn)，第一個參數(shù)為父節(jié)點(diǎn)（Element對象），第二個參數(shù)為子節(jié)點(diǎn)名稱。
? ? ? ? child1 = etree.SubElement(root, ‘child1’)?
? ? ? ? child2 = etree.SubElement(root, ‘child2’)?
? ? ? ? child3 = etree.SubElement(root, ‘child3’)?
5、刪除子節(jié)點(diǎn)
? ? ? ? 使用remove方法刪除指定節(jié)點(diǎn)，參數(shù)為Element對象。clear方法清空所有節(jié)點(diǎn)。
? ? ? ? root.remove(child1) # 刪除指定子節(jié)點(diǎn)?
? ? ? ? print(etree.tostring(root))?
? ? ? ? b’’?
? ? ? ? root.clear() # 清除所有子節(jié)點(diǎn)?
? ? ? ? print(etree.tostring(root))?
? ? ? ? b’’?
6、以列表的方式操作子節(jié)點(diǎn)
? ? ? ? 可以將Element對象的子節(jié)點(diǎn)視為列表進(jìn)行各種操作：
? ? ? ? child = root[0] # 下標(biāo)訪問?
? ? ? ? print(child.tag)?
? ? ? ? child1
? ? ? ? print(len(root)) # 子節(jié)點(diǎn)數(shù)量?
? ? ? ? 3
? ? ? ? root.index(child2) # 獲取索引號?
? ? ? ? 1
? ? ? ? for child in root: # 遍歷?
? ? ? ? … print(child.tag)?
? ? ? ? child1?
? ? ? ? child2?
? ? ? ? child3
? ? ? ? root.insert(0, etree.Element(‘child0’)) # 插入?
? ? ? ? start = root[:1] # 切片?
? ? ? ? end = root[-1:]
? ? ? ? print(start[0].tag)?
? ? ? ? child0?
? ? ? ? print(end[0].tag)?
? ? ? ? child3
? ? ? ? root.append( etree.Element(‘child4’) ) # 尾部添加?
? ? ? ? print(etree.tostring(root))?
? ? ? ? b’’?
? ? ? ? 其實(shí)前面講到的刪除子節(jié)點(diǎn)的兩個方法remove和clear也和列表相似。

7、獲取父節(jié)點(diǎn)
? ? ? ? 使用getparent方法可以獲取父節(jié)點(diǎn)。
? ? ? ? print(child1.getparent().tag)?
? ? ? ? root?

屬性操作

屬性是以key-value的方式存儲的，就像字典一樣。

1、創(chuàng)建屬性

? ? ? ? 可以在創(chuàng)建Element對象時同步創(chuàng)建屬性，第二個參數(shù)即為屬性名和屬性值：
? ? ? ? root = etree.Element(‘root’, interesting=’totally’)?
? ? ? ? print(etree.tostring(root))?
? ? ? ? b’’?
? ? ? ? 也可以使用set方法給已有的Element對象添加屬性，兩個參數(shù)分別為屬性名和屬性值：

? ? ? ? root.set(‘hello’, ‘Huhu’)?
? ? ? ? print(etree.tostring(root))?
? ? ? ? b’’
?? ??? ?
2、獲取屬性
? ? ? ? 屬性是以key-value的方式存儲的，就像字典一樣。直接看例子

? ? ? ? get方法獲得某一個屬性值
? ? ? ? print(root.get(‘interesting’))?
? ? ? ? totally
? ? ? ? keys方法獲取所有的屬性名
? ? ? ? sorted(root.keys())?
? ? ? ? [‘hello’, ‘interesting’]

? ? ? ? items方法獲取所有的鍵值對
? ? ? ? for name, value in sorted(root.items()):?
? ? ? ? ? ? … print(‘%s = %r’ % (name, value))?
? ? ? ? hello = ‘Huhu’?
? ? ? ? interesting = ‘totally’?
? ? ? ??
? ? ? ? 也可以用attrib屬性一次拿到所有的屬性及屬性值存于字典中：
? ? ? ? attributes = root.attrib?
? ? ? ? print(attributes)?
? ? ? ? {‘interesting’: ‘totally’, ‘hello’: ‘Huhu’}

? ? ? ? attributes[‘good’] = ‘Bye’ # 字典的修改影響節(jié)點(diǎn)?
? ? ? ? print(root.get(‘good’))?
? ? ? ? Bye?

文本操作

標(biāo)簽及標(biāo)簽的屬性操作介紹完了，最后就剩下標(biāo)簽內(nèi)的文本了。
可以使用text和tail屬性、或XPath的方式來訪問文本內(nèi)容。

1、text 和 tail 屬性

? ? ? ? 一般情況，可以用Element的text屬性訪問標(biāo)簽的文本。
? ? ? ? root = etree.Element(‘root’)?
? ? ? ? root.text = ‘Hello, World!’?
? ? ? ? print(root.text)?
? ? ? ? Hello, World!?
? ? ? ? print(etree.tostring(root))?
? ? ? ? b’Hello, World!’?

? ? ? ? Element類提供了tail屬性支持單一標(biāo)簽的文本獲取。
? ? ? ? html = etree.Element(‘html’)?
? ? ? ? body = etree.SubElement(html, ‘body’)?
? ? ? ? body.text = ‘Text’?
? ? ? ? print(etree.tostring(html))?
? ? ? ? b’Text’
? ? ? ??
? ? ? ? br = etree.SubElement(body, ‘br’)?
? ? ? ? print(etree.tostring(html))?
? ? ? ? b’Text’
? ? ? ? ? ? ? ??
? ? ? ? tail僅在該標(biāo)簽后面追加文本 ? ? ??
? ? ? ? br.tail = ‘Tail’?
? ? ? ? print(etree.tostring(br))?
? ? ? ? b’
? ? ? ? Tail’
? ? ? ??
? ? ? ? print(etree.tostring(html))?
? ? ? ? b’Text
? ? ? ? Tail’
? ? ? ??
? ? ? ? tostring方法增加method參數(shù)，過濾單一標(biāo)簽，輸出全部文本
? ? ? ? ? ? ? ??
? ? ? ? print(etree.tostring(html, method=’text’))?
? ? ? ? b’TextTail’?
?? ??? ?
2、XPath方式
? ? ? ? 方式一：過濾單一標(biāo)簽，返回文本
? ? ? ? print(html.xpath(‘string()’))?
? ? ? ? TextTail
? ? ? ??
? ? ? ? 方式二：返回列表，以單一標(biāo)簽為分隔
? ? ? ??
? ? ? ? print(html.xpath(‘//text()’))?
? ? ? ? [‘Text’, ‘Tail’]?
? ? ? ? 方法二獲得的列表，每個元素都會帶上它所屬節(jié)點(diǎn)及文本類型信息，如下：
? ? ? ??
? ? ? ? texts = html.xpath(‘//text()’))
? ? ? ??
? ? ? ? print(texts[0])?
? ? ? ? Text

所屬節(jié)點(diǎn)
? ? ? ? parent = texts[0].getparent()?
? ? ? ? print(parent.tag)?
? ? ? ? body
? ? ? ??
? ? ? ? print(texts[1], texts[1].getparent().tag)?
? ? ? ? Tail br

文本類型：是普通文本還是tail文本
? ? ? ? print(texts[0].is_text)?
? ? ? ? True?
? ? ? ? print(texts[1].is_text)?
? ? ? ? False?
? ? ? ? print(texts[1].is_tail)?
? ? ? ? True?

文件解析與輸出
回答問題1。

這部分講述如何將XML文件解析為Element對象，以及如何將Element對象輸出為XML文件。

1、文件解析

? ? ? ? 文件解析常用的有fromstring、XML 和 HTML 三個方法。接受的參數(shù)都是字符串。
? ? ? ? xml_data = ‘data’
? ? ? ? fromstring方法
? ? ? ? root1 = etree.fromstring(xml_data)?
? ? ? ? print(root1.tag)?
? ? ? ? root?
? ? ? ? print(etree.tostring(root1))?
? ? ? ? b’data’

? ? ? ? XML方法，與fromstring方法基本一樣

? ? ? ? root2 = etree.XML(xml_data)?
? ? ? ? print(root2.tag)?
? ? ? ? root?
? ? ? ? print(etree.tostring(root2))?
? ? ? ? b’data’

? ? ? ? HTML方法，如果沒有和標(biāo)簽，會自動補(bǔ)上
? ? ? ? root3 = etree.HTML(xml_data)?
? ? ? ? print(root3.tag)?
? ? ? ? html?
? ? ? ? print(etree.tostring(root3))?
? ? ? ? b’data’?
?? ??? ?
2、輸出
? ? ? ? 輸出其實(shí)就是前面一直在用的tostring方法了，這里補(bǔ)充xml_declaration和encoding兩個參數(shù)，前者是XML聲明，后者是指定編碼。
? ? ? ? root = etree.XML(‘‘)
? ? ? ? print(etree.tostring(root))?
? ? ? ? b’’

XML聲明
print(etree.tostring(root, xml_declaration=True))?
b”

指定編碼

print(etree.tostring(root, encoding=’iso-8859-1’))?
b”

查找第一個b標(biāo)簽
print(root.find(‘b’))?
None?
print(root.find(‘a(chǎn)’).tag)?
a

查找所有b標(biāo)簽，返回Element對象組成的列表

[ b.tag for b in root.findall(‘.//b’) ]?
[‘b’, ‘b’]

根據(jù)屬性查詢
print(root.findall(‘.//a[@x]’)[0].tag)?
a?
print(root.findall(‘.//a[@y]’))?
[]?
?

XPath語法

XPath 是一門在 XML 文檔中查找信息的語言。XPath 可用來在 XML 文檔中對元素和屬性進(jìn)行遍歷。XPath 是 W3C XSLT 標(biāo)準(zhǔn)的主要元素，并且 XQuery 和 XPointer 都構(gòu)建于 XPath 表達(dá)之上。

在 XPath 中，有七種類型的節(jié)點(diǎn)：元素、屬性、文本、命名空間、處理指令、注釋以及文檔（根）節(jié)點(diǎn)。XML 文檔是被作為節(jié)點(diǎn)樹來對待的。樹的根被稱為文檔節(jié)點(diǎn)或者根節(jié)點(diǎn)。

根節(jié)點(diǎn)在xpath中可以用 “//” 來啊表示

XPath 常用規(guī)則

表達(dá)式	描述
nodename	選取此節(jié)點(diǎn)的所有子節(jié)點(diǎn)
/	從當(dāng)前節(jié)點(diǎn)選取直接子節(jié)點(diǎn)
//	從當(dāng)前節(jié)點(diǎn)選取子孫節(jié)點(diǎn)
.	選取當(dāng)前節(jié)點(diǎn)
..	選取當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)
@	選取屬性
*	通配符，選擇所有元素節(jié)點(diǎn)與元素名
@*	選取所有屬性
[@attrib]	選取具有給定屬性的所有元素
[@attrib='value']	選取給定屬性具有給定值的所有元素
[tag]	選取所有具有指定元素的直接子節(jié)點(diǎn)
[tag='text']	選取所有具有指定元素并且文本內(nèi)容是text節(jié)點(diǎn)

讀取文本解析節(jié)點(diǎn) （ etree 會修復(fù) HTML 文本節(jié)點(diǎn) ）

#!/usr/bin/python3 # -*- coding: utf-8 -*- # @Author : # @File : test.py # @Software : PyCharm # @description : XXXfrom lxml import etreetext = ''' <div><ul><li class="item-0"><a href="link1.html">第一個</a></li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0"><a href="link5.html">a屬性</a></ul></div> '''html = etree.HTML(text) # 初始化生成一個XPath解析對象 result = etree.tostring(html, encoding='utf-8') # 解析對象輸出代碼 print(type(html)) print(type(result)) print(result.decode('utf-8'))''' 執(zhí)行結(jié)果： <class 'lxml.etree._Element'> <class 'bytes'> <html><body><div><ul><li class="item-0"><a href="link1.html">第一個</a></li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0"><a href="link5.html">a屬性</a></li></ul></div> </body></html> '''

讀取 HTML文件進(jìn)行解析

from lxml import etreehtml = etree.parse('test.html', etree.HTMLParser()) # 指定解析器HTMLParser會根據(jù)文件修復(fù)HTML文件中缺失的如聲明信息 result = etree.tostring(html) # 解析成字節(jié) # result=etree.tostringlist(html) #解析成列表 print(type(html)) print(type(result)) print(result)

節(jié)點(diǎn)關(guān)系

（1）父（Parent）

每個元素以及屬性都有一個父。在下面的例子中，book 元素是 title、author、year 以及 price 元素的父：

<book><title>Harry Potter</title><author>J K. Rowling</author><year>2005</year><price>29.99</price> </book>

（2）子（Children）

元素節(jié)點(diǎn)可有零個、一個或多個子。在下面的例子中，title、author、year 以及 price 元素都是 book 元素的子：

<book><title>Harry Potter</title><author>J K. Rowling</author><year>2005</year><price>29.99</price> </book>

（3）同胞（Sibling）

擁有相同的父的節(jié)點(diǎn)。在下面的例子中，title、author、year 以及 price 元素都是同胞：

<book><title>Harry Potter</title><author>J K. Rowling</author><year>2005</year><price>29.99</price> </book>

（4）先輩（Ancestor）

某節(jié)點(diǎn)的父、父的父，等等。在下面的例子中，title 元素的先輩是 book 元素和 bookstore 元素：

<bookstore><book><title>Harry Potter</title><author>J K. Rowling</author><year>2005</year><price>29.99</price> </book></bookstore>

（5）后代（Descendant）

某個節(jié)點(diǎn)的子，子的子，等等。在下面的例子中，bookstore 的后代是 book、title、author、year 以及 price 元素：

<bookstore><book><title>Harry Potter</title><author>J K. Rowling</author><year>2005</year><price>29.99</price> </book></bookstore>

選取節(jié)點(diǎn)

XPath 使用路徑表達(dá)式在 XML 文檔中選取節(jié)點(diǎn)。節(jié)點(diǎn)是通過沿著路徑或者 step 來選取的。

下面列出了最有用的路徑表達(dá)式：

表達(dá)式	描述
nodename	選取此節(jié)點(diǎn)的所有子節(jié)點(diǎn)
/	從當(dāng)前節(jié)點(diǎn)選取直接子節(jié)點(diǎn)
//	從當(dāng)前節(jié)點(diǎn)選取子孫節(jié)點(diǎn)
.	選取當(dāng)前節(jié)點(diǎn)
..	選取當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)
@	選取屬性
*	通配符，選擇所有元素節(jié)點(diǎn)與元素名
@*	選取所有屬性
[@attrib]	選取具有給定屬性的所有元素
[@attrib='value']	選取給定屬性具有給定值的所有元素
[tag]	選取所有具有指定元素的直接子節(jié)點(diǎn)
[tag='text']	選取所有具有指定元素并且文本內(nèi)容是text節(jié)點(diǎn)

實(shí)例

在下面的表格中，我們已列出了一些路徑表達(dá)式以及表達(dá)式的結(jié)果：

路徑表達(dá)式結(jié)果

bookstore	選取 bookstore 元素的所有子節(jié)點(diǎn)。
/bookstore	選取根元素 bookstore。注釋：假如路徑起始于正斜杠( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬于 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文檔中的位置。
bookstore//book	選擇屬于 bookstore 元素的后代的所有 book 元素，而不管它們位于 bookstore 之下的什么位置。
//@lang	選取名為 lang 的所有屬性。

謂語（Predicates）

謂語用來查找某個特定的節(jié)點(diǎn)或者包含某個指定的值的節(jié)點(diǎn)。謂語被嵌在方括號中。

實(shí)例

在下面的表格中，我們列出了帶有謂語的一些路徑表達(dá)式，以及表達(dá)式的結(jié)果：

路徑表達(dá)式結(jié)果

/bookstore/book[1]	選取屬于 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬于 bookstore 子元素的最后一個 book 元素。
/bookstore/book[last()-1]	選取屬于 bookstore 子元素的倒數(shù)第二個 book 元素。
/bookstore/book[position()<3]	選取最前面的兩個屬于 bookstore 元素的子元素的 book 元素。
//title[@lang]	選取所有擁有名為 lang 的屬性的 title 元素。
//title[@lang=’eng’]	選取所有 title 元素，且這些元素?fù)碛兄禐?eng 的 lang 屬性。
/bookstore/book[price>35.00]	選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大于 35.00。
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大于 35.00。

選取未知節(jié)點(diǎn)

XPath 通配符可用來選取未知的 XML 元素。

通配符描述

*	匹配任何元素節(jié)點(diǎn)。
@*	匹配任何屬性節(jié)點(diǎn)。
node()	匹配任何類型的節(jié)點(diǎn)。

實(shí)例

在下面的表格中，我們列出了一些路徑表達(dá)式，以及這些表達(dá)式的結(jié)果：

路徑表達(dá)式結(jié)果

/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文檔中的所有元素。
//title[@*]	選取所有帶有屬性的 title 元素。

選取若干路徑

通過在路徑表達(dá)式中使用“|”運(yùn)算符，您可以選取若干個路徑。

實(shí)例

在下面的表格中，我們列出了一些路徑表達(dá)式，以及這些表達(dá)式的結(jié)果：

路徑表達(dá)式結(jié)果

//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素。
//title \| //price	選取文檔中的所有 title 和 price 元素。
/bookstore/book/title \| //price	選取屬于 bookstore 元素的 book 元素的所有 title 元素，以及文檔中所有的 price 元素。

XPath 運(yùn)算符

下面列出了可用在 XPath 表達(dá)式中的運(yùn)算符：（此表參考來源:http://www.w3school.com.cn/xpath/xpath_operators.asp）

運(yùn)算符描述實(shí)例返回值

\|	計(jì)算兩個節(jié)點(diǎn)集	//book \| //cd	返回所有擁有 book 和 cd 元素的節(jié)點(diǎn)集
+	加法	6 + 4	10
–	減法	6 – 4	2
*	乘法	6 * 4	24
div	除法	8 div 4	2
=	等于	price=9.80	如果 price 是 9.80，則返回 true。如果 price 是 9.90，則返回 false。
!=	不等于	price!=9.80	如果 price 是 9.90，則返回 true。如果 price 是 9.80，則返回 false。
<	小于	price<9.80	如果 price 是 9.00，則返回 true。如果 price 是 9.90，則返回 false。
<=	小于或等于	price<=9.80	如果 price 是 9.00，則返回 true。如果 price 是 9.90，則返回 false。
>	大于	price>9.80	如果 price 是 9.90，則返回 true。如果 price 是 9.80，則返回 false。
>=	大于或等于	price>=9.80	如果 price 是 9.90，則返回 true。如果 price 是 9.70，則返回 false。
or	或	price=9.80 or price=9.70	如果 price 是 9.80，則返回 true。如果 price 是 9.50，則返回 false。
and	與	price>9.00 and price<9.90	如果 price 是 9.80，則返回 true。如果 price 是 8.50，則返回 false。
mod	計(jì)算除法的余數(shù)	5 mod 2	1

XPath 函數(shù)的高級使用示例：

1.使用 contains() 和 and
? ? //div[starts-with(@id,'res')]//table[1]//tr//td[2]//a//span[contains(.,'_Test') and contains(.,'KPI')]?
? ? //div[contains(@id,'in')] ,表示選擇id中包含有’in’的div節(jié)點(diǎn)
2.text()：
? ? 由于一個節(jié)點(diǎn)的文本值不屬于屬性，比如“<a class=”baidu“ href=”http://www.baidu.com“>baidu</a>”,
? ? 所以，用text()函數(shù)來匹配節(jié)點(diǎn)：//a[text()='baidu']
? ? //span[@id='idHeaderTitleCell' and contains(text(),'QuickStart')]
3.last()：
? ? 前面已介紹
4. 使用starts-with()
? ? //div[starts-with(@id,'in')] ，表示選擇以’in’開頭的id屬性的div節(jié)點(diǎn)
?? ?//div[starts-with(@id,'res')]//table//tr//td[2]//table//tr//td//a//span[contains(.,'Developer Tutorial')]
5.not()函數(shù)，表示否定。not()函數(shù)通常與返回值為true or false的函數(shù)組合起來用，
? ? 比如contains(),starts-with()等，但有一種特別情況請注意一下：
?? ?我們要匹配出input節(jié)點(diǎn)含有id屬性的，寫法為：//input[@id]，
? ? 如果我們要匹配出input節(jié)點(diǎn)不含用id屬性的，則為：//input[not(@id)]
? ? //input[@name=‘identity’ and not(contains(@class,‘a(chǎn)’))] ，表示匹配出name為identity并且class的值中不包含a的input節(jié)點(diǎn)。
6.使用descendant
? ? //div[starts-with(@id,'res')]//table[1]//tr//td[2]//a//span[contains(.,'QuickStart')]/../../../descendant::img
7.使用ancestor
//div[starts-with(@id,'res')]//table[1]//tr//td[2]//a//span[contains(.,'QuickStart')]/ancestor::div[starts-with(@id,'res')]//table[2]//descendant::a[2]

示例代碼：

# -*- coding: utf-8 -*- # @Author : # @File : douban_api.py # @Software: PyCharm # @description : XXXimport re import json import datetime import requests from lxml import etree import urllib3urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)class DBSpider(object):def __init__(self):self.custom_headers = {'Host': 'movie.douban.com','Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,''image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',},# self.proxies = {# 'http': '127.0.0.1:8080',# 'https': '127.0.0.1:8080'# }self.s = requests.session()self.s.verify = Falseself.s.headers = self.custom_headers# self.s.proxies = self.proxiesdef __del__(self):passdef api_artists_info(self, artists_id=None):ret_val = Noneif artists_id:url = f'https://movie.douban.com/celebrity/{artists_id}/'try:r = self.s.get(url)if 200 == r.status_code:response = etree.HTML(r.text)artists_name = response.xpath('//h1/text()')artists_name = artists_name[0] if len(artists_name) else ''chinese_name = re.findall(r'([\u4e00-\u9fa5·]+)', artists_name)chinese_name = chinese_name[0] if len(chinese_name) else ''english_name = artists_name.replace(chinese_name, '').strip()pic = response.xpath('//div[@id="headline"]//div[@class="pic"]//img/@src')pic = pic[0] if len(pic) else ''sex = response.xpath('//div[@class="info"]//span[contains(text(), "性別")]/../text()')sex = ''.join(sex).replace('\n', '').replace(':', '').strip() if len(sex) else ''constellation = response.xpath('//div[@class="info"]//span[contains(text(), "星座")]/../text()')constellation = ''.join(constellation).replace('\n', '').replace(':', '').strip() if len(constellation) else ''birthday = response.xpath('//div[@class="info"]//span[contains(text(), "日期")]/../text()')birthday = ''.join(birthday).replace('\n', '').replace(':', '').strip() if len(birthday) else ''place = response.xpath('//div[@class="info"]//span[contains(text(), "出生地")]/../text()')place = ''.join(place).replace('\n', '').replace(':', '').strip() if len(place) else ''occupation = response.xpath('//div[@class="info"]//span[contains(text(), "職業(yè)")]/../text()')occupation = ''.join(occupation).replace('\n', '').replace(':', '').strip() if len(occupation) else ''desc = ''.join([x for x in response.xpath('//span[@class="all hidden"]/text()')])artists_info = dict(artistsId=artists_id,homePage=f'https://movie.douban.com/celebrity/{artists_id}',sex=sex,constellation=constellation,chineseName=chinese_name,foreignName=english_name,posterAddre=pic,# posterAddreOSS=Images.imgages_data(pic, 'movie/douban'),birthDate=birthday,birthAddre=place,desc=desc,occupation=occupation,showCount='',fetchTime=str(datetime.datetime.now()),)# print(json.dumps(artists_info, ensure_ascii=False, indent=4))ret_val = artists_infoelse:print(f'status code : {r.status_code}')except BaseException as e:print(e)return ret_valdef test(self):passif __name__ == '__main__':douban = DBSpider()# temp_uid = '1044707'# temp_uid = '1386515'# temp_uid = '1052358'temp_uid = '1052357'user_info = douban.api_artists_info(temp_uid)print(json.dumps(user_info, ensure_ascii=False, indent=4))pass

Xpath 高級用法

scrapy實(shí)戰(zhàn)2，使用內(nèi)置的xpath，re 和 css 提取值：https://www.cnblogs.com/regit/p/9629263.html

span 標(biāo)簽 class 屬性包含 ?selectable 字符串：//span[contains(@class, 'selectable')]

匹配貓眼座位數(shù)
????????//div[@class='seats-wrapper']/div/span[contains(@class,'seat') and not(contains(@class,'empty'))]
????等價于
????????//div[@class='seats-wrapper']/div//span[not(contains(//span[contains(@class, 'seat')]/@class, 'empty'))]

./@data-val
//div[contains(@class, "show-list") and @data-index="{0}"]
.//div[@class="show-date"]//span[contains(@class, "date-item")]/text()
.//div[contains(@class, "plist-container")][1]//tbody//tr ????xpath 中下標(biāo)是從 1 開始的
substring-before(substring-after(//script[contains(text(), '/apps/feedlist')]/text(), 'html":"'), '"})')
//div[text()="hello"]/p/text()
//a[@class="movie-name"][1]/text()
string(//a[@class="movie-name"][1])

1. 獲取父節(jié)點(diǎn)屬性
? ? 首先選中 href 屬性為 link4.html的a節(jié)點(diǎn)，然后再獲取其父節(jié)點(diǎn)，然后再獲取其class屬性
? ? result1 = response.xpath('//a[@href="link4.html"]/../@class')
? ? 我們也可以通過parent::來獲取父節(jié)點(diǎn)
? ? result2 = response.xpath('//a[@href="link4.html"]/parent::*/@class')
? ? 注意：?//a表示html中的所有a節(jié)點(diǎn)，他們的href屬性有多個，這里[]的作用是屬性匹配，找到a的href屬性為link4.html的節(jié)點(diǎn)
2. 獲取節(jié)點(diǎn)內(nèi)部文本
? ? 獲取class為item-1的li節(jié)點(diǎn)文本，
? ? result3 = response.xpath('//li[@class="item-0"]/a/text()')
? ? 返回結(jié)果為：['first item', 'fifth item']
3. 屬性獲取
? ? 獲取所有l(wèi)i節(jié)點(diǎn)下的所有a節(jié)點(diǎn)的href屬性
? ? result4 = response.xpath('//li/a/@href')
? ? 返回結(jié)果為：['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
4. 按序選擇
? ? result = response.xpath('//li[1]/a/text()') ? #選取第一個li節(jié)點(diǎn)
? ? result = response.xpath('//li[last()]/a/text()') ? #選取最后一個li節(jié)點(diǎn)
? ? result = response.xpath('//li[position()<3]/a/text()') ? #選取位置小于3的li節(jié)點(diǎn)，也就是1和2的節(jié)點(diǎn)
? ? result = response.xpath('//li[last()-2]/a/text()') ?#選取倒數(shù)第三個節(jié)點(diǎn)?
5. 節(jié)點(diǎn)軸選擇
? ? 1）返回第一個li節(jié)點(diǎn)的所有祖先節(jié)點(diǎn)，包括html,body,div和ul
? ? ? ? ? ? ? ? result = response.xpath('//li[1]/ancestor::*')? ? ?
? ? 2）返回第一個li節(jié)點(diǎn)的<div>祖先節(jié)點(diǎn)
? ? ? ? ? ? ? ? result = response.xpath('//li[1]/ancestor::div')? ? ?
? ? 3）返回第一個li節(jié)點(diǎn)的所有屬性值
? ? ? ? ? ? ? ? result = response.xpath('//li[1]/attribute::*')? ? ?
? ? 4）首先返回第一個li節(jié)點(diǎn)的所有子節(jié)點(diǎn)，然后加上限定條件，選組href屬性為link1.html的a節(jié)點(diǎn)
? ? ? ? ? ? ? ? result = response.xpath('//li[1]/child::a[@href="link1.html"]')? ? ?
? ? 5）返回第一個li節(jié)點(diǎn)的所有子孫節(jié)點(diǎn)，然后加上只要span節(jié)點(diǎn)的條件
? ? ? ? ? ? ? ? result = response.xpath('//li[1]/descendant::span')? ? ?
? ? 6）following軸可獲得當(dāng)前節(jié)點(diǎn)之后的所有節(jié)點(diǎn)，雖然使用了*匹配，但是又加了索引選擇，所以只獲取第2個后續(xù)節(jié)點(diǎn)，也就是第2個<li>節(jié)點(diǎn)中的<a>節(jié)點(diǎn)
? ? ? ? ? ? ? ? result = response.xpath('//li[1]/following::*[2]')? ? ?
? ? 7）following-sibling可獲取當(dāng)前節(jié)點(diǎn)之后的所有同級節(jié)點(diǎn)，也就是后面所有的<li>節(jié)點(diǎn)
? ? ? ? ? ? ? ? result = response.xpath('//li[1]/following-sibling::*')?
6. 屬性多值匹配
? ? ? ? ? ? ? ? <li class="li li-first"><a href="link.html">first item</a></li>? ? ?
? ? ? ? ? ? ? ? result5 = response.xpath('//li[@class="li"]/a/text()')
? ? 返回值為空，因?yàn)檫@里HTML文本中l(wèi)i節(jié)點(diǎn)為class屬性有2個值li和li-first，如果還用之前的屬性匹配就不行了，需要用contain()函數(shù)? ? ?
? ? 正確方法如下
? ? ? ? ? ? ? ? result5 = response.xpath('//li[contains(@class, "li")]/a/text()')
? ? contains()方法中，第一個參數(shù)為屬性名，第二個參數(shù)傳入屬性值，只要此屬性名包含所傳入的屬性值就可完成匹配?
?7. 多屬性匹配，這里說一下不用框架的時候，xpath的常規(guī)用法
? ? 有時候我們需要多個屬性來確定一個節(jié)點(diǎn)，那么就需要同時匹配多個屬性，可用and來連接? ??
? ? from lxml import etree
? ? text = '''
? ? <li class = "li li-first" name="item"><a href="link.html">first item</a></li>
? ? '''
? ? html = etree.HTML(text)
? ? result6 = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
? ? print(result)? ??
? ? 這里的li節(jié)點(diǎn)有class和name兩個屬性，需要用and操作符相連，然后置于中括號內(nèi)進(jìn)行條件篩選

xpath 學(xué)習(xí)筆記

1.依靠自己的屬性，文本定位//td[text()='Data Import']//div[contains(@class,'cux-rightArrowIcon-on')]//a[text()='馬上注冊']//input[@type='radio' and @value='1'] 多條件//span[@name='bruce'][text()='bruce1'][1] 多條件//span[@id='bruce1' or text()='bruce2'] 找出多個//span[text()='bruce1' and text()='bruce2'] 找出多個 2.依靠父節(jié)點(diǎn)定位//div[@class='x-grid-col-name x-grid-cell-inner']/div//div[@id='dynamicGridTestInstanceformclearuxformdiv']/div//div[@id='test']/input 3.依靠子節(jié)點(diǎn)定位//div[div[@id='navigation']]//div[div[@name='listType']]//div[p[@name='testname']] 4.混合型//div[div[@name='listType']]//img//td[a//font[contains(text(),'seleleium2從零開始視屏')]]//input[@type='checkbox'] 5.進(jìn)階部分//input[@id='123']/following-sibling::input 找下一個兄弟節(jié)點(diǎn)//input[@id='123']/preceding-sibling::span 上一個兄弟節(jié)點(diǎn)//input[starts-with(@id,'123')] 以什么開頭//span[not(contains(text(),'xpath')）] 不包含xpath字段的span 6.索引//div/input[2]//div[@id='position']/span[3]//div[@id='position']/span[position()=3]//div[@id='position']/span[position()>3]//div[@id='position']/span[position()<3]//div[@id='position']/span[last()]//div[@id='position']/span[last()-1] 7.substring 截取判斷<div data-for="result" id="swfEveryCookieWrap"></div>//*[substring(@id,4,5)='Every']/@id 截取該屬性定位3,取長度5的字符 //*[substring(@id,4)='EveryCookieWrap'] 截取該屬性從定位3 到最后的字符 //*[substring-before(@id,'C')='swfEvery']/@id 屬性 'C'之前的字符匹配//*[substring-after(@id,'C')='ookieWrap']/@id 屬性'C之后的字符匹配 8.通配符*//span[@*='bruce']//*[@name='bruce'] 9.軸//div[span[text()='+++current node']]/parent::div 找父節(jié)點(diǎn)//div[span[text()='+++current node']]/ancestor::div 找祖先節(jié)點(diǎn) 10.孫子節(jié)點(diǎn)//div[span[text()='current note']]/descendant::div/span[text()='123']//div[span[text()='current note']]//div/span[text()='123'] 兩個表達(dá)的意思一樣 11.following pre https://www.baidu.com/s?wd=xpath//span[@class="fk fk_cur"]/../following::a 往下的所有a//span[@class="fk fk_cur"]/../preceding::a[1] 往上的所有axpath提取多個標(biāo)簽下的text在寫爬蟲的時候，經(jīng)常會使用xpath進(jìn)行數(shù)據(jù)的提取，對于如下的代碼： <div id="test1">大家好！</div> 使用xpath提取是非常方便的。假設(shè)網(wǎng)頁的源代碼在selector中： data = selector.xpath('//div[@id="test1"]/text()').extract()[0] 就可以把“大家好！”提取到data變量中去。然而如果遇到下面這段代碼呢？ <div id="test2">美女，<font color=red>你的微信是多少？</font><div> 如果使用： data = selector.xpath('//div[@id="test2"]/text()').extract()[0] 只能提取到“美女，”；如果使用： data = selector.xpath('//div[@id="test2"]/font/text()').extract()[0] 又只能提取到“你的微信是多少？” 可是我本意是想把“美女，你的微信是多少？”這一整個句子提取出來。 <div id="test3">我左青龍，<span id="tiger">右白虎， <ul>上朱雀，<li>下玄武。</li></ul>老牛在當(dāng)中，</span>龍頭在胸口。 <div> 而且內(nèi)部的標(biāo)簽還不固定，如果我有一百段這樣類似的html代碼，又如何使用xpath表達(dá)式，以最快最方便的方式提取出來？使用xpath的string(.) 以第三段代碼為例： data = selector.xpath('//div[@id="test3"]') info = data.xpath('string(.)').extract()[0] 這樣，就可以把“我左青龍，右白虎，上朱雀，下玄武。老牛在當(dāng)中，龍頭在胸口”整個句子提取出來，賦值給info變量。

示例 XML 文檔

<?xml version="1.0" encoding="utf8"?> <bookstore><book><title lang="eng">Harry Potter</title><price>29.99</price></book><book><title lang="eng">Learning XML</title><price>39.95</price></book> </bookstore>

選取節(jié)點(diǎn)

以下為基本路徑的表達(dá)方式，記住 XPath 的路徑表達(dá)式都是基于某個節(jié)點(diǎn)之上的，例如最初的當(dāng)前節(jié)點(diǎn)一般是根節(jié)點(diǎn)，這與 Linux 下路徑切換原理是一樣的。

表達(dá)式	描述
nodename	選取已匹配節(jié)點(diǎn)下名為 nodename 的子元素節(jié)點(diǎn)。
/	如果以 / 開頭，表示從根節(jié)點(diǎn)作為選取起點(diǎn)。
//	在已匹配節(jié)點(diǎn)后代中選取節(jié)點(diǎn)，不考慮目標(biāo)節(jié)點(diǎn)的位置。
.	選取當(dāng)前節(jié)點(diǎn)。
..	選取當(dāng)前節(jié)點(diǎn)的父元素節(jié)點(diǎn)。
@	選取屬性。

>>> from lxml import etree >>> xml = """<?xml version="1.0" encoding="utf8"?> <bookstore><book><title lang="eng">Harry Potter</title><price>29.99</price></book><book><title lang="eng">Learning XML</title><price>39.95</price></book> </bookstore>"""# 得到根節(jié)點(diǎn) >>> root = etree.fromstring(xml)　　 >>> print root <Element bookstore at 0x2c9cc88># 選取所有book子元素 >>> root.xpath('book')　　 [<Element book at 0x2d88878>, <Element book at 0x2d888c8>]# 選取根節(jié)點(diǎn)bookstore >>> root.xpath('/bookstore')　　 [<Element bookstore at 0x2c9cc88>]# 選取所有book子元素的title子元素 >>> root.xpath('book/title')　　 [<Element title at 0x2d88878>, <Element title at 0x2d888c8>]# 以根節(jié)點(diǎn)為始祖，選取其后代中的title元素 >>> root.xpath('//title')　　　　 [<Element title at 0x2d88878>, <Element title at 0x2d888c8>]# 以book子元素為始祖，選取后代中的price元素 >>> root.xpath('book//price')　　 [<Element price at 0x2ca20a8>, <Element price at 0x2d88738>]# 以根節(jié)點(diǎn)為始祖，選取其后代中的lang屬性值 >>> root.xpath('//@lang')　　　　 ['eng', 'eng']

預(yù)判（Predicates）

預(yù)判是用來查找某個特定的節(jié)點(diǎn)或者符合某種條件的節(jié)點(diǎn)，預(yù)判表達(dá)式位于方括號中。

# 選取bookstore的第一個book子元素>>> root.xpath('/bookstore/book[1]')　　　　　　　　　　[<Element book at 0x2ca20a8>]# 選取bookstore的最后一個book子元素>>> root.xpath('/bookstore/book[last()]')　　　　　　　　[<Element book at 0x2d88878>]# 選取bookstore的倒數(shù)第二個book子元素>>> root.xpath('/bookstore/book[last()-1]')　　　　　　[<Element book at 0x2ca20a8>]# 選取bookstore的前兩個book子元素>>> root.xpath('/bookstore/book[position()<3]')　　　　[<Element book at 0x2ca20a8>, <Element book at 0x2d88878>]# 以根節(jié)點(diǎn)為始祖，選取其后代中含有l(wèi)ang屬性的title元素>>> root.xpath('//title[@lang]')　　　　　[<Element title at 0x2d888c8>, <Element title at 0x2d88738>]# 以根節(jié)點(diǎn)為始祖，選取其后代中含有l(wèi)ang屬性并且值為eng的title元素>>> root.xpath("//title[@lang='eng']")[<Element title at 0x2d888c8>, <Element title at 0x2d88738>]# 選取bookstore子元素book，條件是book的price子元素要大于35>>> root.xpath("/bookstore/book[price>35.00]")[<Element book at 0x2ca20a8>]# 選取bookstore子元素book的子元素title，條件是book的price子元素要大于35>>> root.xpath("/bookstore/book[price>35.00]/title")[<Element title at 0x2d888c8>]

通配符

通配符	描述
*	匹配任何元素。
@*	匹配任何屬性。
node()	匹配任何類型的節(jié)點(diǎn)。

# 選取 bookstore 所有子元素>>> root.xpath('/bookstore/*')[<Element book at 0x2d888c8>, <Element book at 0x2ca20a8>]# 選取根節(jié)點(diǎn)的所有后代元素>>> root.xpath('//*')　　[<Element bookstore at 0x2c9cc88>, <Element book at 0x2d888c8>, <Element title at 0x2d88738>, <Element price at 0x2d88878>, <Element book at 0x2ca20a8>, <Element title at 0x2d88940>, <Element price at 0x2d88a08>]# 選取根節(jié)點(diǎn)的所有具有屬性節(jié)點(diǎn)的title元素>>> root.xpath('//title[@*]')　　[<Element title at 0x2d88738>, <Element title at 0x2d88940>]# 選取當(dāng)前節(jié)點(diǎn)下所有節(jié)點(diǎn)。'\n ' 是文本節(jié)點(diǎn)。>>> root.xpath('node()')['\n ', <Element book at 0x2d888c8>, '\n ', <Element book at 0x2d88878>, '\n']# 選取根節(jié)點(diǎn)所有后代節(jié)點(diǎn)，包括元素、屬性、文本。>>> root.xpath('//node()')[<Element bookstore at 0x2c9cc88>, '\n ', <Element book at 0x2d888c8>, '\n ', <Element title at 0x2d88738>, 'Harry Potter', '\n ', <Element price at 0x2d88940>, '29.99', '\n ', '\n ', <Element book at 0x2d88878>, '\n ', <Element title at 0x2ca20a8>, 'Learning XML', '\n ', <Element price at 0x2d88a08>, '39.95', '\n ', '\n']

或條件選取

使用 "|" 運(yùn)算符，你可以選取符合“或”條件的若干路徑。

# 選取所有book的title元素或者price元素>>> root.xpath('//book/title|//book/price')　　[<Element title at 0x2d88738>, <Element price at 0x2d88940>, <Element title at 0x2ca20a8>, <Element price at 0x2d88a08>]# 選擇所有title或者price元素>>> root.xpath('//title|//price')　　[<Element title at 0x2d88738>, <Element price at 0x2d88940>, <Element title at 0x2ca20a8>, <Element price at 0x2d88a08>]# 選擇book子元素title或者全部的price元素>>> root.xpath('/bookstore/book/title|//price')[<Element title at 0x2d88738>, <Element price at 0x2d88940>, <Element title at 0x2ca20a8>, <Element price at 0x2d88a08>]

lxml 用法

首先我們利用它來解析 HTML 代碼，先來一個小例子來感受一下它的基本用法。

from lxml import etree text = ''' <div><ul><li class="item-0"><a href="link1.html">first item</a></li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-inactive"><a href="link3.html">third item</a></li><li class="item-1"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></ul></div> ''' html = etree.HTML(text) result = etree.tostring(html) print(result)

首先我們使用 lxml 的 etree 庫，然后利用 etree.HTML 初始化，然后我們將其打印出來。

其中，這里體現(xiàn)了 lxml 的一個非常實(shí)用的功能就是自動修正 html 代碼，大家應(yīng)該注意到了，最后一個 li 標(biāo)簽，其實(shí)我把尾標(biāo)簽刪掉了，是不閉合的。不過，lxml 因?yàn)槔^承了 libxml2 的特性，具有自動修正 HTML 代碼的功能。

所以輸出結(jié)果是這樣的

<html><body> <div><ul><li class="item-0"><a href="link1.html">first item</a></li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-inactive"><a href="link3.html">third item</a></li><li class="item-1"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li> </ul></div></body></html>

不僅補(bǔ)全了 li 標(biāo)簽，還添加了 body，html 標(biāo)簽。

文件讀取

除了直接讀取字符串，還支持從文件讀取內(nèi)容。比如我們新建一個文件叫做 hello.html，內(nèi)容為

<div><ul><li class="item-0"><a href="link1.html">first item</a></li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div>

利用 parse 方法來讀取文件。

from lxml import etree html = etree.parse('hello.html') result = etree.tostring(html, pretty_print=True) print(result)

同樣可以得到相同的結(jié)果。

XPath 實(shí)例測試

python3解析庫 lxml ：http://www.cnblogs.com/zhangxinqi/p/9210211.html

依然以上一段程序?yàn)槔?/p>

（1）獲取所有的 <li> 標(biāo)簽

from lxml import etree html = etree.parse('hello.html') print type(html) result = html.xpath('//li') print result print len(result) print type(result) print type(result[0])

運(yùn)行結(jié)果

可見，etree.parse 的類型是 ElementTree，
通過調(diào)用 xpath 以后，得到了一個列表，包含了 5 個 <li> 元素，每個元素都是 Element 類型。
獲取所有節(jié)點(diǎn)。返回一個列表每個元素都是Element類型，所有節(jié)點(diǎn)都包含在其中

from lxml import etreehtml = etree.parse('hello.html', etree.HTMLParser()) result = html.xpath('//*') # //代表獲取子孫節(jié)點(diǎn)，*代表獲取所有print(type(html)) print(type(result)) print(result)# 如要獲取li節(jié)點(diǎn)，可以使用//后面加上節(jié)點(diǎn)名稱，然后調(diào)用xpath()方法 html.xpath('//li') # 獲取所有子孫節(jié)點(diǎn)的li節(jié)點(diǎn)

（2）獲取子節(jié)點(diǎn)

通過 / 或者 // 即可查找元素的子節(jié)點(diǎn) 或者子孫節(jié)點(diǎn)，如果想選擇li節(jié)點(diǎn)的所有直接a節(jié)點(diǎn)，可以這樣使用

# 通過追加/a選擇所有l(wèi)i節(jié)點(diǎn)的所有直接a節(jié)點(diǎn)，因?yàn)?/li用于選中所有l(wèi)i節(jié)點(diǎn)，/a用于選中l(wèi)i節(jié)點(diǎn)的所有直接子節(jié)點(diǎn)a result=html.xpath('//li/a') ?

（3）獲取父節(jié)點(diǎn)

通過 / 或者 // 可以查找子節(jié)點(diǎn) 或子孫節(jié)點(diǎn)，那么要查找父節(jié)點(diǎn)可以使用 .. 來實(shí)現(xiàn)也可以使用 parent:: 來獲取父節(jié)點(diǎn)

from lxml import etree from lxml.etree import HTMLParser text=''' <div><ul><li class="item-0"><a href="link1.html">第一個</a></li><li class="item-1"><a href="link2.html">second item</a></li></ul></div> '''html=etree.HTML(text,etree.HTMLParser()) result=html.xpath('//a[@href="link2.html"]/../@class') result1=html.xpath('//a[@href="link2.html"]/parent::*/@class') print(result) print(result1)''' ['item-1'] ['item-1'] '''

（4）屬性匹配

在選取的時候，我們還可以用 @符號進(jìn)行屬性過濾。比如，這里如果要選取 class 為 link1.html 的 li 節(jié)點(diǎn)，可以這樣實(shí)現(xiàn):

from lxml import etree from lxml.etree import HTMLParser text=''' <div><ul><li class="item-0"><a href="link1.html">第一個</a></li><li class="item-1"><a href="link2.html">second item</a></li></ul></div> '''html=etree.HTML(text, etree.HTMLParser()) result=html.xpath('//li[@class="link1.html"]') print(result)# 獲取 <li> 標(biāo)簽的所有 class result = html.xpath('//li/@class') print(result)

（5）文本獲取

我們用XPath中的 text() 方法獲取節(jié)點(diǎn)中的文本

from lxml import etreetext=''' <div><ul><li class="item-0"><a href="link1.html">第一個</a></li><li class="item-1"><a href="link2.html">second item</a></li></ul></div> '''html=etree.HTML(text,etree.HTMLParser()) result=html.xpath('//li[@class="item-1"]/a/text()') #獲取a節(jié)點(diǎn)下的內(nèi)容 result1=html.xpath('//li[@class="item-1"]//text()') #獲取li下所有子孫節(jié)點(diǎn)的內(nèi)容print(result) print(result1)

（6）屬性獲取

使用 @符號即可獲取節(jié)點(diǎn)的屬性，如下：獲取所有l(wèi)i節(jié)點(diǎn)下所有a節(jié)點(diǎn)的href屬性

result=html.xpath('//li/a/@href') #獲取a的href屬性 result=html.xpath('//li//@href') #獲取所有l(wèi)i子孫節(jié)點(diǎn)的href屬性

（7）屬性多值匹配

如果某個屬性的值有多個時，我們可以使用 contains() 函數(shù)來獲取

from lxml import etreetext1=''' <div><ul><li class="aaa item-0"><a href="link1.html">第一個</a></li><li class="bbb item-1"><a href="link2.html">second item</a></li></ul></div> '''html=etree.HTML(text1,etree.HTMLParser()) result=html.xpath('//li[@class="aaa"]/a/text()') result1=html.xpath('//li[contains(@class,"aaa")]/a/text()')print(result) print(result1)#通過第一種方法沒有取到值，通過contains（）就能精確匹配到節(jié)點(diǎn)了 [] ['第一個']

（8）多屬性匹配

另外我們還可能遇到一種情況，那就是根據(jù)多個屬性確定一個節(jié)點(diǎn)，這時就需要同時匹配多個屬性，此時可用運(yùn)用and運(yùn)算符來連接使用：

from lxml import etreetext1=''' <div><ul><li class="aaa" name="item"><a href="link1.html">第一個</a></li><li class="aaa" name="fore"><a href="link2.html">second item</a></li></ul></div> '''html=etree.HTML(text1,etree.HTMLParser()) result=html.xpath('//li[@class="aaa" and @name="fore"]/a/text()') result1=html.xpath('//li[contains(@class,"aaa") and @name="fore"]/a/text()')print(result) print(result1)# ['second item'] ['second item']

（9）按序選擇

有時候，我們在選擇的時候某些屬性可能同時匹配多個節(jié)點(diǎn)，但我們只想要其中的某個節(jié)點(diǎn)，如第二個節(jié)點(diǎn)或者最后一個節(jié)點(diǎn)，這時可以利用中括號引入索引的方法獲取特定次序的節(jié)點(diǎn)：

這里使用了last()、position()函數(shù)，在XPath中，提供了100多個函數(shù)，包括存取、數(shù)值、字符串、邏輯、節(jié)點(diǎn)、序列等處理功能，它們的具體作用可參考：http://www.w3school.com.cn/xpath/xpath_functions.asp

（10）節(jié)點(diǎn)軸選擇

XPath提供了很多節(jié)點(diǎn)選擇方法，包括獲取子元素、兄弟元素、父元素、祖先元素等，示例如下：

from lxml import etreetext1=''' <div><ul><li class="aaa" name="item"><a href="link1.html">第一個</a></li><li class="aaa" name="item"><a href="link1.html">第二個</a></li><li class="aaa" name="item"><a href="link1.html">第三個</a></li><li class="aaa" name="item"><a href="link1.html">第四個</a></li> </ul></div> '''html=etree.HTML(text1,etree.HTMLParser()) result=html.xpath('//li[1]/ancestor::*') #獲取所有祖先節(jié)點(diǎn) result1=html.xpath('//li[1]/ancestor::div') #獲取div祖先節(jié)點(diǎn) result2=html.xpath('//li[1]/attribute::*') #獲取所有屬性值 result3=html.xpath('//li[1]/child::*') #獲取所有直接子節(jié)點(diǎn) result4=html.xpath('//li[1]/descendant::a') #獲取所有子孫節(jié)點(diǎn)的a節(jié)點(diǎn) result5=html.xpath('//li[1]/following::*') #獲取當(dāng)前子節(jié)之后的所有節(jié)點(diǎn) result6=html.xpath('//li[1]/following-sibling::*') #獲取當(dāng)前節(jié)點(diǎn)的所有同級節(jié)點(diǎn)# [<Element html at 0x3ca6b960c8>, <Element body at 0x3ca6b96088>, <Element div at 0x3ca6b96188>, <Element ul at 0x3ca6b961c8>] [<Element div at 0x3ca6b96188>] ['aaa', 'item'] [<Element a at 0x3ca6b96248>] [<Element a at 0x3ca6b96248>] [<Element li at 0x3ca6b96308>, <Element a at 0x3ca6b96348>, <Element li at 0x3ca6b96388>, <Element a at 0x3ca6b963c8>, <Element li at 0x3ca6b96408>, <Element a at 0x3ca6b96488>] [<Element li at 0x3ca6b96308>, <Element li at 0x3ca6b96388>, <Element li at 0x3ca6b96408>]

# 獲取 <li> 標(biāo)簽下 href 為 link1.html 的 <a> 標(biāo)簽
result = html.xpath('//li/a[@href="link1.html"]')
print result

# 獲取 <li> 標(biāo)簽下的所有 <span> 標(biāo)簽 (應(yīng)為是所有，所以使用 // )
result = html.xpath('//li//span')

# 獲取 <li> 標(biāo)簽下的所有 class，不包括 <li>
result = html.xpath('//li/a//@class')
print result

# 獲取最后一個 <li> 的 <a> 的 href
result = html.xpath('//li[last()]/a/@href')
print result

# 獲取倒數(shù)第二個元素的內(nèi)容
result = html.xpath('//li[last()-1]/a')
print result[0].text

# 獲取 class 為 bold 的標(biāo)簽名
result = html.xpath('//*[@class="bold"]')
print result[0].tag

以上使用的是XPath軸的用法，更多軸的用法可參考：http://www.w3school.com.cn/xpath/xpath_axes.asp

案例應(yīng)用：抓取TIOBE指數(shù)前20名排行開發(fā)語言

#!/usr/bin/python3 # -*- coding: utf-8 -*- # @Author : # @File : test_1.py # @Software : PyCharm # @description : XXXimport requests from requests.exceptions import RequestException from lxml import etree from lxml.etree import ParseError import jsondef one_to_page(html):headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'}try:response = requests.get(html, headers=headers)body = response.text # 獲取網(wǎng)頁內(nèi)容try:html = etree.HTML(body, etree.HTMLParser()) # 解析HTML文本內(nèi)容result = html.xpath('//table[contains(@class,"table-top20")]/tbody/tr//text()') # 獲取列表數(shù)據(jù)pos = 0for i in range(20):if i == 0:yield result[i:5]else:yield result[pos:pos + 5] # 返回排名生成器數(shù)據(jù)pos += 5except ParseError as e:print(e.position)except RequestException as e:print('request is error!', e)def write_file(data): # 將數(shù)據(jù)重新組合成字典寫入文件并輸出for i in data:sul = {'2018年6月排行': i[0],'2017年6排行': i[1],'開發(fā)語言': i[2],'評級': i[3],'變化率': i[4]}with open('test.txt', 'a', encoding='utf-8') as f:f.write(json.dumps(sul, ensure_ascii=False) + '\n') # 必須格式化數(shù)據(jù)f.close()print(sul)def main():url = 'https://www.tiobe.com/tiobe-index/'data = one_to_page(url)write_file(data)if __name__ == '__main__':main()''' {'2018年6月排行': '1', '2017年6排行': '1', '開發(fā)語言': 'Java', '評級': '15.932%', '變化率': '+2.66%'} {'2018年6月排行': '2', '2017年6排行': '2', '開發(fā)語言': 'C', '評級': '14.282%', '變化率': '+4.12%'} {'2018年6月排行': '3', '2017年6排行': '4', '開發(fā)語言': 'Python', '評級': '8.376%', '變化率': '+4.60%'} {'2018年6月排行': '4', '2017年6排行': '3', '開發(fā)語言': 'C++', '評級': '7.562%', '變化率': '+2.84%'} {'2018年6月排行': '5', '2017年6排行': '7', '開發(fā)語言': 'Visual Basic .NET', '評級': '7.127%', '變化率': '+4.66%'} {'2018年6月排行': '6', '2017年6排行': '5', '開發(fā)語言': 'C#', '評級': '3.455%', '變化率': '+0.63%'} {'2018年6月排行': '7', '2017年6排行': '6', '開發(fā)語言': 'JavaScript', '評級': '3.063%', '變化率': '+0.59%'} {'2018年6月排行': '8', '2017年6排行': '9', '開發(fā)語言': 'PHP', '評級': '2.442%', '變化率': '+0.85%'} {'2018年6月排行': '9', '2017年6排行': '-', '開發(fā)語言': 'SQL', '評級': '2.184%', '變化率': '+2.18%'} {'2018年6月排行': '10', '2017年6排行': '12', '開發(fā)語言': 'Objective-C', '評級': '1.477%', '變化率': '-0.02%'} {'2018年6月排行': '11', '2017年6排行': '16', '開發(fā)語言': 'Delphi/Object Pascal', '評級': '1.396%', '變化率': '+0.00%'} {'2018年6月排行': '12', '2017年6排行': '13', '開發(fā)語言': 'Assembly language', '評級': '1.371%', '變化率': '-0.10%'} {'2018年6月排行': '13', '2017年6排行': '10', '開發(fā)語言': 'MATLAB', '評級': '1.283%', '變化率': '-0.29%'} {'2018年6月排行': '14', '2017年6排行': '11', '開發(fā)語言': 'Swift', '評級': '1.220%', '變化率': '-0.35%'} {'2018年6月排行': '15', '2017年6排行': '17', '開發(fā)語言': 'Go', '評級': '1.189%', '變化率': '-0.20%'} {'2018年6月排行': '16', '2017年6排行': '8', '開發(fā)語言': 'R', '評級': '1.111%', '變化率': '-0.80%'} {'2018年6月排行': '17', '2017年6排行': '15', '開發(fā)語言': 'Ruby', '評級': '1.109%', '變化率': '-0.32%'} {'2018年6月排行': '18', '2017年6排行': '14', '開發(fā)語言': 'Perl', '評級': '1.013%', '變化率': '-0.42%'} {'2018年6月排行': '19', '2017年6排行': '20', '開發(fā)語言': 'Visual Basic', '評級': '0.979%', '變化率': '-0.37%'} {'2018年6月排行': '20', '2017年6排行': '19', '開發(fā)語言': 'PL/SQL', '評級': '0.844%', '變化率': '-0.52%'} '''

案例應(yīng)用：解析古文網(wǎng) 并打印詩經(jīng) 所對應(yīng)的?URL

#!/usr/bin/python3 # -*- coding: utf-8 -*- # @Author : # @File : shijing.py # @Software : PyCharm # @description : XXXimport json import traceback import requests from lxml import etree""" step1: 安裝 lxml 庫。 step2: from lxml import etree step3: selector = etree.HTML(網(wǎng)頁源代碼) step4: selector.xpath(一段神奇的符號) """def parse():url = 'https://www.gushiwen.org/guwen/shijing.aspx'r = requests.get(url)if r.status_code == 200:selector = etree.HTML(r.text)s_all_type_content = selector.xpath('//div[@class="sons"]/div[@class="typecont"]')print(len(s_all_type_content))article_list = list()for s_type_content in s_all_type_content:book_m1 = s_type_content.xpath('.//strong/text()')[0].encode('utf-8').decode('utf-8')s_all_links = s_type_content.xpath('.//span/a')article_dict = dict()for s_link in s_all_links:link_name = s_link.xpath('./text()')[0].encode('utf-8').decode('utf-8')try:link_href = s_link.xpath('./@href')[0].encode('utf-8').decode('utf-8')except BaseException as e:link_href = Nonearticle_dict[link_name] = link_hreftemp = dict()temp[book_m1] = article_dictarticle_list.append(temp)print(json.dumps(article_list, ensure_ascii=False, indent=4))else:print(r.status_code)if __name__ == '__main__':parse()pass

CSS 選擇器——cssSelector 定位方式詳解

CSS 選擇器參考手冊：http://www.w3school.com.cn/cssref/css_selectors.asp
CSS 選擇器：http://www.runoob.com/cssref/css-selectors.html

Selenium之CSS Selector定位詳解：https://www.bbsmax.com/A/MyJxLGE1Jn/

css selector

CSS選擇器用于選擇你想要的元素的樣式的模式。

"CSS"列表示在CSS版本的屬性定義（CSS1，CSS2，或?qū)SS3）。

選擇器示例示例說明CSS

.class	.intro	選擇所有class="intro"的元素	1
#id	#firstname	選擇所有id="firstname"的元素	1
*	*	選擇所有元素	2
element	p	選擇所有<p>元素	1
element,element	div,p	選擇所有<div>元素和<p>元素	1
element?element	div p	選擇<div>元素內(nèi)的所有<p>元素	1
element>element	div>p	選擇所有父級是 <div> 元素的 <p> 元素	2
element+element	div+p	選擇所有緊接著<div>元素之后的<p>元素	2
[attribute]	[target]	選擇所有帶有target屬性元素	2
[attribute=value]	[target=-blank]	選擇所有使用target="-blank"的元素	2
[attribute~=value]	[title~=flower]	選擇標(biāo)題屬性包含單詞"flower"的所有元素	2
[attribute\|=language]	[lang\|=en]	選擇 lang 屬性以 en 為開頭的所有元素	2
:link	a:link	選擇所有未訪問鏈接	1
:visited	a:visited	選擇所有訪問過的鏈接	1
:active	a:active	選擇活動鏈接	1
:hover	a:hover	選擇鼠標(biāo)在鏈接上面時	1
:focus	input:focus	選擇具有焦點(diǎn)的輸入元素	2
:first-letter	p:first-letter	選擇每一個<P>元素的第一個字母	1
:first-line	p:first-line	選擇每一個<P>元素的第一行	1
:first-child	p:first-child	指定只有當(dāng)<p>元素是其父級的第一個子級的樣式。	2
:before	p:before	在每個<p>元素之前插入內(nèi)容	2
:after	p:after	在每個<p>元素之后插入內(nèi)容	2
:lang(language)	p:lang(it)	選擇一個lang屬性的起始值="it"的所有<p>元素	2
element1~element2	p~ul	選擇p元素之后的每一個ul元素	3
[attribute^=value]	a[src^="https"]	選擇每一個src屬性的值以"https"開頭的元素	3
[attribute$=value]	a[src$=".pdf"]	選擇每一個src屬性的值以".pdf"結(jié)尾的元素	3
*[attribute=value]**	a[src*="runoob"]	選擇每一個src屬性的值包含子字符串"runoob"的元素	3
:first-of-type	p:first-of-type	選擇每個p元素是其父級的第一個p元素	3
:last-of-type	p:last-of-type	選擇每個p元素是其父級的最后一個p元素	3
:only-of-type	p:only-of-type	選擇每個p元素是其父級的唯一p元素	3
:only-child	p:only-child	選擇每個p元素是其父級的唯一子元素	3
:nth-child(n)	p:nth-child(2)	選擇每個p元素是其父級的第二個子元素	3
:nth-last-child(n)	p:nth-last-child(2)	選擇每個p元素的是其父級的第二個子元素，從最后一個子項(xiàng)計(jì)數(shù)	3
:nth-of-type(n)	p:nth-of-type(2)	選擇每個p元素是其父級的第二個p元素	3
:nth-last-of-type(n)	p:nth-last-of-type(2)	選擇每個p元素的是其父級的第二個p元素，從最后一個子項(xiàng)計(jì)數(shù)	3
:last-child	p:last-child	選擇每個p元素是其父級的最后一個子級。	3
:root	:root	選擇文檔的根元素	3
:empty	p:empty	選擇每個沒有任何子級的p元素（包括文本節(jié)點(diǎn)）	3
:target	#news:target	選擇當(dāng)前活動的#news元素（包含該錨名稱的點(diǎn)擊的URL）	3
:enabled	input:enabled	選擇每一個已啟用的輸入元素	3
:disabled	input:disabled	選擇每一個禁用的輸入元素	3
:checked	input:checked	選擇每個選中的輸入元素	3
:not(selector)	:not(p)	選擇每個并非p元素的元素	3
::selection	::selection	匹配元素中被用戶選中或處于高亮狀態(tài)的部分	3
:out-of-range	:out-of-range	匹配值在指定區(qū)間之外的input元素	3
:in-range	:in-range	匹配值在指定區(qū)間之內(nèi)的input元素	3
:read-write	:read-write	用于匹配可讀及可寫的元素	3
:read-only	:read-only	用于匹配設(shè)置 "readonly"（只讀）屬性的元素	3
:optional	:optional	用于匹配可選的輸入元素	3
:required	:required	用于匹配設(shè)置了 "required" 屬性的元素	3
:valid	:valid	用于匹配輸入值為合法的元素	3
:invalid	:invalid	用于匹配輸入值為非法的元素	3

CSS選擇器的常見語法：

1.??根據(jù) 標(biāo)簽 定位 tagName (定位的是一組，多個元素）
? ? ? ? find_element_by_css_selector("div")

2. 根據(jù) id屬性定位 ( 注意：id 使用 # 表示）
? ? ? ? find_element_by_css_selector("#eleid")
? ? ? ? find_element_by_css_selector("div#eleid")
?? ??? ?
3. 根據(jù) className 屬性定位（注意：class 屬性使用 . )

? ? ? ? 兩種方式：前面加上 tag 名稱。也可以不加。如果不加 tag 名稱時，點(diǎn)不能省略。
? ? ? ? find_element_by_css_selector('.class_value')? ? ? ?
????????find_element_by_css_selector("div.eleclass")
? ? ? ? find_element_by_css_selector('tag_name.class_value')

? ? ? ? 有的 class_value 比較長，而且中間有空格時，不能把空格原樣寫進(jìn)去，那樣不能識別。
?? ? ? ?這時，空格用點(diǎn)代替，前面要加上 tag_name。
? ? ? ? driver.find_element_by_css_selector('div.panel.panel-email').click()
? ? ? ? # <p class="important warning">This paragraph is a very important warning.</p>
? ? ? ? driver.find_element_by_css_selector('.important')
? ? ? ? driver.find_element_by_css_selector('.important.warning')
?? ??? ?
4. 根據(jù) 標(biāo)簽?屬性定位
? ? ? ? 兩種方式，可以在前面加上 tag 名稱，也可以不加。
? ? ? ? find_element_by_css_selector("[attri_name='attri_value']")
? ? ? ? find_element_by_css_selector("input[type='password']").send_keys('密碼')
? ? ? ? find_element_by_css_selector("[type='password']").send_keys('密碼')
?? ?4.1 精確匹配：
? ? ? ? find_element_by_css_selector("div[name=elename]")??#屬性名=屬性值，精確值匹配
?????? ?find_element_by_css_selector("a[href]") #是否存在該屬性，判斷a元素是否存在href屬性

? ? 注意：如果 class屬性值里帶空格，用.來代替空格
?? ?4.2 模糊匹配
?????? ?find_element_by_css_selector("div[name^=elename]") #從起始位置開始匹配
?????? ?find_element_by_css_selector("div[name$=name2]") #從結(jié)尾匹配
?????? ?find_element_by_css_selector("div[name*=name1]") #從中間匹配，包含
? ? 4.3 多屬性匹配
???? ? ?find_element_by_css_selector("div[type='eletype][value='elevalue']") #同時有多屬性
?? ?? ? find_element_by_css_selector("div.eleclsss[name='namevalue'] #選擇class屬性為eleclass并且name為namevalue的div節(jié)點(diǎn)
???? ? ?find_element_by_css_selector("div[name='elename'][type='eletype']:nth-of-type(1) #選擇name為elename并且type為eletype的第1個div節(jié)點(diǎn)

5. 定位? 子元素 (A>B)
???? ? ?find_element_by_css_selector("div#eleid>input") #選擇id為eleid的div下的所有input節(jié)點(diǎn)
???? ? ?find_element_by_css_selector("div#eleid>input:nth-of-type(4) #選擇id為eleid的div下的第4個input節(jié)點(diǎn)
???? ? ?find_element_by_css_selector("div#eleid>nth-child(1)") #選擇id為eleid的div下的第一個子節(jié)點(diǎn)

6. 定位? 后代元素 (A空格B)
???? ? ?find_element_by_css_selector("div#eleid input") #選擇id為eleid的div下的所有的子孫后代的 input 節(jié)點(diǎn)
???? ? ?find_element_by_css_selector("div#eleid>input:nth-of-type(4)+label #選擇id為eleid的div下的第4個input節(jié)點(diǎn)的相鄰的label節(jié)點(diǎn)
???? ? ?find_element_by_css_selector("div#eleid>input:nth-of-type(4)~label #選擇id為eleid的div下的第4個input節(jié)點(diǎn)之后中的所有l(wèi)abel節(jié)點(diǎn)

7. 不是（否）
?? ? ? ?find_element_by_css_selector("div#eleid>*.not(input)") #選擇id為eleid的div下的子節(jié)點(diǎn)中不為input 的所有子節(jié)點(diǎn)
?? ? ? ?find_element_by_css_selector("div:not([type='eletype'])") #選擇div節(jié)點(diǎn)中type不為eletype的所有節(jié)點(diǎn)

8. 包含?
???? ? ?find_element_by_css_selector("li:contains('Goa')") # <li>Goat</li>
???? ? ?find_element_by_css_selector("li:not(contains('Goa'))) # <li>Cat</li>

9. by index
???? ? ?find_element_by_css_selector("li:nth(5)")

10. 路徑? 法
? ? ? ? 兩種方式，可以在前面加上 tag 名稱，也可以不加。注意它的層級關(guān)系使用大于號">"。
? ? ? ? find_element_by_css_selector("form#loginForm>ul>input[type='password']").send_keys('密碼')

高階：

基本 css 選擇器

CSS 選擇器中，最常用的選擇器 如下：

選擇器	描述	舉例
*	通配選擇器，選擇所有的元素	*
<type>	選擇特定類型的元素，支持基本HTML標(biāo)簽	h1
.<class>	選擇具有特定class的元素。	.class1
<type>.<class>	特定類型和特定class的交集。（直接將多個選擇器連著一起表示交集）	h1.class1
#<id>	選擇具有特定id屬性值的元素	#id1

屬性選擇器

除了最基本的核心選擇器外，還有可以 基于屬性的屬性選擇器：

選擇器	描述	舉例
[attr]	選取定義attr屬性的元素，即使該屬性沒有值	[placeholder]
[attr="val"]	選取attr屬性等于val的元素	[placeholder="請輸入關(guān)鍵詞"]
[attr^="val"]	選取attr屬性開頭為val的元素	[placeholder^="請輸入"]
[attr$="val"]	選取attr屬性結(jié)尾為val的元素	[placeholder$="關(guān)鍵詞"]
*[attr="val"]**	選取attr屬性包含val的元素	[placeholder*="入關(guān)"]
[attr~="val"]	選取attr屬性包含多個空格分隔的屬性，其中一個等于val的元素	[placeholder~="關(guān)鍵詞"]
[attr\|="val"]	選取attr屬性等于val的元素或第一個屬性值等于val的元素	[placeholder\|="關(guān)鍵詞"]

? ? ? ? <p class="important warning">This paragraph is a very important warning.</p>
? ? ? ? selenium舉例： (By.CSS_SELECTOR,'p[class="import warning"]')?
? ? ? ? 屬性與屬性的值需要完全匹配，如上面用p[class='impprtant']就定位不到；?
? ? ? ? 部分屬性匹配：(By.CSS_SELECTOR,'p[class~="import warning"]')；?
? ? ? ? 子串匹配&特定屬性匹配：?
? ? ? ? [class^="def"]：選擇 class 屬性值以 "def" 開頭的所有元素?
? ? ? ? [class$="def"]：選擇 class 屬性值以 "def" 結(jié)尾的所有元素?
? ? ? ? [class*="def"]：選擇class 屬性值中包含子串 "def" 的所有元素?
? ? ? ? [class|="def"]：選擇class 屬性值等于"def"或以"def-"開頭的元素（這個是特定屬性匹配）

關(guān)系選擇器

有一些選擇器是基于層級之間的關(guān)系，這類選擇器稱之為關(guān)系選擇器。

選擇器	描述	舉例
<selector>?<selector>	第二個選擇器為第一個選擇器的后代元素，選取第二個選擇器匹配結(jié)果	.class1 h1
<selector> > <selector>	第二個選擇器為第一個選擇器的直接子元素，選取第二個選擇器匹配結(jié)果	.class1 > *
<selector> + <selector>	第二個選擇器為第一個選擇器的兄弟元素，選取第二個選擇器的下一兄弟元素	.class1 + [lang]
<selector> ~ <selector>	第二個選擇器為第一個選擇器的兄弟元素，選取第二個選擇器的全部兄弟元素	.class1 ~ [lang]

? ? ? ? 選擇某個元素的后代的元素：?
? ? ? ? selenium舉例：(By.CSS_SELECTOR,‘div button’)
? ? ? ? div元素的所有的后代元素中標(biāo)簽為button元素，不管嵌套有多深

? ? ? ? 選擇某個元素的子代元素：?
? ? ? ? selenium舉例：(By.CSS_SELECTOR,‘div > button’)
? ? ? ? div元素的所有的子代元素中標(biāo)簽為button元素（>符號前后的空格可有可無）

? ? ? ? 一個元素不好定位時，它的兄長元素很起眼，可以借助兄長來揚(yáng)名，因此不妨稱之為 "弟弟選擇器".
? ? ? ? 即選擇某個元素的弟弟元素（先為兄，后為弟）：?
? ? ? ? selenium舉例： (By.CSS_SELECTOR,'button + li')
? ? ? ? button與li屬于同一父元素，且button與li相鄰，選擇button下標(biāo)簽為li的元素

聯(lián)合選擇器與反選擇器

利用 聯(lián)合選擇器與反選擇器，可以實(shí)現(xiàn) 與和或 的關(guān)系。

選擇器	描述	舉例
<selector>,<selector>	屬于第一個選擇器的元素或者是屬于第二個選擇器的元素	h1, h2
:not(<selector>)	不屬于選擇器選中的元素	:not(html)

偽元素和偽類選擇器

CSS選擇器支持了 偽元素和偽類選擇器。

:active	鼠標(biāo)點(diǎn)擊的元素
:checked	處于選中狀態(tài)的元素
:default	選取默認(rèn)值的元素
:disabled	選取處于禁用狀態(tài)的元素
:empty	選取沒有任何內(nèi)容的元素
:enabled	選取處于可用狀態(tài)的元素
:first-child	選取元素的第一個子元素
:first-letter	選取文本的第一個字母
:first-line	選取文本的第一行
:focus	選取得到焦點(diǎn)的元素
:hover	選取鼠標(biāo)懸停的元素
:in-range	選取范圍之內(nèi)的元素
:out-of-range	選取范圍之外的元素
:lang(<language>)	選取lang屬性為language的元素
:last-child	選取元素的最后一個子元素

總結(jié)

以上是生活随笔為你收集整理的Python 中 xpath 语法与 lxml 库解析 HTML/XML 和 CSS Selector的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：菜鸟教程之 JavaScript 教程
下一篇：安卓逆向_12 --- jeb工具的使用

HTML

Python 中 xpath 语法 与 lxml 库解析 HTML/XML 和 CSS Selector

前言

安裝

Python 中如何安裝使用 XPath

Python3 解析 XML

一、什么是 XML?

二、有哪些可以解析 XML 的 Python 包 ？

三、利用 ElementTree 解析 XML

四、使用示例（ lxml 中 etree?）：

XPath語法

節(jié)點(diǎn)關(guān)系

（1）父（Parent）

（2）子（Children）

（3）同胞（Sibling）

（4）先輩（Ancestor）

（5）后代（Descendant）

選取節(jié)點(diǎn)

下面列出了最有用的路徑表達(dá)式：

謂語（Predicates）

選取未知節(jié)點(diǎn)

選取若干路徑

XPath 運(yùn)算符

XPath 函數(shù)的高級使用示例：

Xpath 高級用法

lxml 用法

文件讀取

XPath 實(shí)例測試

案例應(yīng)用：抓取TIOBE指數(shù)前20名排行開發(fā)語言

案例應(yīng)用：解析 古文網(wǎng) 并打印 詩經(jīng) 所對應(yīng)的?URL

CSS 選擇器——cssSelector 定位方式詳解

基本 css 選擇器

屬性選擇器

關(guān)系選擇器

聯(lián)合選擇器與反選擇器

偽元素和偽類選擇器

總結(jié)

Python 中 xpath 语法与 lxml 库解析 HTML/XML 和 CSS Selector

一、什么是 XML?

二、有哪些可以解析 XML 的 Python 包？

三、利用 ElementTree 解析 XML

四、使用示例（ lxml 中 etree?）：

案例應(yīng)用：解析古文網(wǎng) 并打印詩經(jīng) 所對應(yīng)的?URL