當前位置：首頁 > 编程语言 > python >内容正文

python

python之lxml.etree解析HTML

發布時間：2023/12/31 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 python之lxml.etree解析HTML 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

原文鏈接：https://www.cnblogs.com/yoyoketang/p/9661273.html
element和elementtree參考：https://blog.csdn.net/hellocsz/article/details/79780654
lxml.etree英文使用說明：https://lxml.de/tutorial.html

lxml安裝

使用pip安裝lxml庫

$ pip install lxml

pip show lxml查看版本號

$ pip show lxml

html解析

etree.HTML方法把html的文本內容解析成html對象，并對HTML文本進行自動修正。
打印html內容，可以用etree.tostring方法，encoding="utf-8"參數可以正常輸出html里面的中文內容。pretty_print=True是以標準格式輸出

# coding:utf-8from lxml import etreehtmldemo = ''' <meta charset="UTF-8">  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <html><head><title>yoyo ketang</title></head> <body>  yoyoketang 這里是我的微信公眾號：yoyoketang <a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler教程</a>, <a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python筆記</a>, <a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium文檔</a>; 快來關注吧！ ... '''# etree.HTML解析html內容 demo = etree.HTML(htmldemo) # 打印解析內容str t = etree.tostring(demo, encoding="utf-8", pretty_print=True) print(t.decode("utf-8"))

運行結果

<html><head><meta charset="UTF-8"/>  <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <title>yoyo ketang</title> </head><body>  yoyoketang 這里是我的微信公眾號：yoyoketang <a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler教程</a>, <a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python筆記</a>, <a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium文檔</a>; 快來關注吧！ ... </body> </html>

解析 XML 字符串

網頁下載下來以后是個字符串的形式，使用etree.fromstring(str)構造一個 etree._ElementTree對象，使用 etree.tostring(t)返回一個字符串

>>> xml_string = '<root><foo id="foo-id" class="foo zoo">Foo</foo><bar>中文</bar><baz></baz></root>' >>> root = etree.fromstring(xml_string.encode('utf-8')) # 最好傳 byte string>>> etree.tostring(root) #默認返回的是 byte string b'<root>root content<foo id="foo-id" class="foo zoo">Foo</foo><bar>Bar</bar><baz/></root>'>>> print(etree.tostring(root, pretty_print=True).decode('utf-8')) #decode 一下變成 unicode <root><foo id="foo-id" class="foo zoo">Foo</foo><bar>Bar</bar><baz/> # 注意這里沒有子節點的 baz 節點被變成了自閉和的標簽 </root>>>> type(root) <class 'lxml.etree._Element'>#可以看出 tostring 返回的是一個_Element類型的對象，也就是整個 xml 樹的根節點

xpath使用案例

使用htnl解析器，最終是想獲取html上的某些元素屬性和text文本內容，接下來看下，用最少的代碼，簡單高效的找出想要的內容。
比如要獲取“這里是我的微信公眾號：yoyoketang”

# coding:utf-8from lxml import etreehtmldemo = ''' 復制上面的html內容 '''# etree.HTML解析html內容 demo = etree.HTML(htmldemo)nodes = demo.xpath('//p[@class="yoyo"]') # 獲取文本 t = nodes[0].text print(t)

運行結果：

這里是我的微信公眾號：yoyoketang

從代碼量上看，簡單的三行代碼就能找到想要的內容了，比之前的beautifulsoup框架要簡單高效的多

nodes是xpath定位獲取到的一個list對象，會找出所有符合條件的元素對象。可以用for 循環查看詳情

#coding:utf-8 from lxml import etreehtmldemo = ''' 復制上面的html內容 '''# etree.HTML解析html內容 demo = etree.HTML(htmldemo)nodes = demo.xpath('//p[@class="yoyo"]')print(nodes) # list對象for i in nodes:# 打印定位到的內容print(etree.tostring(i, encoding="utf-8", pretty_print=True).decode("utf-8"))# 元素屬性，字典格式print(i.attrib)

運行結果

[<Element p at 0x2bcd388>] 這里是我的微信公眾號：yoyoketang <a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler教程</a>, <a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python筆記</a>, <a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium文檔</a>; 快來關注吧！{'class': 'yoyo'}

二次查找

通過xpath定位語法//p[@class=“yoyo”]定位到的是class="yoyo"這個元素以及它的所有子節點，如果想定位其中一個子節點，可以二次定位，繼續xpath查找，如

nodes = demo.xpath('//p[@class="yoyo"]')t1 = nodes[0].xpath('//a[@id="link2"]') print(t1[0].text)

運行結果

python筆記

總結

以上是生活随笔為你收集整理的python之lxml.etree解析HTML的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：双线性映射
下一篇： zemax操作例子_光学软件使用实例：从