當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫遍历文档树

發布時間：2025/3/20 python 10 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫遍历文档树小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1.直接子節點：.contents .children屬性

.content

Tag的.content屬性可以將Tag的子節點以列表的方式輸出

from bs4 import BeautifulSouphtml = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ...# 創建 Beautiful Soup 對象，指定lxml解析器 soup = BeautifulSoup(html, "lxml")# 輸出方式為列表 print(soup.head.contents)print(soup.head.contents[0])

運行結果

[<title>The Dormouse's story</title>] <title>The Dormouse's story</title>

.children

它返回的不是一個列表，不過我們可以通過遍歷獲取所有的子節點。

from bs4 import BeautifulSouphtml = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """# 創建 Beautiful Soup 對象，指定lxml解析器 soup = BeautifulSoup(html, "lxml")# 輸出方式為列表生成器對象 print(soup.head.children)# 通過遍歷獲取所有子節點 for child in soup.head.children:print(child)

運行結果

<list_iterator object at 0x008FF950> <title>The Dormouse's story</title>

2.所有子孫節點：.descendants屬性

上面講的.contents和.children屬性僅包含Tag的直接子節點，.descendants屬性可以對所有Tag的子孫節點進行遞歸循環，和children類似，我們也需要通過遍歷的方式獲取其中的內容。

''' 遇到問題沒人解答？小編創建了一個Python學習交流QQ群：531509025 尋找有志同道合的小伙伴，互幫互助,群里還有不錯的視頻學習教程和PDF電子書！ ''' from bs4 import BeautifulSouphtml = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """# 創建 Beautiful Soup 對象，指定lxml解析器 soup = BeautifulSoup(html, "lxml")# 輸出方式為列表生成器對象 print(soup.head.descendants)# 通過遍歷獲取所有子孫節點 for child in soup.head.descendants:print(child)

運行結果

<generator object descendants at 0x00519AB0> <title>The Dormouse's story</title> The Dormouse's story

3.節點內容：.string屬性

如果Tag只有一個NavigableString類型子節點，那么這個Tag可以使用.string得到子節點。如果一個Tag僅有一個子節點，那么這個Tab也可以使用.string方法，輸出結果與當前唯一子節點的.string結果相同。

通俗點來講就是：如果一個標簽里面沒有標簽了，那么.string就會返回標簽里面的內容。如果標簽里面只有唯一的一個標簽了，那么.string也會返回里面的內容。例如：

from bs4 import BeautifulSouphtml = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """# 創建 Beautiful Soup 對象，指定lxml解析器 soup = BeautifulSoup(html, "lxml")print(soup.head.string)print(soup.head.title.string)

運行結果

The Dormouse's story The Dormouse's story

總結

以上是生活随笔為你收集整理的Python爬虫遍历文档树的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。