Python爬虫遍历文档树
生活随笔
收集整理的這篇文章主要介紹了
Python爬虫遍历文档树
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
1.直接子節點:.contents .children屬性
.content
Tag的.content屬性可以將Tag的子節點以列表的方式輸出
from bs4 import BeautifulSouphtml = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p># 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml")# 輸出方式為列表 print(soup.head.contents)print(soup.head.contents[0])運行結果
[<title>The Dormouse's story</title>] <title>The Dormouse's story</title>.children
它返回的不是一個列表,不過我們可以通過遍歷獲取所有的子節點。
from bs4 import BeautifulSouphtml = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """# 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml")# 輸出方式為列表生成器對象 print(soup.head.children)# 通過遍歷獲取所有子節點 for child in soup.head.children:print(child)運行結果
<list_iterator object at 0x008FF950> <title>The Dormouse's story</title>2.所有子孫節點:.descendants屬性
上面講的.contents和.children屬性僅包含Tag的直接子節點,.descendants屬性可以對所有Tag的子孫節點進行遞歸循環,和children類似,我們也需要通過遍歷的方式獲取其中的內容。
''' 遇到問題沒人解答?小編創建了一個Python學習交流QQ群:531509025 尋找有志同道合的小伙伴,互幫互助,群里還有不錯的視頻學習教程和PDF電子書! ''' from bs4 import BeautifulSouphtml = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """# 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml")# 輸出方式為列表生成器對象 print(soup.head.descendants)# 通過遍歷獲取所有子孫節點 for child in soup.head.descendants:print(child)運行結果
<generator object descendants at 0x00519AB0> <title>The Dormouse's story</title> The Dormouse's story3.節點內容:.string屬性
如果Tag只有一個NavigableString類型子節點,那么這個Tag可以使用.string得到子節點。如果一個Tag僅有一個子節點,那么這個Tab也可以使用.string方法,輸出結果與當前唯一子節點的.string結果相同。
通俗點來講就是:如果一個標簽里面沒有標簽了,那么.string就會返回標簽里面的內容。如果標簽里面只有唯一的一個標簽了,那么.string也會返回里面的內容。例如:
from bs4 import BeautifulSouphtml = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """# 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml")print(soup.head.string)print(soup.head.title.string)運行結果
The Dormouse's story The Dormouse's story總結
以上是生活随笔為你收集整理的Python爬虫遍历文档树的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 掌握Python字典的12个例子
- 下一篇: Python变量的作用域的使用