當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫--BeautifulSoup使用

發布時間：2025/3/15 编程问答 9 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫--BeautifulSoup使用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

解析庫

解析器使用方法優勢劣勢

Python標準庫	BeautifulSoup(markup, "html.parser")	Python的內置標準庫、執行速度適中、文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快、文檔容錯能力強	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, "xml")	速度快、唯一支持XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup, "html5lib")	最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴展

基本使用

html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title.string)# 輸出 <html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"></a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p><p class="story">...</p></body> </html> The Dormouse's story

標簽選擇器

選擇元素

html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p)# 輸出 <title>The Dormouse's story</title> <class 'bs4.element.Tag'> <head><title>The Dormouse's story</title></head> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

獲取名稱

?

獲取屬性

獲取內容

html = """ <html><head><title>The Dormouse's story</title></head> <body> <p clss="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.string)# 輸出 The Dormouse's story

嵌套選擇

子節點和子孫節點

父節點和祖先節點

html = """ <html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent)# 輸出 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p> html = """ <html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.parents)))# 輸出 [(0, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p>), (1, <body> <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p> <p class="story">...</p> </body>), (2, <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html>), (3, <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html>)]

兄弟節點

標準選擇器

find_all( name , attrs , recursive , text , **kwargs )

可根據標簽名、屬性、內容查找文檔

name

html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul')) print(type(soup.find_all('ul')[0]))# 輸出 [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] <class 'bs4.element.Tag'> html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.find_all('ul'):print(ul.find_all('li'))# 輸出 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]

attrs

html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1" name="elements"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'})) print(soup.find_all(attrs={'name': 'elements'}))# 輸出 [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>]

text

find( name , attrs , recursive , text , **kwargs )

find返回單個元素，find_all返回所有元素

find_parents() find_parent()

find_parents()返回所有祖先節點，find_parent()返回直接父節點。

find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟節點，find_next_sibling()返回后面第一個兄弟節點。

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟節點，find_previous_sibling()返回前面第一個兄弟節點。

find_all_next() find_next()

find_all_next()返回節點后所有符合條件的節點, find_next()返回第一個符合條件的節點

find_all_previous() 和 find_previous()

find_all_previous()返回節點后所有符合條件的節點, find_previous()返回第一個符合條件的節點

CSS選擇器

通過select()直接傳入CSS選擇器即可完成選擇：

. 代表類名? ? ? ? # 代表id? ? ? ?啥也不寫是標簽名

html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))# 輸出 [<div class="panel-heading"> <h4>Hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] <class 'bs4.element.Tag'> html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'):print(ul.select('li'))# 輸出 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]

獲取屬性

獲取內容

總結

推薦使用lxml解析庫，必要時使用html.parser
標簽選擇篩選功能弱但是速度快
建議使用find()、find_all() 查詢匹配單個結果或者多個結果
如果對CSS選擇器熟悉建議使用select()
記住常用的獲取屬性和文本值的方法

轉載于:https://www.cnblogs.com/panlei3707/p/8745097.html

總結

以上是生活随笔為你收集整理的爬虫--BeautifulSoup使用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Sping（一）——IOC/DI
下一篇： GUI编程tkinter模块常用参数(p