當前位置：首頁 > 编程语言 > python >内容正文

python

Python BeautifulSoup

發布時間：2023/12/20 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python BeautifulSoup 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Beautiful Soup?是一個用于從HTML和XML文件中提取數據的python庫。它與您最喜歡的解析器一起工作，提供導航、搜索和修改解析樹的慣用方法。它通常可以節省程序員數小時或數天的工作時間。

這些說明用例子說明了 Beautiful Soup 4的所有主要特征。我向您展示了這個庫的好處，它是如何工作的，如何使用它，如何讓它做您想要做的事情，以及當它違反了您的期望時應該做什么。

本文中的示例在python 2.7和python 3.2中的工作方式應該相同。

您可能正在查找?Beautiful Soup 3?. 如果是這樣，您應該知道 Beautiful Soup 3不再被開發，并且 Beautiful Soup 4被推薦用于所有新項目。如果您想了解 Beautiful Soup 3和 Beautiful Soup 4之間的區別，請參見?Porting code to BS4.

本文檔已由Beautiful Soup用戶翻譯成其他語言：

這篇文檔當然還有中文版.
___日本語_利用__ (外部リンク?）
_______ (?? ???）

得到幫助

如果你對 Beautiful Soup 有疑問，或者遇到問題，?send mail to the discussion group?. 如果您的問題涉及解析HTML文檔，請務必提及?what the diagnose() function says?關于那個文件。

快速啟動

這是一個HTML文檔，我將在整個文檔中用作示例。這是一個故事的一部分?Alice in Wonderland?：：

把“三姐妹”的文件放進 Beautiful Soup 里，我們就可以?BeautifulSoup?對象，它將文檔表示為嵌套數據結構：

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')print(soup.prettify()) # <html> # <head> # <title> # The Dormouse's story # </title> # </head> # <body> # # # The Dormouse's story # # # # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # , # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # and # <a class="sister" href="http://example.com/tillie" id="link2"> # Tillie # </a> # ; and they lived at the bottom of a well. # # # ... # # </body> # </html>

以下是導航該數據結構的一些簡單方法：

soup.title # <title>The Dormouse's story</title>soup.title.name # u'title'soup.title.string # u'The Dormouse's story'soup.title.parent.name # u'head'soup.p # The Dormouse's storysoup.p['class'] # u'title'soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

一個常見的任務是提取頁面標記中找到的所有URL:：

for link in soup.find_all('a'):print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie

另一個常見任務是從頁面提取所有文本：：

print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ...

這看起來像你需要的嗎？如果是這樣，請繼續閱讀。

安裝 Beautiful Soup

如果您使用的是Debian或Ubuntu Linux的最新版本，您可以使用System Package Manager安裝漂亮的soup：

$ apt-get install python-bs4?（對于Python 2）

$ apt-get install python3-bs4?（對于Python 3）

BeautifulSoup4是通過pypi發布的，因此如果您不能用系統打包程序安裝它，您可以用?easy_install?或?pip?. 包名稱是?beautifulsoup4?和同一個包在python 2和python 3上工作。確保使用正確版本的?pip?或?easy_install?對于您的python版本（這些可能被命名為?pip3?和?easy_install3?如果使用的是python 3）。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

(The?BeautifulSoup?包可能是?not?what you want. That's the previous major release,?Beautiful Soup 3?. 很多軟件都使用BS3，所以它仍然可用，但是如果您正在編寫新的代碼，則應該安裝?beautifulsoup4?）

如果你沒有?easy_install?或?pip?已安裝，您可以?download the Beautiful Soup 4 source tarball?安裝時?setup.py?.

$ python setup.py install

如果所有其他方法都失敗了，那么 Beautiful Soup 的許可證允許您用您的應用程序打包整個庫。你可以下載tarball，復制它?bs4?目錄到應用程序的代碼庫中，使用漂亮的soup而不安裝它。

我使用python 2.7和python 3.2來開發 Beautiful Soup ，但它應該與其他最新版本一起使用。

安裝后的問題

Beautiful Soup 被打包成python 2代碼。當您安裝它與python 3一起使用時，它會自動轉換為python 3代碼。如果不安裝包，代碼將不會被轉換。也有關于安裝了錯誤版本的Windows計算機的報告。

如果你得到?ImportError?“沒有名為htmlparser的模塊”，您的問題是您正在運行python 3下的代碼的python 2版本。

如果你得到?ImportError?“沒有名為html.parser的模塊”，您的問題是您在python 2下運行代碼的python 3版本。

在這兩種情況下，最好的辦法是從系統中完全刪除漂亮的soup安裝（包括解壓縮tarball時創建的任何目錄），然后再次嘗試安裝。

如果你得到?SyntaxError?行上的“無效語法”?ROOT_TAG_NAME?=?u'[document]'?，您需要將python 2代碼轉換為python 3。您可以通過安裝程序包來執行此操作：

$ python3 setup.py install

或者手動運行python的?2to3?上的轉換腳本?bs4?目錄：

$ 2to3-3.2 -w bs4

安裝分析器

BeautifulSoup支持包含在Python標準庫中的HTML解析器，但它也支持許多第三方Python解析器。一個是?lxml parser?. 根據您的設置，您可以使用以下命令之一安裝LXML：

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

另一種選擇是純 Python?html5lib parser?以Web瀏覽器的方式解析HTML。根據您的設置，您可以使用以下命令之一安裝html5lib：

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib

此表總結了每個解析器庫的優點和缺點：

語法分析器	典型用法	優勢	缺點
python的html.parser	BeautifulSoup(markup,?"html.parser")	包括電池體面速度寬大（從python 2.7.3和3.2開始）	不太寬松（在python 2.7.3或3.2.2之前）
LXML的HTML分析器	BeautifulSoup(markup,?"lxml")	非常快寬大的	外部C依賴
LXML的XML分析器	BeautifulSoup(markup,?"lxml-xml")BeautifulSoup(markup,?"xml")	非常快當前唯一支持的XML分析器	外部C依賴
HTML5LIB	BeautifulSoup(markup,?"html5lib")	非常寬大以與Web瀏覽器相同的方式分析頁面創建有效的HTML5	非常緩慢外部python依賴項

如果可以，我建議您安裝并使用LXML來提高速度。如果您使用的是2.7.3之前的python 2版本，或者3.2.2之前的python 3版本，那么?essential?安裝lxml或html5lib——Python的內置HTML解析器在舊版本中不是很好。

注意，如果一個文檔無效，不同的解析器將為它生成不同的 Beautiful Soup 樹。見?Differences between parsers?有關詳細信息。

Making the soup

要解析文檔，請將其傳遞到?BeautifulSoup?構造函數。可以傳入字符串或打開的文件句柄：

from bs4 import BeautifulSoupwith open("index.html") as fp:soup = BeautifulSoup(fp)soup = BeautifulSoup("<html>data</html>")

首先，文檔轉換為Unicode，HTML實體轉換為Unicode字符：

BeautifulSoup("Sacré bleu!") <html><head></head><body>Sacré bleu!</body></html>

然后，BeautifulSoup使用可用的最佳分析器解析文檔。它將使用HTML解析器，除非您特別告訴它使用XML解析器。（見?Parsing XML?）

物體種類

漂亮的soup將復雜的HTML文檔轉換為復雜的python對象樹。但是你只需要處理大約四個問題?kinds?對象：?Tag，?NavigableString?，?BeautifulSoup?和?Comment?.

Tag

A?Tag?對象對應于原始文檔中的XML或HTML標記：：

soup = BeautifulSoup('Extremely bold') tag = soup.b type(tag) # <class 'bs4.element.Tag'>

標簽有很多屬性和方法，我將在?Navigating the tree?和?Searching the tree?. 目前，標簽最重要的特性是它的名稱和屬性。

名字

每個標記都有一個名稱，可作為?.name?：：

tag.name # u'b'

如果更改標記的名稱，則更改將反映在由Beautiful Soup生成的任何HTML標記中：

tag.name = "blockquote" tag # <blockquote class="boldest">Extremely bold</blockquote>

屬性

標記可以具有任意數量的屬性。標簽?<b?id="boldest">?具有值為“boldest”的屬性“id”。通過將標記視為字典，可以訪問標記的屬性：

tag['id'] # u'boldest'

你可以直接用?.attrs?：：

tag.attrs # {u'id': 'boldest'}

可以添加、刪除和修改標記的屬性。同樣，這是通過將標簽視為字典來完成的：

tag['id'] = 'verybold' tag['another-attribute'] = 1 tag # del tag['id'] del tag['another-attribute'] tag # tag['id'] # KeyError: 'id' print(tag.get('id')) # None

多值屬性

HTML4定義了一些可以有多個值的屬性。HTML5刪除了其中的一些，但還定義了一些。最常見的多值屬性是?class?（也就是說，一個標記可以有多個CSS類）。其他包括?rel?，?rev?，?accept-charset?，?headers?和?accesskey?. Beautiful Soup 將多值屬性的值作為列表顯示：

css_soup = BeautifulSoup('') css_soup.p['class'] # ["body"]css_soup = BeautifulSoup('') css_soup.p['class'] # ["body", "strikeout"]

如果屬性?looks?就像它有多個值一樣，但它不是由HTML標準的任何版本定義的多值屬性， Beautiful Soup 將使該屬性單獨存在：

id_soup = BeautifulSoup('') id_soup.p['id'] # 'my id'

將標記轉換回字符串時，將合并多個屬性值：

rel_soup = BeautifulSoup('Back to the <a rel="index">homepage</a>') rel_soup.a['rel'] # ['index'] rel_soup.a['rel'] = ['index', 'contents'] print(rel_soup.p) # Back to the <a rel="index contents">homepage</a>

你可以使用?`get_attribute_list?獲取一個始終是列表、字符串的值，無論它是否為多值屬性

id_soup.p.get_attribute_list（'id'）。# [“我的身份證”]

如果將文檔解析為XML，則不存在多值屬性：

xml_soup = BeautifulSoup('', 'xml') xml_soup.p['class'] # u'body strikeout'

NavigableString

字符串對應于標記內的一位文本。 Beautiful Soup 用的是?NavigableString?類以包含這些文本位：

tag.string # u'Extremely bold' type(tag.string) # <class 'bs4.element.NavigableString'>

A?NavigableString?與python unicode字符串類似，只是它還支持?Navigating the tree?和?Searching the tree?. 您可以轉換?NavigableString?到Unicode字符串?unicode()?：：

unicode_string = unicode(tag.string) unicode_string # u'Extremely bold' type(unicode_string) # <type 'unicode'>

不能就地編輯字符串，但可以使用?replace_with()?：：

tag.string.replace_with("No longer bold") tag # <blockquote>No longer bold</blockquote>

NavigableString?支持中描述的大多數功能?Navigating the tree?和?Searching the tree?但不是全部。特別是，由于字符串不能包含任何內容（標記可能包含字符串或其他標記的方式），字符串不支持?.contents?或?.string?屬性，或?find()?方法。

如果你想用?NavigableString?除了 Beautiful Soup ，你應該打電話給?unicode()?將其轉換為普通的python unicode字符串。如果不這樣做，那么字符串將攜帶對整個 Beautiful Soup 解析樹的引用，即使使用了 Beautiful Soup 。這是對記憶的極大浪費。

BeautifulSoup

這個?BeautifulSoup?對象本身表示整個文檔。在大多數情況下，您可以將其視為?Tag?對象。這意味著它支持?Navigating the tree?和?Searching the tree?.

自從?BeautifulSoup?對象與實際的HTML或XML標記不對應，它沒有名稱和屬性。但有時看一下?.name?所以它被賦予了?.name?“ [文件] “：

soup.name # u'[document]'

注釋和其他特殊字符串

Tag?，?NavigableString?和?BeautifulSoup?幾乎涵蓋了在HTML或XML文件中看到的所有內容，但還有一些剩余的部分。你可能唯一需要擔心的是評論：

markup = "" soup = BeautifulSoup(markup) comment = soup.b.string type(comment) # <class 'bs4.element.Comment'>

這個?Comment?對象只是?NavigableString?：：

comment # u'Hey, buddy. Want to buy a used parser'

但是當它作為HTML文檔的一部分出現時，?Comment?以特殊格式顯示：

print(soup.b.prettify()) # #  #

beautiful soup為XML文檔中可能出現的任何其他內容定義類：?CData?，?ProcessingInstruction?，?Declaration?和?Doctype?. 就像?Comment?，這些類是?NavigableString?這會給字符串增加一些額外的內容。下面是一個用CDATA塊替換注釋的示例：

from bs4 import CData cdata = CData("A CDATA block") comment.replace_with(cdata)print(soup.b.prettify()) # # <![CDATA[A CDATA block]]> #

瀏覽樹

這是“三姐妹”HTML文檔：

我將用這個例子向您展示如何從文檔的一個部分移動到另一個部分。

下降

標記可以包含字符串和其他標記。這些元素是標記的?children?. Beautiful Soup 提供了許多不同的屬性來導航和迭代標簽的子項。

請注意， Beautiful Soup 字符串不支持任何這些屬性，因為字符串不能有子字符串。

使用標記名導航

導航解析樹的最簡單方法是說出所需標記的名稱。如果你想要<head>標簽，只需說?soup.head?：：

soup.head # <head><title>The Dormouse's story</title></head>soup.title # <title>The Dormouse's story</title>

您可以反復使用這個技巧來放大解析樹的某個部分。此代碼獲取<body>標記下的第一個標記：

soup.body.b # The Dormouse's story

使用標記名作為屬性將僅為您提供?first?按該名稱標記：

soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果你需要?all?the <a> tags, or anything more complicated than the first tag with a certain name, you'll need to use one of the methods described in?Searching the tree?，如?find_all()?：：

soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

.contents?and?.children

標簽的子項在名為?.contents?：：

head_tag = soup.head head_tag # <head><title>The Dormouse's story</title></head>head_tag.contents [<title>The Dormouse's story</title>]title_tag = head_tag.contents[0] title_tag # <title>The Dormouse's story</title> title_tag.contents # [u'The Dormouse's story']

這個?BeautifulSoup?對象本身具有子級。在這種情況下，<html>標記是?BeautifulSoup?對象：：

len(soup.contents) # 1 soup.contents[0].name # u'html'

字符串沒有?.contents?，因為它不能包含任何內容：：

text = title_tag.contents[0] text.contents # AttributeError: 'NavigableString' object has no attribute 'contents'

您可以使用?.children?發電機：

for child in title_tag.children:print(child) # The Dormouse's story

.descendants

這個?.contents?和?.children?屬性只考慮標記的?direct?孩子們。例如，<head>標記有一個直接子元素--the<title>標記：：

head_tag.contents # [<title>The Dormouse's story</title>]

但是標簽本身有一個孩子：字符串“睡鼠的故事”。在某種意義上，該字符串也是<head>標記的子級。這個?.descendants?屬性允許您迭代?all?以遞歸方式表示標記的子代：它的直接子代、它的直接子代的子代，等等：

for child in head_tag.descendants:print(child) # <title>The Dormouse's story</title> # The Dormouse's story

<head>標記只有一個子代，但它有兩個子代：<title>標記和<title>標記的子代。這個?BeautifulSoup?對象只有一個直接子級（標記<html>），但它有很多子級：

len(list(soup.children)) # 1 len(list(soup.descendants)) # 25

.string

如果標記只有一個子項，而該子項是?NavigableString?，孩子的可用性為?.string?：：

title_tag.string # u'The Dormouse's story'

如果標記的唯一子代是另一個標記，并且?that?標簽有一個?.string?，則認為父標記具有相同的?.string?作為孩子：

head_tag.contents # [<title>The Dormouse's story</title>]head_tag.string # u'The Dormouse's story'

如果一個標簽包含多個東西，那么它不清楚?.string?應該參考，所以?.string?定義為?None?：：

print(soup.html.string) # None

.strings?and?stripped_strings

如果一個標記中有多個東西，您仍然可以只查看字符串。使用?.strings?發電機：

for string in soup.strings:print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n'

這些字符串往往有很多額外的空白，可以使用?.stripped_strings?發電機代替：

for string in soup.stripped_strings:print(repr(string)) # u"The Dormouse's story" # u"The Dormouse's story" # u'Once upon a time there were three little sisters; and their names were' # u'Elsie' # u',' # u'Lacie' # u'and' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'...'

在這里，完全由空白組成的字符串被忽略，字符串開頭和結尾的空白被刪除。

上升

繼續“家族樹”類比，每個標記和每個字符串都有一個?parent?：包含它的標記。

.parent

您可以使用?.parent?屬性。在示例“三姐妹”文檔中，<head>標記是<title>標記的父級：

title_tag = soup.title title_tag # <title>The Dormouse's story</title> title_tag.parent # <head><title>The Dormouse's story</title></head>

標題字符串本身有一個父級：包含它的<title>標記：

title_tag.string.parent # <title>The Dormouse's story</title>

像<html>這樣的頂級標記的父級是?BeautifulSoup?對象本身：

html_tag = soup.html type(html_tag.parent) # <class 'bs4.BeautifulSoup'>

以及?.parent?A的?BeautifulSoup?對象定義為無：：

print(soup.parent) # None

.parents

可以使用?.parents?. 此示例使用?.parents?要從埋在文檔深處的標簽移動到文檔的最頂端：

link = soup.a link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> for parent in link.parents:if parent is None:print(parent)else:print(parent.name) # p # body # html # [document] # None

側身而行

考慮這樣一個簡單的文檔：

sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>") print(sibling_soup.prettify()) # <html> # <body> # <a> # # text1 # # <c> # text2 # </c> # </a> # </body> # </html>

標記和<c>標記處于同一級別：它們都是同一標記的直接子代。我們稱之為?siblings?. 當文檔打印得很好時，兄弟姐妹會出現在相同的縮進級別。您也可以在編寫的代碼中使用這種關系。

.next_sibling?and?.previous_sibling

你可以使用?.next_sibling?和?.previous_sibling?要在分析樹的同一級別上的頁面元素之間導航，請執行以下操作：

sibling_soup.b.next_sibling # <c>text2</c>sibling_soup.c.previous_sibling # text1

標記具有?.next_sibling?但沒有?.previous_sibling?，因為在標記之前沒有任何內容?on the same level of the tree?. 出于同樣的原因，<c>標記具有?.previous_sibling?但沒有?.next_sibling?：：

print(sibling_soup.b.previous_sibling) # None print(sibling_soup.c.next_sibling) # None

字符串“text1”和“text2”是?not?兄弟姐妹，因為他們沒有相同的父母：

sibling_soup.b.string # u'text1'print(sibling_soup.b.string.next_sibling) # None

在真實文件中，?.next_sibling?或?.previous_sibling?標記的通常是包含空格的字符串。回到“三姐妹”文件：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

你可能認為?.next_sibling?第一個標簽中的第二個是第二個標簽。但實際上，它是一個字符串：逗號和換行符將第一個標記與第二個標記分隔開來：

link = soup.a link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>link.next_sibling # u',\n'

第二個標簽實際上是?.next_sibling?逗號：：

link.next_sibling.next_sibling # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

.next_siblings?and?.previous_siblings

您可以使用?.next_siblings?或?.previous_siblings?：：

for sibling in soup.a.next_siblings:print(repr(sibling)) # u',\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # u' and\n' # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> # u'; and they lived at the bottom of a well.' # Nonefor sibling in soup.find(id="link3").previous_siblings:print(repr(sibling)) # ' and\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # u',\n' # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # u'Once upon a time there were three little sisters; and their names were\n' # None

來回走動

看看“三姐妹”文件的開頭：

<html><head><title>The Dormouse's story</title></head> The Dormouse's story

HTML解析器將此字符串轉換為一系列事件：“打開一個<html>標記”、“打開一個<head>標記”、“打開一個<title>標記”、“添加一個字符串”、“關閉<title>標記”、“打開一個標記”等等。BeautifulSoup提供了重建文檔初始解析的工具。

.next_element?and?.previous_element

這個?.next_element?字符串或標記的屬性指向隨后立即解析的內容。可能和?.next_sibling?但這通常是大不相同的。

這是“三姐妹”文檔中的最后一個標簽。它的?.next_sibling?是字符串：由<a>標記開頭中斷的句子的結論。：：

last_a_tag = soup.find("a", id="link3") last_a_tag # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>last_a_tag.next_sibling # '; and they lived at the bottom of a well.'

但是?.next_element?在<a>標記中，在<a>標記后立即分析的內容是?not?那句話的其余部分：是“tillie”這個詞：

last_a_tag.next_element # u'Tillie'

這是因為在原始標記中，“tillie”一詞出現在分號之前。解析器遇到一個<a>標記，然后是單詞“tillie”，然后是結束語</a>標記，然后是分號和句子的其余部分。分號與<a>標記位于同一級別，但首先遇到了單詞“tillie”。

這個?.previous_element?屬性與?.next_element?. 它指向在此元素之前立即解析的任何元素：

last_a_tag.previous_element # u' and\n' last_a_tag.previous_element.next_element # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

.next_elements?and?.previous_elements

你現在應該明白了。您可以使用這些迭代器在文檔分析時向前或向后移動：

for element in last_a_tag.next_elements:print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # ... # u'...' # u'\n' # None

在樹上搜索

漂亮的soup定義了許多搜索解析樹的方法，但它們都非常相似。我要花很多時間解釋兩種最流行的方法：?find()和?find_all()?. 其他方法采用幾乎完全相同的參數，所以我將簡要介紹它們。

再次，我將以“三姐妹”文檔為例：

把一個過濾器傳遞給?find_all()?，您可以放大您感興趣的文檔部分。

過濾器種類

在詳細討論之前?find_all()?和類似的方法，我想展示一些可以傳遞給這些方法的不同過濾器的例子。這些過濾器在整個搜索API中一次又一次地出現。您可以使用它們根據標記的名稱、屬性、字符串文本或它們的某些組合進行篩選。

一串

最簡單的過濾器是字符串。將一個字符串傳遞給搜索方法， Beautiful Soup 將對該字符串執行匹配。此代碼查找文檔中的所有標記：

soup.find_all('b') # [The Dormouse's story]

如果傳入一個字節字符串，BeautySoup將假定該字符串編碼為UTF-8。您可以通過傳入Unicode字符串來避免這種情況。

正則表達式

如果傳入正則表達式對象，漂亮的soup將使用?search()?方法。此代碼查找名稱以字母“b”開頭的所有標記；在本例中，<body>標記和標記：

import re for tag in soup.find_all(re.compile("^b")):print(tag.name) # body # b

此代碼查找名稱中包含字母“t”的所有標記：

for tag in soup.find_all(re.compile("t")):print(tag.name) # html # title

一覽表

如果你通過一個名單， Beautiful Soup 將允許一個字符串匹配?any?列表中的項目。此代碼查找所有標記?and?所有標簽：

soup.find_all(["a", "b"]) # [The Dormouse's story, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True

價值?True?盡可能匹配。此代碼查找?all?文檔中的標記，但沒有文本字符串：

for tag in soup.find_all(True):print(tag.name) # html # head # title # body # p # b # p # a # a # a # p

函數

如果其他匹配項都不適用于您，請定義一個將元素作為其唯一參數的函數。函數應該返回?True?如果參數匹配，并且?False?否則。

這是一個返回?True?如果標記定義了“class”屬性，但沒有定義“id”屬性：

def has_class_but_no_id(tag):return tag.has_attr('class') and not tag.has_attr('id')

將此函數傳遞到?find_all()?您將獲得所有的標簽：

soup.find_all(has_class_but_no_id) # [The Dormouse's story, # Once upon a time there were..., # ...]

此函數只選取標記。它不會拾取<a>標記，因為這些標記定義了“class”和“id”。它不會拾取像<html>和<title>這樣的標記，因為這些標記沒有定義“class”。

如果傳入一個函數來篩選特定的屬性，例如?href?，傳遞給函數的參數將是屬性值，而不是整個標記。這里有一個函數?a?標簽的?href?屬性?不?匹配正則表達式：

def not_lacie(href):return href and not re.compile("lacie").search(href) soup.find_all(href=not_lacie) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

這個函數可以像您需要的那樣復雜。這是一個返回?True?如果標記被字符串對象包圍：

from bs4 import NavigableString def surrounded_by_strings(tag):return (isinstance(tag.next_element, NavigableString)and isinstance(tag.previous_element, NavigableString))for tag in soup.find_all(surrounded_by_strings):print tag.name # p # a # a # a # p

現在我們準備詳細研究一下搜索方法。

find_all()

簽名：全部查找 (name?，?attrs?，?recursive?，?string?，?limit?，?**kwargs?）

這個?find_all()?方法查找標記的后代并檢索?all?descendants that match your filters. I gave several examples in?Kinds of filters?，但這里還有一些：

soup.find_all("title") # [<title>The Dormouse's story</title>]soup.find_all("p", "title") # [The Dormouse's story]soup.find_all("a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.find_all(id="link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]import re soup.find(string=re.compile("sisters")) # u'Once upon a time there were three little sisters; and their names were\n'

其中一些看起來應該很熟悉，但另一些是新的。傳遞值是什么意思?string?或?id?？為什么?find_all("p",?"title")?找到一個帶有css類“title”的標記？讓我們來看看?find_all()?.

這個?name?論點

為傳遞值?name?你會告訴 Beautiful Soup 只考慮特定名稱的標簽。文本字符串將被忽略，名稱不匹配的標記也將被忽略。

這是最簡單的用法：

soup.find_all("title") # [<title>The Dormouse's story</title>]

回憶起?Kinds of filters?價值在于?name?可以是?a string?，?a regular expression?，?a list?，?a function?或?the value True?.

關鍵字參數

任何無法識別的參數都將轉換為標記某個屬性的篩選器。如果為一個名為?id?， Beautiful Soup 將根據每個標簽的“id”屬性過濾：

soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果傳遞的值?href?， Beautiful Soup 將根據每個標簽的“href”屬性過濾：

soup.find_all(href=re.compile("elsie")) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

可以根據篩選屬性?a string?，?a regular expression?，?a list?，?a function?或?the value True?.

此代碼查找?id?無論值是什么，屬性都有一個值：

soup.find_all(id=True) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通過傳入多個關鍵字參數，可以一次篩選多個屬性：：

soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

有些屬性（如HTML 5中的data-*屬性）的名稱不能用作關鍵字參數的名稱：

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') data_soup.find_all(data-foo="value") # SyntaxError: keyword can't be an expression

您可以在搜索中使用這些屬性，方法是將它們放入字典并將字典傳遞到?find_all()?作為?attrs?論點：

data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]

不能使用關鍵字參數來搜索HTML的“name”元素，因為beautiful soup使用?name?參數來包含標記本身的名稱。相反，您可以在?attrs?論點：

name_soup = BeautifulSoup('<input name="email"/>') name_soup.find_all(name="email") # [] name_soup.find_all(attrs={"name": "email"}) # [<input name="email"/>]

按CSS類搜索

搜索具有特定css類的標記非常有用，但css屬性“class”的名稱是python中的保留字。使用?class?作為關鍵字，參數會給您一個語法錯誤。從BeautifulSoup4.1.2開始，可以使用關鍵字參數通過CSS類搜索?class_?：：

soup.find_all("a", class_="sister") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

與任何關鍵字參數一樣，可以通過?class_?字符串、正則表達式、函數或?True?：：

soup.find_all(class_=re.compile("itl")) # [The Dormouse's story]def has_six_characters(css_class):return css_class is not None and len(css_class) == 6soup.find_all(class_=has_six_characters) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Remember?單個標記的“class”屬性可以有多個值。當您搜索與某個CSS類匹配的標記時，您將?any?它的CSS類：

css_soup = BeautifulSoup('') css_soup.find_all("p", class_="strikeout") # []css_soup.find_all("p", class_="body") # []

您還可以搜索?class?屬性：

css_soup.find_all("p", class_="body strikeout") # []

但搜索字符串值的變體將不起作用：

css_soup.find_all("p", class_="strikeout body") # []

如果要搜索與兩個或多個CSS類匹配的標記，則應使用CSS選擇器：

css_soup.select("p.strikeout.body") # []

在老版本的 Beautiful Soup 中，沒有?class_?快捷方式，您可以使用?attrs?上面提到的技巧。創建一個字典，其中“class”的值是要搜索的字符串（或正則表達式或其他類型）：：

soup.find_all("a", attrs={"class": "sister"}) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

這個?string?論點

用?string?您可以搜索字符串而不是標記。和一樣?name?和關鍵字參數，您可以傳入?a string?，?a regular expression?，?a list?，?a function?或?the value True?. 以下是一些例子：

soup.find_all(string="Elsie") # [u'Elsie']soup.find_all(string=["Tillie", "Elsie", "Lacie"]) # [u'Elsie', u'Lacie', u'Tillie']soup.find_all(string=re.compile("Dormouse")) [u"The Dormouse's story", u"The Dormouse's story"]def is_the_only_string_within_a_tag(s):"""Return True if this string is the only child of its parent tag."""return (s == s.parent.string)soup.find_all(string=is_the_only_string_within_a_tag) # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

雖然?string?是為了找到字符串，您可以將它與找到標簽的參數結合起來： Beautiful Soup 將找到所有標簽?.string與您的值匹配?string?. 此代碼查找?.string?是“Elsie”：

soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

這個?string?在《 Beautiful Soup 》4.4.0中，爭論是新的。在早期版本中，它被稱為?text?：：

soup.find_all("a", text="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

這個?limit?論點

find_all()?返回與篩選器匹配的所有標記和字符串。如果文檔很大，這可能需要一段時間。如果你不需要?all?結果，您可以為?limit?. 這就像SQL中的limit關鍵字一樣工作。它告訴 Beautiful Soup 在找到某個數字后停止收集結果。

“三姐妹”文檔中有三個鏈接，但此代碼只找到前兩個：

soup.find_all("a", limit=2) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

這個?recursive?論點

如果你打電話?mytag.find_all()?Beautiful Soup 將檢驗所有的后代?mytag?：它的孩子，它的孩子，等等。如果你只想讓 Beautiful Soup 考慮直接的孩子，你可以通過?recursive=False?. 請看這里的區別：

soup.html.find_all("title") # [<title>The Dormouse's story</title>]soup.html.find_all("title", recursive=False) # []

這是文件的一部分：

<html><head><title>The Dormouse's story</title></head> ...

<title>標記位于<html>標記之下，但它不是?directly?在<html>標記下面：正在使用<head>標記。當允許查看<html>標記的所有后代時，beautiful soup會找到<title>標記，但當?recursive=False?將其限制到標記的直接子代，它什么也找不到。

“ Beautiful Soup ”提供了許多樹搜索方法（見下文），它們大多采用與?find_all()?：?name?，?attrs?，?string?，?limit?和關鍵字參數。但是?recursive?參數不同：?find_all()?和?find()?是唯一支持它的方法。經過?recursive=False?像這樣的方法?find_parents()?不會很有用的。

打標簽就像打電話?find_all()

因為?find_all()?是 Beautiful Soup 搜索API中最流行的方法，您可以使用它的快捷方式。如果你治療?BeautifulSoup對象或?Tag?對象，就像它是一個函數一樣，然后它與調用?find_all()?在那個物體上。這兩行代碼是等效的：

soup.find_all("a") soup("a")

這兩行也相當于：

soup.title.find_all(string=True) soup.title(string=True)

find()

簽名：查找 (name?，?attrs?，?recursive?，?string?，?**kwargs?）

這個?find_all()?方法掃描整個文檔以查找結果，但有時您只希望找到一個結果。如果您知道一個文檔只有一個<body>標記，那么掃描整個文檔尋找更多標記是浪費時間。而不是通過?limit=1?每次你打電話?find_all?，您可以使用?find()?方法。這兩行代碼是?nearly?當量：

soup.find_all('title', limit=1) # [<title>The Dormouse's story</title>]soup.find('title') # <title>The Dormouse's story</title>

唯一的區別是?find_all()?返回包含單個結果的列表，以及?find()?只返回結果。

如果?find_all()?找不到任何內容，它返回空列表。如果?find()?找不到任何東西，它會返回?None?：：

print(soup.find("nosuchtag")) # None

記住?soup.head.title?騙局?Navigating using tag names?？這種把戲是通過不斷地打電話?find()?：：

soup.head.title # <title>The Dormouse's story</title>soup.find("head").find("title") # <title>The Dormouse's story</title>

find_parents()?and?find_parent()

簽名：查找家長 (name?，?attrs?，?string?，?limit?，?**kwargs?）

簽名：查找父級 (name?，?attrs?，?string?，?**kwargs?）

我花了很多時間來掩蓋?find_all()?和?find()?. Beautiful Soup API定義了另外十種搜索樹的方法，但不要害怕。其中五種方法基本上與?find_all()?其他五個基本相同?find()?. 唯一的區別在于它們搜索的是樹的哪些部分。

首先讓我們考慮一下?find_parents()?和?find_parent()?. 記住?find_all()?和?find()?沿著樹往下走，看看標簽的后代。這些方法恰恰相反：它們按自己的方式工作。?up?樹，查看標簽（或字符串）的父級。讓我們從埋在“三個女兒”文件深處的一根繩子開始，來試試它們：

a_string = soup.find(string="Lacie") a_string # u'Lacie'a_string.find_parents("a") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]a_string.find_parent("p") # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; # and they lived at the bottom of a well.a_string.find_parents("p", class="title") # []

三個標簽中的一個是相關字符串的直接父級，因此我們的搜索會找到它。三個標記中的一個是字符串的間接父級，我們的搜索也發現了這一點。有一個帶有css類“title”的標簽?somewhere?在文檔中，但它不是此字符串的父字符串之一，因此我們無法使用?find_parents()?.

你可能把?find_parent()?和?find_parents()?和?.parent?和?.parents?前面提到的屬性。連接非常牢固。這些搜索方法實際使用?.parents?遍歷所有父級，并對照提供的過濾器檢查每個父級是否匹配。

find_next_siblings()?and?find_next_sibling()

簽名：找到下一個兄弟姐妹 (name?，?attrs?，?string?，?limit?，?**kwargs?）

簽名：查找下一個兄弟 (name?，?attrs?，?string?，?**kwargs?）

這些方法使用?.next_siblings?遍歷樹中元素的其他同級。這個?find_next_siblings()?方法返回匹配的所有同級，以及?find_next_sibling()?只返回第一個：

first_link = soup.a first_link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>first_link.find_next_siblings("a") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]first_story_paragraph = soup.find("p", "story") first_story_paragraph.find_next_sibling("p") # ...

find_previous_siblings()?and?find_previous_sibling()

簽名：查找以前的兄弟姐妹 (name?，?attrs?，?string?，?limit?，?**kwargs?）

簽名：查找上一個兄弟 (name?，?attrs?，?string?，?**kwargs?）

這些方法使用?.previous_siblings?在樹中遍歷元素前面的同級元素。這個?find_previous_siblings()?方法返回匹配的所有同級，以及?find_previous_sibling()?只返回第一個：

last_link = soup.find("a", id="link3") last_link # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>last_link.find_previous_siblings("a") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]first_story_paragraph = soup.find("p", "story") first_story_paragraph.find_previous_sibling("p") # The Dormouse's story

find_all_next()?and?find_next()

簽名：查找下一個 (name?，?attrs?，?string?，?limit?，?**kwargs?）

簽名：查找下一個 (name?，?attrs?，?string?，?**kwargs?）

這些方法使用?.next_elements?迭代文檔中出現在它后面的任何標記和字符串。這個?find_all_next()?方法返回所有匹配項，并且?find_next()?只返回第一個匹配項：：

first_link = soup.a first_link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>first_link.find_all_next(string=True) # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']first_link.find_next("p") # ...

在第一個示例中，出現了字符串“elsie”，即使它包含在我們開始使用的<a>標記中。在第二個示例中，出現了文檔中最后一個標記，盡管它與我們開始使用的<a>標記不在樹的同一部分。對于這些方法，最重要的是一個元素匹配過濾器，并且在文檔中比開始元素晚出現。

find_all_previous()?and?find_previous()

簽名：查找所有上一個 (name?，?attrs?，?string?，?limit?，?**kwargs?）

簽名：查找上一個 (name?，?attrs?，?string?，?**kwargs?）

這些方法使用?.previous_elements?迭代文檔中出現在它前面的標記和字符串。這個?find_all_previous()?方法返回所有匹配項，并且?find_previous()?只返回第一個匹配項：：

first_link = soup.a first_link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>first_link.find_all_previous("p") # [Once upon a time there were three little sisters; ..., # The Dormouse's story]first_link.find_previous("title") # <title>The Dormouse's story</title>

呼喚?find_all_previous("p")?在文檔中找到第一段（class=“title”），但它也找到第二段，即包含我們開始使用的<a>標記的標記。這不應該太令人驚訝：我們正在查看文檔中出現的所有標簽，這些標簽比我們開始使用的標簽要早。包含<a>標記的標記必須出現在它包含的<a>標記之前。

CSS選擇器

從4.7.0版開始，Beautiful Soup通過?SoupSieve?項目。如果你安裝了 Beautiful Soup?pip?同時安裝了soupsieve，因此您不必做任何額外的事情。

BeautifulSoup?有一個?.select()?方法，該方法使用soupsieve對已分析的文檔運行CSS選擇器，并返回所有匹配的元素。?Tag?有一個類似的方法，它針對單個標記的內容運行CSS選擇器。

（早期版本的 Beautiful Soup 也有?.select()?方法，但只支持最常用的CSS選擇器。）

SoupSieve?documentation?列出當前支持的所有CSS選擇器，但以下是一些基本信息：

您可以找到標簽：

soup.select("title") # [<title>The Dormouse's story</title>]soup.select("p:nth-of-type(3)") # [...]

在其他標記下查找標記：：

soup.select("body a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("html head title") # [<title>The Dormouse's story</title>]

查找標簽?directly?其他標簽下方：

soup.select("head > title") # [<title>The Dormouse's story</title>]soup.select("p > a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("p > a:nth-of-type(2)") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]soup.select("p > #link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select("body > a") # []

查找標簽的同級：

soup.select("#link1 ~ .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("#link1 + .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

按CSS類查找標記：：

soup.select(".sister") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("[class~=sister]") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

按ID查找標記：

soup.select("#link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select("a#link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

從選擇器列表中查找與任何選擇器匹配的標記：

soup.select("#link1,#link2") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

測試是否存在屬性：：

soup.select('a[href]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

按屬性值查找標記：：

soup.select('a[href="http://example.com/elsie"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select('a[href^="http://example.com/"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select('a[href$="tillie"]') # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select('a[href*=".com/el"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

還有一個方法叫做?select_one()?，它只找到與選擇器匹配的第一個標記：：

soup.select_one(".sister") # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果您已經解析了定義名稱空間的XML，那么可以在CSS選擇器中使用它們。：：

from bs4 import BeautifulSoup xml = """<tag xmlns:ns1="http://namespace1/" xmlns:ns2="http://namespace2/"><ns1:child>I'm in namespace 1</ns1:child><ns2:child>I'm in namespace 2</ns2:child> </tag> """ soup = BeautifulSoup(xml, "xml")soup.select("child") # [<ns1:child>I'm in namespace 1</ns1:child>, <ns2:child>I'm in namespace 2</ns2:child>]soup.select("ns1|child", namespaces=namespaces) # [<ns1:child>I'm in namespace 1</ns1:child>]

當處理使用名稱空間的CSS選擇器時，BeautifulSoup使用它在分析文檔時找到的名稱空間縮寫。您可以通過傳入自己的縮略語詞典來覆蓋此內容：

namespaces = dict(first="http://namespace1/", second="http://namespace2/") soup.select("second|child", namespaces=namespaces) # [<ns1:child>I'm in namespace 2</ns1:child>]

所有這些CSS選擇器的東西對于已經知道CSS選擇器語法的人來說都很方便。您可以使用 Beautiful Soup API來完成所有這些工作。如果CSS選擇器是您所需要的全部，那么您應該使用lxml解析文檔：它速度更快。但這讓你?combine?帶有 Beautiful Soup API的CSS選擇器。

修改樹

BeautifulSoup的主要優勢在于搜索解析樹，但是您也可以修改樹，并將更改作為新的HTML或XML文檔來編寫。

更改標記名稱和屬性

我早先討論過這個，在?Attributes?但它需要重復。可以重命名標記、更改其屬性的值、添加新屬性和刪除屬性：

soup = BeautifulSoup('Extremely bold') tag = soup.btag.name = "blockquote" tag['class'] = 'verybold' tag['id'] = 1 tag # <blockquote class="verybold" id="1">Extremely bold</blockquote>del tag['class'] del tag['id'] tag # <blockquote>Extremely bold</blockquote>

修改?.string

如果你設置了一個標簽?.string?屬性，標記的內容將替換為您提供的字符串：：

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup)tag = soup.a tag.string = "New link text." tag # <a href="http://example.com/">New link text.</a>

小心：如果標簽包含其他標簽，它們及其所有內容將被銷毀。

append()

可以使用添加到標記的內容?Tag.append()?. 就像打電話一樣?.append()?在python列表上：

soup = BeautifulSoup("<a>Foo</a>") soup.a.append("Bar")soup # <html><head></head><body><a>FooBar</a></body></html> soup.a.contents # [u'Foo', u'Bar']

extend()

從 Beautiful Soup 4.7.0開始，?Tag?還支持名為?.extend()?，就像在python列表上調用?.extend()?一樣：

soup = BeautifulSoup("<a>Soup</a>") soup.a.extend(["'s", " ", "on"])soup # <html><head></head><body><a>Soup's on</a></body></html> soup.a.contents # [u'Soup', u''s', u' ', u'on']

NavigableString()?and?.new_tag()

如果您需要向文檔中添加一個字符串，沒問題——您可以將一個python字符串傳入?append()?或者你可以打電話給?NavigableString?施工人員：

soup = BeautifulSoup("") tag = soup.b tag.append("Hello") new_string = NavigableString(" there") tag.append(new_string) tag # Hello there. tag.contents # [u'Hello', u' there']

如果要創建注釋或?NavigableString?，只需調用構造函數：

from bs4 import Comment new_comment = Comment("Nice to see you.") tag.append(new_comment) tag # Hello there tag.contents # [u'Hello', u' there', u'Nice to see you.']

（這是 Beautiful Soup 4.4.0中的新功能。）

如果您需要創建一個全新的標簽呢？最好的解決方案是調用工廠方法?BeautifulSoup.new_tag()?：：

soup = BeautifulSoup("") original_tag = soup.bnew_tag = soup.new_tag("a", href="http://www.example.com") original_tag.append(new_tag) original_tag # <a href="http://www.example.com"></a>new_tag.string = "Link text." original_tag # <a href="http://www.example.com">Link text.</a>

只需要第一個參數，即標記名。

insert()

Tag.insert()?就像是?Tag.append()?，但新元素不一定位于其父元素的末尾?.contents?. 它將插入到您所說的任何數字位置。它的工作原理就像?.insert()?在python列表上：

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) tag = soup.atag.insert(1, "but did not endorse ") tag # <a href="http://example.com/">I linked to but did not endorse example.com</a> tag.contents # [u'I linked to ', u'but did not endorse', example.com]

insert_before()?and?insert_after()

這個?insert_before()?方法在分析樹中的其他內容之前插入標記或字符串：：

soup = BeautifulSoup("stop") tag = soup.new_tag("i") tag.string = "Don't" soup.b.string.insert_before(tag) soup.b # Don'tstop

這個?insert_after()?方法在分析樹中的其他內容之后立即插入標記或字符串：：

div = soup.new_tag('div') div.string = 'ever' soup.b.i.insert_after(" you ", div) soup.b # Don't you <div>ever</div> stop soup.b.contents # [Don't, u' you', <div>ever</div>, u'stop']

clear()

Tag.clear()?刪除標記的內容：：

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) tag = soup.atag.clear() tag # <a href="http://example.com/"></a>

extract()

PageElement.extract()?從樹中刪除標記或字符串。它返回提取的標記或字符串：

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.ai_tag = soup.i.extract()a_tag # <a href="http://example.com/">I linked to</a>i_tag # example.comprint(i_tag.parent) None

此時，您實際上有兩個解析樹：一個在?BeautifulSoup?對象，其中一個在提取的標記處建立。你可以繼續打電話?extract?在提取的元素的子元素上：

my_string = i_tag.string.extract() my_string # u'example.com'print(my_string.parent) # None i_tag #

decompose()

Tag.decompose()?從樹中刪除標記，然后?completely destroys it and its contents?：：

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.asoup.i.decompose()a_tag # <a href="http://example.com/">I linked to</a>

replace_with()

PageElement.replace_with()?從樹中刪除標記或字符串，并將其替換為您選擇的標記或字符串：

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.anew_tag = soup.new_tag("b") new_tag.string = "example.net" a_tag.i.replace_with(new_tag)a_tag # <a href="http://example.com/">I linked to example.net</a>

replace_with()?返回被替換的標記或字符串，以便您可以檢查它或將其添加回樹的另一部分。

wrap()

PageElement.wrap()?在指定的標記中包裝元素。它返回新包裝：

soup = BeautifulSoup("I wish I was bold.") soup.p.string.wrap(soup.new_tag("b")) # I wish I was bold.soup.p.wrap(soup.new_tag("div") # <div>I wish I was bold.</div>

這種方法在 Beautiful Soup 4.0.5中是新的。

unwrap()

Tag.unwrap()?與…相反?wrap()?. 它用標簽內的內容替換標簽。它有助于去除標記：

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.aa_tag.i.unwrap() a_tag # <a href="http://example.com/">I linked to example.com</a>

喜歡?replace_with()?，?unwrap()?返回被替換的標記。

產量

漂亮的印刷

這個?prettify()?方法將把一個漂亮的soup解析樹轉換成一個格式良好的unicode字符串，每個標記和每個字符串都有一行：

markup='<a href=“http://example.com/”>i linked toexample.com.<a>'soup=beautifulsoup（markup）soup.prettify（）'<html>n<head>n<head>n<body>n<a href=“http://example.com/”>n…'

print（soup.prettify（））<html><head><head><body><a href=“http://example.com/”>i linked toexample.com#</a>#</body>#</html>

你可以打電話?prettify()?在頂層?BeautifulSoup?對象，或其上的任何對象?Tag?物體：：

print(soup.a.prettify()) # <a href="http://example.com/"> # I linked to # # example.com # # </a>

不好看的印刷品

如果您只需要一個字符串，而不需要復雜的格式，您可以調用?unicode()?或?str()?在一?BeautifulSoup?對象，或?Tag內：：

str(soup) # '<html><head></head><body><a href="http://example.com/">I linked to example.com</a></body></html>'unicode(soup.a) # u'<a href="http://example.com/">I linked to example.com</a>'

這個?str()?函數返回以UTF-8編碼的字符串。見?Encodings?其他選項。

你也可以打電話?encode()?得到一個字節串，和?decode()?獲取Unicode。

輸出格式化程序

如果為Beautiful Soup提供一個包含“&lQuot；”等HTML實體的文檔，則這些實體將轉換為Unicode字符：

soup = BeautifulSoup("“Dammit!” he said.") unicode(soup) # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'

如果然后將文檔轉換為字符串，則Unicode字符將編碼為UTF-8。您將無法取回HTML實體：

str(soup) # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'

默認情況下，輸出時轉義的唯一字符是空和號和尖括號。它們會變成“&amp；”、“&lt；”和“&gt；”，這樣 Beautiful Soup 不會無意中生成無效的HTML或XML:：

soup = BeautifulSoup("The law firm of Dewey, Cheatem, & Howe") soup.p # The law firm of Dewey, Cheatem, & Howesoup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') soup.a # <a href="http://example.com/?foo=val1&bar=val2">A link</a>

您可以通過為?formatter?參數?prettify()?，?encode()?或?decode()?. Beautiful Soup 可以識別六種可能的價值?formatter?.

默認值為?formatter="minimal"?. 字符串的處理只能確保漂亮的soup生成有效的html/xml:：

french = "Il a dit <<Sacré bleu!>>" soup = BeautifulSoup(french) print(soup.prettify(formatter="minimal")) # <html> # <body> # # Il a dit <<Sacré bleu!>> # # </body> # </html>

如果你通過?formatter="html"?， Beautiful Soup 將盡可能將Unicode字符轉換為HTML實體：：

print(soup.prettify(formatter="html")) # <html> # <body> # # Il a dit <<Sacré bleu!>> # # </body> # </html>If you pass in ``formatter="html5"``, it's the same as

formatter="html5"?但是 Beautiful Soup 將省略HTML空標記中的結束斜杠，如“br”：：

soup = BeautifulSoup(" ")print(soup.encode(formatter="html")) # <html><body> </body></html>print(soup.encode(formatter="html5")) # <html><body> </body></html>

如果你通過?formatter=None?， Beautiful Soup 在輸出時根本不會修改字符串。這是最快的選項，但它可能會導致漂亮的soup生成無效的HTML/XML，如以下示例所示：

print(soup.prettify(formatter=None)) # <html> # <body> # # Il a dit <<Sacré bleu!>> # # </body> # </html>link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') print(link_soup.a.encode(formatter=None)) # <a href="http://example.com/?foo=val1&bar=val2">A link</a>

最后，如果傳遞函數?formatter?對于文檔中的每個字符串和屬性值，漂亮的soup將調用該函數一次。你可以在這個函數中做你想做的任何事情。這里有一個格式化程序，它將字符串轉換為大寫，不執行任何其他操作：

def uppercase(str):return str.upper()print(soup.prettify(formatter=uppercase)) # <html> # <body> # # IL A DIT <<SACRé BLEU!>> # # </body> # </html>print(link_soup.a.prettify(formatter=uppercase)) # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2"> # A LINK # </a>

如果你在寫你自己的函數，你應該知道?EntitySubstitution?類中?bs4.dammit?模塊。這個類實現了漂亮的soup的標準格式化程序作為類方法：“html”格式化程序是?EntitySubstitution.substitute_html?“最小”格式化程序是?EntitySubstitution.substitute_xml?. 您可以使用這些函數來模擬?formatter=html?或?formatter==minimal?但是做一些額外的事情。

下面是一個示例，盡可能用HTML實體替換Unicode字符，但是?also?將所有字符串轉換為大寫：：

from bs4.dammit import EntitySubstitution def uppercase_and_substitute_html_entities(str):return EntitySubstitution.substitute_html(str.upper())print(soup.prettify(formatter=uppercase_and_substitute_html_entities)) # <html> # <body> # # IL A DIT <<SACRÉ BLEU!>> # # </body> # </html>

最后一個警告：如果創建?CData?對象，始終顯示該對象中的文本?exactly as it appears, with no formatting?. BeautifulSoup將調用格式化程序方法，以防編寫了一個對文檔中的所有字符串進行計數的自定義方法，但它將忽略返回值：

from bs4.element import CData soup = BeautifulSoup("<a></a>") soup.a.string = CData("one < three") print(soup.a.prettify(formatter="xml")) # <a> # <![CDATA[one < three]]> # </a>

get_text()

如果只需要文檔或標記的文本部分，則可以使用?get_text()?方法。它以單個Unicode字符串的形式返回文檔中或標記下的所有文本：

markup = '<a href="http://example.com/">\nI linked to example.com\n</a>' soup = BeautifulSoup(markup)soup.get_text() u'\nI linked to example.com\n' soup.i.get_text() u'example.com'

可以指定用于將文本位連接在一起的字符串：：

# soup.get_text("|") u'\nI linked to |example.com|\n'

您可以告訴 Beautiful Soup ，從每一段文字的開頭和結尾去除空白：

# soup.get_text("|", strip=True) u'I linked to|example.com'

但在那一點上，你可能想使用?.stripped_strings?生成，然后自己處理文本：

[text for text in soup.stripped_strings] # [u'I linked to', u'example.com']

指定要使用的分析器

如果只需要解析一些HTML，可以將標記轉儲到?BeautifulSoup?建造師，可能會沒事的。 Beautiful Soup 將為您選擇一個解析器并解析數據。但是還有一些附加的參數可以傳遞給構造函數來更改使用的解析器。

第一個論點?BeautifulSoup?構造函數是一個字符串或一個打開的文件句柄——您希望解析的標記。第二個論點是?how?您希望對標記進行分析。

如果不指定任何內容，您將獲得安裝的最佳HTML解析器。漂亮的soup將lxml的解析器列為最好的，然后是html5lib的，然后是python的內置解析器。您可以通過指定以下選項之一來覆蓋此選項：

要分析的標記類型。目前支持的是“html”、“xml”和“html5”。
要使用的分析器庫的名稱。目前支持的選項有“lxml”、“html5lib”和“html.parser”（python的內置html解析器）。

斷面?Installing a parser?對比支持的解析器。

如果您沒有安裝合適的解析器，那么BeautifulSoup將忽略您的請求并選擇不同的解析器。現在，唯一支持的XML解析器是lxml。如果您沒有安裝lxml，那么請求XML解析器就不會給您一個，而請求“lxml”也不會起作用。

解析器之間的差異

漂亮的soup為許多不同的解析器提供了相同的接口，但是每個解析器都是不同的。不同的解析器將從同一文檔創建不同的解析樹。最大的區別在于HTML解析器和XML解析器之間。這是一個簡短的文檔，解析為HTML:：

BeautifulSoup("<a></a>") # <html><head></head><body><a></a></body></html>

由于空的標記不是有效的HTML，因此解析器將其轉換為標記對。

以下是與XML解析的相同文檔（運行此文檔需要安裝lxml）。請注意，空的標記是單獨保留的，并且文檔得到了XML聲明，而不是放入<html>標記中。：：

BeautifulSoup("<a></a>", "xml") # <?xml version="1.0" encoding="utf-8"?> # <a></a>

HTML解析器之間也存在差異。如果您給BeautifulSoup一個完美的HTML文檔，那么這些差異就不重要了。一個解析器比另一個要快，但它們都會為您提供一個與原始HTML文檔完全相同的數據結構。

但是，如果文檔沒有完全形成，不同的解析器將給出不同的結果。這里有一個使用lxml的HTML解析器解析的短的、無效的文檔。請注意，掛起的標簽只是被忽略：

BeautifulSoup("<a>", "lxml") # <html><body><a></a></body></html>

以下是使用html5lib解析的同一文檔：

BeautifulSoup("<a>", "html5lib") # <html><head></head><body><a></a></body></html>

HTML5LIB不會忽略懸空標記，而是將其與一個開頭的標記配對。此分析器還向文檔添加一個空的<head>標記。

下面是用Python的內置HTML解析器解析的同一文檔：

BeautifulSoup("<a>", "html.parser") # <a></a>

與html5lib一樣，此分析器忽略結束標記。與html5lib不同，此解析器不嘗試通過添加<body>標記來創建格式良好的HTML文檔。與LXML不同，它甚至不需要添加<html>標記。

由于文檔“<a>”無效，因此這些技術中沒有一種是“正確”處理它的方法。html5lib解析器使用的技術是html5標準的一部分，因此它最有可能是“正確的”方法，但這三種技術都是合法的。

解析器之間的差異會影響您的腳本。如果計劃將腳本分發給其他人，或在多臺計算機上運行腳本，則應在?BeautifulSoup?構造函數。這將減少用戶解析文檔與解析文檔不同的可能性。

編碼

任何HTML或XML文檔都是以特定的編碼（如ASCII或UTF-8）編寫的。但當您將該文檔加載到 Beautiful Soup 中時，您會發現它已轉換為Unicode:：

markup = "<h1>Sacr\xc3\xa9 bleu!</h1>" soup = BeautifulSoup(markup) soup.h1 # <h1>Sacré bleu!</h1> soup.h1.string # u'Sacr\xe9 bleu!'

這不是魔法。（那當然不錯。） Beautiful Soup 使用一個名為?Unicode, Dammit?檢測文檔的編碼并將其轉換為Unicode。自動檢測到的編碼可用作?.original_encoding?的屬性?BeautifulSoup?對象：

soup.original_encoding 'utf-8'

Unicode，該死的，大多數時候都是正確的猜測，但有時會出錯。有時它猜對了，但只有在對文檔逐字節搜索了很長時間之后。如果您碰巧提前知道文檔的編碼，則可以通過將其傳遞給?BeautifulSoup?構造函數AS?from_encoding?.

這是一份用ISO-8859-8編寫的文件。該文檔太短，以致于Unicode無法對其進行鎖定，并錯誤地將其標識為ISO-8859-7:：

markup = b"<h1>\xed\xe5\xec\xf9</h1>" soup = BeautifulSoup(markup) soup.h1 <h1>νεμω</h1> soup.original_encoding 'ISO-8859-7'

我們可以通過傳遞正確的?from_encoding?：：

soup = BeautifulSoup(markup, from_encoding="iso-8859-8") soup.h1 <h1>????</h1> soup.original_encoding 'iso8859-8'

如果你不知道正確的編碼是什么，但是你知道Unicode，該死的猜錯了，你可以把錯誤的猜錯傳給?exclude_encodings：：

soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"]) soup.h1 <h1>????</h1> soup.original_encoding 'WINDOWS-1255'

Windows-1255不是100%正確的，但是編碼是ISO-8859-8的兼容超集，所以它足夠接近了。 (exclude_encodings?是 Beautiful Soup 4.4.0中的新功能。）

在極少數情況下（通常當UTF-8文檔包含用完全不同的編碼編寫的文本時），獲得unicode的唯一方法可能是用特殊的unicode字符“替換字符”（u+fffd，）替換某些字符。如果unicode，dammit需要這樣做，它將設置?.contains_replacement_characters?屬性到?True?上?UnicodeDammit?或?BeautifulSoup?對象。這讓您知道Unicode表示不是原始表示的精確表示——一些數據丟失了。如果文檔包含，但是?.contains_replacement_characters?是?False?，您將知道最初存在（如本段所示），并且不代表缺少的數據。

輸出編碼

當您從BeautifulSoup中寫出一個文檔時，您會得到一個utf-8文檔，即使文檔沒有以utf-8開頭。這是一份用拉丁文-1編碼的文件：

markup = b'''<html><head><meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /></head><body>Sacr\xe9 bleu!</body></html> '''soup = BeautifulSoup(markup) print(soup.prettify()) # <html> # <head> # <meta content="text/html; charset=utf-8" http-equiv="Content-type" /> # </head> # <body> # # Sacré bleu! # # </body> # </html>

請注意，<meta>標記已被重寫，以反映文檔現在是UTF-8格式的事實。

如果不需要UTF-8，可以將編碼傳遞到?prettify()?：：

print(soup.prettify("latin-1")) # <html> # <head> # <meta content="text/html; charset=latin-1" http-equiv="Content-type" /> # ...

您還可以在上調用encode（）。?BeautifulSoup?對象或 soup 中的任何元素，就像它是一個python字符串一樣：

soup.p.encode("latin-1") # 'Sacr\xe9 bleu!'soup.p.encode("utf-8") # 'Sacr\xc3\xa9 bleu!'

在所選編碼中無法表示的任何字符都將轉換為數字XML實體引用。這是一個包含unicode字符snowman的文檔：

markup = u"\N{SNOWMAN}" snowman_soup = BeautifulSoup(markup) tag = snowman_soup.b

snowman字符可以是utf-8文檔的一部分（看起來像），但在iso-latin-1或ascii中沒有該字符的表示，因此對于這些編碼，它轉換為“&9731”：：

print(tag.encode("utf-8")) # ?print tag.encode("latin-1") # ☃print tag.encode("ascii") # ☃

Unicode，該死

你可以使用unicode，該死的，不用 Beautiful Soup 。當您有一個未知編碼的數據并且只希望它成為Unicode時，它是有用的：

from bs4 import UnicodeDammit dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'utf-8'

Unicode，如果安裝?chardet?或?cchardet?python庫。你提供的Unicode數據越多，該死的，它就越能準確猜測。如果您對編碼可能是什么有自己的懷疑，可以將它們作為列表傳遞：

dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'latin-1'

unicode，dammit有兩個 Beautiful Soup 不使用的特殊功能。

智能報價

可以使用unicode、dammit將Microsoft智能引號轉換為HTML或XML實體：

markup = b"I just \x93love\x94 Microsoft Word\x92s smart quotes"UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup # u'I just “love” Microsoft Word’s smart quotes'UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup # u'I just “love” Microsoft Word’s smart quotes'

您還可以將Microsoft智能引號轉換為ASCII引號：：

UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup # u'I just "love" Microsoft Word\'s smart quotes'

希望你會發現這個功能很有用，但“ Beautiful Soup ”并沒有用。 Beautiful Soup 喜歡默認行為，即將Microsoft智能引號轉換為Unicode字符以及其他所有字符：

UnicodeDammit(markup, ["windows-1252"]).unicode_markup # u'I just \u201clove\u201d Microsoft Word\u2019s smart quotes'

編碼不一致

有時文檔大部分是UTF-8格式，但包含Windows-1252字符，如（再次）Microsoft智能引號。當一個網站包含來自多個來源的數據時，就會發生這種情況。你可以使用?UnicodeDammit.detwingle()?將這樣的文檔轉換為純UTF-8。下面是一個簡單的例子：

snowmen = (u"\N{SNOWMAN}" * 3) quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}") doc = snowmen.encode("utf8") + quote.encode("windows_1252")

這份文件亂七八糟。雪人是UTF-8，引語是windows-1252。您可以顯示雪人或引語，但不能同時顯示：

print(doc) # ???�I like snowmen!�print(doc.decode("windows-1252")) # a??a??a??“I like snowmen!”

將文檔解碼為utf-8會引發?UnicodeDecodeError?把它解碼成windows-1252會讓你胡言亂語。幸運的是，?UnicodeDammit.detwingle()?將字符串轉換為純UTF-8，允許您將其解碼為Unicode并同時顯示雪人和引號：

new_doc = UnicodeDammit.detwingle(doc) print(new_doc.decode("utf8")) # ???“I like snowmen!”

UnicodeDammit.detwingle()?只知道如何處理嵌入在UTF-8中的Windows-1252（我想是相反的），但這是最常見的情況。

注意你必須知道打電話?UnicodeDammit.detwingle()?在將數據傳遞到?BeautifulSoup?或?UnicodeDammit?構造函數。BeautifulSoup假設文檔有一個單獨的編碼，不管它是什么。如果您給它傳遞一個同時包含UTF-8和Windows-1252的文檔，它可能會認為整個文檔都是Windows-1252，并且文檔的外觀?a??a??a??“I?like?snowmen!”?.

UnicodeDammit.detwingle()?是在 Beautiful Soup 4.1.0的新。

比較對象是否相等

Beautiful Soup 說兩個?NavigableString?或?Tag?當對象表示相同的HTML或XML標記時，它們是相等的。在本例中，兩個標記被視為相等的，即使它們位于對象樹的不同部分，因為它們看起來都像“pizza”：

markup = "I want pizza and more pizza!" soup = BeautifulSoup(markup, 'html.parser') first_b, second_b = soup.find_all('b') print first_b == second_b # Trueprint first_b.previous_element == second_b.previous_element # False

如果要查看兩個變量是否引用完全相同的對象，請使用?is?：：

print first_b is second_b # False

復制 Beautiful Soup 對象

你可以使用?copy.copy()?創建任何?Tag?或?NavigableString?：：

import copy p_copy = copy.copy(soup.p) print p_copy # I want pizza and more pizza!

副本被視為與原始副本相同，因為它表示與原始副本相同的標記，但它不是同一對象：

print soup.p == p_copy # Trueprint soup.p is p_copy # False

唯一真正的區別是，復制品完全脫離了原來 Beautiful Soup 對象樹，就像?extract()?已被傳喚：

print p_copy.parent # None

這是因為兩個不同?Tag?對象不能同時占用同一空間。

只分析文檔的一部分

假設你想用 Beautiful Soup 看看文件的標簽。解析整個文檔，然后再次檢查它以查找<a>標記，這是浪費時間和內存。首先忽略所有不屬于<a>標記的內容會更快。這個?SoupStrainer?類允許您選擇解析傳入文檔的哪些部分。你只是創造了一個?SoupStrainer?把它傳給?BeautifulSoup?作為?parse_only?爭論。

（注意?如果使用html5lib解析器，此功能將不起作用?. 如果您使用html5lib，那么不管怎樣，都將解析整個文檔。這是因為html5lib在工作時會不斷地重新排列解析樹，如果文檔的某些部分沒有真正進入解析樹，那么它將崩潰。為了避免混淆，在下面的示例中，我將強制BeautifulSoup使用Python的內置解析器。）

SoupStrainer

這個?SoupStrainer?類采用與典型方法相同的參數?Searching the tree?：?name?，?attrs?，?string?和?**kwargs?. 這里有三個?SoupStrainer?物體：：

from bs4 import SoupStraineronly_a_tags = SoupStrainer("a")only_tags_with_id_link2 = SoupStrainer(id="link2")def is_short_string(string):return len(string) < 10only_short_strings = SoupStrainer(string=is_short_string)

我再把“三姐妹”文檔帶回來一次，我們將看到文檔與這三個文檔一起解析時的樣子。?SoupStrainer?物體：：

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's storyOnce upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.... """print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # <a class="sister" href="http://example.com/tillie" id="link3"> # Tillie # </a>print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a>print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... #

你也可以通過?SoupStrainer?包括在?Searching the tree?. 這可能不太有用，但我想我會提到：

soup = BeautifulSoup(html_doc) soup.find_all(only_short_strings) # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u'\n\n', u'...', u'\n']

故障排除

diagnose()

如果您無法理解 Beautiful Soup 對文檔的作用，請將文檔傳遞到?diagnose()?功能。（BeautifulSoup4.2.0中的新功能。）BeautifulSoup將打印一份報告，向您展示不同的解析器如何處理文檔，并告訴您是否缺少一個解析器，BeautifulSoup可以使用它：

from bs4.diagnose import diagnose with open("bad.html") as fp:data = fp.read() diagnose(data)# Diagnostic running on Beautiful Soup 4.2.0 # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) # I noticed that html5lib is not installed. Installing it may help. # Found lxml version 2.3.2.0 # # Trying to parse your data with html.parser # Here's what html.parser did with the document: # ...

只需查看diagnose（）的輸出就可以了解如何解決問題。即使沒有，也可以粘貼?diagnose()?在尋求幫助時。

分析文檔時出錯

有兩種不同的分析錯誤。有一些崩潰，您將一個文檔放入 Beautiful Soup 中，它會引發一個異常，通常是?HTMLParser.HTMLParseError?. 還有一些意想不到的行為，其中一個漂亮的soup解析樹看起來與用于創建它的文檔大不相同。

幾乎所有這些問題都不是 Beautiful Soup 的問題。這并不是因為“ Beautiful Soup ”是一款寫得非常好的軟件。這是因為漂亮的soup不包含任何解析代碼。相反，它依賴于外部解析器。如果一個解析器不能處理某個文檔，最好的解決方案是嘗試另一個解析器。見?Installing a parser?有關詳細信息和分析器比較。

最常見的分析錯誤是?HTMLParser.HTMLParseError:?malformed?start?tag?和?HTMLParser.HTMLParseError:?bad?end?tag?. 它們都是由Python的內置HTML解析器庫生成的，解決方案是?install lxml or html5lib.

最常見的意外行為類型是在文檔中找不到已知的標記。你看到它進去了，但是?find_all()?收益率?[]?或?find()?收益率?None?. 這是Python的內置HTML解析器的另一個常見問題，它有時會跳過它不理解的標記。同樣，解決辦法是?install lxml or html5lib.

版本不匹配問題

SyntaxError:?Invalid?syntax?（線上）?ROOT_TAG_NAME?=?u'[document]'?）：在python 3下運行beautiful soup的python 2版本，而不轉換代碼。
ImportError:?No?module?named?HTMLParser?-由運行python 2版本的beautiful soup在python 3下引起。
ImportError:?No?module?named?html.parser?-由運行python 3版本的beautiful soup在python 2下引起。
ImportError:?No?module?named?BeautifulSoup?-在沒有安裝BS3的系統上運行漂亮的soup 3代碼。或者，在不知道包名已更改為的情況下編寫漂亮的soup 4代碼?bs4?.
ImportError:?No?module?named?bs4?-由于在沒有安裝BS4的系統上運行了漂亮的soup 4代碼。

解析XML

默認情況下，BeautifulSoup將文檔解析為HTML。要將文檔解析為XML，請將“xml”作為第二個參數傳遞給?BeautifulSoup?施工人員：

soup = BeautifulSoup(markup, "xml")

你需要?have lxml installed?.

其他分析器問題

如果您的腳本在一臺計算機上而不是另一臺計算機上工作，或者在一個虛擬環境中而不是另一個虛擬環境中工作，或者在虛擬環境之外而不是在內部工作，那么可能是因為這兩個環境具有不同的可用分析器庫。例如，您可能在安裝了lxml的計算機上開發了該腳本，然后嘗試在只安裝了html5lib的計算機上運行該腳本。見?Differences between parsers?了解這一點的原因，并通過在?BeautifulSoup?建造師。
因為?HTML tags and attributes are case-insensitive?，所有三個HTML解析器都將標記和屬性名轉換為小寫。也就是說，標記<tag>>被轉換為<tag>>。如果要保留混合大小寫或大寫的標記和屬性，則需要?parse the document as XML.

其他

UnicodeEncodeError:?'charmap'?codec?can't?encode?character?u'\xfoo'?in?position?bar?（或任何其他?UnicodeEncodeError）-這對 Beautiful Soup 來說不是問題。這個問題主要表現在兩種情況下。首先，當您嘗試打印控制臺不知道如何顯示的Unicode字符時。（見?this page on the Python wiki?第二，當你在寫一個文件時，輸入了一個默認編碼不支持的Unicode字符。在這種情況下，最簡單的解決方案是使用?u.encode("utf8")?.
KeyError:?[attr]?-訪問引起的?tag['attr']?當有問題的標簽沒有定義?attr?屬性。最常見的錯誤是?KeyError:?'href'和?KeyError:?'class'?. 使用?tag.get('attr')?如果你不確定?attr?定義，就像使用Python字典一樣。
AttributeError:?'ResultSet'?object?has?no?attribute?'foo'?-這通常是因為你預料到?find_all()?返回單個標記或字符串。但是?find_all()?返回A _list_ 標簽和字符串的數目--a?ResultSet?對象。您需要遍歷列表并查看?.foo?每一個。或者，如果你真的只想要一個結果，你需要使用?find()?而不是?find_all()?.
AttributeError:?'NoneType'?object?has?no?attribute?'foo'?-這通常是因為你打過電話?find()?然后嘗試訪問?.foo?結果的屬性。但在你的情況下， ``find()`?找不到任何東西，所以它回來了?None?，而不是返回標記或字符串。你得弄清楚為什么?find()?呼叫沒有返回任何內容。

提高性能

Beautiful Soup 永遠不會像它坐在上面的解析器那么快。如果響應時間是關鍵的，如果你按小時支付計算機時間的費用，或者如果有任何其他原因導致計算機時間比程序員時間更有價值，你應該忘記 Beautiful Soup ，直接在上面工作。?lxml?.

也就是說，你可以做一些事情來加速 Beautiful Soup 。如果不使用lxml作為底層解析器，我的建議是?start?. 與使用html.parser或html5lib相比，漂亮的soup使用lxml解析文檔的速度要快得多。

您可以通過安裝?cchardet?類庫。

Parsing only part of a document?不會為您節省很多時間來分析文檔，但它可以節省大量的內存，并且?searching?文檔速度快得多。

總結

以上是生活随笔為你收集整理的Python BeautifulSoup的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：关键信息基础设施保护必须以等级保护制度为
下一篇： php产品经理面试题目,12/11/17