當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

网页解析之BeautifulSoup

發(fā)布時間：2025/4/14 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了网页解析之BeautifulSoup 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

介紹及安裝

Beautiful Soup 是一個HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 數(shù)據(jù)。
BeautifulSoup 用來解析 HTML 比較簡單，API非常人性化，支持CSS選擇器、Python標(biāo)準(zhǔn)庫中的HTML解析器，也支持 lxml 的 XML解析器。
Beautiful Soup 3 目前已經(jīng)停止開發(fā)，推薦現(xiàn)在的項目使用Beautiful Soup 4。使用 pip 安裝即可：pip install beautifulsoup4

四大對象種類

Beautiful Soup將復(fù)雜HTML文檔轉(zhuǎn)換成一個復(fù)雜的樹形結(jié)構(gòu),每個節(jié)點都是Python對象,所有對象可以歸納為4種:

Tag
NavigableString
BeautifulSoup
Comment

1.Tag

Tag 通俗點講就是 HTML 中的一個個標(biāo)簽。

print(soup.p) # The Dormouse's storyprint(type(soup.p)) # <class 'bs4.element.Tag'>

我們可以利用 soup 加標(biāo)簽名輕松地獲取這些標(biāo)簽的內(nèi)容，這些對象的類型是bs4.element.Tag。
對于 Tag，它有兩個重要的屬性，是 name 和 attrs.

print(soup.name) # [document] soup 對象本身比較特殊，它的 name 即為 [document] print(soup.head.name) # head 對于其他內(nèi)部標(biāo)簽，輸出的值便為標(biāo)簽本身的名稱 print(soup.p.attrs) # {'class': ['title'], 'name': 'dromouse'} print(soup.p['class']) # ['title']

NavigableString

NavigableString簡單來講就是一個可以遍歷的字符串。
例如:

print(soup.p.string) # The Dormouse's story print(type(soup.p.string)) # <class 'bs4.element.NavigableString'>

搜索文檔

Beautiful Soup定義了很多搜索方法,這里著重介紹2個: find() 和 find_all() .其它方法的參數(shù)和用法類似。

html_doc = """ <html><head><title>The Dormouse's story</title></head>The Dormouse's storyOnce upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.... """from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc)

使用find_all等類似的方法可以查找想要的文檔內(nèi)容。
在介紹find_all方法之前，先介紹一下過濾器的類型。

字符串

最簡單的過濾器是字符串。在搜索方法中傳入一個字符串參數(shù),BeautifulSoup會查找與字符串完整匹配的內(nèi)容。
例如:

查找所有的b標(biāo)簽。 soup.find_all('b') # [The Dormouse's story]

正則表達(dá)式

find_all方法可以接受正則表示式作為參數(shù)，BeautifulSoup會通過match方法來匹配內(nèi)容。

匹配以b開頭的標(biāo)簽 for tag in soup.find_all(re.compile('^b')):print(tag.name) # body b 匹配包含t的標(biāo)簽 for tag in soup.find_all(re.compile('t')):print(tag.name) # html title

列表

find_all方法也能接受列表參數(shù)，BeautifulSoup會將與列表中任一元素匹配的內(nèi)容返回。

查找a標(biāo)簽和b標(biāo)簽 for tag in soup.find_all(['a','b']):print(tag.name) # b a a a

方法

如果沒有合適的過濾器,我們也可以自己定義一個方法，方法只接受一個元素參數(shù)。

匹配包含class屬性，但是不包括id屬性的標(biāo)簽。 def has_class_but_no_id(tag):return tag.has_attr('class') and not tag.has_attr('id')print([tag.name for tag in soup.find_all(has_class_but_no_id)]) # ['p','p','p']

css選擇器

這就是另一種與 find_all 方法有異曲同工之妙的查找方法.

寫 CSS 時，標(biāo)簽名不加任何修飾，類名前加.，id名前加#
在這里我們也可以利用類似的方法來篩選元素，用到的方法是 soup.select()，返回類型是 list

1.通過標(biāo)簽名查找

print(soup.select('title')) #[<title>The Dormouse's story</title>]print(soup.select('a')) #[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]print(soup.select('b')) #[The Dormouse's story]

2.通過類名查找

print(soup.select('.sister')) #[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

3.通過 id 名查找

print(soup.select('#link1')) #[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

4.組合查找

組合查找即和寫 class 文件時，標(biāo)簽名與類名、id名進(jìn)行的組合原理是一樣的，例如查找 p 標(biāo)簽中，id 等于 link1的內(nèi)容，二者需要用空格分開

print(soup.select('p #link1')) #[<a class="sister" href="http://example.com/elsie" id="link1"></a>] 直接子標(biāo)簽查找，則使用 > 分隔print(soup.select("head > title")) #[<title>The Dormouse's story</title>]

5.屬性查找

查找時還可以加入屬性元素，屬性需要用中括號括起來，注意屬性和標(biāo)簽屬于同一節(jié)點，所以中間不能加空格，否則會無法匹配到。

print(soup.select('a[class="sister"]')) #[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]print(soup.select('a[href="http://example.com/elsie"]')) #[<a class="sister" href="http://example.com/elsie" id="link1"></a>] 同樣，屬性仍然可以與上述查找方式組合，不在同一節(jié)點的空格隔開，同一節(jié)點的不加空格print(soup.select('p a[href="http://example.com/elsie"]')) #[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

6.獲取內(nèi)容

以上的 select 方法返回的結(jié)果都是列表形式，可以遍歷形式輸出，然后用 get_text() 方法來獲取它的內(nèi)容。

soup = BeautifulSoup(html, 'lxml') print(type(soup.select('title'))) print(soup.select('title')[0].get_text())for title in soup.select('title'):print(title.get_text())

===============================================================================================

通過tag標(biāo)簽逐層查找: soup.select("body a") 找到某個tag標(biāo)簽下的直接子標(biāo)簽 soup.select("head > title") 通過CSS的類名查找: soup.select(".sister") 通過tag的id查找: soup.select("#link1") 通過是否存在某個屬性來查找: soup.select('a[href]') 通過屬性的值來查找: soup.select('a[href="http://example.com/elsie"]')

轉(zhuǎn)載于:https://www.cnblogs.com/pythoner6833/p/8960785.html

總結(jié)

以上是生活随笔為你收集整理的网页解析之BeautifulSoup的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： spring 整和activemq
下一篇： string转date