當前位置：首頁 > 前端技术 > HTML >内容正文

HTML

Python 第三方模块之 beautifulsoup（bs4）- 解析 HTML

發布時間：2023/12/20 HTML 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 第三方模块之 beautifulsoup（bs4）- 解析 HTML 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

簡單來說，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據。官方解釋如下：官網文檔

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。
它是一個工具箱，通過解析文檔為用戶提供需要抓取的數據，因為簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。

安裝

pip3 install beautifulsoup4

解析器

Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器，如果我們不安裝它，則 Python 會使用 Python默認的解析器，lxml 解析器更加強大，速度更快，推薦安裝。

pip3 install lxml

另一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,可以選擇下列方法來安裝html5lib:

pip install html5lib

解析器對比

BeautifulSoup是一個模塊，該模塊用于接收一個HTML或XML字符串，然后將其進行格式化，之后遍可以使用他提供的方法進行快速查找指定元素，從而使得在HTML或XML中查找指定元素變得簡單。

用法

# 簡單示例 from bs4 import BeautifulSouphtml_doc = """ <html><head><title>The Dormouse's story</title></head> <body> asdf<div class="title"><b>The Dormouse's story總共</b><h1>f</h1></div> <div class="story">Once upon a time there were three little sisters; and their names were<a class="sister0" id="link1">Els<span>f</span>ie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</div> ad<br/>sf <p class="story">...</p> </body> </html> """soup = BeautifulSoup(html_doc, features="lxml") tag1 = soup.find(name='a') # 找到第一個a標簽 tag2 = soup.find_all(name='a') # 找到所有的a標簽 tag3 = soup.select('#link2') # 找到id＝link2的標簽

find(name, attrs, recursive, text, **kwargs)

# name參數，查找所有名字為name的tag，字符串對象被忽略。 soup.find_all('title') # keyword參數，如果一個指定名字的參數不是搜索內置的參數名，搜索時會把該參數當作指定名字tag的屬性來搜索。 soup.find_all(id='link2')# 如果多個指定名字的參數可以同時過濾tag的多個屬性： soup.find_all(href=re.compile('elsie'),id='link1') # 有些tag屬性在搜索不能使用，比如HTML5中的data*屬性，但是可以通過find_all()的attrs參數定義一個字典來搜索： data_soup.find_all(attrs={'data-foo':'value'})# recursive參數如果指向搜索tag的直接子節點，可以使用參數recursive=False。# limit參數可以用來限制返回結果的數量# text 參數通過 text 參數可以搜搜文檔中的字符串內容，與 name 參數的可選值一樣, text 參數接受字符串 , 正則表達式 , 列表

使用示例：

text,標簽內容

print(soup.text) print(soup.get_text()) # 兩種方式都可以返回獲取的所有文本

name,標簽名稱

tag = soup.find('a') name = tag.name # 獲取標簽名稱 print(name) tag.name = 'span' # 設置標簽名稱 print(soup)

attrs,標簽屬性

tag = soup.find('a') attrs = tag.attrs # 獲取標簽屬性 print(attrs) tag.attrs = {'ik':123} # 設置標簽屬性 tag.attrs['id'] = 'iiiii' # 設置標簽屬性 print(soup)

children,所有子標簽

body = soup.find('body') v = body.children

descendants,所有子子孫孫標簽

body = soup.find('body') v = body.descendants

clear,將標簽的所有子標簽全部清空（保留標簽名）

tag = soup.find('body') tag.clear() print(soup)

decompose,遞歸的刪除所有的標簽

body = soup.find('body') body.decompose() print(soup)

extract,遞歸的刪除所有的標簽，并獲取刪除的標簽

body = soup.find('body') v = body.extract() print(soup)

decode,轉換為字符串（含當前標簽）；decode_contents（不含當前標簽）

body = soup.find('body') v = body.decode() v = body.decode_contents() print(v)

encode,轉換為字節（含當前標簽）；encode_contents（不含當前標簽）

body = soup.find('body') v = body.encode() v = body.encode_contents() print(v)

find,獲取匹配的第一個標簽

tag = soup.find('a') print(tag) tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie') print(tag)

find_all,獲取匹配的所有標簽

tags = soup.find_all('a') print(tags) tags = soup.find_all('a',limit=1) print(tags) tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie') print(tags)# ####### 列表 ####### # v = soup.find_all(name=['a','div']) # print(v)# v = soup.find_all(class_=['sister0', 'sister']) # print(v)# v = soup.find_all(text=['Tillie']) # print(v, type(v[0]))# v = soup.find_all(id=['link1','link2']) # print(v)# v = soup.find_all(href=['link1','link2']) # print(v)# ####### 正則 ####### import re # rep = re.compile('p') # rep = re.compile('^p') # v = soup.find_all(name=rep) # print(v)# rep = re.compile('sister.*') # v = soup.find_all(class_=rep) # print(v)# rep = re.compile('http://www.oldboy.com/static/.*') # v = soup.find_all(href=rep) # print(v)# ####### 方法篩選 ####### # def func(tag): # return tag.has_attr('class') and tag.has_attr('id') # v = soup.find_all(name=func) # print(v)# ####### get,獲取標簽屬性 # tag = soup.find('a') # v = tag.get('id') # print(v)

has_attr,檢查標簽是否具有該屬性

tag = soup.find('a') v = tag.has_attr('id') print(v)

get_text,獲取標簽內部文本內容

tag = soup.find('a') v = tag.get_text('id') print(v)

index,檢查標簽在某標簽中的索引位置

tag = soup.find('body') v = tag.index(tag.find('div')) print(v)tag = soup.find('body') for i,v in enumerate(tag):print(i,v)

is_empty_element,是否是空標簽(是否可以是空)或者自閉合標簽，

判斷是否是如下標簽：’br’ , ‘hr’, ‘input’, ‘img’, ‘meta’,’spacer’, ‘link’, ‘frame’, ‘base’

tag = soup.find('br') v = tag.is_empty_element print(v)

當前的關聯標簽

soup.next soup.next_element soup.next_elements soup.next_sibling soup.next_siblings tag.previous tag.previous_element tag.previous_elements tag.previous_sibling tag.previous_siblings tag.parent tag.parents

查找某標簽的關聯標簽

# tag.find_next(...) # tag.find_all_next(...) # tag.find_next_sibling(...) # tag.find_next_siblings(...)# tag.find_previous(...) # tag.find_all_previous(...) # tag.find_previous_sibling(...) # tag.find_previous_siblings(...)# tag.find_parent(...) # tag.find_parents(...)# 參數同find_all

select,select_one, CSS選擇器

soup.select("title") soup.select("p nth-of-type(3)") soup.select("body a") soup.select("html head title") tag = soup.select("span,a") soup.select("head > title") soup.select("p > a") soup.select("p > a:nth-of-type(2)") soup.select("p > #link1") soup.select("body > a") soup.select("#link1 ~ .sister") soup.select("#link1 + .sister") soup.select(".sister") soup.select("[class~=sister]") soup.select("#link1") soup.select("a#link2") soup.select('a[href]') soup.select('a[href="http://example.com/elsie"]') soup.select('a[href^="http://example.com/"]') soup.select('a[href$="tillie"]') soup.select('a[href*=".com/el"]')from bs4.element import Tagdef default_candidate_generator(tag):for child in tag.descendants:if not isinstance(child, Tag):continueif not child.has_attr('href'):continueyield childtags = soup.find('body').select("a", _candidate_generator=default_candidate_generator) print(type(tags), tags)from bs4.element import Tag def default_candidate_generator(tag):for child in tag.descendants:if not isinstance(child, Tag):continueif not child.has_attr('href'):continueyield childtags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1) print(type(tags), tags)

標簽的內容

# tag = soup.find('span') # print(tag.string) # 獲取 # tag.string = 'new content' # 設置 # print(soup)# tag = soup.find('body') # print(tag.string) # tag.string = 'xxx' # print(soup)# tag = soup.find('body') # v = tag.stripped_strings # 遞歸內部獲取所有標簽的文本 # print(v)

append在當前標簽內部追加一個標簽

# tag = soup.find('body') # tag.append(soup.find('a')) # print(soup) # # from bs4.element import Tag # obj = Tag(name='i',attrs={'id': 'it'}) # obj.string = '我是一個新來的' # tag = soup.find('body') # tag.append(obj) # print(soup)

insert在當前標簽內部指定位置插入一個標簽

# from bs4.element import Tag # obj = Tag(name='i', attrs={'id': 'it'}) # obj.string = '我是一個新來的' # tag = soup.find('body') # tag.insert(2, obj) # print(soup)

insert_after,insert_before 在當前標簽后面或前面插入

# from bs4.element import Tag # obj = Tag(name='i', attrs={'id': 'it'}) # obj.string = '我是一個新來的' # tag = soup.find('body') # # tag.insert_before(obj) # tag.insert_after(obj) # print(soup)

replace_with 在當前標簽替換為指定標簽

# from bs4.element import Tag # obj = Tag(name='i', attrs={'id': 'it'}) # obj.string = '我是一個新來的' # tag = soup.find('div') # tag.replace_with(obj) # print(soup)

創建標簽之間的關系

# tag = soup.find('div') # a = soup.find('a') # tag.setup(previous_sibling=a) # print(tag.previous_sibling)

wrap，將指定標簽把當前標簽包裹起來

# from bs4.element import Tag # obj1 = Tag(name='div', attrs={'id': 'it'}) # obj1.string = '我是一個新來的' # # tag = soup.find('a') # v = tag.wrap(obj1) # print(soup)# tag = soup.find('a') # v = tag.wrap(soup.find('p')) # print(soup)

unwrap，去掉當前標簽，將保留其包裹的標簽

# tag = soup.find('a') # v = tag.unwrap() # print(soup)

更多參數官方：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

總結

以上是生活随笔為你收集整理的Python 第三方模块之 beautifulsoup（bs4）- 解析 HTML的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： sop8封装尺寸图_IC封装原理及功能特
下一篇： RHadoop（一）