當前位置：首頁 > 编程语言 > python >内容正文

python

python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup

發布時間：2023/12/19 python 21 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Beautiful Soup

基本介紹Beautiful Soup 是一個HTML/XML 的解析器，主要用于解析和提取 HTML/XML 數據。

它是基于HTML DOM 的，會載入整個文檔，解析整個DOM樹，因此時間和內存開銷都會大很多，所以性能要低于lxml(xpath)，lxml只會進行局部遍歷。

BeautifulSoup 用來解析 HTML 比較簡單，API非常人性化，支持CSS選擇器、Python標準庫中的HTML解析器，也支持 lxml 的 XML解析器。

雖然說BeautifulSoup4 簡單比較容易上手，但是匹配效率還是遠遠不如正則以及xpath的，一般不推薦使用，推薦正則的使用。

安裝使用beautiful soup安裝：pip install beautifulsoup4

在代碼中導入： from bs4 import BeautifulSoup

創建 Beautiful Soup對象 soup = BeautifulSoup(html，'html.parser')

這里html是被解析的文檔，'html.parser'是文檔解析器。要解析的文檔類型，目前支持： “html”, “xml”, 和 “html5”

指定的解析器，目前支持：“lxml”, “html5lib”, 和 “html.parser”

如果僅僅想要解析HTML文檔,只要用文檔創建 BeautifulSoup 對象就可以了.Beautiful Soup會自動選擇一個解析器來解析文檔 ,解析器的優先順序: lxml, html5lib .

下表列出了主要的解析器,以及它們的優缺點:

如果指定的解析器沒有安裝,Beautiful Soup會自動選擇其它方案.目前只有 lxml 解析器支持XML文檔的解析,在沒有安裝lxml庫的情況下,創建 beautifulsoup 對象時無論是否指定使用lxml,都無法得到解析后的對象

如果一段HTML或XML文檔格式不正確的話,那么在不同的解析器中返回的結果可能是不一樣的,查看官方文檔了解更多細節。解析器之間的區別：

Beautiful Soup庫的基本元素

Beautiful Soup庫的理解： Beautiful Soup庫是解析、遍歷、維護“標簽樹”的功能庫，對應一個HTML/XML文檔的全部內容。

BeautifulSoup類的基本元素:Tag 標簽，最基本的信息組織單元，分別用<>和標明開頭和結尾；

Name 標簽的名字，

…

的名字是'p'，格式：.name;

Attributes 標簽的屬性，字典形式組織，格式：.attrs;

NavigableString 標簽內非屬性字符串，<>…>中字符串，格式：.string;

Comment 標簽內字符串的注釋部分，一種特殊的NavigableString 對象類型;

下面來進行代碼實踐

# 導入bs4庫

from bs4 import BeautifulSoup

import requests # 抓取頁面

r = requests.get('https://python123.io/ws/demo.html') # Demo網址

demo = r.text # 抓取的數據

demo

This is a python demo page\r\n\r\n

The demo python introduces several python courses.

\r\n

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.

\r\n'

# 解析HTML頁面

soup = BeautifulSoup(demo, 'html.parser') # 抓取的頁面數據；bs4的解析器

# 有層次感的輸出解析后的HTML頁面

print(soup.prettify())

輸出：

This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python

and

Advanced Python

1)標簽，用soup.訪問獲得:

當HTML文檔中存在多個相同對應內容時，soup.返回第一個

soup.a # 訪問標簽a

# Basic Python

soup.title

This is a python demo page

2)標簽的名字:每個都有自己的名字，通過soup..name獲取，字符串類型

soup.a.name, soup.a.parent.name, soup.p.parent.name

# ('a', 'p', 'body')

3)標簽的屬性,一個可以有0或多個屬性，字典類型,soup..attrs

tag = soup.a

print(tag.attrs)

# 獲取屬性值的兩種方法

print(tag.attrs['class'], tag['class'])

print(type(tag.attrs))

# 輸出

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

['py1'] ['py1']

4)NavigableString:標簽內非屬性字符串,格式：soup..string, NavigableString可以跨越多個層次

print(soup.a.string)

print(type(soup.a.string))

# 輸出

Basic Python

5)Comment:標簽內字符串的注釋部分，Comment是一種特殊類型的NavigableString 對象(有-->)

markup = ""

soup2 = BeautifulSoup(markup)

comment = soup2.b.string

print(type(comment))

print(comment)

# Hey, buddy. Want to buy a used parser?

6) .prettify()為HTML文本<>及其內容增加更加'\n',有層次感的輸出

.prettify()可用于標簽，方法：.prettify()

print(soup.a.prettify())

# 輸出

Basic Python

7)bs4庫將任何HTML輸入都變成utf‐8編碼

Python 3.x默認支持編碼是utf‐8,解析無障礙

newsoup = BeautifulSoup('中文', 'html.parser')

print(newsoup.prettify())

# 輸出

中文

基于bs4庫的HTML內容遍歷方法

HTML基本格式:<>…>構成了所屬關系，形成了標簽的樹形結構標簽樹的下行遍歷.contents 子節點的列表，將所有兒子節點存入列表

.children 子節點的迭代類型，與.contents類似，用于循環遍歷兒子節點

.descendants 子孫節點的迭代類型，包含所有子孫節點，用于循環遍歷

標簽樹的上行遍歷.parent 節點的父親標簽

.parents 節點先輩標簽的迭代類型，用于循環遍歷先輩節點

標簽樹的平行遍歷.next_sibling 返回按照HTML文本順序的下一個平行節點標簽

.previous_sibling 返回按照HTML文本順序的上一個平行節點標簽

.next_siblings 迭代類型，返回按照HTML文本順序的后續所有平行節點標簽

.previous_siblings 迭代類型，返回按照HTML文本順序的前續所有平行節點標簽

標簽樹的回退和前進.next_element返回解析過程中下一個被解析的對象(字符串或tag)

.previous_element返回解析過程中上一個被解析的對象(字符串或tag)

.next_elements 迭代類型，返回解析過程中后續所有被解析的對象(字符串或tag)

.previous_elements迭代類型，返回解析過程中前續所有被解析的對象(字符串或tag)

詳見官方文檔的“遍歷文檔樹”及博客：

標簽樹的下行遍歷

import requests

from bs4 import BeautifulSoup

r=requests.get('http://python123.io/ws/demo.html')

demo=r.text

soup=BeautifulSoup(demo,'html.parser')

print(soup.contents)# 獲取整個標簽樹的兒子節點

# 輸出

[

This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

]

print(soup.body.contents)#返回標簽樹的body標簽下的節點

# 輸出

['\n',

The demo python introduces several python courses.

, '\n',

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

, '\n']

print(soup.head)#返回head標簽

# 輸出

This is a python demo page

for child in soup.body.children:#遍歷兒子節點

print(child)

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

for child in soup.body.descendants:#遍歷子孫節點

print(child)

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python

and

Advanced Python

可以發現這里是深度遍歷。

標簽樹的上行遍歷

print((soup.title.parent, type(soup.html.parent), soup.parent))

# (

This is a python demo page, , None)

title的父節點是head標簽，文檔的頂層節點的父節點是 BeautifulSoup 對象， BeautifulSoup 對象的 .parent 是None

for parent in soup.a.parents: # 遍歷先輩的信息

if parent is None:

print('parent:', parent)

else:

print('parent name:', parent.name)

parent name: p

parent name: body

parent name: html

parent name: [document]

標簽a的父節點關系是：a—> p —> body —> html —> [document]，最后一個是因為soup.name是'[document]'

標簽樹的平行遍歷

注意：標簽樹的平行遍歷是有條件的

平行遍歷發生在同一個父親節點的各節點之間

標簽中的內容也構成了節點

再來復習一下之前的文檔結構：

print(soup.prettify())

This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python

and

Advanced Python

第一個a標簽和字符串文本“Python is a wonderful...”，” and“，第二個a標簽，字符串文本” .“是兄弟節點，他們的父節點是p標簽，因此：

print(soup.a.next_sibling)#a標簽的下一個兄弟標簽

# and

print(soup.a.next_sibling.next_sibling)#a標簽的下一個標簽的下一個標簽

# Advanced Python

print(soup.a.previous_sibling)#a標簽的前一個標簽

# Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

print(soup.a.previous_sibling.previous_sibling)#a標簽的前一個標簽的前一個標簽

# None

for sibling in soup.a.next_siblings:#遍歷后續節點

print(sibling)

# 輸出

and

Advanced Python

for sibling in soup.a.previous_siblings:#遍歷之前的節點

print(sibling)

# 輸出

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

回退和前進

看一下這個文檔：

The Dormouse's story

HTML解析器把這段字符串轉換成一連串的事件: “打開標簽”,”打開一個

標簽”,”打開一個標簽”,”添加一段字符串”,”關閉標簽”,”打開

標簽”,等等.Beautiful Soup提供了重現解析器初始化過程的方法.

soup.a.next_element, soup.a.previous_element

# 輸出

('Basic Python',

'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n')

for element in soup.a.next_elements:

print(element)

Basic Python

and

Advanced Python

基于bs4庫的HTML內容的查找方法<>.find_all(name, attrs, recursive, string, **kwargs)name : 對標簽名稱的檢索字符串，可以使任一類型的過濾器,字符串,正則表達式,列表,方法或是 True . True表示返回所有。

attrs: 對標簽屬性值的檢索字符串，可標注屬性檢索

recursive: 是否對子孫全部檢索，默認True

string: <>…>中字符串區域的檢索字符串(..) 等價于 .find_all(..)

soup(..) 等價于 soup.find_all(..)

擴展方法：<>.find() 搜索且只返回一個結果，同.find_all()參數

<>.find_parents() 在先輩節點中搜索，返回列表類型，同.find_all()參數

<>.find_parent() 在先輩節點中返回一個結果，同.find()參數

<>.find_next_siblings() 在后續平行節點中搜索，返回列表類型，同.find_all()參數

<>.find_next_sibling() 在后續平行節點中返回一個結果，同.find()參數

<>.find_previous_siblings() 在前序平行節點中搜索，返回列表類型，同.find_all()參數

<>.find_previous_sibling() 在前序平行節點中返回一個結果，同.find()參數

import requests

from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')

demo = r.text

soup = BeautifulSoup(demo,'html.parser')

先介紹一下過濾器的類型。

字符串

# name : 對標簽名稱的檢索字符串

soup.find_all('a')

# 輸出

[Basic Python,

Advanced Python]

# attrs: 對標簽屬性值的檢索字符串，可標注屬性檢索

soup.find_all("p","course")

# 輸出

[

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

]

# recursive: 是否對子孫全部檢索，默認True

soup.find_all('p',recursive=False)

# []

# string: <>…>中字符串區域的檢索字符串

soup.find_all(string = "Basic Python") # 完全匹配才能匹配到

# ['Basic Python']

正則表達式

import re

# 查找所有以b開頭的標簽

for tag in soup.find_all(re.compile("^b")):

print(tag.name)

# body

# b

# 找出所有名字中包含”t”的標簽

for tag in soup.find_all(re.compile("t")):

print(tag.name)

# html

# title

列表

# 找到文檔中所有標簽和

標簽

soup.find_all(['a', 'p'])

# 輸出

[

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

Basic Python,

Advanced Python]

True

True 可以匹配任何值,下面代碼查找到所有的tag,但是不會返回字符串節點

for tag in soup.find_all(True):

print(tag.name)

# html

# head

# title

# body

# p

# b

# p

# a

方法

方法只接受一個元素參數,如果這個方法返回 True 表示當前元素匹配并且被找到,如果不是則反回 False

下面方法校驗了當前元素,如果包含 class 屬性卻不包含 id 屬性,那么將返回 True。將這個方法作為參數傳入 find_all() 方法,將得到所有

標簽

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

# 輸出

[

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

]

注意這里返回的標簽是

標簽的一部分。沒有單獨返回

通過一個方法來過濾一類標簽屬性的時候, 這個方法的參數是要被過濾的屬性的值, 而不是這個標簽. 下面的例子是找出 href 屬性不符合指定正則的 a 標簽.

def not_lacie(href):

return href and not re.compile("BIT-268001").search(href)

soup.find_all(href=not_lacie)

# 輸出

[]

過濾出前后都有文字的標簽

from bs4 import NavigableString

def surrounded_by_strings(tag):

return (isinstance(tag.next_element, NavigableString)

and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):

print(tag.name)

# body

# p

# a

keyword 參數

如果一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數當作指定名字tag的屬性來搜索。比如tag的"id"屬性，"href"屬性，參數值包括字符串, 正則表達式, 列表, True.

還可以按CSS搜索，比如搜索class帶有"tit"的標簽，支持不同類型的過濾器 ,字符串,正則表達式,方法或 True

import re

soup.find_all(id="link") # 完全匹配才能匹配到

# []

soup.find_all(class_=re.compile("tit"))

# [

The demo python introduces several python courses.

]

實戰：中國大學排名定向爬取

爬取思路：從網絡上獲取大學排名網頁內容

提取網頁內容中信息到合適的數據結構(二維數組)-排名，學校名稱，總分

利用數據結構展示并輸出結果

# 導入庫

import requests

from bs4 import BeautifulSoup

import bs4

1.從網絡上獲取大學排名網頁內容

def gethtml(url):

try:

res = requests.get(url)

# response.raise_for_status()這個方法可以捕獲異常，使得出現異常時就會跳到except中執行，而不影響整體進程。

# r.encoding:從HTTP header中猜測的響應內容編碼方式。

# r.apparent_encoding：根據網頁內容分析出的編碼方式。

# # 不論headers中是否要求編碼格式，都從內容中找到實際編碼格式，確保順利解碼

res.encoding=res.apparent_encoding

content = res.text

return content

except:

return ""

content = gethtml('http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html')

soup = BeautifulSoup(content, 'html.parser')

soup

2.提取網頁內容中信息到合適的數據結構(二維數組)查看網頁源代碼，觀察并定位到需要爬取內容的標簽；

使用bs4的查找方法提取所需信息-'排名，學校名稱，總分'

查看網頁源代碼可以發現，我們需要的排名、學校名稱、總分等數據在一個表格中，tbody是表格的主體內容，每一個tr標簽的內容對應著表格的每一行，同時也是tbody的子節點標簽。因此，我們要獲取數據，就得解析出每一個tr標簽。

根據之前所學的知識，我們可以先找到(find)表格主體tbody(網頁中只有一個表格)，然后找出tbody下面的所有子節點標簽tr，再從子節點的子節點中解析出排名、學校名稱、總分。

方法一：使用find和find_all方法：

need_list = []

for tr in soup.find('tbody').find_all('tr'):

tds=tr('td') # 等價于tr.find_all('td')

need_list.append([tds[0].string,tds[1].string,tds[3].string])

need_list

方法二：使用children方法，但要進行實例類別判斷，因為會存在bs4.element.NavigableString類型的文本內容：

need_list = []

for tr in soup.find('tbody').children:

if isinstance(tr,bs4.element.Tag):

# 或者用 tds=list(tr.children)

tds=tr('td') # 等價于tr.find_all('td')

need_list.append([tds[0].string,tds[1].string,tds[3].string])

need_list

3.利用數據結構展示并輸出結果

# 參考 https://www.cnblogs.com/zhz-8919/p/9767357.html

# https://python3-cookbook.readthedocs.io/zh_CN/latest/c02/p13_aligning_text_strings.html

def printUnivList(ulist,num):

tplt = "{0:{3}^10}\t{1:{3}^10}\t{2:^10}"

print(tplt.format("排名","學校名稱","總分", chr(12288)))

for u in ulist[:num]:

print(tplt.format(u[0],u[1],u[2], chr(12288)))

printUnivList(need_list,30)

采用.format打印輸出時，可以定義輸出字符串的輸出寬度，在 ':' 后傳入一個整數, 可以保證該域至少有這么多的寬度。用于美化表格時很有用。

但是在打印多組中文的時候，不是每組中文的字符串寬度都一樣，當中文字符寬度不夠的時候，程序采用西文空格填充，中西文空格寬度不一樣，就會導致輸出文本不整齊

解決方法：寬度不夠時采用中文空格填充，中文空格的編碼為chr(12288)

參考資料

BeautifulSoup中文文檔：

歡迎關注我的公眾號：

總結

以上是生活随笔為你收集整理的python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：大裁员和ChatGPT来袭，科技行业员工
下一篇：小米牵头组建国家级联合体：3C智能制造创

python

python 爬虫 标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup

總結

python 爬虫标签文本beautifullsoup_【Python爬虫】学习BeautifulSoup