當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Beautiful Soup库

發布時間：2023/12/18 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 Beautiful Soup库小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Beautiful Soup：美味湯

非常優秀的python第三方庫

能夠對html、xml格式進行解析，并且提取其中的相關信息

Beautiful Soup可以對你提供給他的任何格式進行相關的爬取，并且可以進行樹形解析

使用原理：把任何你給他的文檔當成一鍋湯，然后煲制這鍋湯

一、安裝：

pip3 install?beautifulsoup4

?HTML頁面是以尖括號為主的一些標簽封裝的一些信息

>>> import requests
>>> r=requests.get("https://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\nThe demo python introduces several python courses.\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.\r\n</body></html>'
>>> demo=r.text

>>> from bs4 import BeautifulSoup#bs4是beautifulsoup4庫的簡寫，從bs4 庫導入BeautifulSoup類

#soup 變量就代表了我們解析后的demo頁面

>>> soup = BeautifulSoup(demo,"html.parser") #第一個參數是我們需要BeautifulSoup解析的一個html信息，可以用'data'來做個代替，也可以使用任何變量，第二個參數是解析這鍋湯所用的解析器（html.parser解析demo的解析器，對demo進行html的解析）

>>> print(soup.prettify())

BeautifulSoup庫成功的解析了我們給出的demo頁面

?二、Beautiful Soup庫的基本元素

BeautifulSoup庫的引用

BeautifulSoup庫,也叫beautifulsoup4庫或bs4庫

from bs4 import?BeautifulSoup （從bs4引用一個類型BeautifulSoup）

import bs4? （對BeautifulSoup庫里的一些變量進行判斷）

BeautifulSoup庫本身解析html、xml文檔，這個文檔與標簽樹一一對應，經過了BeautifulSoup類的處理，可以把標簽樹（可以理解為字符串）轉換成BeautifulSoup類，BeautifulSoup類就是一個能代表標簽樹的類型，實際上，可以認為HTML文檔<---------->標簽樹<---------->BeautifulSoup類三者是等價的

通過BeautifulSoup類使得標簽樹形成了一個變量，而對這個變量的處理，就是對標簽樹的相關處理

簡單講，我們可以把BeautifulSoup類當做對應一個HTML/XML文檔的全部內容

Beautiful Soup庫的解析器

解析器　　　　　　　　　　　　使用方法　　　　　　　　　　　　　　條件

bs4的HTML解析器　　　　BeautifulSoup(mk,'html.parser')　　　　　　安裝bs4庫

lxml的HTML解析器　　　　BeautifulSoup(mk,'lxml')　　　　　　　　pip install lxml

lxml的xml解析器　　　　　BeautifulSoup(mk,'html.xml')　　　　　　pip install lxml

html5lib的解析器　　　　　BeautifulSoup(mk,'html5lib')　　　　　　pip install html5lib

Beautiful Soup類的基本元素

基本元素　　　　　　　　　　　　說明

Tag　　　　　　　　　　　　　　標簽，最基本的信息組織單元，分別用<>和</>標明開頭和結尾

Name　　　　　　　　　　　　　標簽的名字，...的名字是‘p’，格式：<tag>.name

Attributes　　　　　　　　　　　標簽的屬性，字典形式組織，格式：<tag>.attrs，無屬性會返回空字典

NavigableString　　　　　　　　標簽內非屬性字符串，<>...</>中字符串，格式：<tag>.string

Comment?　　　　　　　　　　? ?標簽內字符串的注釋部分，一種特殊的Comment類型

看頁面title

>>> soup.title
<title>This is a python demo page</title>

>>> tag=soup.a #有多個，只能獲得第一個a標簽的信息
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

>>> soup.a.name　　#獲得a標簽的名字，字符串類型
'a'

>>> soup.a.parent.name　　#獲得a標簽的父標簽的名字
'p'

>>> soup.a.parent.parent.name
'body'

>>> tag=soup.a
>>> tag.attrs? ?#獲得標簽屬性，字典類型
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'

>>> soup.p
The demo python introduces several python courses.
>>> soup.p.string
'The demo python introduces several python courses.'　　#沒有打印b標簽，說明NavigableString是可以跨越多個標簽層次的
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>#bs4庫定義的類型

>>> newsoup = BeautifulSoup("This is not a comment","html.parser")　　#<--表示一個注釋的開始
>>> newsoup.b.string　　#不需要提取注釋信息，需要對相關類型進行判斷
'This is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

三、基于bs4庫的HTML內容遍歷方法

HTML基本格式

下行遍歷：

　　屬性　　　　　　說明

.contents　　　? ?子節點的列表，將<tag>所有兒子節點存入列表

　　.children　　　　子節點的迭代類型，與.contents類似，用于循環遍歷兒子節點

　　.descendants　　子孫節點的迭代類型，包含所有子孫節點，用于循環遍歷

>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents　　#對于一個標簽的兒子節點，不僅僅包括標簽節點，也包括字符串節點，比如像'\n'的回車，他也是一個body標簽的兒子節點類型
['\n', The demo python introduces several python courses., '\n', Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>., '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
The demo python introduces several python courses.

for child in soup.body.children:

　　print(child)

for child in soup.body.children:

　　print(child)

上行遍歷：

　　屬性　　　　　　說明

　　.parent　　　　　節點的父親標簽

　　.parents　　　　? 節點先輩標簽的迭代類型，用于循環遍歷先輩節點

>>> soup = BeautifulSoup(demo,'html.parser')
>>> for parent in soup.a.parents:
... if parent is None:
... print(parent)
... else:
... print(parent.name)
...
p
body
html
[document]

#在遍歷一個標簽的所有先輩標簽時，會遍歷到soup本身，而soup的先輩并不存在.name的信息，在這種情況下需要做一個區分，如果先輩是None就不能打印這部分信息了

平行遍歷：

　　屬性　　　　　　說明

.next_sibling　　　　返回按照HTML文本順序的下一個平行節點標簽

.previous_sibling　　返回按照HTML文本順序的上一個平行節點標簽

.next_siblings　　　? ?迭代類型，返回按照HTML文本順序的后續所有平行節點標簽

.previous_siblings　　迭代類型，返回按照HTML文本順序的前續所有平行節點標簽

>>> soup.a.next_sibling　　#a標簽的下一個平行節點是一個字符串and，這里注意一下，在標簽樹中，盡管樹形結構采用的是標簽的形式來組織，但是標簽之間的NavigableString? 也構成了標簽的節點，也就是說，任何一個節點，他的平行標簽，他的兒子標簽是可能存在NavigableString? ?類型的，所以并不能想當然的認為，平行遍歷獲得的節點一定是標簽類型。
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_sibling　　#空信息
>>> soup.a.parent
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.

#遍歷后續節點
>>> for sibling in soup.a.next_siblings:
... print(sibling)
...
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.#遍歷前續節點
>>> for sibling in soup.a.previous_siblings:
... print(sibling)
...
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

>>>

?四、基于bs4庫的HTML格式化和編碼

>>> soup.prettify()　　#每一個標簽后面加了一個換行符\n
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n \n \n The demo python introduces several python courses.\n \n \n \n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n \n </body>\n</html>'
>>> print(soup.prettify())　　#每一個標簽以及相關內容都分行顯示
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>


The demo python introduces several python courses.



Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.

</body>
</html>
>>>

prettify這個方法能夠為html文本的標簽和內容增加換行符，他也可以對每一個標簽進行相關處理

>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>

bs4庫將任何讀入的html文件或字符串都轉換成了utf8編碼，utf8編碼是國際通用的標準編碼格式，他能夠很好的支持中文等第三國的語言，由于py3.x默認支持編碼是utf8，因此在做相關解析的時候，使用bs4庫并沒有相關障礙

>>> soup = BeautifulSoup("中文","html.parser")
>>> soup.p.string
'中文'
>>> print(soup.p.prettify())

中文

>>>

總結：BeautifulSoup是用來解析html、xml文檔的功能庫，可以使用from bs4 import?BeautifulSoup引入BeautifulSoup類型，并用這個類型加載相關的解析器，來解析一個變量出來，這個變量就是用來提取信息和遍歷信息的BeautifulSoup的類型

轉載于:https://www.cnblogs.com/suitcases/p/11200898.html

總結

以上是生活随笔為你收集整理的Beautiful Soup库的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。