爬虫--BeautifulSoup使用
生活随笔
收集整理的這篇文章主要介紹了
爬虫--BeautifulSoup使用
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
解析庫
| Python標準庫 | BeautifulSoup(markup, "html.parser") | Python的內置標準庫、執行速度適中 、文檔容錯能力強 | Python 2.7.3 or 3.2.2)前的版本中文容錯能力差 |
| lxml HTML 解析器 | BeautifulSoup(markup, "lxml") | 速度快、文檔容錯能力強 | 需要安裝C語言庫 |
| lxml XML 解析器 | BeautifulSoup(markup, "xml") | 速度快、唯一支持XML的解析器 | 需要安裝C語言庫 |
| html5lib | BeautifulSoup(markup, "html5lib") | 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 | 速度慢、不依賴外部擴展 |
?
基本使用
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title.string)# 輸出 <html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p><p class="story">...</p></body> </html> The Dormouse's story?
?
標簽選擇器
選擇元素
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p)# 輸出 <title>The Dormouse's story</title> <class 'bs4.element.Tag'> <head><title>The Dormouse's story</title></head> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>?
?
獲取名稱
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.title.name)# 輸出 title?
?
獲取屬性
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.attrs['name']) print(soup.p['name'])# 輸出 dromouse dromouse?
?
獲取內容
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p clss="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.string)# 輸出 The Dormouse's story?
?
嵌套選擇
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.head.title.string)# 輸出 The Dormouse's story?
?
子節點和子孫節點
html = """ <html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.contents)# 輸出 ['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']?
html = """ <html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.children) for i, child in enumerate(soup.p.children):print(i, child)# 輸出 <list_iterator object at 0x1064f7dd8> 0 Once upon a time there were three little sisters; and their names were1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> 2 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 4 and5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 6 and they lived at the bottom of a well. html = """ <html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.descendants) for i, child in enumerate(soup.p.descendants):print(i, child)# 輸出 <generator object descendants at 0x10650e678> 0 Once upon a time there were three little sisters; and their names were1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> 2 3 <span>Elsie</span> 4 Elsie 5 6 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 8 Lacie 9 and10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 11 Tillie 12 and they lived at the bottom of a well.?
?
父節點和祖先節點
html = """ <html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent)# 輸出 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p> html = """ <html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.parents)))# 輸出 [(0, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p>), (1, <body> <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p> <p class="story">...</p> </body>), (2, <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html>), (3, <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html>)]?
?
兄弟節點
html = """ <html><head><title>The Dormouse's story</title></head><body><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.next_siblings))) print(list(enumerate(soup.a.previous_siblings)))# 輸出 [(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')] [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]?
?
標準選擇器
find_all( name , attrs , recursive , text , **kwargs )
可根據標簽名、屬性、內容查找文檔
name
html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul')) print(type(soup.find_all('ul')[0]))# 輸出 [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] <class 'bs4.element.Tag'> html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.find_all('ul'):print(ul.find_all('li'))# 輸出 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]?
attrs
html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1" name="elements"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'})) print(soup.find_all(attrs={'name': 'elements'}))# 輸出 [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>]?
html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(id='list-1')) print(soup.find_all(class_='element'))# 輸出 [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]?
?
text
html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))# 輸出 ['Foo', 'Foo']?
?
find( name , attrs , recursive , text , **kwargs )
find返回單個元素,find_all返回所有元素
html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find('ul')) print(type(soup.find('ul'))) print(soup.find('page'))# 輸出 <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <class 'bs4.element.Tag'> None?
?
find_parents() find_parent()
find_parents()返回所有祖先節點,find_parent()返回直接父節點。
find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟節點,find_next_sibling()返回后面第一個兄弟節點。
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟節點,find_previous_sibling()返回前面第一個兄弟節點。
find_all_next() find_next()
find_all_next()返回節點后所有符合條件的節點, find_next()返回第一個符合條件的節點
find_all_previous() 和 find_previous()
find_all_previous()返回節點后所有符合條件的節點, find_previous()返回第一個符合條件的節點
CSS選擇器
通過select()直接傳入CSS選擇器即可完成選擇:
. 代表類名? ? ? ? # 代表id? ? ? ?啥也不寫是 標簽名
html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))# 輸出 [<div class="panel-heading"> <h4>Hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] <class 'bs4.element.Tag'> html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'):print(ul.select('li'))# 輸出 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]?
?
獲取屬性
html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'):print(ul['id'])print(ul.attrs['id'])# 輸出 list-1 list-1 list-2 list-2?
?
獲取內容
html=''' <div class="panel"><div class="panel-heading"><h4>Hello</h4></div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-small" id="list-2"><li class="element">Foo</li><li class="element">Bar</li></ul></div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'):print(li.get_text())# 輸出 Foo Bar Jay Foo Bar?
?
總結
- 推薦使用lxml解析庫,必要時使用html.parser
- 標簽選擇篩選功能弱但是速度快
- 建議使用find()、find_all() 查詢匹配單個結果或者多個結果
- 如果對CSS選擇器熟悉建議使用select()
- 記住常用的獲取屬性和文本值的方法
?
轉載于:https://www.cnblogs.com/panlei3707/p/8745097.html
總結
以上是生活随笔為你收集整理的爬虫--BeautifulSoup使用的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Sping(一)——IOC/DI
- 下一篇: GUI编程tkinter模块常用参数(p