爬虫笔记:pyquery详解
生活随笔
收集整理的這篇文章主要介紹了
爬虫笔记:pyquery详解
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
pyquery
強大又靈活的網(wǎng)頁解析庫,如果你覺得正則寫起來太麻煩,如果你覺得BeautifuiSoup語法太難記,如果你熟悉JQuery的語法,那么PyQuery就是你的絕對選擇。
初始化
1字符串初始化
html = ''' <div><ul><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div> ''' from pyquery import PyQuery as pq doc = pq(html)#聲明一個對象 print(doc('li'))#傳入一個選擇器doc(‘li’) 選擇器,如果選擇標簽直接加名字,如果選擇id,加#,如果選擇class,前面加.點。
2URL初始化
from pyquery import PyQuery as pq doc = pq(url='https://www.2345.com/?38001')#傳入一個網(wǎng)址 print(doc('head'))3文件初始化
from pyquery import PyQuery as pq doc = pq(filename='demo.html') print(doc('li'))基本CSS選擇器
html = ''' <div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div> ''' from pyquery import PyQuery as pq doc = pq(html) print(doc('#container .list li'))#id,class,標簽名doc(’#container .list li’)中l(wèi)ist不一定是container的直接子對象,只要有層級關(guān)系就可以,中間需要用空格隔開。如果沒有空格表示并列,表示條件需要同時滿足。如(a.b)表示條件要同時滿足ab。ab之間沒有層級關(guān)系。
查找子元素
### 子元素#%%html = ''' <div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') print(type(items)) print(items)print('查找子元素') lis = items.children() print(type(lis)) print(lis) print('具體子元素') lis = items.children('.active') print(lis)items = doc(’.list’),items是一個查找對象,對對象可以調(diào)用查找方法,如find(查找子元素),children(直接子元素)。
查找父元素
html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') parents = items.parents() parent = items.parent() print('父親以及祖輩') print(parents) print('直接父元素') print(parent)查找兄弟元素
html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.list .item-0.active') print('所有兄弟') print(li.siblings()) print('具體某一兄弟') print(li.siblings('.active'))遍歷單個元素
html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) lis = doc('li').items() print(type(lis)) for li in lis:print(li)獲取信息
獲取屬性
html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.item-0.active a') print(a) print(a.attr('href'))#查找網(wǎng)址 print(a.attr.href)獲取文本
html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.item-0.active a')#.item-0.active之間沒有空格,表示class同時是item-0,active。有空格表示層級關(guān)系,如active a print(a) print(a.text())#獲取文本獲取HTML
html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) print(li.html())DOM操作
addClass、removeClass
html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active')#.item-0.active,屬性之間無空格,表示同時滿足 print(li) li.removeClass('active') print(li) li.addClass('active') print(li)attr、css
html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) li.attr('name', 'link') print(li) li.css('font-size', '14px') print(li)remove
html = ''' <div class="wrap">Hello, World<p>This is a paragraph.</p></div> ''' from pyquery import PyQuery as pq doc = pq(html) wrap = doc('.wrap') print(wrap.text()) wrap.find('p').remove() print(wrap.text())其他DOM方法
http://pyquery.readthedocs.io/en/latest/api.html
偽類選擇器
html = ''' <div class="wrap"><div id="container"><ul class="list"><li class="item-0">first item</li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li><li class="item-1 active"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('li:first-child')#獲取第一個Li標簽 print(li) li = doc('li:last-child')#獲取最后一個li標簽 print(li) li = doc('li:nth-child(2)')#獲取第二個li標簽 print(li) li = doc('li:gt(2)')#獲取第二個li標簽 print(li) li = doc('li:nth-child(2n)')#獲取第二個li標簽 print(li) li = doc('li:contains(second)')#獲取第二個li標簽 print(li)作者:電氣-余登武。寫作屬實不容易,如果你覺得本文不錯,點個贊再走。
總結(jié)
以上是生活随笔為你收集整理的爬虫笔记:pyquery详解的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 味蕾是什么?
- 下一篇: 爬虫实战:爬虫加数据分析,重庆电气小哥一