當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

爬虫-用xpath爬取豆瓣图书的短评

發(fā)布時(shí)間：2023/12/16 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫-用xpath爬取豆瓣图书的短评小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Xpath的安裝：

1.使用pip安裝? ? $ pip install lxml

2.下載whl文件? ? $ pip install "文件名"

Xpath的使用

導(dǎo)入lxml——>返回xml結(jié)構(gòu)——>尋找數(shù)據(jù)

from lxml import etreeurl = ""s = etree.HTML(url)print(s.xpath())

1.獲取文本內(nèi)容用text()

2.獲取注釋用comment()

3.獲取其他任何屬性用@xx，如

@href
@src
@value

4.想獲取某個(gè)標(biāo)簽下的所有文本(包括子標(biāo)簽下的文本)，使用string

5.starts-with 匹配字符串前面相等

6.contains 匹配任何位置相等

下面用xpth爬取豆瓣讀書(shū)?

分析網(wǎng)站：

爬取的是豆瓣讀書(shū)網(wǎng)中圖書(shū)的短評(píng)，網(wǎng)站地址：豆瓣讀書(shū)短評(píng)

?打開(kāi)瀏覽器開(kāi)發(fā)者模式，按住Ctrl+Shirt+C然后點(diǎn)擊第一條評(píng)論。

?瀏覽器會(huì)自己找到我們所點(diǎn)擊的地方，然后右鍵Copy——>Copy XPath。

?結(jié)果：//*[@id="comments"]/ul/li[1]/div[2]/p/span

import requests from lxml import etree url = "https://book.douban.com/subject/25924253/comments/ resp = requests.get(url).text print(s.xpath('//*[@id="comments"]/ul/li[1]/div[2]/p/span/text()'))

下面我們爬取這一頁(yè)所有的短評(píng)，先copy Xpath上幾個(gè)，看看有什么規(guī)律

第一個(gè)短評(píng)：//*[@id="comments"]/ul/li[1]/div[2]/p/span

第二個(gè)短評(píng)：//*[@id="comments"]/ul/li[2]/div[2]/p/span

第三個(gè)短評(píng)：//*[@id="comments"]/ul/li[3]/div[2]/p/span

很容易發(fā)現(xiàn)li[]隨著短評(píng)數(shù)目增加而增加，此時(shí)只需要改成//*[@id="comments"]/ul/li/div[2]/p/span就行了

import requests from lxml import etree url = "https://book.douban.com/subject/25924253/comments/" resp = requests.get(url).text s = etree.HTML(resp) print(s.xpath('//*[@id="comments"]/ul/li/div[2]/p/span/text()'))

發(fā)現(xiàn)是不是很容易呢？如果要爬取多頁(yè)的短評(píng)呢？此時(shí)點(diǎn)擊下一頁(yè)看看網(wǎng)址有什么變化。

第一頁(yè)網(wǎng)址：https://book.douban.com/subject/25924253/comments/

第二頁(yè)網(wǎng)址：https://book.douban.com/subject/25924253/comments/hot?p=2

第三頁(yè)網(wǎng)址：https://book.douban.com/subject/25924253/comments/hot?p=3

發(fā)現(xiàn)后面的p的取值決定第幾頁(yè)，這下就好辦了。

import requests from lxml import etree for i in range(1,10):url = "https://book.douban.com/subject/25924253/comments/hot?p={}".format(i)resp = requests.get(url).texts = etree.HTML(resp)print(s.xpath('//*[@id="comments"]/ul/li/div[2]/p/span/text()'))

很容易就爬取到了，也可以寫(xiě)入txt文本中，這里就不寫(xiě)了。

總結(jié)

以上是生活随笔為你收集整理的爬虫-用xpath爬取豆瓣图书的短评的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Git 单文件上传大小限制
下一篇：机器人理论简介—— 台湾交通大学机器人学