當前位置：首頁 > 编程语言 > python >内容正文

python

python爬取知乎标题_python爬虫爬取知乎文章标题及评论

發布時間：2023/12/2 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取知乎标题_python爬虫爬取知乎文章标题及评论小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

目的：學習筆記

2.首先我們試著爬取下來一篇文章的評論，通過搜索發現在 response里面我們并沒有匹配到評論，說明評論是動態加載的。

3.此時我們清空請求，收起評論，再次打開評論

4.完成上面操作后，我們選擇XHR，可以發現點擊評論的時候發送了3個請求。

5.我們點擊帶comments的請求，然后在response里搜索可以匹配到評論，返回的是json數據，說明評論請求是這條沒錯了

請求鏈接在上圖，現在咱先不管請求鏈接的組合規則是什么，繼續往下

6.接下來打開json.cn，復制response里的json數據粘貼進去

7。分析json數據，一個object包含一條評論的所有信息，比如評論人，評論內容等等，我們需要寫代碼從里面把相關的信息搞出來。

8.現在我們知道了請求鏈接url=https://www.zhihu.com/api/v4/articles/258812959/root_comments?order=normal&limit=20&offset=20&status=open'

請求方式為：request

可以開始寫代碼獲取相關信息了

代碼：

import requests

import json

url = 'https://www.zhihu.com/api/v4/articles/258812959/root_comments?order=normal&limit=20&offset=20&status=open'

Headers = {

'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36",

"referer": "https://www.zhihu.com/"

}

res=requests.get(url,headers=Headers).content.decode('utf-8')

jsonfile=json.loads(res)

next_page=jsonfile['paging']['is_end']

print(next_page)

for data in jsonfile['data']:

id=data['id']

content=data['content']

author=data['author']['member']['name']

print(id,content,author)

打印效果：

9.至此，我們打印了知乎上面第一頁，第一個話題，第一頁評論，下面我們來思考怎么抓取該話題的所有評論。

10.我們點擊第二頁獲取到請求url=https://www.zhihu.com/api/v4/answers/1307614528/root_comments?order=normal&limit=20&offset=20&status=open

對比第一頁的url1=https://www.zhihu.com/api/v4/answers/1307614528/root_comments?order=normal&limit=20&offset=0&status=open

可以發先offset由0變成了20，繼續分析后面頁面可得每過一頁offset便加20。

那么一直加20什么時候會是個頭呢，這時我們翻到最后一頁，分析最后一頁的json數據發現

is_end的值為Ture，所以我們可以用一個while循環，當is_end==Ture時 break掉就行

11.代碼：

import requests

import json

from lxml import etree

import re

i=0

while True:

url='https://www.zhihu.com/api/v4/articles/258812959/root_comments?order=normal&limit=20&offset={}&status=open'.format(i)

i+=20

Headers = {

'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36",

"referer": "https://www.zhihu.com/"

}

res = requests.get(url=url, headers=Headers).content.decode('utf-8')

jsonfile=json.loads(res)

next_page=jsonfile['paging']['is_end']

print(next_page)

comp = re.compile('?\w+[^>]*>')

for data in jsonfile['data']:

content=comp.sub("",data['content'])

author=data['author']['member']['name']

print("昵稱---"+author,"評論:"+content)

if next_page==True:

break

通過分析可以看出只有前面那串數字不一樣，于是可以得出前面那串數字是控制不同話題的

13.接下來我們從話題來找關聯，發先能在response里面匹配到信息，于是我打算動手直接寫代碼把相關信息提前出來

代碼：

import requests,json

from lxml import etree

Headers = {

'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36",

"cookie": '''_zap=3935ec64-2d91-4666-903c-a641b2510b18; d_c0="AOBcn_xhAxKPToowOy9HNL3DgDozDHDt63I=|1602244658"; capsion_ticket="2|1:0|10:1604449780|14:capsion_ticket|44:YjAwODdlMjcxNzY4NGYwODlmNjgxMzYyNWFkZDJlYTI=|6312ca79725710f1810a97a1fe3c4bbd6d16d02f769891626a67124cba7dd1f9"; z_c0="2|1:0|10:1604449808|4:z_c0|92:Mi4xVnlXQUNBQUFBQUFBNEZ5Zl9HRURFaVlBQUFCZ0FsVk5FRVNQWUFDOUlsV1pKa2hZUTdvc1U5Z1cxbTluajk5UW5n|dc09f94f0b3e78d3d80f6d18da39109a38b3357154e275642bec5e4afa4c825b"; tst=r; q_c1=097e8b52467b4017a4f27f26dd8622c2|1604625864000|1604625864000; _xsrf=88ef577a-c34c-49a1-8a13-faf8cd85c55a; KLBRSID=4843ceb2c0de43091e0ff7c22eadca8c|1605003647|1604996383''',

"referer": "https://www.zhihu.com/"

}

url1='https://www.zhihu.com/'

res=requests.get(url1,headers=Headers).text

html=etree.HTML(res)

divs = html.xpath('''//div[@class="Card TopstoryItem TopstoryItem--old TopstoryItem-isRecommend"]''')

for div in divs:

title=div.xpath('.//h2//a[@target="_blank"]/text()')[0]

link=div.xpath('.//h2//a[@target="_blank"]/@href')[0]

link_num=link.split('/')[-1]

print(link_num)

運行結果:

14.如上面結果可以發現link_num剛好是可以控制話題滴。

于是開始寫代碼：

import requests,json

from lxml import etree

headers = {

'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36",

"referer": "https://www.zhihu.com/"

}

url1='https://www.zhihu.com/'

res=requests.get(url1,headers=headers).text

html=etree.HTML(res)

divs = html.xpath('''//div[@class="Card TopstoryItem TopstoryItem--old TopstoryItem-isRecommend"]''')

for div in divs:

title=div.xpath('.//h2//a[@target="_blank"]/text()')[0]

link=div.xpath('.//h2//a[@target="_blank"]/@href')[0]

link_num=link.split('/')[-1]

i=0

print(f'.........................................標題為：{title} ...........................................................')

while True:

url2='https://www.zhihu.com/api/v4/answers/{}/root_comments?order=normal&limit=20&offset={}&status=open'.format(link_num,i)

i += 20

print(f'正在打印第{i / 20}頁。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。')

res = requests.get(url2, headers=headers).content.decode('utf-8')

jsonfile = json.loads(res)

next_page = jsonfile['paging']['is_end']

# print(next_page)

for data in jsonfile['data']:

id = data['id']

content = data['content']

author = data['author']['member']['name']

print(f'{author}評價：{content}')

if next_page == True:

break

運行結果截圖：

至此完成了第一頁全話題全評論的爬取.

寫道這里發現知乎的話題也是動態加載的并不需要翻頁，很多數據都是通過json傳入，而且需要傳入cookie才可以進行爬取

最后：代碼寫的不夠完善至少加強自己對爬蟲的理解，有些地方需添加異常處理的。

總結

以上是生活随笔為你收集整理的python爬取知乎标题_python爬虫爬取知乎文章标题及评论的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： mel滤波器组频率响应曲线_了解二阶滤波
下一篇：知道接口地址如何传数据_如何选显示器连

python

python爬取知乎标题_python爬虫 爬取知乎文章标题及评论

總結

python爬取知乎标题_python爬虫爬取知乎文章标题及评论