當前位置：首頁 > 编程语言 > python >内容正文

python

python爬取当当网商品评论

發布時間：2024/3/7 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取当当网商品评论小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python爬取當當網商品評論

本案例獲取某鞋評論作為例

案例目的：

通過爬取當當網商品評價，介紹通過結合jsonpath和正則表達式獲取目標數據的方法。

代碼功能：

輸入爬取的頁數，自動下載保存每頁的的評價以及對應用戶昵稱。

1.找到目標的url

2.檢查響應結果

由上圖，我們發現：評論的內容在json文件中的html文件之下。

當我們展開Show more時，可以找到評論內容。

那么問題來了，如何提取評論內容呢？

1.既然，節點html對應的數據為html格式，我們是否可以通過xpath的方法來提取里面的內容。答案是不可以，我經過調試之后，無法通過xpath進行提取。

2.我們可以通過jsonpath將html中的內容全部提取出來，然后再通過正則表達式將我們需要的商品評論和用戶昵稱提取出來。經過調試，答案是可以的。

3.找到提取方法后，還需要找到翻頁的url規律。

找到前三頁評論的url如下：

http://product.dangdang.com/index.php?r=comment%2Flist&productId=1412059069&categoryPath=58.65.03.03.00.00&mainProductId=1412059069&mediumId=21&pageIndex=1&sortType=1&filterType=1&isSystem=1&tagId=0&tagFilterCount=0&template=cloth http://product.dangdang.com/index.php?r=comment%2Flist&productId=1412059069&categoryPath=58.65.03.03.00.00&mainProductId=1412059069&mediumId=21&pageIndex=2&sortType=1&filterType=1&isSystem=1&tagId=0&tagFilterCount=0&template=cloth&long_or_short=short http://product.dangdang.com/index.php?r=comment%2Flist&productId=1412059069&categoryPath=58.65.03.03.00.00&mainProductId=1412059069&mediumId=21&pageIndex=3&sortType=1&filterType=1&isSystem=1&tagId=0&tagFilterCount=0&template=cloth&long_or_short=short

對比發現：不同頁url的不同之處，在于參數pageindex。第一頁對應1，第二頁為2… … 而參數long_or_short=short對結果無影響，因此不做考慮。

所有問題分析完畢，上代碼：

import requests import jsonpath import re import jsonif __name__ == '__main__':# 輸入要爬取的頁數pages = int(input('請輸入要爬取的頁數：'))for i in range(pages):page = i + 1# 確認目標的urlurl = f'http://product.dangdang.com/index.php?r=comment%2Flist&productId=1412059069&categoryPath=58.65.03.03.00.00&mainProductId=1412059069&mediumId=21&pageIndex={page}&sortType=1&filterType=1&isSystem=1&tagId=0&tagFilterCount=0&template=cloth'# 構造請求頭參數headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36','Referer':'http://product.dangdang.com/1412059069.html','Cookie':'from=460-5-biaoti; order_follow_source=P-460-5-bi%7C%231%7C%23www.baidu.com%252Fother.php%253Fsc.060000jRtGgkBd47ECAxHUxBlqwLkfBJsl8lSLtmm9Zl27Qa_kZyOm2Qg_lyRgkRd4vKD9uWt%7C%230-%7C-; ddscreen=2; __permanent_id=20210304204636997189494350346254347; __visit_id=20210304204637001245338343220621735; __out_refer=1614861997%7C!%7Cwww.baidu.com%7C!%7C%25E5%25BD%2593%25E5%25BD%2593%25E7%25BD%2591; __ddc_15d_f=1614861997%7C!%7C_utm_brand_id%3D11106; dest_area=country_id%3D9000%26province_id%3D111%26city_id%3D0%26district_id%3D0%26town_id%3D0; pos_0_end=1614862009963; __ddc_1d=1614862062%7C!%7C_utm_brand_id%3D11106; __ddc_24h=1614862062%7C!%7C_utm_brand_id%3D11106; __ddc_15d=1614862062%7C!%7C_utm_brand_id%3D11106; pos_9_end=1614862078563; ad_ids=4343831%2C3554365%7C%233%2C2; secret_key=f097eea219c17c155499399cb471dd5a; pos_1_start=1614863547245; pos_1_end=1614863547264; __rpm=%7Cp_1412059069.029..1614863548625; __trace_id=20210304211706253212636290464425201'}# 發送請求，獲取響應response = requests.get(url,headers=headers)# 響應類型為json類型py_data = response.json()# 解析數據，提取數據.html_data = jsonpath.jsonpath(py_data,'$..html')[0]# 通過正則表達式，提取html數據中的用戶昵稱和評論comment_list = re.findall(r'<div class="describe_detail">\s*<span>(.*?)</span>\s*</div>',html_data)nickname_list = re.findall(r'alt="(.*?)"',html_data)# 將數據放進字典for i in range(len(comment_list)):dict_ = {}dict_[nickname_list[i]] = comment_list[i]# 將字典轉換成json格式json_data = json.dumps(dict_,ensure_ascii=False)+',\n'# 將數據保存到本地with open('當當網商品評價.json','a',encoding='utf-8')as f:f.write(json_data)

爬取了兩頁，每一頁對應10個數據。

執行結果如下：

總結

以上是生活随笔為你收集整理的python爬取当当网商品评论的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： matlab下载小木虫,小木虫关于flu
下一篇： C语言过时了？扯淡！