當前位置：首頁 > 编程资源 > 综合教程 >内容正文

综合教程

学习强国网页爬取)

發布時間：2024/8/26 综合教程 33 生活家

生活随笔收集整理的這篇文章主要介紹了学习强国网页爬取) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

需求

https://www.xuexi.cn/f997e76a890b0e5a053c57b19f468436/018d244441062d8916dd472a4c6a0a0b.html頁面中的新聞數據。

項目分析

1 首先我們通過請求網頁地址響應數據中查看瀏覽器頁面的數據是否存在于網頁html中.

2 在網頁響應的html 文件中不存在我們頁面數據,因此學習強國網的新聞數據都是動態加載出來的,并且通過抓包工具,發現也不是ajax請求(因為沒有捕獲ajax請求的數據包),那這里的數據只有可能是通過js生成的.

3 通過谷歌瀏覽器自帶的抓包工具,我們查看是哪一個js請求的數據格式.打開開發者應用,刷新頁面.

4 查看數據響應的詳細信息

5 同樣可以拿到詳情頁的url

6 url分析

詳情頁面:https://www.xuexi.cn/e5577906b82bc00b102d2c8d3b723312/e43e220633a65f9b6d8b53712cba9caa.html

詳情頁數據:https://www.xuexi.cn/e5577906b82bc00b102d2c8d3b723312/datae43e220633a65f9b6d8b53712cba9caa.js

全棧爬取代碼實現

import requests
import re
import json
from lxml import etree

class Spider:
    def __init__(self, headers, url,fp=None):
        self.headers = headers
        self.url=url
        self.fp=fp

    def open_file(self):
        self.fp = open('學習強國2.txt', 'w', encoding='utf8')

    def get_data(self):
       return requests.get(url=self.url, headers=self.headers).text

    def parse_home_data(self):
        ex = '"static_page_url":"(.*?)"'
        home_data=self.get_data()
        return re.findall(ex,home_data)

    def parse_detail_data(self):
        detail_url=self.parse_home_data()
        print(detail_url)
        i = 0
        for url in detail_url:
            i += 1
            '''<title>系統維護中</title> 坑人'''
            try:
                self.url = url.replace(r'/e', r'/datae').replace('html', 'js')
                detail_data = self.get_data()
                detail_data = detail_data.replace('globalCache = ', '')[:-1]
                dic_data = json.loads(detail_data)
　　　　　　　　　 #獲取字典中的第一個鍵值對的key
                first = list(dic_data.keys())[0]
                title = dic_data[first]['detail']['frst_name']
                content_html = dic_data[first]['detail']['content_list'][0]['content']
                tree = etree.HTML(content_html)
                content_list = tree.xpath('.//p/text()')
            except Exception as e:
                print(e) 
                continue
            self.fp.write(f'第{i}章' + title + '
' + ''.join(content_list) + '

')

    def close_file(self):
        self.fp.close()

    def run(self):
        self.open_file()
        self.parse_detail_data()
        self.close_file()

if __name__ == '__main__':
    headers = {
        'Host': 'www.xuexi.cn',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
    }
    url = 'https://www.xuexi.cn/f997e76a890b0e5a053c57b19f468436/data018d244441062d8916dd472a4c6a0a0b.js'
    spider=Spider(url=url,headers=headers)
    spider.run()

效果:

　　一共64篇文章

總結

以上是生活随笔為你收集整理的学习强国网页爬取)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。