當前位置：首頁 > 编程语言 > python >内容正文

python

python3读取网页_python3+selenium获取页面加载的所有静态资源文件链接操作

發布時間：2023/12/19 python 17 豆豆

生活随笔收集整理的這篇文章主要介紹了 python3读取网页_python3+selenium获取页面加载的所有静态资源文件链接操作小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

軟件版本：

python 3.7.2

selenium 3.141.0

pycharm 2018.3.5

具體實現流程如下，廢話不多說，直接上代碼：

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

d = DesiredCapabilities.CHROME

chrome_options = Options()

#使用無頭瀏覽器

chrome_options.add_argument('--headless')

chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36')

#瀏覽器啟動默認最大化

chrome_options.add_argument("--start-maximized");

#該處替換自己的chrome驅動地址

browser = webdriver.Chrome("D://googleDever//chromedriver.exe",chrome_options=chrome_options,desired_capabilities=d)

browser.set_page_load_timeout(150)

browser.get("https://www.xxx.com")

#靜態資源鏈接存儲集合

urls = []

#獲取靜態資源有效鏈接

for log in browser.get_log('performance'):

if 'message' not in log:

continue

log_entry = json.loads(log['message'])

try:

#該處過濾了data:開頭的base64編碼引用和document頁面鏈接

if "data:" not in log_entry['message']['params']['request']['url'] and 'Document' not in log_entry['message']['params']['type']:

urls.append(log_entry['message']['params']['request']['url'])

except Exception as e:

pass

print(urls)

打印結果為頁面渲染時加載的靜態資源文件鏈接：

[http://www.xxx.com/aaa.js,http://www.xxx.com/css.css]

以上代碼為selenium獲取頁面加載過程中預加載的各類靜態資源文件鏈接，使用該功能獲取到鏈接后，使用其他插件進行可對資源進行下載！

補充知識：在idea 中python import sys，import requests 報錯

File->Project Structure

project -> sdk -> new -> ok

設置編譯參數（主要是設置和檢查Python JDK是否正確）

以上這篇python3+selenium獲取頁面加載的所有靜態資源文件鏈接操作就是小編分享給大家的全部內容了，希望能給大家一個參考，也希望大家多多支持python博客。

總結

以上是生活随笔為你收集整理的python3读取网页_python3+selenium获取页面加载的所有静态资源文件链接操作的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：苹果官网可以用京东白条支付吗？
下一篇： gitlab 开源项目星_Docker