當前位置：首頁 > 前端技术 > HTML >内容正文

HTML

11-selenium浏览器自动化

發布時間：2024/9/15 HTML 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 11-selenium浏览器自动化小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

selenium

- 概念：

Selenium 是一個 Web 應用的自動化框架
自動化：通過它，我們可以寫出自動化程序，像人一樣在瀏覽器里操作web界面。比如點擊界面按鈕，在文本框中輸入文字等操作，還能從web界面獲取信息。比如獲取12306票務信息，招聘網站職位信息，財經網站股票價格信息，以及滑動模塊驗證碼滑動等等，然后用程序進行分析處理。
Selenium 的自動化原理

- selenium的安裝
- pip install selenium
- 安裝瀏覽器驅動
- 瀏覽器驅動是和瀏覽器對應的，不同的瀏覽器需要選擇不同的瀏覽器驅動目前主流的瀏覽器中， Chrome 瀏覽器對Selenium自動化的支持更加成熟一些。
我們就以Chrome瀏覽器為例下載url如下：
- https://chromedriver.storage.googleapis.com/index.html

比如：當前Chrome瀏覽器版本是72, 通常就需要下載72開頭的目錄里面的驅動程序。- 注意：驅動和瀏覽器的版本號越接近越好，但是略有差別（比如72和73），通常也沒有什么問題.
比如，解壓到 d:\webdrivers 目錄下面,也就是保證我們的Chrome瀏覽器驅動路徑為 d:\webdrivers\chromedriver.exe

# -*- coding: utf-8 -*- from selenium import webdriver# 創建 WebDriver 對象，指明使用chrome瀏覽器驅動 wd = webdriver.Chrome(r'd:\webdrivers\chromedriver.exe')# 調用WebDriver 對象的get方法可以讓瀏覽器打開指定網址 wd.get('https://www.baidu.com')

如果我們直接把驅動程序放到python安裝目錄下就不需要指定驅動路徑了

selenium和爬蟲之間的關聯

1，便捷的捕獲到任意形式的動態加載數據（可見即可得）

2，實現模擬登錄jd
- 標簽定位使用xpath表達式進行定位，也可使用css（根據id，class等進行定位）

# -*- coding: utf-8 -*- from selenium import webdriverwd = webdriver.Chrome()# 創建 WebDriver 對象，指明使用chrome瀏覽器驅動 wd.get('https://www.jd.com')# 調用WebDriver 對象的get方法可以讓瀏覽器打開指定網址 search = wd.find_element_by_xpath('//*[@id="key"]') #定位到搜索框 search.send_keys('macbook pro') # 模擬輸入搜索的內容 btn = wd.find_element_by_xpath('//*[@id="search"]/div/div[2]/button').click()#模擬點擊搜索按鈕

# -*- coding: utf-8 -*- from selenium import webdriver import timewd = webdriver.Chrome()# 創建 WebDriver 對象，指明使用chrome瀏覽器驅動 wd.get('https://www.jd.com')# 調用WebDriver 對象的get方法可以讓瀏覽器打開指定網址 wd.implicitly_wait(5) #靜默等待最大5秒，保證頁面加載完畢 search = wd.find_element_by_xpath('//*[@id="key"]') #定位到搜索框 search.send_keys('macbook pro') # 模擬輸入搜索的內容 btn = wd.find_element_by_xpath('//*[@id="search"]/div/div[2]/button').click()#模擬點擊搜索按鈕 time.sleep(2)#等待2秒執行下面操作 #在搜索結果頁面進行滾輪向下滑動的操作（執行js操作：js注入） wd.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(2) #為了看見滑動效果我們可以等待2秒wd.quit()#關閉瀏覽器

爬蟲展示

3 使用selenium爬取jd商城數據（該案例翻頁效果失敗）
- 仔細分析京東的頁面后發現，京東的頁面是分兩段動態生成的，先顯示一半的結果，當你下拉頁面后，再顯示后一半的結果
- 每次下拉一半時，都會生成一個新的s_new.php?..，同時請注意請求參數中的 page 數的變化情況
- 由此可以得出，網頁中的一頁，實際上是 2 個 page 組成的，那么出現這樣的錯誤就可以解釋了
- 當剛剛加載出頁面時，此時頁面中只有 page: 1，而整個頁面框架也剛剛加載出來，所以此時的頁面跳轉模塊在page：1的下面，而當selenium選擇頁面跳轉模塊時，頁面就已經滾動到下方了，于是Ajax又動態加載了page: 2，頁面因此發生了改變，所以原先選擇的元素就失效了
-

from selenium import webdriver from selenium.webdriver.common.keys import Keys # 鍵盤按鍵操作庫 import time# 1,模擬用戶訪問網址def spider(url,keyword):driver = webdriver.Chrome()# 定義瀏覽器driver.get(url)driver.maximize_window() # 窗口最大化driver.implicitly_wait(5) # 隱式等待，確保所有節點完全加載出來try:input_tag = driver.find_element_by_id('key') # 定位搜索欄輸入口罩input_tag.send_keys(keyword) # 模擬鍵盤輸入input_tag.send_keys(Keys.ENTER) # 回車鍵time.sleep(5) #等待5秒時間driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')#滑動到最底部get_goods(driver)finally:driver.close() # 不管有沒有異常，都執行# 2.定位商品數據抓取 def get_goods(driver):try:goods = driver.find_elements_by_class_name('gl-item')for good in goods: # 商品名字連接價格評論detail_url = good.find_element_by_tag_name('a').get_attribute('href')p_name = good.find_element_by_css_selector('.p-name em').text.replace('\n','') # 抓取名字price = good.find_element_by_css_selector('.p-price i').text # 價格p_commint = good.find_element_by_css_selector('.p-commit a').text # 獲取評論msg = """商品:%s連接:%s價格:%s評論:%s"""%(p_name,detail_url,price,p_commint)print(msg)with open('jd.txt','a',encoding= " utf-8") as jdf:jdf.write(msg)print("打印完畢")except Exception:pass# 3，抓取大量數據 (翻頁)try:button = driver.find_element_by_link_text('下一頁 ')button.click()time.sleep(5)get_goods(driver) #調用抓取數據的函數except Exception:passif __name__=='__main__': # 標準寫法，用于判斷文件程序入口spider("https://www.jd.com/",keyword='macbook pro')

號碼爬取帶自動翻頁

import time from selenium import webdriver # 調用實例化模塊# #創建web實例化啟動瀏覽器def spider(url):try:wd = webdriver.Chrome()# 訪問網站wd.get(url)wd.implicitly_wait(5) # 隱式等待phones1(wd)except Exception:passdef phones1(wd):try:phones = wd.find_elements_by_class_name('r-left')for phone in phones:yzc = phone.textprint(yzc)with open('yzc2.txt', 'a', encoding="utf-8") as yzcf:yzcf.write(yzc + "\n")print("打印完畢")except Exception:passtry:wd.find_element_by_link_text('下一頁').click()time.sleep(2)phones1(wd)except Exception:passif __name__ == '__main__':spider('https://ketangsadas.aboatedu.com/question/comQuestionIndex')

電話號碼注冊判斷

注意：僅供參考請不要嘗試去暴力破解，爬蟲學得好，牢飯管到飽

### 注意：僅供參考請不要嘗試去暴力破解，爬蟲學得好，牢飯管到飽 ###### import time from selenium import webdriver # 調用實例化模塊wd = webdriver.Chrome()#創建web實例化啟動瀏覽器 wd.get('https://ketandsadasg.aboatedu.com/login/forget?method=phone')# 訪問網站 wd.implicitly_wait(5) # 隱式等待最大5秒def a(phones1):element = wd.find_element_by_id('mobile') # 定位到輸入框element.send_keys(phones1) #模擬輸入電話號碼wd.find_element_by_id('nextStep').click() # 點擊下一步判斷是否注冊# time.sleep(1) #間隔時間一秒try:elements = wd.find_element_by_css_selector('#mobileTips')register = elements.textprint(register)wd.find_element_by_id('mobile').clear()#清空輸入框with open('result.txt', 'a', encoding="utf-8") as yzcf:yzcf.write((phones1 + register).replace('\n', ' ') + "\n")print("打印完畢")except Exception:passdef b():phone = open("phones.txt", mode="r", encoding=" utf-8") # 打開電話號碼txtfor phones1 in phone: # 循環電話號碼time.sleep(1)print( phones1)a(phones1)if __name__ == '__main__':b()

動作鏈 Action Chains

動作鏈：一系列連續的動作（如滑動操作）

# -*- coding: utf-8 -*- from selenium import webdriver from selenium.webdriver import ActionChains import timeurl = "https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable" bro = webdriver.Chrome() bro.get(url) time.sleep(1)# 如果通過find系列的函數進行標簽定位，如果標簽存在iframe嵌套里面，則會定位失敗 # 解決方案：使用switch_to 即可bro.switch_to.frame("iframeResult") # 進入frame嵌套 div_tag = bro.find_element_by_xpath('//*[@id="draggable"]')# 對div_tag進行滑動操作 action = ActionChains(bro) action.click_and_hold(div_tag) # 點擊且長按不放for i in range(6):# perform 讓動作鏈立即執行action.move_by_offset(10,15).perform() #偏移x10像素，y15像素time.sleep(0.5) action.release() bro.quit()

無頭瀏覽器

import time from selenium import webdriver from selenium.webdriver.chrome.options import Options#這個是一個用來控制chrome以無界面模式打開的瀏覽器 #創建一個參數對象，用來控制chrome以無界面的方式打開 chrome_options = Options() #后面的兩個是固定寫法必須這么寫 chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu')#驅動路徑谷歌的驅動存放路徑 path = r'C:\python\chromedriver.exe'#創建瀏覽器對象browser = webdriver.Chrome(executable_path=path,chrome_options=chrome_options) #訪問 url ='http://www.jd.com/' gpc = browser.get(url) time.sleep(3) #截圖 browser.save_screenshot('baid.png') print(browser.page_source)browser.quit()

Selenium WebDriver-網頁的前進、后退、刷新、最大化、獲取窗口位置、設置窗口大小、獲取頁面title、獲取網頁源碼、獲取Url等基本操作

通過selenium webdriver操作網頁前進、后退、刷新、最大化、獲取窗口位置、設置窗口大小、獲取頁面title、獲取網頁源碼、獲取Url等基本操作

from selenium import webdriver driver = webdriver.Ie(executable_path = "e:\\IEDriverServer") #打開瀏覽器 driver.get("http://wenku.baidu.com") #輸入網址 driver.back() #向后退 driver.forward() #向前進 driver.refresh() #刷新頁面driver.set_page_load_timeout(2) #設置超時等待的時間，超過不再等待try: #捕獲超時異常driver.get("http://www.sohu.com") ... except Exception,e: ... print e ... Message: Timed out waiting for page to load.driver.maximize_window() #窗口最大化driver.get_window_position() #獲取坐標位置 {'y': -8, 'x': 1672}driver.name #判斷使用的瀏覽器 u'internet explorer'driver.set_window_position(y=200, x=400) #設置瀏覽器坐標 #y:指的上下走，屏幕最頂部y=0 ；x：指的左右走，最左邊x=0，不再當前屏幕的會出現負數 #瀏覽器最大化的狀態再去設置坐標就不起作用了driver.get_window_position()['x'] #獲取x軸的位置 2335 driver.get_window_position()['y'] #獲取y軸的位置 98driver.get_window_size() #獲取瀏覽器的窗體大小 {'width': 160, 'height': 32} driver.get_window_size()['width'] #獲取瀏覽器的寬度 160 driver.get_window_size()['height'] #獲取瀏覽器的高度 32 driver.set_window_size(100,200) #設置瀏覽器的窗體大小print driver.title #獲取頁面title，可以用于做斷言看打開的頁面對不對搜狐assert u"搜狐" == driver.title #斷言標題是否正確 assert u"搜狐2" == driver.title #斷言標題出錯 Traceback (most recent call last):File "<stdin>", line 1, in <module> AssertionErrordriver.page_source() #獲取網頁源碼，返回的其實是unicode字符串#抓取頁面源碼時，webdriver可以觸犯頁面上的js動態數據，但是它的缺點是比較慢；之前講過的#request抓取源碼快，但只適用于靜態頁面，無法抓取js的動態頁面內容 #抓取源碼是非常重要的，可以隨意操作driver = webdriver.Ie(executable_path = "e:\\IEDriverServer") driver.get("http://www.iciba.com") driver.page_source[:50] #獲取第50行的頁面源碼 u'<html><head><style></style><avalon class="avalonHi' u"熱門詞匯" in driver.page_source #判斷指定字段是不是在頁面源碼中存在 Truedriver.page_source.encode("gbk","ignore") #將頁面源碼轉碼成中文，加ignore避免無法識別的生僻字報錯html=driver.page_source.encode("gbk","ignore") #將頁面源碼轉成html文件 >>> with open("e:\\1.html","w"): ... pass ... >>> with open("e:\\1.html","w") as fp: ... fp.write(html) ...driver.current_url #獲取當前頁面的url u'http://www.iciba.com/

重點

一、根據tag名、id、class選擇元素
二、根據css選擇元素
三、頁面嵌套frame元素切換/窗口切換, frame 或者iframe元素內部會包含一個被嵌入的另一份html文檔
四、selenium 選取選擇框
五、更多操作技巧
六、Xpath 選擇器等
selenium成神鏈接：https://download.csdn.net/download/qq_37978800/12715808

總結

以上是生活随笔為你收集整理的11-selenium浏览器自动化的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 10-异步爬虫（线程池/asyncio协
下一篇： 12-基于selenium实现12306