當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

[python爬虫] selenium爬取局部动态刷新网站（URL始终固定）

發(fā)布時(shí)間：2024/6/1 python 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 [python爬虫] selenium爬取局部动态刷新网站（URL始终固定）小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

在爬取網(wǎng)站過(guò)程中，通常會(huì)遇到局部動(dòng)態(tài)刷新情況，當(dāng)你點(diǎn)擊“下一頁(yè)”或某一頁(yè)時(shí)，它的數(shù)據(jù)就進(jìn)行刷新，但其頂部的URL始終不變。這種局部動(dòng)態(tài)刷新的網(wǎng)站，怎么爬取數(shù)據(jù)呢？某網(wǎng)站數(shù)據(jù)顯示如下圖所示，當(dāng)點(diǎn)擊“第五頁(yè)”之時(shí)，其URL始終不變，傳統(tǒng)的網(wǎng)站爬取方法是無(wú)法拼接這類(lèi)鏈接的，所以本篇文章主要解決這個(gè)問(wèn)題。

本文主要采用Selenium爬取局部動(dòng)態(tài)刷新的網(wǎng)站，獲取“下一頁(yè)”按鈕實(shí)現(xiàn)自動(dòng)點(diǎn)擊跳轉(zhuǎn)，再依次爬取每一頁(yè)的內(nèi)容。希望對(duì)您有所幫助，尤其是遇到同樣問(wèn)題的同學(xué)，如果文章中出現(xiàn)錯(cuò)誤或不足之處，還請(qǐng)海涵~

一. Selenium爬取第一頁(yè)信息

首先，我們嘗試使用Selenium爬取第一頁(yè)的內(nèi)容，采用瀏覽器右鍵“審查”元素，可以看到對(duì)應(yīng)的HTML源代碼，如下圖所示，可以看到，每一行工程信息都位于<table class="table table-hover">節(jié)點(diǎn)下的<tr>...</tr>中。

然后我們?cè)僬归_(kāi)其中一個(gè)<tr>...</tr>節(jié)點(diǎn)，看它的源碼詳情，如下圖所示，包括公告標(biāo)題、發(fā)布時(shí)間、項(xiàng)目所在地。如果我們需要抓取公告標(biāo)題，則定位<div class="div_title text_view">節(jié)點(diǎn)，再獲取標(biāo)題內(nèi)容和超鏈接。

完整代碼如下：

# coding=utf-8 from selenium import webdriver import re import time import osprint "start" #打開(kāi)Firefox瀏覽器設(shè)定等待加載時(shí)間 driver = webdriver.Firefox()#定位節(jié)點(diǎn) url = 'http:/www.xxxx.com/' print url driver.get(url) content = driver.find_elements_by_xpath("//div[@class='div_title text_view']") for u in content:print u.text

輸出內(nèi)容如下圖所示：

PS：由于網(wǎng)站安全問(wèn)題，我不直接給出網(wǎng)址，主要給出爬蟲(chóng)的核心思想。同時(shí)，下面的代碼我也沒(méi)有給出網(wǎng)址，但思路一樣，請(qǐng)大家替換成自己的局部刷新網(wǎng)址進(jìn)行測(cè)試。

二. Selenium實(shí)現(xiàn)局部動(dòng)態(tài)刷新爬取

接下來(lái)我們想爬取第2頁(yè)的網(wǎng)站內(nèi)容，其代碼步驟如下：
? ? 1.定位驅(qū)動(dòng)：driver = webdriver.Firefox()
? ? 2.訪(fǎng)問(wèn)網(wǎng)址：driver.get(url)
? ? 3.定位節(jié)點(diǎn)獲取第一頁(yè)內(nèi)容并爬取：driver.find_elements_by_xpath()
? ? 4.獲取“下一頁(yè)”按鈕，并調(diào)用click()函數(shù)點(diǎn)擊跳轉(zhuǎn)
? ? 5.爬取第2頁(yè)的網(wǎng)站內(nèi)容：driver.find_elements_by_xpath()

其核心步驟是獲取“下一頁(yè)”按鈕，并調(diào)用Selenium自動(dòng)點(diǎn)擊按鈕技術(shù)，從而實(shí)現(xiàn)跳轉(zhuǎn)，之后再爬取第2頁(yè)內(nèi)容。“下一頁(yè)”按鈕的源代碼如下圖所示：

其中，“下一頁(yè)”按鈕始終在第9個(gè)<li>...</li>位置，則核心代碼如下：
nextPage = driver.find_element_by_xpath("//ul[@class='pagination']/li[9]/a")
nextPage.click()

完整代碼如下：

# coding=utf-8 from selenium import webdriver import re import time import osprint "start" driver = webdriver.Firefox()url = 'http://www.XXXX.com/' print url driver.get(url) #項(xiàng)目名稱(chēng) titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']") for u in titles:print u.text #超鏈接 urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a") for u in urls:print u.get_attribute("href") #時(shí)間 times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]") for u in times:print u.text #地點(diǎn) places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]") for u in places:print u.text#點(diǎn)擊下一頁(yè) nextPage = driver.find_element_by_xpath("//ul[@class='pagination']/li[9]/a") print nextPage.text nextPage.click() time.sleep(5)#爬取第2頁(yè)數(shù)據(jù) content = driver.find_elements_by_xpath("//div[@class='div_title text_view']") for u in content:print u.text 輸出內(nèi)容如下所示，可以看到第二頁(yè)的內(nèi)容也爬取成功了，并且作者爬取了公告主題、超鏈接、發(fā)布時(shí)間、發(fā)布地點(diǎn)。
>>> start http://www.xxxx.com/ 觀山湖區(qū)依法治國(guó)普法教育基地（施工）中標(biāo)候選人公示興義市2017年農(nóng)村公路生命防護(hù)工程一期招標(biāo)公告安龍縣市政廣場(chǎng)地下停車(chē)場(chǎng)10kV線(xiàn)路遷改、10kV臨時(shí)用電、10kV電纜敷設(shè)及400V電纜敷設(shè)工程施工公開(kāi)競(jìng)爭(zhēng)性談判公告劍河縣小香雞種苗孵化場(chǎng)建設(shè)項(xiàng)目（場(chǎng)坪工程）中標(biāo)公示安龍縣棲鳳生態(tài)濕地走廊建設(shè)項(xiàng)目（原冰河步道A、B段）10kV線(xiàn)路、400V線(xiàn)路、220V線(xiàn)路及變壓器遷改工程施工招標(biāo) 鎮(zhèn)寧自治縣2017年簡(jiǎn)嘎鄉(xiāng)農(nóng)村飲水安全鞏固提升工程(施工)招標(biāo)廢標(biāo)公示 S313線(xiàn)安龍縣城至普坪段道路改擴(kuò)建工程勘察招標(biāo)公告 S313線(xiàn)安龍縣城至普坪段道路改擴(kuò)建工程勘察招標(biāo)公告貴州中煙工業(yè)有限責(zé)任公司2018物資公開(kāi)招標(biāo)-卷煙紙中標(biāo)候選人公示冊(cè)亨縣者樓河納福新區(qū)河段生態(tài)治理項(xiàng)目（上游一標(biāo)段）初步設(shè)計(jì) 招標(biāo)公告 http://www.gzzbw.cn/historydata/view/?id=116163 http://www.gzzbw.cn/historydata/view/?id=114995 http://www.gzzbw.cn/historydata/view/?id=115720 http://www.gzzbw.cn/historydata/view/?id=116006 http://www.gzzbw.cn/historydata/view/?id=115719 http://www.gzzbw.cn/historydata/view/?id=115643 http://www.gzzbw.cn/historydata/view/?id=114966 http://www.gzzbw.cn/historydata/view/?id=114965 http://www.gzzbw.cn/historydata/view/?id=115400 http://www.gzzbw.cn/historydata/view/?id=116031 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 2017-12-22 未知興義市安龍縣未知安龍縣未知安龍縣安龍縣未知冊(cè)亨縣下一頁(yè) 冊(cè)亨縣丫他鎮(zhèn)板其村埃近1～2組村莊綜合整治項(xiàng)目冊(cè)亨縣者樓河納福新區(qū)河段生態(tài)治理項(xiàng)目（上游一標(biāo)段）勘察招標(biāo)公告惠水縣撤并建制村通硬化路施工總承包中標(biāo)候選人公示冊(cè)亨縣丫他鎮(zhèn)板街村村莊綜合整治項(xiàng)目施工招標(biāo) 招標(biāo)公告鎮(zhèn)寧自治縣農(nóng)村環(huán)境整治工程項(xiàng)目（環(huán)翠街道辦事處）施工（三標(biāo)段）(二次）（項(xiàng)目名稱(chēng)）交易結(jié)果公示丫他鎮(zhèn)生態(tài)移民附屬設(shè)施建設(shè)項(xiàng)目劍河縣城市管理辦公室的劍河縣仰阿莎主題文化廣場(chǎng)護(hù)坡綠化工程中標(biāo)公示冊(cè)亨縣者樓河納福新區(qū)河段生態(tài)治理項(xiàng)目（上游一標(biāo)段）施工圖設(shè)計(jì) 冊(cè)亨縣2017年巖架城市棚戶(hù)區(qū)改造項(xiàng)目配套基礎(chǔ)設(shè)施建設(shè)項(xiàng)目中標(biāo)公示數(shù)字甕安地理空間框架建設(shè)項(xiàng)目 >>> Firefox成功跳轉(zhuǎn)到第2頁(yè)，此時(shí)你增加一個(gè)循環(huán)則可以跳轉(zhuǎn)很多頁(yè)，并爬取信息，詳見(jiàn)第三個(gè)步驟。

三. Selenium爬取詳情頁(yè)面

上面爬取了每行公告信息的詳情頁(yè)面超鏈接(URL)，本來(lái)我準(zhǔn)備采用BeautifulSoup爬蟲(chóng)爬取詳情頁(yè)面信息的，但是被攔截了，詳情頁(yè)面如下圖所示：

這里作者繼續(xù)定義另一個(gè)Selenium Firefox驅(qū)動(dòng)進(jìn)行爬取，完整代碼如下： # coding=utf-8 from selenium import webdriver from selenium.webdriver.common.keys import Keys import re import time import osprint "start" #打開(kāi)Firefox瀏覽器 driver = webdriver.Firefox() driver2 = webdriver.Firefox()url = 'http://www.xxxx.com/' print url driver.get(url) #項(xiàng)目名稱(chēng) titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']") #超鏈接 urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a") num = [] for u in urls:href = u.get_attribute("href")driver2.get(href)con = driver2.find_element_by_xpath("//div[@class='col-xs-9']") #print con.textnum.append(con.text) #時(shí)間 times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]") #地點(diǎn) places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]")#輸出所有結(jié)果 print len(num) i = 0 while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""i = i + 1#點(diǎn)擊下一頁(yè) j = 0 while j<5:nextPage = driver.find_element_by_xpath("//ul[@class='pagination']/li[9]/a")print nextPage.textnextPage.click()time.sleep(5)#項(xiàng)目名稱(chēng)titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']")#超鏈接urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a")num = []for u in urls:href = u.get_attribute("href")driver2.get(href)con = driver2.find_element_by_xpath("//div[@class='col-xs-9']") num.append(con.text)#時(shí)間times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]")#地點(diǎn)places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]")#輸出所有結(jié)果print len(num)i = 0while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""i = i + 1j = j + 1

注意作者定義了一個(gè)while循環(huán)，一次性輸出一條完整的招標(biāo)信息，代碼如下：

print len(num) i = 0 while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""i = i + 1

輸出結(jié)果如下圖所示：

其中一條完整的結(jié)果如下所示：

觀山湖區(qū)依法治國(guó)普法教育基地（施工）中標(biāo)候選人公示 http://www.gzzbw.cn/historydata/view/?id=116163 2017-12-22 未知觀山湖區(qū)依法治國(guó)普法教育基地（施工）中標(biāo)候選人公示來(lái)源: 貴州百利工程建設(shè)咨詢(xún)有限公司發(fā)布時(shí)間: 2017-12-22 根據(jù)法律、法規(guī)、規(guī)章和招標(biāo)文件的規(guī)定，觀山湖區(qū)司法局、貴陽(yáng)觀山湖投資（集團(tuán)）旅游文化產(chǎn)業(yè)發(fā)展有限公司（代建）的觀山湖區(qū)依法治國(guó)普法教育基地（施工）（項(xiàng)目編號(hào)：BLZB01201744）已于2017年 12月22日進(jìn)行談判，根據(jù)談判小組出具的競(jìng)爭(zhēng)性談判報(bào)告，現(xiàn)公示下列內(nèi)容：第一中標(biāo)候選人：貴州鴻友誠(chéng)建筑安裝有限公司中標(biāo)價(jià)：1930000.00（元）工期：57日歷天第二中標(biāo)候選人：貴州隆瑞建設(shè)有限公司中標(biāo)價(jià)：1940000.00（元）工期：60日歷天第三中標(biāo)候選人：鳳岡縣建筑工程有限責(zé)任總公司中標(biāo)價(jià)：1953285.00（元）工期：60日歷天中標(biāo)結(jié)果公示至2017年12月25日。招標(biāo) 人：貴陽(yáng)觀山湖投資（集團(tuán)）有限公司招標(biāo)代理人：貴州百利工程建設(shè)咨詢(xún)有限公司 2017年12月22日最后讀者可以結(jié)合MySQLdb庫(kù)，將爬取的內(nèi)容存儲(chǔ)至本地中。同時(shí)，如果您爬取的內(nèi)容需要設(shè)置時(shí)間，比如2015年的數(shù)據(jù)，則在爬蟲(chóng)開(kāi)始之前設(shè)置time.sleep(5)暫定函數(shù)，手動(dòng)點(diǎn)擊2015年或輸入關(guān)鍵字，再進(jìn)行爬取。也建議讀者采用Selenium技術(shù)來(lái)自動(dòng)跳轉(zhuǎn)，而詳情頁(yè)面采用BeautifulSoup爬取。

# coding=utf-8 from selenium import webdriver from selenium.webdriver.common.keys import Keys import selenium.webdriver.support.ui as ui import re import time import os import codecs from bs4 import BeautifulSoup import urllib import MySQLdb#存儲(chǔ)數(shù)據(jù)庫(kù) #參數(shù):公告名稱(chēng) 發(fā)布時(shí)間發(fā)布地點(diǎn) 發(fā)布內(nèi)容 def DatabaseInfo(title,url,fbtime,fbplace,content): try: conn = MySQLdb.connect(host='localhost',user='root', passwd='123456',port=3306, db='20180426ztb') cur=conn.cursor() #數(shù)據(jù)庫(kù)游標(biāo) #報(bào)錯(cuò):UnicodeEncodeError: 'latin-1' codec can't encode character conn.set_character_set('utf8') cur.execute('SET NAMES utf8;') cur.execute('SET CHARACTER SET utf8;') cur.execute('SET character_set_connection=utf8;') #SQL語(yǔ)句智聯(lián)招聘(zlzp) sql = '''insert into ztb (title, url, fbtime, fbplace, content) values(%s, %s, %s, %s, %s);'''cur.execute(sql, (title,url,fbtime,fbplace,content)) print '數(shù)據(jù)庫(kù)插入成功' #異常處理 except MySQLdb.Error,e: print "Mysql Error %d: %s" % (e.args[0], e.args[1]) finally: cur.close() conn.commit() conn.close()print "start" #打開(kāi)Firefox瀏覽器 driver = webdriver.Firefox() driver2 = webdriver.Firefox()url = 'http://www.gzzbw.cn/historydata/' print url driver.get(url) #項(xiàng)目名稱(chēng) titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']") #超鏈接 urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a") num = [] for u in urls:href = u.get_attribute("href")driver2.get(href)con = driver2.find_element_by_xpath("//div[@class='col-xs-9']") #print con.textnum.append(con.text) #時(shí)間 times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]") #地點(diǎn) places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]")#輸出所有結(jié)果 print len(num) i = 0 while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""DatabaseInfo(titles[i].text, urls[i].get_attribute("href"), times[i].text,places[i].text, num[i])i = i + 1#點(diǎn)擊下一頁(yè) j = 0 while j<100:nextPage = driver.find_element_by_xpath("//ul[@class='pagination']/li[9]/a")print nextPage.textnextPage.click()time.sleep(5)#項(xiàng)目名稱(chēng)titles = driver.find_elements_by_xpath("//div[@class='div_title text_view']")#超鏈接urls = driver.find_elements_by_xpath("//div[@class='div_title text_view']/a")num = []for u in urls:href = u.get_attribute("href")driver2.get(href)con = driver2.find_element_by_xpath("//div[@class='col-xs-9']") num.append(con.text)#時(shí)間times = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[2]")#地點(diǎn)places = driver.find_elements_by_xpath("//table[@class='table table-hover']/tbody/tr/td[3]")#輸出所有結(jié)果print len(num)i = 0while i<len(num):print titles[i].textprint urls[i].get_attribute("href")print times[i].textprint places[i].textprint num[i]print ""DatabaseInfo(titles[i].text, urls[i].get_attribute("href"), times[i].text,places[i].text, num[i])i = i + 1print u"已爬取頁(yè)碼:", (j+2)j = j + 1 存儲(chǔ)至數(shù)據(jù)庫(kù)：

最后希望文章對(duì)您有所幫助，尤其是要爬取局部刷新的同學(xué)，
如果文章中出現(xiàn)錯(cuò)誤或不足之處，還請(qǐng)海涵~

(By:Eastmount 2018-04-26 早上11點(diǎn)半?http://blog.csdn.net/eastmount/?)

總結(jié)

以上是生活随笔為你收集整理的[python爬虫] selenium爬取局部动态刷新网站（URL始终固定）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Python安装MySQL库详解（解决M
下一篇： [python应用案例] 一.Beaut