當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬取百度文库内容(Selenium+BeautifulSoup)

發布時間：2023/12/18 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了爬取百度文库内容(Selenium+BeautifulSoup) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

提取百度文庫內容：

一，通過Selenium登錄網頁
二，模擬點擊繼續閱讀
三，模擬點擊加載更多
四，解析頁面內容
五，全部代碼
六，參考文獻

一，通過Selenium登錄網頁

??selenium操作游覽器，需要先安裝selenium,并下載與游覽器匹配的驅動。具體可參考：使用selenium調用qq游覽器(基于Chrome瀏覽器)。我在這里使用的是QQ游覽器，使用的是chrome的驅動，QQ游覽器本質上和Chrome沒有太大區別。

#coding:utf-8 from selenium import webdriver import time # 偽裝成iPhone，這樣才能看到相應的html頁面中的標簽，才能進行爬取 options.add_argument('--user-agent=Mozilla/5.0 (iPhone; CPU iPhone OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3') options.binary_location = r'D:\QQ游覽器\QQBrowser\QQBrowser.exe' driver = webdriver.Chrome(r'D:\anaconda_64anzhuang\chromedriver.exe', options=options) driver.get('https://wenku.baidu.com/view/aa31a84bcf84b9d528ea7a2c.html')#獲得頁面

二，模擬點擊繼續閱讀

?要爬取的頁面內容太多，采用的是非靜態頁面，需要模仿鼠標點擊加載出其他頁面。其中find_element_by_xpath()的用法可參見：Selenium六 find_element_by_xpath()的幾種方法

# 向下滑動，到有“繼續閱讀”的位置停止，選擇class屬性為foldpagewg-root的div元素 foldpagewg = driver.find_elements_by_xpath("//div[@class='foldpagewg-root']") # https://blog.csdn.net/u012941152/article/details/83011110 foldpagewg-text-con driver.execute_script('arguments[0].scrollIntoView();', foldpagewg[-1]) # 模擬點擊，繼續閱讀 continue_read = driver.find_element_by_xpath("//div[@class='foldpagewg-icon']") continue_read.click()

三，模擬點擊加載更多

? 爬取的頁面內容較多，故模擬點擊繼續閱讀后，又需要模擬點擊加載更多。其中可能會出現可以通過find_elements_by_xpath()定位到相應元素，卻不能進行click()的情況，可參考：selenium中click擴展

time.sleep(5)#等待5秒鐘，否則由于上一步的點擊可能會大致頁面還沒加載出來 pagerwg = driver.find_elements_by_xpath("//div[@class='pagerwg-root']") # 拖動到可見的元素 driver.execute_script('arguments[0].scrollIntoView();', pagerwg[-1]) # 模擬點擊，加載更多頁面 more_text = driver.find_element_by_xpath("//span[@class='pagerwg-arrow-lower']") # more_text.click()此方法失效，為啥我也不清楚。 driver.execute_script("arguments[0].click()", more_text)

四，解析頁面內容

?采用BeautifulSoup，re進行頁面內容的解析。本文采用先獲取頁面中的P標簽,再提取Span標簽中的內容的方法。需要注意的是本文采用.string的方式提取文字內容，可能會出現.string為None的情況，此處進行了簡單處理，其他情況可參考：使用BeautifulSoup的string元素提取標簽內容出現None的解決方法

from bs4 import BeautifulSoup import re html = driver.page_source#返回頁面源碼 soup1 = BeautifulSoup(html, 'lxml') result_p = soup1.find_all("p",class_=re.compile("rtcscls\d*_r_\d* rtcscls\d*_p_\d* fszl_show")) for num_p in range(len(result_p)):soup2 = BeautifulSoup(str(result_p[num_p]), "lxml")result_span = soup2.find_all("span")for each_result_span in result_span:soup2 = BeautifulSoup(str(each_result_span), 'lxml')file = open("6.txt", "a+", encoding="utf-8")if soup2.string != None:#.string可能出現為None的情況，此處簡單處理，可能會遺漏部分內容file.write(soup2.string) file.close()

五，全部代碼

#coding:utf-8 from selenium import webdriver import time options = webdriver.ChromeOptions() # # 偽裝成iPhone options.add_argument('--user-agent=Mozilla/5.0 (iPhone; CPU iPhone OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3') options.binary_location = r'D:\QQ游覽器\QQBrowser\QQBrowser.exe' driver = webdriver.Chrome(r'D:\anaconda_64anzhuang\chromedriver.exe', options=options) driver.get('https://wenku.baidu.com/view/aa31a84bcf84b9d528ea7a2c.html')#獲得頁面foldpagewg = driver.find_elements_by_xpath("//div[@class='foldpagewg-root']") driver.execute_script('arguments[0].scrollIntoView();', foldpagewg[-1]) # # 模擬點擊，繼續閱讀 continue_read = driver.find_element_by_xpath("//div[@class='foldpagewg-icon']") continue_read.click()time.sleep(5)#等待5秒鐘，否則由于上一步的點擊可能會大致頁面還沒加載出來 pagerwg = driver.find_elements_by_xpath("//div[@class='pagerwg-root']") # 拖動到可見的元素 driver.execute_script('arguments[0].scrollIntoView();', pagerwg[-1]) # 模擬點擊，加載更多頁面 more_text = driver.find_element_by_xpath("//span[@class='pagerwg-arrow-lower']") # more_text.click()此方法失效，為啥我也不清楚。 driver.execute_script("arguments[0].click()", more_text)from bs4 import BeautifulSoup import re html = driver.page_source#返回頁面源碼 soup1 = BeautifulSoup(html, 'lxml') result_p = soup1.find_all("p",class_=re.compile("rtcscls\d*_r_\d* rtcscls\d*_p_\d* fszl_show")) for num_p in range(len(result_p)):soup2 = BeautifulSoup(str(result_p[num_p]), "lxml")result_span = soup2.find_all("span")for each_result_span in result_span:soup2 = BeautifulSoup(str(each_result_span), 'lxml')file = open("6.txt", "a+", encoding="utf-8")if soup2.string != None:file.write(soup2.string) file.close()

六，參考文獻

本文主要參考以下大神的文章:
利用Python進行百度文庫內容爬取（一）
利用Python進行百度文庫內容爬取（二）——自動點擊預覽全文并爬取

總結

以上是生活随笔為你收集整理的爬取百度文库内容(Selenium+BeautifulSoup)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：怎么查询服务器是什么操作系统,怎么查服务
下一篇： Forth 常见问题解答