當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫知乎图片_python爬虫（爬取知乎答案图片）

發布時間：2025/3/15 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫知乎图片_python爬虫（爬取知乎答案图片）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python爬蟲（爬取知乎答案圖片）

1.?先，你要在電腦?安裝 python 的環境

我會提供2.7和3.6兩個版本的代碼,但是本?只以python3.6版本為例。

安裝完成后，打開你電腦的終端（Terminal）執?以下命令：

搭建 python 3.6環境，

爬蟲專用。如果你已經裝好了 python3.6的環境，那么可以跳過搭建環境這一步，直接安裝所需要的 python庫。

檢查 python 版本

python –version

2、selenium

因為知乎?站前端是? react 搭建的，頁?內容隨著?戶?標滾軸滑動、點擊依次展現，為了獲取海量的圖?內容，我們需要?selenium這個 lib 模擬?戶對瀏覽器進?滑動點擊等操作。

利? pip 安裝 selenium

pip install -U selenium

下載安裝完成后，我建議?家打開上?的鏈接，閱讀?下 selenium 的使??法。意思?致為，為了運? selenium，我們需要安裝?個 chrome 的driver，下載完成后，對于 Mac ?戶，直接把它復制到/usr/bin或者/usr/local/bin，當然你也可以?定義并添加路徑。對于 Win ?戶，也是同理。

chromedriver與chrome各版本及下載地址

Chrome: 下載地址

Firefox:下載地址

Safari:下載地址

2.0在爬?的時候我們經常會發現?頁都是經過壓縮去掉縮進和空格的，頁?結構會很不清晰，這時候我們就需要? BeautifulSoup 這個 lib 來進?html ?件結構化。

pip install beautifulsoup4

Section 2 – 代碼解釋

from selenium import webdriver

import time

import urllib.request

from bs4 import BeautifulSoup

import html.parser

2.1 確定目標URL

在 main 函數里邊，打開chrome driver，然后輸入 url

def main():

# ********* Open chrome driver and type the website that you want to view ***********************

driver = webdriver.Chrome() # 打開瀏覽器

列出來你想要下載圖片的網站

https://www.zhihu.com/question/35242408? ?#擁有豐富的表情包是怎樣的體驗

2.2 模擬滾動點擊操作

在 main 函數?我們定義?個重復執?的函數，來進?滾動和點擊的操作。?先我們可以?driver.execute_scrip來進?滾動操作。通過觀察，我們發現知乎問題底部有?個“查看更多回答的”的按鈕，如下圖。因此我們可以?driver.find_element_by_css_selector來選中這個按鈕，并點擊。我們這?只爬取五個頁?的圖?。其實，五個頁?，100個回答，往往都能有1000張圖?了。。。

# ****************** Scroll to the bottom, and click the "view more" button *********

def execute_times(times):

for i in range(times):

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # 滑動到瀏覽器底部

time.sleep(2) # 等待頁面加載

try:

driver.find_element_by_css_selector('button.QuestionMainAction').click() # 選中并點擊頁面底部的加載更多

print("page" + str(i)) # 輸出頁面頁數

time.sleep(1) # 等待頁面加載

except:

break

execute_times(5)

2.3 結構化HTML頁面并保存

我們每次爬取頁?信息，要做的第?件事就是把頁? HTML 存儲下來。為了?便我們?眼瀏覽，這時候就需要?beautifulSoup把壓縮后的 HTML ?件結構化并保存。

# **************** Prettify the html file and store raw data file *****************************************

result_raw = driver.page_source # 這是原網頁 HTML 信息

result_soup = BeautifulSoup(result_raw, 'html.parser')

result_bf = result_soup.prettify() # 結構化原 HTML 文件

with open("./output/rawfile/raw_result.txt", 'w') as girls: # 存儲路徑里的文件夾需要事先創建。

girls.write(result_bf)

girls.close()

print("Store raw data successfully!!!")

2.4 爬取知乎問題回答里的nodes

要知道，在我們每次想要爬取頁?信息之前，要做的第?件事就是觀察，觀察這個頁?的結構，量??裁。?般每個頁??都有很多個圖?，?如在這個知乎頁??，有很多?戶頭像以及插?的圖?。但是我們這?不想要?戶頭像，我們只想要要回答問題?的照?，所以不能夠直接爬取所有 <\img> 的照?。仔細觀察，都是被 escape（HTML entity 轉碼）了的，所以要?html.parser.unescape進?解碼。

# **************** Find all nodes and store them *****************************************

with open("./output/rawfile/noscript_meta.txt", 'w') as noscript_meta: # 存儲路徑里的文件夾需要事先創建。

noscript_nodes = result_soup.find_all('noscript') # 找到所有

node

noscript_inner_all = ""

for noscript in noscript_nodes:

noscript_inner = noscript.get_text() # 獲取

node內部內容

noscript_inner_all += noscript_inner + "\n"

noscript_all = html.parser.unescape(noscript_inner_all) # 將內部內容轉碼并存儲

noscript_meta.write(noscript_all)

noscript_meta.close()

print("Store noscript meta data successfully!!!")

2.5 下載圖片

有了 img 的所有 node，下載圖?就輕松多了。??個 urllib.request.urlretrieve就全部搞定。這?我又做了?點清理，把所有的 url 單獨存了?下，并?序號標記，你也可以不要這?步直接下載。

# **************** Store meta data of imgs *****************************************

img_soup = BeautifulSoup(noscript_all, 'html.parser')

img_nodes = img_soup.find_all('img')

with open("./output/rawfile/img_meta.txt", 'w') as img_meta:

count = 0

for img in img_nodes:

if img.get('src') is not None:

img_url = img.get('src')

line = str(count) + "\t" + img_url + "\n"

img_meta.write(line)

urllib.request.urlretrieve(img_url, "./output/image/" + str(count) + ".jpg") # 一個一個下載圖片

count += 1

img_meta.close()

print("Store meta data and images successfully!!!")

記得還有最后的！！！

if __name__ == '__main__':

main()

下面是全部源碼

from selenium import webdriver

import time

import urllib.request

from bs4 import BeautifulSoup

import html.parser

def main():

# ********* Open chrome driver and type the website that you want to view ***********************

driver = webdriver.Chrome() # 打開瀏覽器

# 列出來你想要下載圖片的網站

driver.get("https://www.zhihu.com/question/35242408") #擁有豐富的表情包是怎樣的體驗

# ****************** Scroll to the bottom, and click the "view more" button *********

def execute_times(times):

for i in range(times):

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(2)

try:

driver.find_element_by_css_selector('button.QuestionMainAction').click()

print("page" + str(i))

time.sleep(1)

except:

break

execute_times(7)

# **************** Prettify the html file and store raw data file *****************************************

result_raw = driver.page_source # 這是原網頁 HTML 信息

result_soup = BeautifulSoup(result_raw, 'html.parser')

result_bf = result_soup.prettify() # 結構化原 HTML 文件

with open("D:/newworld/output/rawfile/raw_result.txt", 'w',encoding='utf-8') as girls: # 存儲路徑里的文件夾需要事先創建。

girls.write(result_bf)

girls.close()

print("Store raw data successfully!!!")

# **************** Find all nodes and store them *****************************************

with open("D:/newworld/output/rawfile/noscript_meta.txt", 'w',encoding='utf-8') as noscript_meta:

noscript_nodes = result_soup.find_all('noscript') # 找到所有

node

noscript_inner_all = ""

for noscript in noscript_nodes:

noscript_inner = noscript.get_text() # 獲取

node內部內容

noscript_inner_all += noscript_inner + "\n"

noscript_all = html.parser.unescape(noscript_inner_all) # 將內部內容轉碼并存儲

noscript_meta.write(noscript_all)

noscript_meta.close()

print("Store noscript meta data successfully!!!")

# **************** Store meta data of imgs *****************************************

img_soup = BeautifulSoup(noscript_all, 'html.parser')

img_nodes = img_soup.find_all('img')

with open("D:/newworld/output/rawfile/img_meta.txt", 'w',encoding='utf-8') as img_meta:

count = 0

for img in img_nodes:

if img.get('src') is not None:

img_url = img.get('src')

line = str(count) + "\t" + img_url + "\n"

img_meta.write(line)

urllib.request.urlretrieve(img_url, "D:/newworld/output/zhihumeinv/" + str(count) + ".jpg") # 一個一個下載圖片

count += 1

img_meta.close()

print("Store meta data and images successfully!!!")

if __name__ == '__main__':

main()

總結

以上是生活随笔為你收集整理的python爬虫知乎图片_python爬虫（爬取知乎答案图片）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： rsatool使用步骤图解_工作中想要事
下一篇： python编写网页游戏脚本_[大数据]