【python爬虫实战】批量下载网站视频
寫在前面
最近在學vue.js,看到一個網站上有很多視頻教程,但在線觀看不能倍速播放,就想著用python爬蟲批量下載到本地。
安裝依賴
pip3 install requests
測試樣例
加上序言總共有16個視頻,我們用python爬蟲技術批量下載到本地。
https://learning.dcloud.io/#/?vid=0
獲取直鏈
首先我們要獲取視頻的下載直鏈。鼠標右擊檢查,可以直接看到視頻的直鏈。
再看一下頁面的源代碼,發現視頻的直鏈不見了,原來視頻直鏈的位置變成了一個js腳本。
如果我們直接用requets庫請求url的話得到的是源代碼,但是源代碼里面并沒有視頻直鏈,所以我們要考慮換個思路。為什么視頻直鏈的位置會被js替換呢?
爬蟲多了你就會知道,這是網頁的動態加載,一定有一個js文件里面保存了視頻的直鏈,然后每次加載網頁的時候,通過js腳本將視頻直鏈動態加載到html中。
點擊網絡,篩選js文件,找到了3個js文件,我們先看第一個js文件里面有沒有視頻直鏈。搜索視頻的標題,直接找到了視頻的直鏈,發現所有的視頻直鏈都被保存到一個名為lesson_list的變量。
lesson_list里面保存了所有的視頻名稱和視頻直鏈信息,這里為了統一,將序言改為第0節。
# lesson_list.py
lesson_list = [{
"name": "第0節 vue.js介紹",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/52d32740-aecd-11ea-b244-a9f5e5565f30.mp4",
"ask": "77367"
}, {
"name": "第1節 安裝與部署",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/52dd6070-aecd-11ea-b43d-2358b31b6ce6.mp4",
"ask": "77369"
}, {
"name": "第2節 創建第一個vue應用",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/52f3cea0-aecd-11ea-b997-9918a5dda011.mp4",
"ask": "77370"
}, {
"name": "第3節 數據與方法",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/52eec590-aecd-11ea-b244-a9f5e5565f30.mp4",
"ask": "77372"
}, {
"name": "第4節 生命周期",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/52e63a10-aecd-11ea-b43d-2358b31b6ce6.mp4",
"ask": "77373"
}, {
"name": "第5節 模板語法-插值",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/52e72470-aecd-11ea-b997-9918a5dda011.mp4",
"ask": "77375"
}, {
"name": "第6節 模板語法-指令",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/98c18710-aecd-11ea-b43d-2358b31b6ce6.mp4",
"ask": "77376"
}, {
"name": "第7節 class與style綁定",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/4fe81fd0-aece-11ea-b997-9918a5dda011.mp4",
"ask": "77377"
}, {
"name": "第8節 條件渲染",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/98bad050-aecd-11ea-b680-7980c8a877b8.mp4",
"ask": "77378"
}, {
"name": "第9節 列表渲染",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/5da98c30-aece-11ea-b244-a9f5e5565f30.mp4",
"ask": "77380"
}, {
"name": "第10節 事件綁定",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/98bd6860-aecd-11ea-8bd0-2998ac5bbf7e.mp4",
"ask": "77381"
}, {
"name": "第11節 表單輸入綁定",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/656e12b0-aece-11ea-a30b-e311646dfaf2.mp4",
"ask": "77382"
}, {
"name": "第12節 組件基礎",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/98a06a80-aecd-11ea-8bd0-2998ac5bbf7e.mp4",
"ask": "77383"
}, {
"name": "第13節 組件注冊",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/98ed7910-aecd-11ea-b997-9918a5dda011.mp4",
"ask": "78520"
}, {
"name": "第14節 單文件組件",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/79db90b0-aece-11ea-8a36-ebb87efcf8c0.mp4",
"ask": "78521"
}, {
"name": "第15節 免終端開發vue應用",
"url": "https://vkceyugu.cdn.bspapp.com/VKCEYUGU-learning-vue/7e3b8f70-aece-11ea-8ff1-d5dcf8779628.mp4",
"ask": "81004"
}]
批量下載
這里用for循環遍歷每一個下載鏈接,然后使用之前寫的一個多線程下載器下載。
from concurrent.futures import ThreadPoolExecutor
from lesson_list import lesson_list
from requests import get, head
import time
class downloader:
def __init__(self, url, num, name):
self.url = url
self.num = num
self.name = name
self.getsize = 0
r = head(self.url, allow_redirects=True)
self.size = int(r.headers['Content-Length'])
def down(self, start, end, chunk_size=10240):
headers = {'range': f'bytes={start}-{end}'}
r = get(self.url, headers=headers, stream=True)
with open(self.name, "rb+") as f:
f.seek(start)
for chunk in r.iter_content(chunk_size):
f.write(chunk)
self.getsize += chunk_size
def main(self):
start_time = time.time()
f = open(self.name, 'wb')
f.truncate(self.size)
f.close()
tp = ThreadPoolExecutor(max_workers=self.num)
futures = []
start = 0
for i in range(self.num):
end = int((i+1)/self.num*self.size)
future = tp.submit(self.down, start, end)
futures.append(future)
start = end+1
while True:
process = self.getsize/self.size*100
last = self.getsize
time.sleep(1)
curr = self.getsize
down = (curr-last)/1024
if down > 1024:
speed = f'{down/1024:6.2f}MB/s'
else:
speed = f'{down:6.2f}KB/s'
print(f'process: {process:6.2f}% | speed: {speed}', end='')
if process >= 100:
print(f'process: {100.00:6}% | speed: 00.00KB/s', end=' | ')
break
end_time = time.time()
total_time = end_time-start_time
average_speed = self.size/total_time/1024/1024
print(f'total-time: {total_time:.0f}s | average-speed: {average_speed:.2f}MB/s')
if __name__ == '__main__':
for lesson in lesson_list:
url = lesson['url']
name = lesson['name']
down = downloader(url, 8, name+'.mp4')
down.main()
結果打印
16個視頻,總計339MB,用了56s就下載完了。
process: 100.0% | speed: 00.00KB/s | total-time: 2s | average-speed: 2.47MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 3s | average-speed: 6.62MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 3s | average-speed: 3.72MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 4s | average-speed: 7.72MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 4s | average-speed: 5.85MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 7s | average-speed: 7.01MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 3s | average-speed: 4.65MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 4s | average-speed: 6.69MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 3s | average-speed: 5.88MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 4s | average-speed: 5.01MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 3s | average-speed: 6.60MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 4s | average-speed: 6.20MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 3s | average-speed: 5.96MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 2s | average-speed: 4.64MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 3s | average-speed: 6.02MB/s
process: 100.0% | speed: 00.00KB/s | total-time: 4s | average-speed: 6.80MB/s
總結展望
有時候視頻或圖片的直鏈不一定需要爬取,在網頁加載的js文件里面說不定就能找到。既然能直接找到,我們何必爬呢?然后下載的時候一定要采用多線程,因為多線程可以占滿寬帶實現滿速下載。
引用參考
https://blog.csdn.net/qq_42951560/article/details/108785802
總結
以上是生活随笔為你收集整理的【python爬虫实战】批量下载网站视频的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 计算机系统概述
- 下一篇: 川贝雪梨糖浆_功效作用注意事项用药禁忌用