python爬取TED演讲视频(代码)
生活随笔
收集整理的這篇文章主要介紹了
python爬取TED演讲视频(代码)
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
環境: windows+python3.6+pycharm(非必須)
引用的python庫/模塊:requests, bs4, os, random,you-get
準備知識:requests的應用,BeautifulSoup的find_all(),os.system(“cmd命令”),you-get
爬取步驟:
1.對于爬蟲,我習慣都用上ip代理池,雖然有的網站沒有反爬蟲策略,但是用上也無大礙。將ip代理池封裝為一個模塊可以隨時調用
直接貼代碼:get_ip.py
import requests from bs4 import BeautifulSoup import randomhead = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre', 'ue': 'utf-8', }def get_ip_list(): ????????????????????????????????# 從IP代理網站1直接爬取大量的ipurl = 'http://www.xicidaili.com/nn/' ????????????#ip代理網站 response = requests.get(url, headers=head).textbs = BeautifulSoup(response, 'html.parser')ips = bs.find_all('tr')ip_list = []for i in range(1, len(ips)):ip_info = ips[i]tds = ip_info.find_all('td')ip_list.append(tds[1].text + ':' + tds[2].text)return ip_listdef get_random_ip():???????????????????????????????????# 在ip池中獲取一個隨機ip地址調用ips_list = get_ip_list()proxy_list = []for ip in ips_list:proxy_list.append('http://' + ip)proxy_ip = random.choice(proxy_list)proxies = {'http': proxy_ip}return proxies2.現在來實現爬取TED
(1)分析TED網頁,我這里直接貼出規律
????? ? TED主頁:https://www.ted.com/
????? ? TED視頻的列表網頁:https://www.ted.com/talks?page=1,最后的page=1表示列表第一頁。如此類推
(2)直接貼代碼get_TED.py
import requests from get_ip import get_random_ip from bs4 import BeautifulSoup import oshead = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre', 'proxies': get_random_ip(), 'ue': 'utf-8', } path = r'F:\TED' def get_TED(url, count):page_part = url.split('=')for i in range(1, count+1):url_ted = page_part[0] + '=' + str(i)response = requests.get(url_ted, params=head)html = response.textbs = BeautifulSoup(html, 'html.parser')talks_list = bs.find_all('div', attrs={'class': 'media__message'})for j in range(len(talks_list)):ted_a = talks_list[j].find_all('a', attrs={'class': 'ga-link', 'data-ga-context': 'talks'})ted_url = 'https://www.ted.com' + ted_a[0]['href']print("TED演講主題:" + ted_a[0].text)os.system(r'you-get -o {} {}'.format(path, ted_url))if __name__ == '__main__':url = 'https://www.ted.com/talks?page=1' count = int(input("請輸入要下載的頁數(一頁36個TED):"))get_TED(url, count)代碼在鏈接在個人的github上:https://github.com/goodloving/python
總結
以上是生活随笔為你收集整理的python爬取TED演讲视频(代码)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 鱼是水中魂
- 下一篇: python可视化窗口打印信息,【pyt