當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫爬取bilibili新番榜

發布時間：2024/4/11 python 21 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫爬取bilibili新番榜小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這里用到的模塊是request模塊和beautifulsoup
首先我們需要打開Bilibili新番榜的審查元素

通過觀察可以發現每一個動漫的信息都分別存在了li標簽下的rank-item類中

而所有具體的信息都在里面的div標簽下的info類中

了解了所在位置就可以開始編寫代碼
首先設置代理及user-agent，然后下載頁面上的內容，以text的格式返回

def get_html(url):proxies = {"http": "36.25.243.51", "http": "39.137.95.70", "http": "59.56.28.199"}headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}response = requests.get(url, headers = headers, proxies = proxies)return response.text

使用request模塊大大簡化了步驟，比起之前用的urllib簡潔很多

之后我們需要用beautifulsoup4模塊以html格式來對下載的數據進行解析，并用一個soup對象將他保存下來。然后再使用fand_all方法來找到所有li標簽下的rank-item對象。然后再找到里面所有我們需要的內容，將他們全部加入到data列表中

def get_datas(text):soup = bs4.BeautifulSoup(text, "html.parser")data = []animes = soup.find_all("li", class_= "rank-item")for anime in animes :title = anime.find('div','info').a.stringlink = anime.find('div','info').a['href']rank = anime.find('div','num').stringupdata = anime.find('div', 'pgc-info').stringplay = anime.find_all('span', class_='data-box')[0].textview = anime.find_all('span', class_='data-box')[1].textfav = anime.find_all('span', class_='data-box')[2].textdata.extend([rank, title, updata, play, view, fav, link])return data

對于這個雜亂的數據，我們需要對這個列表進行分割，以七個元素為一組

def Slicing(iterable, n):return zip(*[iter(iterable)] * n)

這個可能有點難理解，iter()是序列上的迭代器， n是每一組的長度，然后zip每次從這個迭代器中拉出一組，合成為一個新的列表，最后再將所有列表合成為一個列表。

def main():url = "https://www.bilibili.com/ranking/bangumi/13/0/3?spm_id_from=333.851.b_62696c695f7265706f72745f616e696d65.51"text = get_html(url)datas = get_datas(text)with open('Bilibili新番榜排行前五十.txt', 'a', encoding = "utf-8") as file:for rank, title, updata, play, view, fav, link in Slicing(datas, 7):file.write(''.join(['排名： ',rank,' 標題： ',title,' 集數： ',updata,' 觀看數： ', play,' 評論數: ',view, ' 喜歡數: ', fav, ' 鏈接：',link, '\n']))if __name__ == "__main__":main()

最后我們只需要將這些數據保存到文件中，以utf-8的格式解碼，然后用’’.join將列表中的所有元素連接成字符串
實現效果

完整代碼

import requests import bs4def get_html(url):proxies = {"http": "36.25.243.51", "http": "39.137.95.70", "http": "59.56.28.199"}headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}response = requests.get(url, headers = headers, proxies = proxies)return response.textdef get_datas(text):soup = bs4.BeautifulSoup(text, "html.parser")data = []animes = soup.find_all("li", class_= "rank-item")for anime in animes :title = anime.find('div','info').a.stringlink = anime.find('div','info').a['href']rank = anime.find('div','num').stringupdata = anime.find('div', 'pgc-info').stringplay = anime.find_all('span', class_='data-box')[0].textview = anime.find_all('span', class_='data-box')[1].textfav = anime.find_all('span', class_='data-box')[2].textdata.extend([rank, title, updata, play, view, fav, link])return datadef Slicing(iterable, n):return zip(*[iter(iterable)] * n) def main():url = "https://www.bilibili.com/ranking/bangumi/13/0/3?spm_id_from=333.851.b_62696c695f7265706f72745f616e696d65.51"text = get_html(url)datas = get_datas(text)with open('Bilibili新番榜排行前五十.txt', 'a', encoding = "utf-8") as file:for rank, title, updata, play, view, fav, link in Slicing(datas, 7):file.write(''.join(['排名： ',rank,' 標題： ',title,' 集數： ',updata,' 觀看數： ', play,' 評論數: ',view, ' 喜歡數: ', fav, ' 鏈接：',link, '\n']))if __name__ == "__main__":main()

總結

以上是生活随笔為你收集整理的python爬虫爬取bilibili新番榜的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python爬虫隐藏身份及设置代理
下一篇：数据结构与算法 | 带头双向循环链表

python

python爬虫 爬取bilibili新番榜

總結

python爬虫爬取bilibili新番榜