當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

猫眼电影爬虫和数据分析

發布時間：2023/12/14 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了猫眼电影爬虫和数据分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

由于疫情關系，宅在家里。記錄一下作業，貓眼電影爬蟲及分析，爬取貓眼電影數據，并對爬取的數據進行分析和展示。

貓眼電影爬蟲

基于requests庫和lxml庫進去貓眼電影TOP100榜電影爬取，爬取地址為：https://maoyan.com/board/4

爬取的信息有：電影名字，主演名字，上映時間以及地點，貓眼評分得分，電影類型，電影時長。

電影數據保存為.csv格式。表頭：電影名字(title)，主演名字(author)，上映時間以及地點(pub_time)，貓眼評分得分(star)，電影類型(style)，電影時長(long_time)。

import requests from lxml import etree import csvheaders = { # 設置header'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' }def get_url(url): # top100電影獲取res = requests.get(url, headers=headers) # 請求# print(res.text)html = etree.HTML(res.text) # 獲取網頁源碼infos = html.xpath('//dl[@class="board-wrapper"]/dd') # 獲取頁面的10部電影，xpathfor info in infos:title = info.xpath('div/div/div[1]/p[1]/a/text()')[0] # 電影名稱author = info.xpath('div/div/div[1]/p[2]/text()')[0].strip().strip('主演：') # 電影主演，strip()去掉空格pub_time = info.xpath('div/div/div[1]/p[3]/text()')[0].strip('上映時間：') # 上映時間star_1 = info.xpath('div/div/div[2]/p/i[1]/text()')[0] # 得分1(整數部分)star_2 = info.xpath('div/div/div[2]/p/i[2]/text()')[0] # 得分2（小數部分）star = star_1 + star_2 # 電影得分movie_url = 'https://maoyan.com' + info.xpath('div/div/div[1]/p[1]/a/@href')[0] # 電影的詳細頁# print(title,author,pub_time,star,movie_url)get_info(movie_url, title, author, pub_time, star) # 進入電影的詳細頁爬取print(‘保存完畢！’)def get_info(url, title, author, pub_time, star): # 電影詳細獲取res = requests.get(url, headers=headers)html = etree.HTML(res.text)style = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[1]/text()')[0] # 電影類型long_time = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[2]/text()')[0].split('/')[1].strip().strip('分鐘') # 電影時長print(title, author, pub_time, star, style, long_time)writer.writerow([title, author, pub_time, star, style, long_time]) # 寫入數據if __name__ == '__main__':fp = open('F://maoyan.csv', 'w', newline='', encoding='utf-8') # 存儲文件writer = csv.writer(fp)writer.writerow(['title', 'author', 'pub_time', 'star', 'style', 'long_time']) # 寫入表頭urls = ['https://maoyan.com/board/4?offset={}'.format(str(i)) for i in range(0, 100, 10)] # url構造for url in urls:get_url(url)

數據分析

數據分析和展示時基于pandas和matplotlib。

分析了100部電影基本分析、演員出演這100部電影次數、電影的年份分布情況、電影的月份分布情況、電影的國家分布情況、前20部電影評分得分情況、電影的類型分布情況、電影的時長分布情況等。

演員出演這100部電影次數
（1）讀取電影數據。
（2）取出主演(author)一列數據，循環對其進行字符串拼接。
（3）以“,”為分隔符進行切割，統計演員的個數以及名字、演員名字出現的次數（實現方法很多，這里使用的是Counter模塊）。
（4）選取出演次數最多的六位演員。構造水平軸數據author，垂直軸數據count。
（5）使用matplotlib的pyplot繪制條形圖，設置title，xlabel，ylabel。

import pandas as pd from matplotlib import pyplot as plt from collections import Counterplt.rcParams['font.sans-serif'] = ['SimHei']datas = pd.read_csv('maoyan.csv', encoding='utf-8')s = '' for i in range(99):s += datas.iloc[i, 1]+',' s += datas.iloc[99, 1] # 防止最后的空格 # print(s) authors = s.split(',') # print(authors) c = Counter(authors) # print(c) items = c.most_common(6) print(items)author = [] count = [] for item in items:author.append(item[0])count.append(item[1])# print(author) # print(count) plt.bar(author, count, color='orange') plt.title('出演次數最多的六位演員情況') plt.xlabel('演員') plt.ylabel('出演次數') plt.show()

運行結果：

電影的年份分布情況
（1）讀取電影數據。
（2）取出上映時間及地點(pub_time)一列數據。
（3）以“-”為分隔符進行切割，取出第一個元素，第二個元素是月份。存儲為新的一列year。
（4）使用groupby對year這一列按照年份進行分組，并統計次數。
（5）數據的index轉化為列表作為水平軸數據，數據轉化為list作為垂直軸數據。
（6）使用matplotlib的pyplot繪制折線圖，設置title，xlabel，ylabel。

import pandas as pd from matplotlib import pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei']datas = pd.read_csv('maoyan.csv', encoding='utf-8')datas['year'] = datas['pub_time'].str.split('-').str[0] datas['month'] = datas['pub_time'].str.split('-').str[1]year = datas.groupby('year')['year'].count() month = datas.groupby('month')['month'].count() # print(list(year.index)) # print(list(year)) # print(month)plt.figure(figsize=(20, 8), dpi=80)plt.plot(list(year.index), list(year)) plt.title('電影年份的分布情況') plt.xlabel('年份') plt.ylabel('電影數量') plt.grid(alpha=0.4) plt.show()

運行結果：

電影的國家分布情況
（1）讀取電影數據。
（2）定義一個方法get_country()，對字符串進行切割，取出國家部分，中國香港返回中國，法國戛納返回法國。
（3）取出上映時間及地點(pub_time)一列數據。
（4）分別進行get_country操作。存儲為新的一列country。
（5）使用groupby對country這一列按照國家進行分組，并統計次數。
（6）數據的index轉化為列表作為水平軸數據，數據轉化為list作為垂直軸數據。
（7）使用matplotlib的pyplot繪制餅圖，設置title，xlabel，ylabel。

import pandas as pd from matplotlib import pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei']datas = pd.read_csv('maoyan.csv', encoding='utf-8')def get_country(s):country = s.split('(')if len(country) == 1:return '中國'else:temp = country[1].strip(')')if temp == '中國香港':return '中國'elif temp == '法國戛納':return '法國'else:return tempdatas['country'] = datas['pub_time'].map(get_country) # print(datas['country'])country = datas.groupby('country')['country'].count() # print(country) # print(list(country)) # print(list(country.index)) explods = [0, 0.2, 0, 0, 0, 0, 0, 0, 0]plt.pie(list(country), labels=list(country.index), autopct='%1.1f%%', explode=explods) plt.title('電影的國家分布情況') plt.show()

運行結果：

電影的時長分布情況
（1）讀取電影數據。
（2）取出電影時長(long_time)一列數據。
（3）按照10分鐘為一個長度歸類
（4）使用matplotlib的pyplot繪制柱狀圖，設置title，xlabel，ylabel。

import pandas as pd from matplotlib import pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei']datas = pd.read_csv('maoyan.csv', encoding='utf-8')long_time = list(datas['long_time']) # print(long_time)d = 10 num_bins = int((max(long_time) - min(long_time)) / d)plt.hist(long_time, range(min(long_time), max(long_time) + d, d), density=True)plt.xticks(range(min(long_time), max(long_time) + d, d)) plt.grid(alpha=0.4)plt.title('電影的時長分布情況') plt.xlabel('電影時長') plt.ylabel('比例')plt.show()

運行結果：

【更多分析】
https://github.com/Tcrushes/crawler-analysis
【參考文獻】
[1] matplotlib用戶指南2020.04.08
[2] panluoluo. 貓眼電影爬蟲及分析GitHub 2019.03.04

總結

以上是生活随笔為你收集整理的猫眼电影爬虫和数据分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 4级网络工程师第5套知识点
下一篇： chromium 安装flash pla