當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

共享单车数据集_共享单车数据可视化报告

發(fā)布時間：2024/4/11 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了共享单车数据集_共享单车数据可视化报告小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1.1 項目說明

自行車共享系統(tǒng)是一種租賃自行車的方法，注冊會員、租車、還車都將通過城市中的站點網(wǎng)絡(luò)自動完成，通過這個系統(tǒng)人們可以根據(jù)需要從一個地方租賃一輛自行車然后騎到自己的目的地歸還。為了更好地服務(wù)和維護共享單車平臺，開發(fā)和運營人員會通過系統(tǒng)記錄的數(shù)據(jù)來研究和優(yōu)化共享單車平臺，挖掘數(shù)據(jù)有用信息，最后落地優(yōu)化系統(tǒng)。這些系統(tǒng)生成的數(shù)據(jù)很有吸引力，騎行時間、出發(fā)地點、到達地點和經(jīng)過的時間都被顯式地記錄下來。

在這次比賽中，參與者需要結(jié)合歷史天氣數(shù)據(jù)下的使用模式，來預(yù)測D.C.華盛頓首都自行車共享項目的自行車租賃需求。

1.2 變量說明

datetime（日期） - hourly date + timestamp

season（季節(jié)） - 1 = spring, 2 = summer, 3 = fall, 4 = winter

holiday（是否假日） - whether the day is considered a holiday

workingday（是否工作日） - whether the day is neither a weekend nor holiday

weather（天氣等級） -

1. 清澈，少云，多云。

2. 霧+陰天，霧+碎云、霧+少云、霧

3. 小雪、小雨+雷暴+散云，小雨+云

4. 暴雨+冰雹+雷暴+霧，雪+霧

temp（溫度） - temperature in Celsius

atemp（體感溫度） - "feels like" temperature in Celsius

humidity（相對濕度） - relative humidity

windspeed（風(fēng)速） - wind speed

casual（臨時租賃數(shù)量） - number of non-registered user rentals initiated

registered（會員租賃數(shù)量） - number of registered user rentals initiated

count（總租賃數(shù)量） - number of total rentals

2.準(zhǔn)備數(shù)據(jù)：

2.1檢查缺失值

%matplotlib inlineimport numpy as np import pandas as pd from datetime import datetime import matplotlib.pyplot as plt import seaborn as snstrain=pd.read_csv('train.csv')#查看訓(xùn)練集數(shù)據(jù)是否有缺失值 train.info()

輸出

<class 'pandas.core.frame.DataFrame'> RangeIndex: 10886 entries, 0 to 10885 Data columns (total 12 columns): datetime 10886 non-null object season 10886 non-null int64 holiday 10886 non-null int64 workingday 10886 non-null int64 weather 10886 non-null int64 temp 10886 non-null float64 atemp 10886 non-null float64 humidity 10886 non-null int64 windspeed 10886 non-null float64 casual 10886 non-null int64 registered 10886 non-null int64 count 10886 non-null int64 dtypes: float64(3), int64(8), object(1) memory usage: 1020.6+ KB #查看測試集數(shù)據(jù)是否有缺失值 test=pd.read_csv('test.csv') test.info()

輸出

<class 'pandas.core.frame.DataFrame'> RangeIndex: 6493 entries, 0 to 6492 Data columns (total 9 columns): datetime 6493 non-null object season 6493 non-null int64 holiday 6493 non-null int64 workingday 6493 non-null int64 weather 6493 non-null int64 temp 6493 non-null float64 atemp 6493 non-null float64 humidity 6493 non-null int64 windspeed 6493 non-null float64 dtypes: float64(3), int64(5), object(1) memory usage: 456.6+ KB

本數(shù)據(jù)集沒有缺失數(shù)據(jù)，但沒有缺失不代表沒有異常

2.2將訓(xùn)練數(shù)據(jù)集與測試數(shù)據(jù)集合并（方便對數(shù)據(jù)進行統(tǒng)一預(yù)處理）

#進行中文轉(zhuǎn)換 import matplotlib as mpl mpl.rcParams['font.sans-serif'] = ['SimHei'] mpl.rcParams['font.serif'] = ['SimHei'] import seaborn as sns sns.set_style("darkgrid",{"font.sans-serif":['simhei', 'Arial']}) dataset=pd.concat([train,test],axis=0) dataset.isnull().sum()

輸出

atemp 0 casual 6493 count 6493 datetime 0 holiday 0 humidity 0 registered 6493 season 0 temp 0 weather 0 windspeed 0 workingday 0 dtype: int64

三個目標(biāo)變量存在缺失值，且數(shù)量與測試數(shù)據(jù)集一致，因此能判斷數(shù)據(jù)較完整無缺失情況，缺失的只是需要預(yù)測的那一部分目標(biāo)變量。將超過樣本均值3個標(biāo)注差以外的數(shù)據(jù)看作異常值，本文對異常情況暫不處理。

將量化的特征轉(zhuǎn)換為方便識別和可視化的字段

# season特征 & weather特征 dataset['season']=dataset['season'].map({1:'spring',2:'summer',3:'fall',4:'winter'}) dataset['weather']=dataset['weather'].map({1:'Good',2:'Normal',3:'Bad',4:'Ver Bad'})

將被量化的季節(jié)特征和天氣特征轉(zhuǎn)化為用字符串描述的數(shù)據(jù)。

3 特征衍生

為了更好地理解數(shù)據(jù)，分析前我們需要判斷能否從數(shù)據(jù)已有的特征中提取、構(gòu)建出某些有價值的特征，這就是衍生特征。

在本篇分析的數(shù)據(jù)中，我們能對時間數(shù)據(jù)datetime做特征衍生，datetime包含了日期和時間信息，從中我們能提取出年、月、日、小時等信息。

# 特征衍生 dataset['datetime']=pd.to_datetime(dataset['datetime']) # 將datetime字段轉(zhuǎn)換為datetimedataset['year']=dataset.datetime.apply(lambda d:d.year) # 對datetime調(diào)用year提取年 dataset['month']=dataset.datetime.apply(lambda d:d.month) # 對datetime調(diào)用month提取月 dataset['day']=dataset.datetime.apply(lambda d:d.day) # 對datetime調(diào)用day提取日 dataset['hour']=dataset.datetime.apply(lambda d:d.hour) # 對datetime調(diào)用hour提取小時# 刪除無用字段 dataset.drop('datetime',axis=1,inplace=True)

4 可視化分析

通過Matplotlib、Seaborn等工具可視化理解數(shù)據(jù)，分析特征與標(biāo)簽之間的相關(guān)性。

4.1 標(biāo)簽特征的整體分布

count（數(shù)量）為本次分析的標(biāo)簽特征，我們需要圍繞此特征做可視化分析。下圖是count的核密度圖：可見count分布呈右偏狀，主要集中在0~400這個區(qū)間。

4.2 時間維度看需求趨勢

月份維度

數(shù)據(jù)涵括了兩年的跨度，因此按年分開繪制每個月的騎行需求趨勢圖

# 篩選提取出2011年和2012年的騎行好數(shù)據(jù) Month_tendency_2011=dataset[dataset.year==2011].groupby('month')[['casual','registered','count']].sum() Month_tendency_2012=dataset[dataset.year==2012].groupby('month')[['casual','registered','count']].sum() # 繪制圖像 plt.style.use('ggplot') fig,[ax1,ax2]=plt.subplots(2,1,figsize=(12,15)) plt.subplots_adjust(hspace=0.3) Month_tendency_2011.plot(kind='line',linestyle='--',linewidth=2,colormap='Set1',ax=ax1) ax1.set_title('2011年需求趨勢',fontsize=15) ax1.grid(linestyle='--',alpha=0.8) ax1.set_ylim(0,150000) ax1.set_xlabel('月份',fontsize=13) ax1.set_ylabel('數(shù)量',fontsize=13) Month_tendency_2012.plot(kind='line',linestyle='--',linewidth=2,colormap='Set1',ax=ax2) ax2.set_title('2012年需求趨勢',fontsize=15) ax2.grid(linestyle='--',alpha=0.8) ax2.set_ylim(0,150000) ax2.set_xlabel('月份',fontsize=13) ax2.set_ylabel('數(shù)量',fontsize=13) sns.despine(left=True)

上圖為可視化結(jié)果，整體上可見2012年相較于2011年，騎行的需求有了大幅的增長。總需求和注冊用戶需求的變化最為明顯，而非注冊用戶提升效果一般，這說明：經(jīng)過一年的運營，越來越多的使用者在平臺上注冊了，注冊用戶提升顯著。從月份角度，我們能看到的是：6月~9月是騎行旺季，12月~2月是騎行淡季。

小時維度

按照一天24小時可視化騎行需求，我們能發(fā)現(xiàn)一些有趣的結(jié)論：

# 提取出每小時的騎行需求均值 Hour_tendency=dataset.groupby('hour')[['casual','registered','count']].mean() # 繪制圖像 Hour_tendency.plot(kind='line',linestyle='--',linewidth=2,colormap='Set1',figsize=(12,6)) plt.title('小時需求趨勢',fontsize=15) plt.grid(linestyle='--',alpha=0.8) plt.ylim(0,550) plt.xlabel('小時',fontsize=13) plt.ylabel('數(shù)量',fontsize=13) sns.despine(left=True)

從24小時觀察騎行需求：存在6~7AM以及5~6PM兩個峰值，中午12AM有一個小峰值，符合通勤規(guī)律，注冊用戶特征更為明顯，而非注冊用戶則變化平緩。

工作日和非工作日的小時維度

plt.style.use('ggplot') plt.figure(figsize=(12,6)) sns.pointplot(x='hour',y='count',hue='workingday',data=dataset,ci=None,palette='Spectral') sns.despine(left=True) plt.xlabel('小時',fontsize=13) plt.ylabel('數(shù)量',fontsize=13) plt.grid(linestyle='--',alpha=0.5) plt.title('工作日/假期需求趨勢',fontsize=15) # 標(biāo)題設(shè)置

可見工作日與非工作日的騎行需求規(guī)律反差巨大，非工作日的共享單車蘇醒慢，騎行需求高峰出現(xiàn)在12AM~3PM的這段時間內(nèi)，另外凌晨0AM~4AM的騎行需求大于工作日。

不同季節(jié)的小時維度

plt.figure(figsize=(12,6)) sns.pointplot(x='hour',y='count',hue='season',data=dataset,ci=None,palette='Spectral') sns.despine(left=True) plt.xlabel('小時',fontsize=13) plt.ylabel('數(shù)量',fontsize=13) plt.grid(linestyle='--',alpha=0.5) plt.title('季節(jié)需求趨勢',fontsize=15)

春夏秋冬四季24小時需求趨勢一致，值得關(guān)注的是春季的整體小于夏秋冬三個季節(jié)，體感舒適度最高的秋天整體騎行需求最高。為何春季需求會低于其他季節(jié)？這是個值得思考的問題，單限于數(shù)據(jù)信息有限，未找出合理的解釋。這也說明了結(jié)合業(yè)務(wù)的重要性，只有深入了解了業(yè)務(wù)規(guī)律才能做出準(zhǔn)確判斷。

4.3 天氣維度

不同天氣、氣溫等外在環(huán)境會影響騎行需求。

不同天氣狀況的騎行需求

# 將數(shù)據(jù)按天聚合 Weather_Demand=dataset.groupby(['weather','day'])[['count']].sum() Weather_Demand.reset_index(inplace=True) # 不同天氣的騎行需求散點圖 plt.figure(figsize=(12,6)) sns.stripplot(x='weather',y='count',data=Weather_Demand,palette='Set1',jitter=True,alpha=0.6) sns.despine(left=True) plt.xlabel('季節(jié)',fontsize=13) plt.ylabel('數(shù)量',fontsize=13) plt.title('天氣的分布趨勢',fontsize=15)

可見，天氣情況越糟，騎行需求越小。

# 不同天氣的騎行需求趨勢 plt.figure(figsize=(12,6)) sns.pointplot(x='hour',y='count',hue='weather',data=dataset,ci=None,palette='Spectral') sns.despine(left=True) plt.xlabel('小時',fontsize=13) plt.ylabel('數(shù)量',fontsize=13) plt.grid(linestyle='--',alpha=0.5) plt.title('天氣的需求趨勢',fontsize=15)

整體需求趨勢也印證了，天氣越差騎行需求越少，極端天氣未形成趨勢線，極端情況還是占少數(shù)的。

氣溫與體表溫度

氣溫與體表溫度應(yīng)該是正相關(guān)的，先來看看兩者的分布圖：

# 溫度與體表溫度的關(guān)系度量 plt.figure(figsize=(10,8)) sns.kdeplot(dataset['temp'],dataset['atemp'],shade=True,shade_lowest=False,cut=10,cmap='YlGnBu',cbar=True) sns.despine(left=True) plt.grid(linestyle='--',alpha=0.4) plt.xlim(0,50) plt.ylim(0,50) plt.xlabel('溫度',fontsize=13) plt.ylabel('體表溫度',fontsize=13) plt.title('溫度和體表溫度的相關(guān)性',fontsize=15)

從兩者的關(guān)系度量圖中我們能看出呈現(xiàn)正相關(guān)分布，另外能獲取到的信息是：顏色最深的分布最為集中，最適宜的氣溫27℃~28℃、體表溫度31℃左右的騎行需求是最密集的。

# 溫度與風(fēng)速的關(guān)系度量 plt.figure(figsize=(10,8)) sns.kdeplot(dataset['temp'],dataset['windspeed'],shade=True,shade_lowest=False,cut=10,cmap='YlGnBu',cbar=True) sns.despine(left=True) plt.grid(linestyle='--',alpha=0.4) plt.xlim(0,50) plt.ylim(-10,40) plt.xlabel('溫度',fontsize=13) plt.ylabel('風(fēng)速',fontsize=13) plt.title('溫度和風(fēng)速的相關(guān)性',fontsize=15)

風(fēng)速的分布存在斷層，初步推斷可能是異常數(shù)據(jù)，這里先不做處理；可見氣溫27℃~28℃，風(fēng)速8~9騎行需求是最密集的。

氣溫與濕度

# 溫度與濕度的關(guān)系度量 plt.figure(figsize=(10,8)) sns.kdeplot(dataset['temp'],dataset['humidity'],shade=True,shade_lowest=False,cut=10,cmap='YlGnBu',cbar=True) sns.despine(left=True) plt.grid(linestyle='--',alpha=0.4) plt.xlim(0,40) plt.ylim(0,110) plt.xlabel('溫度',fontsize=13) plt.ylabel('濕度',fontsize=13) plt.title('溫度和濕度的相關(guān)性',fontsize=15)

氣溫27℃~28℃，濕度80~90騎行需求是最密集的。

天氣因素與騎行需求

將氣溫和風(fēng)速作為天氣因素分析騎行需求（忽略體表溫度，氣溫即代表了體表溫度），使用散點圖繪制，散點的大小和顏色代表了騎行量，每個散點代表一天。

plt.figure(figsize=(12,6)) plt.scatter(x='temp',y='windspeed',s=dataset['count']/2,c='count',cmap='RdBu_r',edgecolors='black',linestyle='--',linewidth=0.2,alpha=0.6,data=dataset) plt.title('天氣因素與騎行需求',fontsize=15) plt.xlabel('溫度',fontsize=13) plt.ylabel('風(fēng)速',fontsize=13) sns.despine(left=True)

可見騎行需求較多的分布范圍：氣溫20℃~35℃，風(fēng)速40以下；人們更熱衷于溫暖天氣騎行，炎熱天氣需求也大的原因可能是因為相對于公共交通，騎行更加靈活方便吧。

將氣溫和濕度作為天氣因素分析騎行需求：

plt.figure(figsize=(12,6)) plt.scatter(x='temp',y='humidity',s=dataset['count']/2,c='count',cmap='RdBu_r',edgecolors='black',linestyle='--',linewidth=0.2,alpha=0.6,data=dataset) plt.title('天氣因素與騎行需求',fontsize=15) plt.xlabel('溫度',fontsize=13) plt.ylabel('濕度',fontsize=13) sns.despine(left=True)

可見騎行需求較多的分布范圍：氣溫20℃~35℃，濕度20~80。

4.5 注冊用戶與未注冊用戶

如何能提升平臺用戶數(shù)量？當(dāng)然是拉新促活促轉(zhuǎn)化，接下來，讓我們來觀察下注冊用戶與非注冊用戶的差異吧。

# 衍生特征 dataset['dif']=dataset['registered']-dataset['casual'] # 衍生特征注冊用戶與非注冊用戶的騎行需求差值 fig,axes=plt.subplots(2,2,figsize=(20,8)) plt.subplots_adjust(hspace=0.3,wspace=0.1) # 繪制子圖1：月度差異 Month_Dif = dataset.groupby('month')[['casual','registered']].mean() Month_Dif.plot(kind='line',linestyle='--',linewidth=2,colormap='Set1',ax=axes[0,0]) axes[0,0].set_title('月份趨勢需求差異',fontsize=15) axes[0,0].grid(linestyle='--',alpha=0.8) axes[0,0].set_xlabel('月份',fontsize=13) axes[0,0].set_ylabel('數(shù)量',fontsize=13) # 繪制子圖2：小時差異 H1=dataset.groupby('hour')[['casual','registered']].mean() H1.plot(kind='line',linestyle='--',linewidth=2,colormap='Set1',ax=axes[0,1]) axes[0,1].set_title('小時趨勢需求差異',fontsize=15) axes[0,1].grid(linestyle='--',alpha=0.8) axes[0,1].set_xlabel('小時',fontsize=13) axes[0,1].set_ylabel('數(shù)量',fontsize=13) # 繪制子圖3：工作日差異 H2_1=dataset[dataset.workingday==1].groupby('hour')[['casual','registered']].mean() # 工作日 H2_0=dataset[dataset.workingday==0].groupby('hour')[['casual','registered']].mean() # 非工作日 H2_1.plot(kind='line',linestyle='--',linewidth=2,colormap='Set1',ax=axes[1,0]) axes[1,0].set_title('工作日時間趨勢需求差異',fontsize=15) axes[1,0].grid(linestyle='--',alpha=0.8) axes[1,0].set_xlabel('小時',fontsize=13) axes[1,0].set_ylabel('數(shù)量',fontsize=13) # 繪制子圖4：非工作日差異 H2_0.plot(kind='line',linestyle='--',linewidth=2,colormap='Set1',ax=axes[1,1]) axes[1,1].set_title('假期時間趨勢需求差異',fontsize=15) axes[1,1].grid(linestyle='--',alpha=0.8) axes[1,1].set_xlabel('小時',fontsize=13) axes[1,1].set_ylabel('數(shù)量',fontsize=13) sns.despine(left=True)

上圖為注冊用戶與非注冊用戶在各因素下的差異組合圖：

第一幅為兩者的月度差異圖，可見整體趨勢相同，注冊用戶遠高于非注冊用戶；

第二幅為兩者的小時差異圖，可見注冊用戶的小時規(guī)律明顯，非注冊用戶則只在12AM~5PM存在峰值，整體差異較大；

第三幅為兩者的工作日差異圖，可見注冊用戶工作日小時規(guī)律明顯，二非注冊用戶趨勢平緩；

第四幅為兩者的非工作日差異圖，可見非工作日兩者差異相較于其他因素差異較小，且趨勢相同，值得注意的是注冊用戶在凌晨更加活躍。

總的來說，注冊用戶需求遠高于非注冊用戶，注冊用戶的使用規(guī)律明顯，而非注冊用戶受其他因素的影響相對較弱。

總結(jié)

以上是生活随笔為你收集整理的共享单车数据集_共享单车数据可视化报告的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：程序员初试和复试_程序员因肌肉发达面试被
下一篇： python和sas哪个有用考研_金融学