當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

kaggle共享单车数据分析及预测（随机森林）

發布時間：2024/7/5 编程问答 44 豆豆

生活随笔收集整理的這篇文章主要介紹了 kaggle共享单车数据分析及预测（随机森林）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

一、數據收集
- 1.1、項目說明
- 1.2、數據內容及變量說明
二、數據處理
- 2.1、導入數據
- 2.2、缺失值處理
- 2.3、Label數據(即count)異常值處理
- 2.4、其他數據異常值處理
- 2.5、時間型數據數據處理
三、數據分析
- 3.1 描述性分析
- 3.2、探索性分析
- - 3.2.1、整體性分析
  - 3.2.2、相關性分析
  - 3.2.3、影響因素分析
  - - 3.2.3.1、時段對租賃數量的影響
    - 3.2.3.2、溫度對租賃數量的影響
    - 3.2.3.3、濕度對租賃數量的影響
    - 3.2.3.4、年份、月份對租賃數量的影響
    - 3.2.3.5、季節對出行人數的影響
    - 3.2.3.6、天氣情況對出行情況的影響
    - 3.2.3.7、風速對出行情況的影響
    - 3.2.3.8、日期對出行的影響
- 3.3、預測性分析
- - 3.3.1、選擇特征值
  - 3.3.2、訓練集、測試集分離
  - 3.3.3、多余特征值舍棄
  - 3.3.4、選擇模型、訓練模型
  - 3.3.5、預測測試集數據

一、數據收集

1.1、項目說明

自行車共享系統是一種租賃自行車的方法，注冊會員、租車、還車都將通過城市中的站點網絡自動完成，通過這個系統人們可以根據需要從一個地方租賃一輛自行車然后騎到自己的目的地歸還。在這次比賽中，參與者需要結合歷史天氣數據下的使用模式，來預測D.C.華盛頓首都自行車共享項目的自行車租賃需求。

1.2、數據內容及變量說明

比賽提供了跨越兩年的每小時租賃數據，包含天氣信息和日期信息，訓練集由每月前19天的數據組成，測試集是每月第二十天到當月底的數據。

二、數據處理

2.1、導入數據

import matplotlib.pyplot as pltimport seaborn as sns sns.set(style='whitegrid' , palette='tab10')train=pd.read_csv(r'D:\A\Data\ufo\train.csv',encoding='utf-8') train.info()test=pd.read_csv(r'D:\A\Data\ufo\test.csv',encoding='utf-8') print(test.info())

2.2、缺失值處理

#可視化查詢缺失值 import missingno as msno msno.matrix(train,figsize=(12,5)) msno.matrix(test,figsize=(12,5))

本次數據沒有缺失值，不需要進行缺失值處理。

2.3、Label數據(即count)異常值處理

#觀察訓練集數據描述統計 train.describe().T

先從數值型數據入手，可以看出租賃額（count）數值差異大，再觀察一下它們的密度分布：

#觀察租賃額密度分布 fig = plt.figure() ax = fig.add_subplot(1, 1, 1) fig.set_size_inches(6,5)sns.distplot(train['count']) ax.set(xlabel='count',title='Distribution of count',)

發現數據密度分布的偏斜比較嚴重，且有一個很長的尾，所以希望能把這一列數據的長尾處理一下，先排除掉3個標準差以外的數據試一下能不能滿足要求

train_WithoutOutliers = train[np.abs(train['count']-train['count'].mean())<=(3*train['count'].std())] print(train_WithoutOutliers.shape) train_WithoutOutliers['count'] .describe()

與處理前對比不是很明顯，可視化展示對比看一下：

fig = plt.figure() ax1 = fig.add_subplot(1, 2, 1) ax2 = fig.add_subplot(1, 2, 2) fig.set_size_inches(12,5)sns.distplot(train_WithoutOutliers['count'],ax=ax1) sns.distplot(train['count'],ax=ax2)ax1.set(xlabel='count',title='Distribution of count without outliers',) ax2.set(xlabel='registered',title='Distribution of count')

可以看到數據波動依然很大，而我們希望波動相對穩定，否則容易產生過擬合，所以希望對數據進行處理，使得數據相對穩定，此處選擇對數變化，來使得數據穩定。

yLabels=train_WithoutOutliers['count'] yLabels_log=np.log(yLabels) sns.distplot(yLabels_log)

經過對數變換后數據分布更均勻，大小差異也縮小了，使用這樣的標簽對訓練模型是有效果的。接下來對其余的數值型數據進行處理，由于其他數據同時包含在兩個數據集中，為方便數據處理先將兩個數據集合并。

Bike_data=pd.concat([train_WithoutOutliers,test],ignore_index=True) #查看數據集大小 Bike_data.shape

2.4、其他數據異常值處理

fig, axes = plt.subplots(2, 2) fig.set_size_inches(12,10)sns.distplot(Bike_data['temp'],ax=axes[0,0]) sns.distplot(Bike_data['atemp'],ax=axes[0,1]) sns.distplot(Bike_data['humidity'],ax=axes[1,0]) sns.distplot(Bike_data['windspeed'],ax=axes[1,1])axes[0,0].set(xlabel='temp',title='Distribution of temp',) axes[0,1].set(xlabel='atemp',title='Distribution of atemp') axes[1,0].set(xlabel='humidity',title='Distribution of humidity') axes[1,1].set(xlabel='windspeed',title='Distribution of windspeed')

通過這個分布可以發現一些問題，比如風速為什么0的數據很多，而觀察統計描述發現空缺值在1–6之間，從這里似乎可以推測，數據本身或許是有缺失值的，但是用0來填充了，但這些風速為0的數據會對預測產生干擾，希望使用隨機森林根據相同的年份，月份，季節，溫度，濕度等幾個特征來填充一下風速的缺失值。填充之前看一下非零數據的描述統計。

Bike_data[Bike_data['windspeed']!=0]['windspeed'].describe()

from sklearn.ensemble import RandomForestRegressorBike_data["windspeed_rfr"]=Bike_data["windspeed"] # 將數據分成風速等于0和不等于兩部分 dataWind0 = Bike_data[Bike_data["windspeed_rfr"]==0] dataWindNot0 = Bike_data[Bike_data["windspeed_rfr"]!=0] #選定模型 rfModel_wind = RandomForestRegressor(n_estimators=1000,random_state=42) # 選定特征值 windColumns = ["season","weather","humidity","month","temp","year","atemp"] # 將風速不等于0的數據作為訓練集，fit到RandomForestRegressor之中 rfModel_wind.fit(dataWindNot0[windColumns], dataWindNot0["windspeed_rfr"]) # 通過訓練好的模型預測風速 wind0Values = rfModel_wind.predict(X= dataWind0[windColumns]) #將預測的風速填充到風速為零的數據中 dataWind0.loc[:,"windspeed_rfr"] = wind0Values #連接兩部分數據 Bike_data = dataWindNot0.append(dataWind0) Bike_data.reset_index(inplace=True) Bike_data.drop('index',inplace=True,axis=1)

觀察隨機森林填充后的密度分布情況

fig, axes = plt.subplots(2, 2) fig.set_size_inches(12,10)sns.distplot(Bike_data['temp'],ax=axes[0,0]) sns.distplot(Bike_data['atemp'],ax=axes[0,1]) sns.distplot(Bike_data['humidity'],ax=axes[1,0]) sns.distplot(Bike_data['windspeed_rfr'],ax=axes[1,1])axes[0,0].set(xlabel='temp',title='Distribution of temp',) axes[0,1].set(xlabel='atemp',title='Distribution of atemp') axes[1,0].set(xlabel='humidity',title='Distribution of humidity') axes[1,1].set(xlabel='windseed',title='Distribution of windspeed')

2.5、時間型數據數據處理

Bike_data['date']=Bike_data.datetime.apply( lambda c : c.split( )[0]) Bike_data['hour']=Bike_data.datetime.apply( lambda c : c.split( )[1].split(':')[0]).astype('int') Bike_data['year']=Bike_data.datetime.apply( lambda c : c.split( )[0].split('/')[0]).astype('int') Bike_data['month']=Bike_data.datetime.apply( lambda c : c.split( )[0].split('/')[1]).astype('int') Bike_data['weekday']=Bike_data.date.apply( lambda c : datetime.strptime(c,'%Y/%m/%d').isoweekday()) Bike_data.head()

三、數據分析

3.1 描述性分析

train.describe().T

溫度, 體表溫度, 相對濕度, 風速均近似對稱分布, 而非注冊用戶, 注冊用戶,以及總數均右邊分布。

for i in range(5, 12):name = train.columns[i]print('{0}偏態系數為 {1}, 峰態系數為 {2}'.format(name, train[name].skew(), train[name].kurt()))

temp, atemp, humidity低度偏態, windspeed中度偏態, casual, registered, count高度偏態；
temp, atemp, humidity為平峰分布, windspeed,casual, registered, count為尖峰分布。

3.2、探索性分析

3.2.1、整體性分析

sns.pairplot(Bike_data ,x_vars=['holiday','workingday','weather','season','weekday','hour','windspeed_rfr','humidity','temp','atemp'] ,y_vars=['casual','registered','count'] , plot_kws={'alpha': 0.1})

大致可以看出：

會員在工作日出行多，節假日出行少，臨時用戶則相反；

一季度出行人數總體偏少；

租賃數量隨天氣等級上升而減少；

小時數對租賃情況影響明顯，會員呈現兩個高峰，非會員呈現一個正態分布；

租賃數量隨風速增大而減少；

溫度、濕度對非會員影響比較大，對會員影響較小。

3.2.2、相關性分析

查看各個特征與每小時租車總量（count）的相關性

correlation = Bike_data.corr() mask = np.array(correlation) mask[np.tril_indices_from(mask)] = False fig,ax= plt.subplots() fig.set_size_inches(20,10) sns.heatmap(correlation, mask=mask,vmax=.8, square=True,annot=True)plt.show()

count 和 registered、casual高度正相關，相關系數分別為0.7 與0.97。因為 count = casual + registered ，所以這個正相關和預期相符。count 和 temp 正相關，相關系數為 0.39。一般來說，氣溫過低人們不愿意騎車出行。count 和 humidity（濕度）負相關，濕度過大的天氣不適宜騎車。當然考慮濕度的同時也應該考慮溫度。windspeed似乎對租車人數影響不大（0.1），但我們也應該考慮到極端大風天氣出現頻率應該不高。風速在正常范圍內波動應該對人們租車影響不大。可以看出特征值對租賃數量的影響力度為,時段>溫度>濕度>年份>月份>季節>天氣等級>風速>星期幾>是否工作日>是否假日

3.2.3、影響因素分析

3.2.3.1、時段對租賃數量的影響

workingday_df=Bike_data[Bike_data['workingday']==1] workingday_df = workingday_df.groupby(['hour'], as_index=True).agg({'casual':'mean','registered':'mean','count':'mean'})nworkingday_df=Bike_data[Bike_data['workingday']==0] nworkingday_df = nworkingday_df.groupby(['hour'], as_index=True).agg({'casual':'mean','registered':'mean', 'count':'mean'}) fig, axes = plt.subplots(1, 2,sharey = True)workingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the working day',ax=axes[0]) nworkingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the nonworkdays',ax=axes[1])

通過對比可以看出：

工作日對于會員用戶上下班時間是兩個用車高峰，而中午也會有一個小高峰，猜測可能是外出午餐的人；

而對臨時用戶起伏比較平緩，高峰期在17點左右；

并且會員用戶的用車數量遠超過臨時用戶。

對非工作日而言租賃數量隨時間呈現一個正態分布，高峰在14點左右，低谷在4點左右，且分布比較均勻。

3.2.3.2、溫度對租賃數量的影響

先觀察溫度的走勢

#數據按小時統計展示起來太麻煩，希望能夠按天匯總取一天的氣溫中位數 temp_df = Bike_data.groupby(['date','weekday'], as_index=False).agg({'year':'mean','month':'mean','temp':'median'})#由于測試數據集中沒有租賃信息，會導致折線圖有斷裂，所以將缺失的數據丟棄 temp_df.dropna ( axis = 0 , how ='any', inplace = True ) #預計按天統計的波動仍然很大，再按月取日平均值 temp_month = temp_df.groupby(['year','month'], as_index=False).agg({'weekday':'min','temp':'median'}) #將按天求和統計數據的日期轉換成datetime格式 temp_df['date']=pd.to_datetime(temp_df['date'])#將按月統計數據設置一列時間序列 temp_month.rename(columns={'weekday':'day'},inplace=True) temp_month['date']=pd.to_datetime(temp_month[['year','month','day']])#設置畫框尺寸 fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1)#使用折線圖展示總體租賃情況（count）隨時間的走勢 plt.plot(temp_df['date'] , temp_df['temp'] , linewidth=1.3 , label='Daily average') ax.set_title('Change trend of average temperature per day in two years') plt.plot(temp_month['date'] , temp_month['temp'] , marker='o', linewidth=1.3 ,label='Monthly average') ax.legend()

可以看出每年的氣溫趨勢相同隨月份變化，在7月份氣溫最高，1月份氣溫最低，再看一下每小時平均租賃數量隨溫度變化的趨勢。

#按溫度取租賃額平均值 temp_rentals = Bike_data.groupby(['temp'], as_index=True).agg({'casual':'mean', 'registered':'mean','count':'mean'}) temp_rentals .plot(title = 'The average number of rentals initiated per hour changes with the temperature')

可觀察到隨氣溫上升租車數量總體呈現上升趨勢，但在氣溫超過35時開始下降，在氣溫4度時達到最低點。

3.2.3.3、濕度對租賃數量的影響

先觀察濕度的走勢：

4humidity_df = Bike_data.groupby('date', as_index=False).agg({'humidity':'mean'}) humidity_df['date']=pd.to_datetime(humidity_df['date']) #將日期設置為時間索引 humidity_df=humidity_df.set_index('date')humidity_month = Bike_data.groupby(['year','month'], as_index=False).agg({'weekday':'min','humidity':'mean'}) humidity_month.rename(columns={'weekday':'day'},inplace=True) humidity_month['date']=pd.to_datetime(humidity_month[['year','month','day']])fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1) plt.plot(humidity_df.index , humidity_df['humidity'] , linewidth=1.3,label='Daily average') plt.plot(humidity_month['date'], humidity_month['humidity'] ,marker='o', linewidth=1.3,label='Monthly average') ax.legend() ax.set_title('Change trend of average humidity per day in two years')

濕度的變化幅度不是很大，多數圍繞60上下浮動，本次數據范圍內峰值為80。

humidity_rentals = Bike_data.groupby(['humidity'], as_index=True).agg({'casual':'mean','registered':'mean','count':'mean'}) humidity_rentals .plot (title = 'Average number of rentals initiated per hour in different humidity')

可以觀察到在濕度20左右租賃數量迅速達到高峰值，此后緩慢遞減。

3.2.3.4、年份、月份對租賃數量的影響

觀察兩年時間里，總租車數量隨時間變化的趨勢

#數據按小時統計展示起來太麻煩，希望能夠按天匯總 count_df = Bike_data.groupby(['date','weekday'], as_index=False).agg({'year':'mean','month':'mean','casual':'sum','registered':'sum','count':'sum'}) #由于測試數據集中沒有租賃信息，會導致折線圖有斷裂，所以將缺失的數據丟棄 count_df.dropna ( axis = 0 , how ='any', inplace = True )#預計按天統計的波動仍然很大，再按月取日平均值 count_month = count_df.groupby(['year','month'], as_index=False).agg({'weekday':'min','casual':'mean', 'registered':'mean','count':'mean'})#將按天求和統計數據的日期轉換成datetime格式 count_df['date']=pd.to_datetime(count_df['date'])#將按月統計數據設置一列時間序列 count_month.rename(columns={'weekday':'day'},inplace=True) count_month['date']=pd.to_datetime(count_month[['year','month','day']])#設置畫框尺寸 fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1)#使用折線圖展示總體租賃情況（count）隨時間的走勢 plt.plot(count_df['date'] , count_df['count'] , linewidth=1.3 , label='Daily average') ax.set_title('Change trend of average number of rentals initiated per day in two years') plt.plot(count_month['date'] , count_month['count'] , marker='o', linewidth=1.3 , label='Monthly average') ax.legend()

可以看出：

共享單車的租賃情況2012年整體是比2011年有增漲的；

租賃情況隨月份波動明顯；

數據在2011年9到12月，2012年3到9月間波動劇烈；

有很多局部波谷值。

3.2.3.5、季節對出行人數的影響

在對年份月份因素的數據分析圖中發現存在很多局部低谷，所以將租賃數量按季節取中位數展示，同時觀察季節的溫度變化

day_df=Bike_data.groupby('date').agg({'year':'mean','season':'mean','casual':'sum', 'registered':'sum','count':'sum','temp':'mean','atemp':'mean'})season_df = day_df.groupby(['year','season'], as_index=True).agg({'casual':'mean', 'registered':'mean','count':'mean'})season_df .plot(figsize=(18,6),title = 'The trend of average number of rentals initiated per day changes with season')

temp_df = day_df.groupby(['year','season'], as_index=True).agg({'temp':'mean', 'atemp':'mean'}) temp_df.plot(figsize=(18,6),title = 'The trend of average temperature per day changes with season')

可以看出無論是臨時用戶還是會員用戶用車的數量都在秋季迎來高峰，而春季度用戶數量最低。

3.2.3.6、天氣情況對出行情況的影響

考慮到不同天氣的天數不同，例如非常糟糕的天氣（4）會很少出現，查看一下不同天氣等級的數據條數，再對租賃數量按天氣等級取每小時平均值。

count_weather = Bike_data.groupby('weather') count_weather[['casual','registered','count']].count()

weather_df = Bike_data.groupby('weather', as_index=True).agg({'casual':'mean','registered':'mean'}) weather_df.plot.bar(stacked=True,title = 'Average number of rentals initiated per hour in different weather')

此處存在不合理數據：天氣等級4的時候出行人數并不少，尤其是會員出行人數甚至比天氣等級2的平均值還高，按理說4等級的應該是最少的，將天氣等級4的數據打印出來找一下原因：

Bike_data[Bike_data['weather']==4]

觀察可知該數據是在上下班高峰期產生的，所以該數據是個異常數據。不具有代表性。

3.2.3.7、風速對出行情況的影響

兩年時間內風速的變化趨勢

windspeed_df = Bike_data.groupby('date', as_index=False).agg({'windspeed_rfr':'mean'}) windspeed_df['date']=pd.to_datetime(windspeed_df['date']) #將日期設置為時間索引 windspeed_df=windspeed_df.set_index('date')windspeed_month = Bike_data.groupby(['year','month'], as_index=False).agg({'weekday':'min','windspeed_rfr':'mean'}) windspeed_month.rename(columns={'weekday':'day'},inplace=True) windspeed_month['date']=pd.to_datetime(windspeed_month[['year','month','day']])fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1) plt.plot(windspeed_df.index , windspeed_df['windspeed_rfr'] , linewidth=1.3,label='Daily average') plt.plot(windspeed_month['date'], windspeed_month['windspeed_rfr'] ,marker='o', linewidth=1.3,label='Monthly average') ax.legend() ax.set_title('Change trend of average number of windspeed per day in two years')

可以看出風速在2011年9月份和2011年12月到2012年3月份間波動和大，觀察一下租賃人數隨風速變化趨勢，考慮到風速特別大的時候很少，如果取平均值會出現異常，所以按風速對租賃數量取最大值。

windspeed_rentals = Bike_data.groupby(['windspeed'], as_index=True).agg({'casual':'max', 'registered':'max','count':'max'}) windspeed_rentals .plot(title = 'Max number of rentals initiated per hour in different windspeed')

可以看到租賃數量隨風速越大租賃數量越少，在風速超過30的時候明顯減少，但風速在風速40左右卻有一次反彈，打印數據找一下反彈原因：

df2=Bike_data[Bike_data['windspeed']>40] df2=df2[df2['count']>400] df2

該條數據產生在上下班高峰期時期，所以也是個異常值，不具有代表性。

3.2.3.8、日期對出行的影響

考慮到相同日期是否工作日，星期幾，以及所屬年份等信息是一樣的，把租賃數據按天求和，其它日期類數據取平均值

day_df = Bike_data.groupby(['date'], as_index=False).agg({'casual':'sum','registered':'sum','count':'sum', 'workingday':'mean','weekday':'mean','holiday':'mean','year':'mean'}) day_df.head()

6number_pei=day_df[['casual','registered']].mean() number_pei

plt.axes(aspect='equal') plt.pie(number_pei, labels=['casual','registered'], autopct='%1.1f%%', pctdistance=0.6 , labeldistance=1.05 , radius=1 ) plt.title('Casual or registered in the total lease')

工作日
由于工作日和休息日的天數差別，對工作日和非工作日租賃數量取了平均值，對一周中每天的租賃數量求和

workingday_df=day_df.groupby(['workingday'], as_index=True).agg({'casual':'mean', 'registered':'mean'}) workingday_df_0 = workingday_df.loc[0] workingday_df_1 = workingday_df.loc[1]# plt.axes(aspect='equal') fig = plt.figure(figsize=(8,6)) plt.subplots_adjust(hspace=0.5, wspace=0.2) #設置子圖表間隔 grid = plt.GridSpec(2, 2, wspace=0.5, hspace=0.5) #設置子圖表坐標軸對齊plt.subplot2grid((2,2),(1,0), rowspan=2) width = 0.3 # 設置條寬p1 = plt.bar(workingday_df.index,workingday_df['casual'], width) p2 = plt.bar(workingday_df.index,workingday_df['registered'], width,bottom=workingday_df['casual']) plt.title('Average number of rentals initiated per day') plt.xticks([0,1], ('nonworking day', 'working day'),rotation=20) plt.legend((p1[0], p2[0]), ('casual', 'registered'))plt.subplot2grid((2,2),(0,0)) plt.pie(workingday_df_0, labels=['casual','registered'], autopct='%1.1f%%', pctdistance=0.6 , labeldistance=1.35 , radius=1.3) plt.axis('equal') plt.title('nonworking day')plt.subplot2grid((2,2),(0,1)) plt.pie(workingday_df_1, labels=['casual','registered'], autopct='%1.1f%%', pctdistance=0.6 , labeldistance=1.35 , radius=1.3) plt.title('working day') plt.axis('equal')

weekday_df= day_df.groupby(['weekday'], as_index=True).agg({'casual':'mean', 'registered':'mean'}) weekday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by weekday')

對比圖可發現：

工作日會員用戶出行數量較多，臨時用戶出行數量較少；

周末會員用戶租賃數量降低，臨時用戶租賃數量增加。

節假日
由于節假日在一年中數量占比非常少，先來看一每年的節假日下有幾天：

holiday_coun=day_df.groupby('year', as_index=True).agg({'holiday':'sum'}) holiday_coun

假期的天數占一年天數的份額十分少，所以對假期和非假期取日平均值

holiday_df = day_df.groupby('holiday', as_index=True).agg({'casual':'mean', 'registered':'mean'}) holiday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by holiday or not')

節假日會員或非會員使用量都比非節假日多，符合規律。

3.3、預測性分析

3.3.1、選擇特征值

根據前面的觀察，決定將時段（hour）、溫度（temp）、濕度（humidity）、年份（year）、月份（month）、季節（season）、天氣等級（weather）、風速（windspeed_rfr）、星期幾（weekday）、是否工作日（workingday）、是否假日（holiday），11項作為特征值。由于CART決策樹使用二分類，所以將多類別型數據使用one-hot轉化成多個二分型類別

dummies_month = pd.get_dummies(Bike_data['month'], prefix= 'month') dummies_season=pd.get_dummies(Bike_data['season'],prefix='season') dummies_weather=pd.get_dummies(Bike_data['weather'],prefix='weather') dummies_year=pd.get_dummies(Bike_data['year'],prefix='year') #把5個新的DF和原來的表連接起來 Bike_data=pd.concat([Bike_data,dummies_month,dummies_season,dummies_weather,dummies_year],axis=1)

3.3.2、訓練集、測試集分離

dataTrain = Bike_data[pd.notnull(Bike_data['count'])] dataTest= Bike_data[~pd.notnull(Bike_data['count'])].sort_values(by=['datetime']) datetimecol = dataTest['datetime'] yLabels=dataTrain['count'] yLabels_log=np.log(yLabels)

3.3.3、多余特征值舍棄

dropFeatures = ['casual' , 'count' , 'datetime' , 'date' , 'registered' ,'windspeed' , 'atemp' , 'month','season','weather', 'year' ]dataTrain = dataTrain.drop(dropFeatures , axis=1) dataTest = dataTest.drop(dropFeatures , axis=1)

3.3.4、選擇模型、訓練模型

rfModel = RandomForestRegressor(n_estimators=1000 , random_state = 42)rfModel.fit(dataTrain , yLabels_log)preds = rfModel.predict( X = dataTrain)

3.3.5、預測測試集數據

predsTest= rfModel.predict(X = dataTest)submission=pd.DataFrame({'datetime':datetimecol , 'count':[max(0,x) for x in np.exp(predsTest)]})submission.to_csv(r'D:\A\Data\ufo\/bike_predictions.csv',index=False)

總結

以上是生活随笔為你收集整理的kaggle共享单车数据分析及预测（随机森林）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：软件外包平台用例图
下一篇： android闹钟测试工具,androi