美国交通事故分析(2017)(项目练习_5)
目錄
- 1.項(xiàng)目摘要說明
- 2.數(shù)據(jù)處理(僅為分析處理,建模的處理放在后面)
- 3.數(shù)據(jù)可視化應(yīng)用
- 4.利用xgboost對嚴(yán)重程度建模預(yù)測
- 4.1建模前預(yù)處理
1.項(xiàng)目摘要說明
項(xiàng)目目的:對于數(shù)據(jù)分析的練習(xí)
數(shù)據(jù)來源:kaggle
源碼.數(shù)據(jù)集以及字段說明 百度云鏈接:
地址:https://pan.baidu.com/s/1UD5HD69bNEsX2EkjaQ1IPg
提取碼:8gd8
本項(xiàng)目分析目標(biāo):
- 對數(shù)據(jù)進(jìn)行基礎(chǔ)分析 發(fā)生事故最多的州,什么時候容易發(fā)生事故,事故發(fā)生時天氣狀況及可視化應(yīng)用:講述2017美國發(fā)生事故的總體情況等等
- 利用xgboost對事故嚴(yán)重程度進(jìn)行預(yù)測,查看事故嚴(yán)重程度和什么因素比較有關(guān)
2.數(shù)據(jù)處理(僅為分析處理,建模的處理放在后面)
原數(shù)據(jù)集(US_Accidents_Dec19.csv)是一個數(shù)據(jù)量49列共300W數(shù)據(jù)量包含2016到2019的交通事故,但考慮到電腦硬件及時間問題,僅選取2017年間的事故進(jìn)行分析(詳情源文件可見)
#截取2017年的 import pandas as pd data = pd.read_csv('./US_Accidents_Dec19.csv') datacopy = data.copy() datacopy['Start_Time'] = pd.to_datetime(datacopy['Start_Time']) datacopy['year'] = datacopy['Start_Time'].apply(lambda x:x.year) data1 = datacopy[datacopy['year']==2017] data1.to_csv('./USaccident2017.csv')對USaccident2017.csv開始分析
導(dǎo)入需要使用的包
| 9206 | A-9207 | MapQuest | 201.0 | 3 | 2017-01-01 00:17:36 | 2017-01-01 00:47:12 | 37.925392 | -122.320595 | NaN | NaN | 0.01 | Accident on I-80 Westbound at Exit 15 Cutting ... | NaN | I-80 E | R | El Cerrito | Contra Costa | CA | 94530 | US | US/Pacific | KCCR | 2017-01-01 00:53:00 | 44.1 | 40.8 | 79.0 | 29.91 | 10.0 | WSW | 5.8 | NaN | Partly Cloudy | False | False | False | False | False | False | False | False | False | False | False | True | False | Night | Night | Night | Night | 2017 |
| 9207 | A-9208 | MapQuest | 201.0 | 3 | 2017-01-01 00:26:08 | 2017-01-01 01:16:06 | 37.878185 | -122.307175 | NaN | NaN | 0.01 | Accident on I-580 Southbound at Exit 12 I-80 I... | NaN | I-580 W | R | Berkeley | Alameda | CA | 94710 | US | US/Pacific | KOAK | 2017-01-01 00:53:00 | 51.1 | NaN | 83.0 | 29.97 | 10.0 | West | 11.5 | NaN | Overcast | False | False | True | False | False | False | False | False | False | False | False | False | False | Night | Night | Night | Night | 2017 |
| 9208 | A-9209 | MapQuest | 201.0 | 2 | 2017-01-01 00:53:41 | 2017-01-01 01:22:35 | 38.014820 | -121.640579 | NaN | NaN | 0.00 | Accident on Taylor Rd Southbound at Bethel Isl... | 2998.0 | Taylor Ln | R | Oakley | Contra Costa | CA | 94561 | US | US/Pacific | KCCR | 2017-01-01 00:53:00 | 44.1 | 40.8 | 79.0 | 29.91 | 10.0 | WSW | 5.8 | NaN | Partly Cloudy | False | False | False | False | False | False | False | False | False | False | False | False | False | Night | Night | Night | Night | 2017 |
| 9209 | A-9210 | MapQuest | 241.0 | 3 | 2017-01-01 01:18:51 | 2017-01-01 01:48:01 | 37.912056 | -122.323982 | NaN | NaN | 0.01 | Lane blocked and queueing traffic due to accid... | NaN | Bayview Ave | R | Richmond | Contra Costa | CA | 94804 | US | US/Pacific | KCCR | 2017-01-01 01:11:00 | 44.1 | 42.5 | 82.0 | 29.95 | 9.0 | SW | 3.5 | NaN | Mostly Cloudy | False | False | False | False | False | False | False | False | False | False | False | False | False | Night | Night | Night | Night | 2017 |
| 9210 | A-9211 | MapQuest | 222.0 | 3 | 2017-01-01 01:20:12 | 2017-01-01 01:49:47 | 37.925392 | -122.320595 | NaN | NaN | 0.01 | Queueing traffic due to accident on I-80 Westb... | NaN | I-80 E | R | El Cerrito | Contra Costa | CA | 94530 | US | US/Pacific | KCCR | 2017-01-01 01:11:00 | 44.1 | 42.5 | 82.0 | 29.95 | 9.0 | SW | 3.5 | NaN | Mostly Cloudy | False | False | False | False | False | False | False | False | False | False | False | True | False | Night | Night | Night | Night | 2017 |
字段說明
https://www.jianshu.com/p/9e597dc8ae71
#查看空值情況 data.isnull().sum()[data.isnull().sum()!=0] #處理空值 #無影響或者不分析的列 刪除 deletelist= ['Unnamed: 0', 'ID','TMC', 'End_Lat', 'End_Lng', 'Airport_Code','Weather_Timestamp','Wind_Chill(F)','Civil_Twilight', 'Nautical_Twilight','Astronomical_Twilight', 'year','Number'] data1 = data.drop(deletelist, axis=1) #刪除有空值的行 data1 = data1.dropna(axis = 0,subset=['City','Zipcode','Timezone','Sunrise_Sunset']) #溫度濕度氣壓能見度用均值填補(bǔ) data1['Temperature(F)'] = data1['Temperature(F)'].fillna(data1['Temperature(F)'].mean()) data1['Humidity(%)'] = data1['Humidity(%)'].fillna(data1['Humidity(%)'].mean()) data1['Pressure(in)'] = data1['Pressure(in)'].fillna(data1['Pressure(in)'].mean()) data1['Visibility(mi)'] = data1['Visibility(mi)'].fillna(data1['Visibility(mi)'].mean()) #風(fēng)速使用近鄰填補(bǔ) data1['Wind_Speed(mph)'] = data1['Wind_Speed(mph)'].interpolate(method='nearest', order=4) #天氣狀況風(fēng)向用眾數(shù)填補(bǔ) data1['Weather_Condition'] = data1['Weather_Condition'].fillna(data1['Weather_Condition'].mode()) data1['Wind_Direction'] = data1['Wind_Direction'].fillna(data1['Wind_Direction'].mode()) #降水量沒有就用0填補(bǔ) data1['Precipitation(in)'] = data1['Precipitation(in)'].fillna(0) #風(fēng)向,把同樣含義單詞的合并起來 occupation = {"CALM":"Calm", "N":"North", "S":"South", "W":"West", "E":"East", "VAR":"Variable"} f = lambda x : occupation.get(x,x) #在occupation中找對應(yīng)的值 data1['Wind_Direction']= data1['Wind_Direction'].map(f) #最后矯正索引因?yàn)閯h除了部分行 data1.index = range(len(data1))3.數(shù)據(jù)可視化應(yīng)用
#發(fā)生事故最多的州 a=(Bar(init_opts=opts.InitOpts(width="2000px",height="400px")) .add_xaxis(data1['State'].value_counts().index.tolist()) .add_yaxis('各州事故發(fā)生數(shù)量',data1['State'].value_counts().tolist(),color='#499C9F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) ) a.render_notebook()
(點(diǎn)擊可放大圖片)
前5多事故發(fā)生的州分別為:CA(加利福尼亞州) TX(得克薩斯州) FL(佛羅里達(dá)州) NY(紐約州)NC(北卡羅來納州)
都是比較發(fā)達(dá)人口較多的地區(qū)
早高峰晚高峰有明顯的凸出來,是交通最繁雜車流量最多的時候
下半年事故發(fā)生數(shù)明顯多于上半年,也許是因?yàn)橄掳肽旯?jié)假日較多且工作量較多
可見,大部分事故發(fā)生時是clear天氣晴朗的,但是后3.4個為陰天或多云,天氣原因還是會部分影響事故發(fā)生
事故發(fā)生時大部分為白天,非路口,無降水,無信號燈,無讓路標(biāo)志,無減速帶
大部分事故發(fā)生時的可見度都時良好的即可見度在6.5英里和12英里之間(1英里約等于1.6公里)
可見事故大部分是在沿海發(fā)達(dá)地區(qū)發(fā)生
4.利用xgboost對嚴(yán)重程度建模預(yù)測
4.1建模前預(yù)處理
dataX = data1.copy() #提取月份和小時 dataX['month'] = dataX['Start_Time'].apply(lambda x:x.month) dataX['hour'] = dataX['Start_Time'].apply(lambda x:x.hour) 刪除對建模無用的特征 deletelist2=['Source','Side','Start_Time', 'End_Time','Description','Street','City','County', 'State', 'Zipcode', 'Country', 'Timezone','Wind_Direction'] dataX = dataX.drop(deletelist2, axis=1) #把false換成0 true換成1 list3=['Amenity','Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway','Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal','Turning_Loop'] m = lambda x : 1 if x else 0 for i in list3:dataX[i] = dataX[i].apply(m) #嚴(yán)重程度為1的數(shù)量太少極度不均衡,刪掉 dataX = dataX.drop(index=(dataX.loc[(dataX['Severity']==1)].index)) dataX.index = range(len(dataX)) dataX.head()| 3 | 37.925392 | -122.320595 | 0.01 | 44.1 | 79.0 | 29.91 | 10.0 | 5.8 | 0.0 | Partly Cloudy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 37.878185 | -122.307175 | 0.01 | 51.1 | 83.0 | 29.97 | 10.0 | 11.5 | 0.0 | Overcast | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 38.014820 | -121.640579 | 0.00 | 44.1 | 79.0 | 29.91 | 10.0 | 5.8 | 0.0 | Partly Cloudy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 37.912056 | -122.323982 | 0.01 | 44.1 | 82.0 | 29.95 | 9.0 | 3.5 | 0.0 | Mostly Cloudy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 37.925392 | -122.320595 | 0.01 | 44.1 | 82.0 | 29.95 | 9.0 | 3.5 | 0.0 | Mostly Cloudy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
獨(dú)熱編碼后Weather_Condition這一列變成了(71669,78)
這樣會發(fā)生過擬合,所以用PCA降維
| -0.365071 | -0.105292 | 0.643741 | -0.661940 | -0.141977 |
| -0.544094 | 0.715639 | -0.290545 | -0.018498 | -0.060453 |
| -0.365071 | -0.105292 | 0.643741 | -0.661940 | -0.141977 |
| -0.482130 | -0.685853 | -0.468909 | -0.024822 | -0.071162 |
| -0.482130 | -0.685853 | -0.468909 | -0.024822 | -0.071162 |
| 0.287847 | -0.995895 | 0.5 | -0.849372 | 0.371429 | -0.454545 | 0.0 | -0.189655 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.0 | -1.166667 | -1.500 |
| 0.281328 | -0.995506 | 0.5 | -0.556485 | 0.485714 | -0.181818 | 0.0 | 0.793103 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.166667 | -1.500 |
| 0.300197 | -0.976155 | 0.0 | -0.849372 | 0.371429 | -0.454545 | 0.0 | -0.189655 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.166667 | -1.500 |
| 0.286006 | -0.995993 | 0.5 | -0.849372 | 0.457143 | -0.272727 | -1.0 | -0.586207 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.166667 | -1.375 |
| 0.287847 | -0.995895 | 0.5 | -0.849372 | 0.457143 | -0.272727 | -1.0 | -0.586207 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.0 | -1.166667 | -1.375 |
拼接X 和xw1
X1 = pd.concat([X,Xw1],axis = 1) X1.shape #(716669, 30) #xgboost分類標(biāo)簽只接受0到類別數(shù),即0-2,234轉(zhuǎn)換為012 def f(x):if x==2:return 0elif x==3:return 1else:return 2 y1 = y.apply(f) y1.value_counts()0 461657
1 230899
2 24113
Name: Severity, dtype: int64
xgboost建模
param1 = {'booster': 'gbtree','objective': 'multi:softmax', # 多分類的問題'num_class': 3, # 類別數(shù),與 multisoftmax 并用'gamma': 0.1, # 用于控制是否后剪枝的參數(shù),越大越保守,一般0.1、0.2這樣子。'max_depth': 12, # 構(gòu)建樹的深度,越大越容易過擬合'lambda': 2, # 控制模型復(fù)雜度的權(quán)重值的L2正則化項(xiàng)參數(shù),參數(shù)越大,模型越不容易過擬合。'subsample': 0.7, # 隨機(jī)采樣訓(xùn)練樣本'colsample_bytree': 0.7, # 生成樹時進(jìn)行的列采樣'min_child_weight': 3,'silent': 1, # 設(shè)置成1則沒有運(yùn)行信息輸出,最好是設(shè)置為0.'eta': 0.007, # 如同學(xué)習(xí)率'seed': 1000,'nthread': 4, # cpu 線程數(shù) } X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, random_state=0) xg_train = xgb.DMatrix(X_train, label=y_train) xg_test = xgb.DMatrix( X_test, label=y_test) bst1 = xgb.train(param1, xg_train) pred1 = bst1.predict( xg_test ) print(accuracy_score(y_test, pred1))0 .7609220422230595 (準(zhǔn)確率)
查看特征重要性
調(diào)參部分(更新中)
總結(jié)
以上是生活随笔為你收集整理的美国交通事故分析(2017)(项目练习_5)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: java class加载_Java 类加
- 下一篇: 用Python快速制作海报级地图