机器学习入门-Knn算法
knn算法不需要進(jìn)行訓(xùn)練, 耗時,適用于多標(biāo)簽分類情況
1. 將輸入的單個測試數(shù)據(jù)與每一個訓(xùn)練數(shù)據(jù)依據(jù)特征做一個歐式距離、
2. 將求得的歐式距離進(jìn)行降序排序,取前n_個
3. 計算這前n_個的y值的平均或者(類別),獲得測試數(shù)據(jù)的預(yù)測值
4.根據(jù)測試數(shù)據(jù)的實際值和測試數(shù)據(jù)的預(yù)測值計算當(dāng)前的rmse,判斷該方法的好壞
使用AIRbob的房子的特征與房價做演示:
演示1.首先使用accommodates屬性對一個數(shù)據(jù)做演示,采用的距離是絕對值距離
import pandas as pd import numpy as npdf_listings = pd.read_csv('listings.csv') # 選取部分特征 features = ['accommodates', 'bedrooms', 'bathrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights', 'number_of_reviews'] # 選取部分特征重新組合 df_listings = df_listings[features] # 先只對accommodates進(jìn)行操作 new_accomodates = 3 # 有一個房子的可容納住房為3 df_listings['distance'] = np.abs(df_listings['accommodates'] - new_accomodates) # 接下來對df_listings按照'distance'進(jìn)行排序操作.value_counts()統(tǒng)計個數(shù), sort_index() 進(jìn)行排序 df_listings.distance.value_counts().sort_index() # 使用洗牌操作,重新賦值 df_listings = df_listings.sample(frac=1, random_state=0) # 重新繼續(xù)排序 df_listings = df_listings.sort_values('distance') print(df_listings.price.head()) # 由于價格是$150 ,我們需要將其轉(zhuǎn)換為float類型 df_listings['price'] = df_listings['price'].str.replace('\$|,', "").astype(float) # 取前5個數(shù)據(jù),求價格的平均值 price_mean_5 = df_listings['price'].iloc[:5].mean() print(price_mean_5)演示2 將住房數(shù)據(jù)分為訓(xùn)練集和測試集, 使用單個特征進(jìn)行測試
df_listings = df_listings.drop('distance', axis=1) # 將數(shù)據(jù)進(jìn)行拆分 train_df = df_listings[:2792] test_df = df_listings[2792:] # 定義預(yù)測函數(shù) def predict_price(test_content, feature_name):temp_df = train_dftemp_df['distance'] = np.abs(test_content - temp_df[feature_name])# 根據(jù)distance進(jìn)行排序temp_df = temp_df.sort_values('distance')price_mean_5 = temp_df.price.iloc[:5].mean()return price_mean_5 cols = ['accommodates'] # 這個.apply相當(dāng)于將每一個數(shù)據(jù)輸入,參數(shù)為函數(shù), feature_name為第二個參數(shù) test_df['predict_price'] = test_df[cols[0]].apply(predict_price, feature_name = 'accommodates') print(test_df['predict_price']) # 計算rmse mse = ((test_df['predict_price'] - test_df['price']) ** 2).mean() rmse = mse ** (1 / 2) print(rmse)# 分別比較其他屬性單個的區(qū)別 for feature in ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']:test_df['predict_price'] = test_df[feature].apply(predict_price, feature_name=feature)print(test_df['predict_price'])# 計算rmsemse = ((test_df['predict_price'] - test_df['price']) ** 2).mean()rmse = mse ** (1 / 2)print('where{}:{}'.format(feature, rmse))演示3:在上面的基礎(chǔ)上,添加數(shù)據(jù)標(biāo)準(zhǔn)化(zeros)操作,標(biāo)準(zhǔn)化的意思是先減去均值,然后再除于標(biāo)準(zhǔn)差。同時引入多變量操作
使用的包有: from sklearn.mean_squred_error? 用于求平均值
? ? ? ? ? ? ? ? ? ? ? from scipy.spatial import distance 用于求歐式距離
? ? ? ? ? ? ? ? ? ? ? ?from sklearn.processing import??StandardScaler? 用于進(jìn)行標(biāo)準(zhǔn)化操作
from sklearn.preprocessing import StandardScaler df_listings = pd.read_csv('listings.csv') # 選取部分特征 features = ['accommodates', 'bedrooms', 'bathrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights', 'number_of_reviews'] # 選取部分特征重新組合 df_listings = df_listings[features] # 對價格進(jìn)行處理 df_listings['price'] = df_listings['price'].str.replace('\$|,', "").astype(float) # 去除有缺失值的行 df_listings = df_listings.dropna()# 對數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化的操作 df_listings[features] = StandardScaler().fit_transform(df_listings[features])# 進(jìn)行數(shù)據(jù)拆分 train_df = df_listings[:2792] test_df = df_listings[2792:]# 使用歐式距離構(gòu)成距離 from scipy.spatial import distance from sklearn.metrics import mean_squared_error # 構(gòu)造多變量函數(shù) def predict_price(new_content, feature_name):temp_df = train_df.copy()temp_df['distance'] = distance.cdist(temp_df[feature_name], [new_content[feature_name]])temp_df = temp_df.sort_values('distance')price_mean_5 = temp_df.price.iloc[:5].mean()return price_mean_5# 選取其中的兩個變量 cols = ['accommodates', 'bathrooms'] test_df['predict_price'] = test_df.apply(predict_price, feature_name=cols, axis=1) mse = mean_squared_error(test_df['predict_price'], test_df['price']) rmse = mse ** (1 / 2) print(rmse)
演示4 使用sklearn附帶的knn進(jìn)行運算
from sklearn.neighbors import KNeighborsRegressor from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_squared_errordf_listings = pd.read_csv('listings.csv') # 選取部分特征 features = ['accommodates', 'bedrooms', 'bathrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights', 'number_of_reviews'] # 選取部分特征重新組合 df_listings = df_listings[features] # 對價格進(jìn)行處理 df_listings['price'] = df_listings['price'].str.replace('\$|,', "").astype(float) # 去除有缺失值的行 df_listings = df_listings.dropna() # 拆分?jǐn)?shù)據(jù) df_listings[features] = StandardScaler().fit_transform(df_listings[features]) train_df = df_listings[:2792] test_df = df_listings[2792:] print(test_df.head()) cols = ['accommodates', 'bathrooms'] # 實例化一個knn, n_neighbors用來調(diào)整k值 knn = KNeighborsRegressor(n_neighbors=10) # 進(jìn)行模型的訓(xùn)練 knn.fit(train_df[cols], train_df['price']) # 進(jìn)行模型的預(yù)測 test_df['predict_price'] = knn.predict(test_df[cols]) # 計算mse mse = mean_squared_error(test_df['predict_price'], test_df['price']) rmse = mse ** (1 / 2) print(rmse)# 使用全部特征做一個比較 cols = ['accommodates', 'bedrooms', 'bathrooms', 'beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews'] knn = KNeighborsRegressor(n_neighbors=10) knn.fit(train_df[cols], train_df['price']) test_df['predict_price'] = knn.predict(test_df[cols]) mse = mean_squared_error(test_df['predict_price'], test_df['price']) rmse = mse ** (1 / 2) print(rmse)?
轉(zhuǎn)載于:https://www.cnblogs.com/my-love-is-python/p/10255019.html
總結(jié)
以上是生活随笔為你收集整理的机器学习入门-Knn算法的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 坚果云+typora(个人十分喜欢的一个
- 下一篇: Flash制作梦幻仙境动画效果