Kaggle Tabular Playground Series - Jan 2022 学习笔记2(使用时间序列的线性回归)
之前我們對TPSJAN22進行了簡單的數(shù)據(jù)分析(詳見:Kaggle Tabular Playground Series - Jan 2022 學(xué)習(xí)筆記1(數(shù)據(jù)分析))。現(xiàn)在我們嘗試使用時間序列和線性回歸來訓(xùn)練模型。
試題地址:Tabular Playground Series - Jan 2022
本文參考:TPSJAN22-03 Linear Model
import pandas as pd import numpy as np import pickle import math import matplotlib.pyplot as plt import dateutil.easter as easter from matplotlib.ticker import MaxNLocator from datetime import datetime, date, timedelta from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GroupKFold from sklearn.linear_model import LinearRegression, HuberRegressor, Ridge, Lassoimport matplotlib.dates as mdates original_train_df = pd.read_csv('../datas/train.csv') original_test_df = pd.read_csv('../datas/test.csv') gdp_df = pd.read_csv('../datas/GDP_data_2015_to_2019_Finland_Norway_Sweden.csv')gdp_df.set_index('year', inplace=True)for df in [original_train_df, original_test_df]:df['date'] = pd.to_datetime(df.date) original_train_df.head(2)
TPSJAN22 要求使用SMAPE作為損失函數(shù),但是Scikit-learn沒有提供,所以先自定義一個損失函數(shù)。
接下來根據(jù)我們之前的數(shù)據(jù)分析猜想對特征進行處理
def engineer(df): def get_gdp(row):country = 'GDP_' + row.countryreturn gdp_df.loc[row.date.year, country]#加上gdp信息;增加每周的季節(jié)性指示器(Seasonal indicators) new_df = pd.DataFrame({'gdp': np.log(df.apply(get_gdp, axis=1)),'wd2': df.date.dt.weekday == 1, 'wd3': df.date.dt.weekday == 2,'wd4': df.date.dt.weekday == 3, 'wd5': df.date.dt.weekday == 4,'wd6': df.date.dt.weekday == 5,'wd7': df.date.dt.weekday == 6,})#將商品種類,國家,商店進行獨熱編碼for country in ['Finland', 'Norway']:new_df[country] = df.country == countrynew_df['KaggleRama'] = df.store == 'KaggleRama'for product in ['Kaggle Mug', 'Kaggle Hat']:new_df[product] = df['product'] == product#添加傅里葉特征:我們對每個產(chǎn)品添加3對傅里葉特征dayofyear = df.date.dt.dayofyearfor k in range(1, 3):new_df[f'sin{k}'] = np.sin(dayofyear / 365 * 2 * math.pi * k)new_df[f'cos{k}'] = np.cos(dayofyear / 365 * 2 * math.pi * k)new_df[f'mug_sin{k}'] = new_df[f'sin{k}'] * new_df['Kaggle Mug']new_df[f'mug_cos{k}'] = new_df[f'cos{k}'] * new_df['Kaggle Mug']new_df[f'hat_sin{k}'] = new_df[f'sin{k}'] * new_df['Kaggle Hat']new_df[f'hat_cos{k}'] = new_df[f'cos{k}'] * new_df['Kaggle Hat']return new_df來看看處理好的特征
train_df = engineer(original_train_df) train_df['date'] = original_train_df.date train_df['num_sold'] = original_train_df.num_sold.astype(np.float32) test_df = engineer(original_test_df) features = test_df.columns for df in [train_df, test_df]:df[features] = df[features].astype(np.float32)train_df list(train_df)[‘gdp’, ‘wd2’, ‘wd3’, ‘wd4’, ‘wd5’, ‘wd6’, ‘wd7’, ‘Finland’, ‘Norway’, ‘KaggleRama’, ‘Kaggle Mug’, ‘Kaggle Hat’, ‘sin1’, ‘cos1’, ‘mug_sin1’, ‘mug_cos1’, ‘hat_sin1’, ‘hat_cos1’, ‘sin2’, ‘cos2’, ‘mug_sin2’, ‘mug_cos2’, ‘hat_sin2’, ‘hat_cos2’, ‘date’, ‘num_sold’]
下面我們開始訓(xùn)練模型
def fit_model(X_tr):# Preprocess the dataX_tr_f = X_tr[features]preproc = StandardScaler()X_tr_f = preproc.fit_transform(X_tr_f)y_tr = X_tr.num_sold.values.reshape(-1, 1)model = LinearRegression()#因為sk-learn沒有SMAPE作為損失函數(shù),所以對目標值取對數(shù),然后使用默認的MAE可以得到近似SMAPE的效果。#更多詳細信息參考:https://www.kaggle.com/code/ambrosm/tpsjan22-03-linear-model/notebook#Training-the-simple-model-(without-holidays) 第一段#和該討論:https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298473model.fit(X_tr_f, np.log(y_tr).ravel())return preproc, model preproc, model = fit_model(train_df)train_pred_df = original_train_df.copy()#因為預(yù)測的結(jié)果是對目標值取對數(shù),所以要獲取預(yù)測值需要將預(yù)測結(jié)果進行指數(shù)運算來進行還原 train_pred_df['pred'] = np.exp(model.predict(preproc.transform(train_df[features]))) train_pred_df
接下來我們選取Finland來看看每個商品在每個商店的預(yù)測值和目標值的損失值
可以發(fā)現(xiàn),每個商店每個商品的損失值分布很相似,我們可以直接看Finland所有商品每天的損失值。
可以發(fā)現(xiàn)每年1月和12月有巨大的誤差,跟之前特征分析的時候的推測很相似。我們可能需要將年底和年初的日期提取出來作為特征值。放大12月、1月的誤差來看看:
可以發(fā)現(xiàn)在12月24號到1月4號之間誤差有巨大的波動,所以我們將每年的12月24號到1月4號之間的日期標注出來作為特征,再訓(xùn)練模型看看
# Feature engineering def engineer(df):"""Return a new dataframe with the engineered features"""def get_gdp(row):country = 'GDP_' + row.countryreturn gdp_df.loc[row.date.year, country]new_df = pd.DataFrame({'gdp': np.log(df.apply(get_gdp, axis=1)),'wd2': df.date.dt.weekday == 1, 'wd3': df.date.dt.weekday == 2,'wd4': df.date.dt.weekday == 3, 'wd5': df.date.dt.weekday == 4,'wd6': df.date.dt.weekday == 5,'wd7': df.date.dt.weekday == 6,})# One-hot encoding (no need to encode the last categories)for country in ['Finland', 'Norway']:new_df[country] = df.country == countrynew_df['KaggleRama'] = df.store == 'KaggleRama'for product in ['Kaggle Mug', 'Kaggle Hat']:new_df[product] = df['product'] == product# Seasonal variations (Fourier series)# The three products have different seasonal patternsdayofyear = df.date.dt.dayofyearfor k in range(1, 3):temp_sin = np.sin(dayofyear / 365 * 2 * math.pi * k)temp_cos = np.cos(dayofyear / 365 * 2 * math.pi * k)new_df[f'mug_sin{k}'] = temp_sin * new_df['Kaggle Mug']new_df[f'mug_cos{k}'] = temp_cos * new_df['Kaggle Mug']new_df[f'hat_sin{k}'] = temp_sin * new_df['Kaggle Hat']new_df[f'hat_cos{k}'] = temp_cos * new_df['Kaggle Hat']new_df = pd.concat([new_df,pd.DataFrame({f"dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland') for d in range(24, 32)}),pd.DataFrame({f"jan{d}":(df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Finland') for d in range(1, 5)})],axis=1)return new_df train_df = engineer(original_train_df) train_df['date'] = original_train_df.date train_df['num_sold'] = original_train_df.num_sold.astype(np.float32) test_df = engineer(original_test_df) features = test_df.columns for df in [train_df, test_df]:df[features] = df[features].astype(np.float32)preproc, model = fit_model(train_df)train_pred_df = original_train_df.copy() train_pred_df['pred'] = np.exp(model.predict(preproc.transform(train_df[features])))by_date = train_pred_df[train_pred_df.country == 'Finland'].groupby(train_pred_df['date']) residuals = (by_date.pred.mean() - by_date.num_sold.mean()) / (by_date.pred.mean() + by_date.num_sold.mean()) * 200plot_all_residuals(residuals)plot_around(residuals, 1, 1, 30)
似乎1月4到15號之間還有比較明顯的誤差,同時12月初似乎也有誤差,我們先將1月的特征范圍增加到15號
現(xiàn)在12月末和1月初的誤差我們已經(jīng)處理得好了很多了。但是觀察上面的散點圖,可以發(fā)現(xiàn)在每年的4、5、6、7、11、12月還有很明顯的誤差。我們放大區(qū)間來觀察一下。
可以發(fā)現(xiàn)在4月每年都有相似的波動,但是有的年份波動發(fā)生得早,有些年份的波動發(fā)生得晚,6月底、11月初都有類似的情況。從每年的5月1日開始又持續(xù)時間大約10天的波動,12月初也有類似的情況?;叵胫暗?2月末和1月初的信息,我們可以推測這些誤差點可能與節(jié)假日有關(guān)。從kaggle網(wǎng)友提供的三個國家的官方節(jié)假日信息,我們可以從中得到一些啟發(fā)。
Holidays_Finland_Norway_Sweden_2015-2019
我們發(fā)現(xiàn)1月有元旦節(jié),日期固定。4月會有復(fù)活節(jié),日期不固定,5月有國際勞動節(jié),日期固定等等。通過對節(jié)假日信息和上面誤差信息比較,我們可以對我們的特征工程做出修改:
# Feature engineering def engineer(df):def get_gdp(row):country = 'GDP_' + row.countryreturn gdp_df.loc[row.date.year, country]new_df = pd.DataFrame({'gdp': np.log(df.apply(get_gdp, axis=1)),'wd2': df.date.dt.weekday == 1, 'wd3': df.date.dt.weekday == 2,'wd4': df.date.dt.weekday == 3, 'wd5': df.date.dt.weekday == 4,'wd6': df.date.dt.weekday == 5,'wd7': df.date.dt.weekday == 6,})# One-hot encoding (no need to encode the last categories)for country in ['Finland', 'Norway']:new_df[country] = df.country == countrynew_df['KaggleRama'] = df.store == 'KaggleRama'for product in ['Kaggle Mug', 'Kaggle Hat']:new_df[product] = df['product'] == productdayofyear = df.date.dt.dayofyearfor k in range(1, 3):temp_sin = np.sin(dayofyear / 365 * 2 * math.pi * k)temp_cos = np.cos(dayofyear / 365 * 2 * math.pi * k)new_df[f'mug_sin{k}'] = temp_sin * new_df['Kaggle Mug']new_df[f'mug_cos{k}'] = temp_cos * new_df['Kaggle Mug']new_df[f'hat_sin{k}'] = temp_sin * new_df['Kaggle Hat']new_df[f'hat_cos{k}'] = temp_cos * new_df['Kaggle Hat']##圣誕節(jié)、元旦節(jié)new_df = pd.concat([new_df,pd.DataFrame({f"dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland') for d in range(24, 32)}),pd.DataFrame({f"jan{d}":(df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Finland') for d in range(1, 16)}),pd.DataFrame({f"n-dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Norway') for d in range(24, 32)}),pd.DataFrame({f"n-jan{d}":(df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Norway') for d in range(1, 10)}),pd.DataFrame({f"s-dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Sweden') for d in range(24, 32)}),pd.DataFrame({f"s-jan{d}":(df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Sweden')for d in range(1, 16)})],axis=1)new_df = pd.concat([new_df,pd.DataFrame({f"may{d}":(df.date.dt.month == 5) & (df.date.dt.day == d)for d in list(range(1, 11))}), # + list(range(17, 25))pd.DataFrame({f"may{d}":(df.date.dt.month == 5) & (df.date.dt.day == d) & (df.country == 'Norway')for d in list(range(16, 28))})],axis=1)new_df = pd.concat([new_df,pd.DataFrame({f"june{d}":(df.date.dt.month == 6) & (df.date.dt.day == d) & (df.country == 'Sweden')for d in list(range(7, 15))}),],axis=1)# Midsummer Daymidsummer_day_date = df.date.dt.year.map({2015: pd.Timestamp(('2015-06-20')),2016: pd.Timestamp(('2016-06-25')),2017: pd.Timestamp(('2017-06-24')),2018: pd.Timestamp(('2018-06-23')),2019: pd.Timestamp(('2019-06-22'))})new_df = pd.concat([new_df,pd.DataFrame({f"midsummer_day{d}": (df.date - midsummer_day_date == np.timedelta64(d, "D")) & (df.country != 'Norway') for d in list(range(-2, 11))})],axis=1)# First Sunday of Novembersun_nov_date = df.date.dt.year.map({2015: pd.Timestamp(('2015-11-1')),2016: pd.Timestamp(('2016-11-6')),2017: pd.Timestamp(('2017-11-5')),2018: pd.Timestamp(('2018-11-4')),2019: pd.Timestamp(('2019-11-3'))})new_df = pd.concat([new_df,pd.DataFrame({f"sun_nov{d}": (df.date - sun_nov_date == np.timedelta64(d, "D")) & (df.country != 'Norway') for d in list(range(-1, 10))})],axis=1)# First half of December (Independence Day of Finland, 6th of December)new_df = pd.concat([new_df,pd.DataFrame({f"dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland')for d in list(range(6, 14))})],axis=1)# Eastereaster_date = df.date.apply(lambda date: pd.Timestamp(easter.easter(date.year)))new_df = pd.concat([new_df,pd.DataFrame({f"easter{d}": (df.date - easter_date == np.timedelta64(d, "D")) & (df.country == 'Finland') for d in list(range(-2, 11)) + list(range(40, 48)) + list(range(50, 59))})],axis=1)new_df = pd.concat([new_df,pd.DataFrame({f"n-easter{d}": (df.date - easter_date == np.timedelta64(d, "D")) & (df.country == 'Norway') for d in list(range(-2, 11)) + list(range(40, 48)) + list(range(50, 59))})],axis=1)new_df = pd.concat([new_df,pd.DataFrame({f"s-easter{d}": (df.date - easter_date == np.timedelta64(d, "D")) & (df.country == 'Sweden') for d in list(range(-2, 11)) + list(range(40, 48)) + list(range(50, 59))})],axis=1)return new_dftrain_df = engineer(original_train_df) train_df['date'] = original_train_df.date train_df['num_sold'] = original_train_df.num_sold.astype(np.float32) test_df = engineer(original_test_df) features = test_df.columns for df in [train_df, test_df]:df[features] = df[features].astype(np.float32)preproc, model = fit_model(train_df)train_pred_df = original_train_df.copy() train_pred_df['pred'] = np.exp(model.predict(preproc.transform(train_df[features])))by_date = train_pred_df[train_pred_df.country == 'Norway'].groupby(train_pred_df['date']) residuals = (by_date.pred.sum() - by_date.num_sold.sum()) / (by_date.pred.sum() + by_date.num_sold.sum()) * 200plot_all_residuals(residuals)plot_around(residuals, 1, 1, 15) plot_around(residuals, 4, 1, 15) plot_around(residuals, 5, 1, 30) plot_around(residuals, 6, 1, 30) plot_around(residuals, 7, 1, 30) plot_around(residuals, 11, 1, 15) plot_around(residuals, 12, 1, 30)可以發(fā)現(xiàn)將每年的節(jié)假日加入特征之后,我們的誤差得到了很好的改善。接下來來看看評估分數(shù)
def fit_model(X_tr, X_va=None, outliers=False):"""Scale the data, fit a model, plot the training history and validate the model"""start_time = datetime.now()# Preprocess the dataX_tr_f = X_tr[features]preproc = StandardScaler()X_tr_f = preproc.fit_transform(X_tr_f)y_tr = X_tr.num_sold.values.reshape(-1, 1)model = LinearRegression()model.fit(X_tr_f, np.log(y_tr).ravel())if X_va is not None:# Preprocess the validation dataX_va_f = X_va[features]X_va_f = preproc.transform(X_va_f)y_va = X_va.num_sold.values.reshape(-1, 1)# Inference for validationy_va_pred = np.exp(model.predict(X_va_f)).reshape(-1, 1)oof.update(pd.Series(y_va_pred.ravel(), index=X_va.index))# Evaluation: Execution time and SMAPEsmape_before_correction = np.mean(smape_loss(y_va, y_va_pred))#y_va_pred *= LOSS_CORRECTIONsmape = np.mean(smape_loss(y_va, y_va_pred))print(f"Fold {run}.{fold} | {str(datetime.now() - start_time)[-12:-7]}"f" | SMAPE: {smape:.5f} (before correction: {smape_before_correction:.5f})")score_list.append(smape)# Plot y_true vs. y_predif fold == 0:plt.figure(figsize=(10, 10))plt.scatter(y_va, y_va_pred, s=1, color='r')#plt.scatter(np.log(y_va), np.log(y_va_pred), s=1, color='g')plt.plot([plt.xlim()[0], plt.xlim()[1]], [plt.xlim()[0], plt.xlim()[1]], '--', color='k')plt.gca().set_aspect('equal')plt.xlabel('y_true')plt.ylabel('y_pred')plt.title('OOF Predictions')plt.show()return preproc, model #%%time RUNS = 1 # should be 1. increase the number of runs only if you want see how the result depends on the random seed OUTLIERS = True TRAIN_VAL_CUT = datetime(2018, 1, 1) LOSS_CORRECTION = 1# Make the results reproducible np.random.seed(202100)total_start_time = datetime.now() oof = pd.Series(0.0, index=train_df.index) score_list = [] for run in range(RUNS):kf = GroupKFold(n_splits=4)for fold, (train_idx, val_idx) in enumerate(kf.split(train_df, groups=train_df.date.dt.year)):X_tr = train_df.iloc[train_idx]X_va = train_df.iloc[val_idx]print(f"Fold {run}.{fold}")preproc, model = fit_model(X_tr, X_va)print(f"Average SMAPE: {sum(score_list) / len(score_list):.5f}") with open('oof.pickle', 'wb') as handle: pickle.dump(oof, handle) # Plot all num_sold_true and num_sold_pred (five years) for one country-store-product combination def plot_five_years_combination(engineer, country='Norway', store='KaggleMart', product='Kaggle Hat'):demo_df = pd.DataFrame({'row_id': 0,'date': pd.date_range('2015-01-01', '2019-12-31', freq='D'),'country': country,'store': store,'product': product})demo_df.set_index('date', inplace=True, drop=False)demo_df = engineer(demo_df)demo_df['num_sold'] = np.exp(model.predict(preproc.transform(demo_df[features])))plt.figure(figsize=(20, 6))plt.plot(np.arange(len(demo_df)), demo_df.num_sold, label='prediction')train_subset = train_df[(original_train_df.country == country) & (original_train_df.store == store) & (original_train_df['product'] == product)]# plt.scatter(np.arange(len(train_subset)), train_subset.num_sold, label='true', alpha=0.5, color='red', s=3)plt.plot(np.arange(len(train_subset)), train_subset.num_sold, label='true', color='red')plt.legend()plt.title('Predictions and true num_sold for five years')plt.show()plot_five_years_combination(engineer)生成提交數(shù)據(jù)
# Fit the model on the complete training data train_idx = np.arange(len(train_df)) X_tr = train_df.iloc[train_idx] preproc, model = fit_model(X_tr, None)plot_five_years_combination(engineer) # Quick check for debugging# Inference for test test_pred_list = [] test_pred_list.append(np.exp(model.predict(preproc.transform(test_df[features]))) * LOSS_CORRECTION)# Create the submission file sub = original_test_df[['row_id']].copy() sub['num_sold'] = sum(test_pred_list) / len(test_pred_list) sub.to_csv('submission_linear_model.csv', index=False)# Plot the distribution of the test predictions plt.figure(figsize=(16,3)) plt.hist(train_df['num_sold'], bins=np.linspace(0, 3000, 201),density=True, label='Training') plt.hist(sub['num_sold'], bins=np.linspace(0, 3000, 201),density=True, rwidth=0.5, label='Test predictions') plt.xlabel('num_sold') plt.ylabel('Frequency') plt.legend() plt.show() sub
因為銷售數(shù)量是整數(shù),我們對結(jié)果進行四舍五入一下,提交看看哪個分數(shù)要高些。
# Create a rounded submission file sub_rounded = sub.copy() sub_rounded['num_sold'] = sub_rounded['num_sold'].round() sub_rounded.to_csv('submission_linear_model_rounded.csv', index=False) sub_rounded未四舍五入的分數(shù)
四舍五入的分數(shù)
相關(guān)連接:
Kaggle Tabular Playground Series - Jan 2022 學(xué)習(xí)筆記1(數(shù)據(jù)分析)
總結(jié)
以上是生活随笔為你收集整理的Kaggle Tabular Playground Series - Jan 2022 学习笔记2(使用时间序列的线性回归)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: VS2010调试快捷键
- 下一篇: 天刀帐号角色服务器查询系统,天涯明月刀手