机器学习实战:使用lightGBM预测饭店流量
生活随笔
收集整理的這篇文章主要介紹了
机器学习实战:使用lightGBM预测饭店流量
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
飯店來客數(shù)據(jù)
CSV數(shù)據(jù)源:鏈接:https://pan.baidu.com/s/1mLZBNv1SszQEnRoBGOYX7w?密碼:mmrf
import pandas as pdair_visit = pd.read_csv('air_visit_data.csv') air_visit.head() air_visit.index = pd.to_datetime(air_visit['visit_date']) air_visit.head()按天來算
air_visit = air_visit.groupby('air_store_id').apply(lambda g: g['visitors'].resample('1d').sum()).reset_index() air_visit.head()缺失值填0
air_visit['visit_date'] = air_visit['visit_date'].dt.strftime('%Y-%m-%d') air_visit['was_nil'] = air_visit['visitors'].isnull() air_visit['visitors'].fillna(0, inplace=True)air_visit.head()日歷數(shù)據(jù)
date_info = pd.read_csv('date_info.csv') date_info.head()shift()操作對數(shù)據(jù)進(jìn)行移動,可以觀察前一天和后天是不是節(jié)假日。
date_info.rename(columns={'holiday_flg': 'is_holiday', 'calendar_date': 'visit_date'}, inplace=True) date_info['prev_day_is_holiday'] = date_info['is_holiday'].shift().fillna(0) date_info['next_day_is_holiday'] = date_info['is_holiday'].shift(-1).fillna(0) date_info.head()地區(qū)數(shù)據(jù)
air_store_info = pd.read_csv('air_store_info.csv')air_store_info.head()測試集
import numpy as npsubmission = pd.read_csv('sample_submission.csv') submission['air_store_id'] = submission['id'].str.slice(0, 20) submission['visit_date'] = submission['id'].str.slice(21) submission['is_test'] = True submission['visitors'] = np.nan submission['test_number'] = range(len(submission))submission.head()所有數(shù)據(jù)信息匯總
data = pd.concat((air_visit, submission.drop('id', axis='columns'))) data.head() data['is_test'].fillna(False, inplace=True) data = pd.merge(left=data, right=date_info, on='visit_date', how='left') data = pd.merge(left=data, right=air_store_info, on='air_store_id', how='left') data['visitors'] = data['visitors'].astype(float)data.head()拿到天氣數(shù)據(jù)
import globweather_dfs = []for path in glob.glob('./1-1-16_5-31-17_Weather/*.csv'):weather_df = pd.read_csv(path)weather_df['station_id'] = path.split('\\')[-1].rstrip('.csv')weather_dfs.append(weather_df)weather = pd.concat(weather_dfs, axis='rows') weather.rename(columns={'calendar_date': 'visit_date'}, inplace=True)weather.head()用各個小地方數(shù)據(jù)求出平均氣溫
means = weather.groupby('visit_date')[['avg_temperature', 'precipitation']].mean().reset_index() means.rename(columns={'avg_temperature': 'global_avg_temperature', 'precipitation': 'global_precipitation'}, inplace=True) weather = pd.merge(left=weather, right=means, on='visit_date', how='left') weather['avg_temperature'].fillna(weather['global_avg_temperature'], inplace=True) weather['precipitation'].fillna(weather['global_precipitation'], inplace=True)weather[['visit_date', 'avg_temperature', 'precipitation']].head()信息數(shù)據(jù)
data['visit_date'] = pd.to_datetime(data['visit_date']) data.index = data['visit_date'] data.sort_values(['air_store_id', 'visit_date'], inplace=True)data.head()異常點問題,數(shù)據(jù)中存在部分異常點,以正太分布為出發(fā)點,認(rèn)為95%的是正常的,所以選擇了1.96這個值。對異常點來規(guī)范,讓特別大的點等于正常中最大的。
def find_outliers(series):return (series - series.mean()) > 1.96 * series.std()def cap_values(series):outliers = find_outliers(series)max_val = series[~outliers].max()series[outliers] = max_valreturn seriesstores = data.groupby('air_store_id') data['is_outlier'] = stores.apply(lambda g: find_outliers(g['visitors'])).values data['visitors_capped'] = stores.apply(lambda g: cap_values(g['visitors'])).values data['visitors_capped_log1p'] = np.log1p(data['visitors_capped'])data.head()日期特征
data['is_weekend'] = data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int) data['day_of_month'] = data['visit_date'].dt.day data.head()指數(shù)加權(quán)移動平均(Exponential Weighted Moving Average),反應(yīng)時間序列變換趨勢,需要我們給定alpha值,這里我們來優(yōu)化求一個最合適的。
from scipy import optimizedef calc_shifted_ewm(series, alpha, adjust=True):return series.shift().ewm(alpha=alpha, adjust=adjust).mean()def find_best_signal(series, adjust=False, eps=10e-5):def f(alpha):shifted_ewm = calc_shifted_ewm(series=series, alpha=min(max(alpha, 0), 1), adjust=adjust)corr = np.mean(np.power(series - shifted_ewm, 2))return corrres = optimize.differential_evolution(func=f, bounds=[(0 + eps, 1 - eps)])return calc_shifted_ewm(series=series, alpha=res['x'][0], adjust=adjust)roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped'])) data['optimized_ewm_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).valuesroll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped'])) data['optimized_ewm_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).valuesroll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped_log1p'])) data['optimized_ewm_log1p_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).valuesroll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped_log1p'])) data['optimized_ewm_log1p_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).valuesdata.head()盡可能多的提取時間序列信息
def extract_precedent_statistics(df, on, group_by):df.sort_values(group_by + ['visit_date'], inplace=True)groups = df.groupby(group_by, sort=False)stats = {'mean': [],'median': [],'std': [],'count': [],'max': [],'min': []}exp_alphas = [0.1, 0.25, 0.3, 0.5, 0.75]stats.update({'exp_{}_mean'.format(alpha): [] for alpha in exp_alphas})for _, group in groups:shift = group[on].shift()roll = shift.rolling(window=len(group), min_periods=1)stats['mean'].extend(roll.mean())stats['median'].extend(roll.median())stats['std'].extend(roll.std())stats['count'].extend(roll.count())stats['max'].extend(roll.max())stats['min'].extend(roll.min())for alpha in exp_alphas:exp = shift.ewm(alpha=alpha, adjust=False)stats['exp_{}_mean'.format(alpha)].extend(exp.mean())suffix = '_&_'.join(group_by)for stat_name, values in stats.items():df['{}_{}_by_{}'.format(on, stat_name, suffix)] = valuesextract_precedent_statistics(df=data,on='visitors_capped',group_by=['air_store_id', 'day_of_week'] )extract_precedent_statistics(df=data,on='visitors_capped',group_by=['air_store_id', 'is_weekend'] )extract_precedent_statistics(df=data,on='visitors_capped',group_by=['air_store_id'] )extract_precedent_statistics(df=data,on='visitors_capped_log1p',group_by=['air_store_id', 'day_of_week'] )extract_precedent_statistics(df=data,on='visitors_capped_log1p',group_by=['air_store_id', 'is_weekend'] )extract_precedent_statistics(df=data,on='visitors_capped_log1p',group_by=['air_store_id'] )data.sort_values(['air_store_id', 'visit_date']).head()?
data = pd.get_dummies(data, columns=['day_of_week', 'air_genre_name']) data.head()數(shù)據(jù)集劃分
data['visitors_log1p'] = np.log1p(data['visitors']) train = data[(data['is_test'] == False) & (data['is_outlier'] == False) & (data['was_nil'] == False)] test = data[data['is_test']].sort_values('test_number')to_drop = ['air_store_id', 'is_test', 'test_number', 'visit_date', 'was_nil','is_outlier', 'visitors_capped', 'visitors','air_area_name', 'latitude', 'longitude', 'visitors_capped_log1p'] train = train.drop(to_drop, axis='columns') train = train.dropna() test = test.drop(to_drop, axis='columns')X_train = train.drop('visitors_log1p', axis='columns') X_test = test.drop('visitors_log1p', axis='columns') y_train = train['visitors_log1p']X_train.head() y_train.head()看一看是不是哪還有問題
assert X_train.isnull().sum().sum() == 0 assert y_train.isnull().sum() == 0 assert len(X_train) == len(y_train) assert X_test.isnull().sum().sum() == 0 assert len(X_test) == 32019lightgbm建模
import lightgbm as lgbm from sklearn import metrics from sklearn import model_selectionnp.random.seed(42)model = lgbm.LGBMRegressor(objective='regression',max_depth=5,num_leaves=25,learning_rate=0.007,n_estimators=1000,min_child_samples=80,subsample=0.8,colsample_bytree=1,reg_alpha=0,reg_lambda=0,random_state=np.random.randint(10e6) )n_splits = 6 cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42)val_scores = [0] * n_splitssub = submission['id'].to_frame() sub['visitors'] = 0feature_importances = pd.DataFrame(index=X_train.columns)for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):X_fit = X_train.iloc[fit_idx]y_fit = y_train.iloc[fit_idx]X_val = X_train.iloc[val_idx]y_val = y_train.iloc[val_idx]model.fit(X_fit,y_fit,eval_set=[(X_fit, y_fit), (X_val, y_val)],eval_names=('fit', 'val'),eval_metric='l2',early_stopping_rounds=200,feature_name=X_fit.columns.tolist(),verbose=False)val_scores[i] = np.sqrt(model.best_score_['val']['l2'])sub['visitors'] += model.predict(X_test, num_iteration=model.best_iteration_)feature_importances[i] = model.feature_importances_print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i]))sub['visitors'] /= n_splits sub['visitors'] = np.expm1(sub['visitors'])val_mean = np.mean(val_scores) val_std = np.std(val_scores)print('Local RMSLE: {:.5f} (±{:.5f})'.format(val_mean, val_std))輸出結(jié)果
sub.to_csv('result.csv', index=False)?
import pandas as pd df = pd.read_csv('result.csv') df.head()—END—
總結(jié)
以上是生活随笔為你收集整理的机器学习实战:使用lightGBM预测饭店流量的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 机器学习实战:GBDT Xgboost
- 下一篇: 对渠道流量异常情况的分析