风控特征学习笔记
總體業務建模流程:
? ??
1、將業務抽象為分類or回歸問題
2、定義標簽,得到y
3、選取合適的樣本,并匹配出全部的信息作為特征的來源
4、特征工程 + 模型訓練 + 模型評價與調優(相互之間可能會有交互)
5、輸出模型報告
6、上線與監控
?
什么是特征?
在機器學習的背景下,特征是用來解釋現象發生的單個特性或一組特性。 當這些特性轉換為某種可度量的形式時,它們被稱為特征。
舉個例子,假設你有一個學生列表,這個列表里包含每個學生的姓名、學習小時數、IQ和之前考試的總分數。現在,有一個新學生,你知道他/她的學習小時數和IQ,但他/她的考試分數缺失,你需要估算他/她可能獲得的考試分數。
在這里,你需要用IQ和study_hours構建一個估算分數缺失值的預測模型。所以,IQ和study_hours就成了這個模型的特征。
?
特征工程可能包含的內容:
1、基礎特征構造
2、數據預處理
3、特征衍生
4、特征變換
5、特征篩選
這是一個完整的特征工程流程,但不是唯一的流程,每個過程都有可能會交換順序。
一、基礎特征構造
""" 預覽數據 """import pandas as pd import numpy as npdf_train = pd.read_csv('train.csv') df_train.head(3) """查看數據基本情況"""df_train.shape df_train.info() df_train.describe()?
"""可以畫3D圖對數據進行可視化,例子下面所示"""from pyecharts import Bar3Dbar3d = Bar3D("2018年申請人數分布", width=1200, height=600) x_axis = ["12a", "1a", "2a", "3a", "4a", "5a", "6a", "7a", "8a", "9a", "10a", "11a","12p", "1p", "2p", "3p", "4p", "5p", "6p", "7p", "8p", "9p", "10p", "11p" ] y_axis = ["Saturday", "Friday", "Thursday", "Wednesday", "Tuesday", "Monday", "Sunday" ] data = [[0, 0, 5], [0, 1, 1], [0, 2, 0], [0, 3, 0], [0, 4, 0], [0, 5, 0],[0, 6, 0], [0, 7, 0], [0, 8, 0], [0, 9, 0], [0, 10, 0], [0, 11, 2],[0, 12, 4], [0, 13, 1], [0, 14, 1], [0, 15, 3], [0, 16, 4], [0, 17, 6],[0, 18, 4], [0, 19, 4], [0, 20, 3], [0, 21, 3], [0, 22, 2], [0, 23, 5],[1, 0, 7], [1, 1, 0], [1, 2, 0], [1, 3, 0], [1, 4, 0], [1, 5, 0],[1, 6, 0], [1, 7, 0], [1, 8, 0], [1, 9, 0], [1, 10, 5], [1, 11, 2],[1, 12, 2], [1, 13, 6], [1, 14, 9], [1, 15, 11], [1, 16, 6], [1, 17, 7],[1, 18, 8], [1, 19, 12], [1, 20, 5], [1, 21, 5], [1, 22, 7], [1, 23, 2],[2, 0, 1], [2, 1, 1], [2, 2, 0], [2, 3, 0], [2, 4, 0], [2, 5, 0],[2, 6, 0], [2, 7, 0], [2, 8, 0], [2, 9, 0], [2, 10, 3], [2, 11, 2],[2, 12, 1], [2, 13, 9], [2, 14, 8], [2, 15, 10], [2, 16, 6], [2, 17, 5],[2, 18, 5], [2, 19, 5], [2, 20, 7], [2, 21, 4], [2, 22, 2], [2, 23, 4],[3, 0, 7], [3, 1, 3], [3, 2, 0], [3, 3, 0], [3, 4, 0], [3, 5, 0],[3, 6, 0], [3, 7, 0], [3, 8, 1], [3, 9, 0], [3, 10, 5], [3, 11, 4],[3, 12, 7], [3, 13, 14], [3, 14, 13], [3, 15, 12], [3, 16, 9], [3, 17, 5],[3, 18, 5], [3, 19, 10], [3, 20, 6], [3, 21, 4], [3, 22, 4], [3, 23, 1],[4, 0, 1], [4, 1, 3], [4, 2, 0], [4, 3, 0], [4, 4, 0], [4, 5, 1],[4, 6, 0], [4, 7, 0], [4, 8, 0], [4, 9, 2], [4, 10, 4], [4, 11, 4],[4, 12, 2], [4, 13, 4], [4, 14, 4], [4, 15, 14], [4, 16, 12], [4, 17, 1],[4, 18, 8], [4, 19, 5], [4, 20, 3], [4, 21, 7], [4, 22, 3], [4, 23, 0],[5, 0, 2], [5, 1, 1], [5, 2, 0], [5, 3, 3], [5, 4, 0], [5, 5, 0],[5, 6, 0], [5, 7, 0], [5, 8, 2], [5, 9, 0], [5, 10, 4], [5, 11, 1],[5, 12, 5], [5, 13, 10], [5, 14, 5], [5, 15, 7], [5, 16, 11], [5, 17, 6],[5, 18, 0], [5, 19, 5], [5, 20, 3], [5, 21, 4], [5, 22, 2], [5, 23, 0],[6, 0, 1], [6, 1, 0], [6, 2, 0], [6, 3, 0], [6, 4, 0], [6, 5, 0],[6, 6, 0], [6, 7, 0], [6, 8, 0], [6, 9, 0], [6, 10, 1], [6, 11, 0],[6, 12, 2], [6, 13, 1], [6, 14, 3], [6, 15, 4], [6, 16, 0], [6, 17, 0],[6, 18, 0], [6, 19, 0], [6, 20, 1], [6, 21, 2], [6, 22, 2], [6, 23, 6] ] range_color = ['#313695', '#4575b4', '#74add1', '#abd9e9', '#e0f3f8', '#ffffbf','#fee090', '#fdae61', '#f46d43', '#d73027', '#a50026'] bar3d.add("",x_axis,y_axis,[[d[1], d[0], d[2]] for d in data],is_visualmap=True,visual_range=[0, 20],visual_range_color=range_color,grid3d_width=200,grid3d_depth=80,is_grid3d_rotate=True, # 自動旋轉grid3d_rotate_speed=180, # 旋轉速度 ) bar3d二、數據預處理
缺失值-主要用到的兩個包:1、pandas fillna ?2、sklearn Imputer
"""均值填充"""df_train['Age'].fillna(value=df_train['Age'].mean()).sample(5)""" 另一種均值填充的方式 """from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='mean', axis=0) age = imp.fit_transform(df_train[['Age']].values).copy() df_train.loc[:,'Age'] = df_train['Age'].fillna(value=df_train['Age'].mean()).copy() df_train.head(5)數值型 - 數值縮放"""取對數等變換"""import numpy as np log_age = df_train['Age'].apply(lambda x:np.log(x)) df_train.loc[:,'log_age'] = log_agedf_train.head(5)""" 幅度縮放,最大最小值縮放到[0,1]區間內 """from sklearn.preprocessing import MinMaxScaler mm_scaler = MinMaxScaler() fare_trans = mm_scaler.fit_transform(df_train[['Fare']])""" 幅度縮放,將每一列的數據標準化為正態分布 """from sklearn.preprocessing import StandardScaler std_scaler = StandardScaler() fare_std_trans = std_scaler.fit_transform(df_train[['Fare']])""" 中位數或者四分位數去中心化數據,對異常值不敏感 """from sklearn.preprocessing import robust_scale fare_robust_trans = robust_scale(df_train[['Fare','Age']])""" 將同一行數據規范化,前面的同一變為1以內也可以達到這樣的效果 """from sklearn.preprocessing import Normalizer normalizer = Normalizer() fare_normal_trans = normalizer.fit_transform(df_train[['Age','Fare']]) fare_normal_trans統計值""" 最大最小值 """max_age = df_train['Age'].max() min_age = df_train["Age"].min()""" 分位數,極值處理,我們最粗暴的方法就是將前后1%的值替換成前后兩個端點的值 """age_quarter_01 = df_train['Age'].quantile(0.01) print(age_quarter_01) age_quarter_99 = df_train['Age'].quantile(0.99) print(age_quarter_99)""" 四則運算 """df_train.loc[:,'family_size'] = df_train['SibSp']+df_train['Parch']+1 df_train.head(2)df_train.loc[:,'tmp'] = df_train['Age']*df_train['Pclass'] + 4*df_train['family_size'] df_train.head(2)""" 多項式特征 """from sklearn.preprocessing import PolynomialFeaturespoly = PolynomialFeatures(degree=2) df_train[['SibSp','Parch']].head()poly_fea = poly.fit_transform(df_train[['SibSp','Parch']]) pd.DataFrame(poly_fea,columns = poly.get_feature_names()).head()""" 等距切分 """df_train.loc[:, 'fare_cut'] = pd.cut(df_train['Fare'], 20) df_train.head(2)""" 等頻切分 """df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 10) df_train.head(2)""" badrate 曲線 """df_train = df_train.sort_values('Fare')alist = list(set(df_train['fare_qcut'])) badrate = {} for x in alist:a = df_train[df_train.fare_qcut == x]bad = a[a.label == 1]['label'].count()good = a[a.label == 0]['label'].count()badrate[x] = bad/(bad+good)f = zip(badrate.keys(),badrate.values()) f = sorted(f,key = lambda x : x[1],reverse = True ) badrate = pd.DataFrame(f) badrate.columns = pd.Series(['cut','badrate']) badrate = badrate.sort_values('cut') print(badrate.head()) badrate.plot('cut','badrate')""" 一般采取等頻分箱,很少等距分箱,等距分箱可能造成樣本非常不均勻 """
""" 一般分5-6箱,保證badrate曲線從非嚴格遞增轉化為嚴格遞增曲線 """
""" OneHot encoding/獨熱向量編碼 """""" 一般像男、女這種二分類categories類型的數據采取獨熱向量編碼, 轉化為0、1 主要用到 pd.get_dummies """embarked_oht = pd.get_dummies(df_train[['Embarked']]) embarked_oht.head(2)fare_qcut_oht = pd.get_dummies(df_train[['fare_qcut']]) fare_qcut_oht.head(2)時間型 日期處理car_sales = pd.read_csv('car_data.csv') car_sales.head(2)car_sales.loc[:,'date'] = pd.to_datetime(car_sales['date_t']) car_sales.head(2) """ 取出關鍵時間信息 """""" 月份 """car_sales.loc[:,'month'] = car_sales['date'].dt.month car_sales.head()""" 幾號 """car_sales.loc[:,'dom'] = car_sales['date'].dt.day""" 一年當中第幾天 """car_sales.loc[:,'doy'] = car_sales['date'].dt.dayofyear""" 星期幾 """car_sales.loc[:,'dow'] = car_sales['date'].dt.dayofweekcar_sales.head(2)文本型數據
from pyecharts import WordCloudname = ['bupt', '金融', '濤濤', '實戰', '人長得帥' ,'機器學習', '深度學習', '異常檢測', '知識圖譜', '社交網絡', '圖算法','遷移學習', '不均衡學習', '瞪噔', '數據挖掘', '哈哈','集成算法', '模型融合','python', '聰明'] value = [10000, 6181, 4386, 4055, 2467, 2244, 1898, 1484, 1112,965, 847, 582, 555, 550, 462, 366, 360, 282, 273, 265] wordcloud = WordCloud(width=800, height=600) wordcloud.add("", name, value, word_size_range=[30, 80]) """ 詞袋模型 """from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() corpus = ['This is a very good class','students are very very very good','This is the third sentence','Is this the last doc','PS teacher Mei is very very handsome' ]X = vectorizer.fit_transform(corpus) X.toarray() """ one-hot 編碼""" vec = CountVectorizer(ngram_range=(1,3)) X_ngram = vec.fit_transform(corpus) X_ngram.toarray() """ TF-IDF """from sklearn.feature_extraction.text import TfidfVectorizertfidf_vec = TfidfVectorizer() tfidf_X = tfidf_vec.fit_transform(corpus) tfidf_vec.get_feature_names() tfidf_X.toarray()組合特征
""" 根據條件去判斷獲取組合特征 """df_train.loc[:,'alone'] = (df_train['SibSp']==0)&(df_train['Parch']==0) df_train.head(3) """ 詞云圖可以直觀的反應哪些詞作用權重比較大 """from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer()corpus = ['This is a very good class','students are very very very good','This is the third sentence','Is this the last doc','teacher Mei is very very handsome' ]X = vectorizer.fit_transform(corpus)L = []for item in list(X.toarray()):L.append(list(item))value = [0 for i in range(len(L[0]))]for i in range(len(L[0])):for j in range(len(L)):value[i] += L[j][i]from pyecharts import WordCloudwordcloud = WordCloud(width=800,height=500) #這里是需要做的 wordcloud.add('',vectorizer.get_feature_names(),value,word_size_range=[20,100]) wordcloud三、特征衍生
data = pd.read_excel('textdata.xlsx') data.head()""" ft 和 gt 表示兩個變量名 1-12 表示對應12個月中每個月的相應數值 """
""" 基于時間序列進行特征衍生 """
""" 最近p個月,inv>0的月份數 inv表示傳入的變量名 """def Num(data,inv,p):df=data.loc[:,inv+'1':inv+str(p)]auto_value=np.where(df>0,1,0).sum(axis=1)return data,inv+'_num'+str(p),auto_valuedata_new = data.copy()for p in range(1,12):for inv in ['ft','gt']:data_new,columns_name,values=Num(data_new,inv,p)data_new[columns_name]=values # -*- coding:utf-8 -*-'''@Author : wangtao@Time : 19/9/3 下午6:28@desc : 構建時間序列衍生特征'''import numpy as np import pandas as pdclass time_series_feature(object):def __init__(self):passdef Num(self,data,inv,p):""":param data::param inv::param p::return: 最近p個月,inv大于0的月份個數"""df = data.loc[:,inv+'1':inv+str(p)]auto_value = np.where(df > 0,1,0).sum(axis=1)return inv+'_num'+str(p),auto_valuedef Nmz(self,data,inv,p):""":param data::param inv::param p::return: 最近p個月,inv=0的月份個數"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = np.where(df == 0, 1, 0).sum(axis=1)return inv + '_nmz' + str(p), auto_valuedef Evr(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,inv>0的月份數是否>=1"""df = data.loc[:, inv + '1':inv + str(p)]arr = np.where(df > 0, 1, 0).sum(axis=1)auto_value = np.where(arr, 1, 0)return inv + '_evr' + str(p), auto_valuedef Avg(self,data,inv, p):""":param p::return: 最近p個月,inv均值"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = np.nanmean(df, axis=1)return inv + '_avg' + str(p), auto_valuedef Tot(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,inv和"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = np.nansum(df, axis=1)return inv + '_tot' + str(p), auto_valuedef Tot2T(self,data,inv, p):""":param data::param inv::param p::return: 最近(2,p+1)個月,inv和 可以看出該變量的波動情況"""df = data.loc[:, inv + '2':inv + str(p + 1)]auto_value = df.sum(1)return inv + '_tot2t' + str(p), auto_valuedef Max(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,inv最大值"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = np.nanmax(df, axis=1)return inv + '_max' + str(p), auto_valuedef Min(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,inv最小值"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = np.nanmin(df, axis=1)return inv + '_min' + str(p), auto_valuedef Msg(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,最近一次inv>0到現在的月份數"""df = data.loc[:, inv + '1':inv + str(p)]df_value = np.where(df > 0, 1, 0)auto_value = []for i in range(len(df_value)):row_value = df_value[i, :]if row_value.max() <= 0:indexs = '0'auto_value.append(indexs)else:indexs = 1for j in row_value:if j > 0:breakindexs += 1auto_value.append(indexs)return inv + '_msg' + str(p), auto_valuedef Msz(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,最近一次inv=0到現在的月份數"""df = data.loc[:, inv + '1':inv + str(p)]df_value = np.where(df == 0, 1, 0)auto_value = []for i in range(len(df_value)):row_value = df_value[i, :]if row_value.max() <= 0:indexs = '0'auto_value.append(indexs)else:indexs = 1for j in row_value:if j > 0:breakindexs += 1auto_value.append(indexs)return inv + '_msz' + str(p), auto_valuedef Cav(self,data,inv, p):""":param p::return: 當月inv/(最近p個月inv的均值)"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = df[inv + '1'] / np.nanmean(df, axis=1)return inv + '_cav' + str(p), auto_valuedef Cmn(self,data,inv, p):""":param data::param inv::param p::return: 當月inv/(最近p個月inv的最小值)"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = df[inv + '1'] / np.nanmin(df, axis=1)return inv + '_cmn' + str(p), auto_valuedef Mai(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,每兩個月間的inv的增長量的最大值"""arr = np.array(data.loc[:, inv + '1':inv + str(p)])auto_value = []for i in range(len(arr)):df_value = arr[i, :]value_lst = []for k in range(len(df_value) - 1):minus = df_value[k] - df_value[k + 1]value_lst.append(minus)auto_value.append(np.nanmax(value_lst))return inv + '_mai' + str(p), auto_valuedef Mad(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,每兩個月間的inv的減少量的最大值"""arr = np.array(data.loc[:, inv + '1':inv + str(p)])auto_value = []for i in range(len(arr)):df_value = arr[i, :]value_lst = []for k in range(len(df_value) - 1):minus = df_value[k + 1] - df_value[k]value_lst.append(minus)auto_value.append(np.nanmax(value_lst))return inv + '_mad' + str(p), auto_valuedef Std(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,inv的標準差"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = np.nanvar(df, axis=1)return inv + '_std' + str(p), auto_valuedef Cva(self,data,inv, p):""":param p::return: 最近p個月,inv的變異系數"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = np.nanmean(df, axis=1) / np.nanvar(df, axis=1)return inv + '_cva' + str(p), auto_valuedef Cmm(self,data,inv, p):""":param data::param inv::param p::return: (當月inv) - (最近p個月inv的均值)"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = df[inv + '1'] - np.nanmean(df, axis=1)return inv + '_cmm' + str(p), auto_valuedef Cnm(self,data,inv, p):""":param p::return: (當月inv) - (最近p個月inv的最小值)"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = df[inv + '1'] - np.nanmin(df, axis=1)return inv + '_cnm' + str(p), auto_valuedef Cxm(self,data,inv, p):""":param data::param inv::param p::return: (當月inv) - (最近p個月inv的最大值)"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = df[inv + '1'] - np.nanmax(df, axis=1)return inv + '_cxm' + str(p), auto_valuedef Cxp(self,data,inv, p):""":param p::return: ( (當月inv) - (最近p個月inv的最大值) ) / (最近p個月inv的最大值) )"""df = data.loc[:, inv + '1':inv + str(p)]temp = np.nanmin(df, axis=1)auto_value = (df[inv + '1'] - temp) / tempreturn inv + '_cxp' + str(p), auto_valuedef Ran(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月,inv的極差"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = np.nanmax(df, axis=1) - np.nanmin(df, axis=1)return inv + '_ran' + str(p), auto_valuedef Nci(self,data,inv, p):""":param data::param inv::param p::return: 最近min( Time on book,p )個月中,后一個月相比于前一個月增長了的月份數"""arr = np.array(data.loc[:, inv + '1':inv + str(p)])auto_value = []for i in range(len(arr)):df_value = arr[i, :]value_lst = []for k in range(len(df_value) - 1):minus = df_value[k] - df_value[k + 1]value_lst.append(minus)value_ng = np.where(np.array(value_lst) > 0, 1, 0).sum()auto_value.append(np.nanmax(value_ng))return inv + '_nci' + str(p), auto_valuedef Ncd(self,data,inv, p):""":param data::param inv::param p::return: 最近min( Time on book,p )個月中,后一個月相比于前一個月減少了的月份數"""arr = np.array(data.loc[:, inv + '1':inv + str(p)])auto_value = []for i in range(len(arr)):df_value = arr[i, :]value_lst = []for k in range(len(df_value) - 1):minus = df_value[k] - df_value[k + 1]value_lst.append(minus)value_ng = np.where(np.array(value_lst) < 0, 1, 0).sum()auto_value.append(np.nanmax(value_ng))return inv + '_ncd' + str(p), auto_valuedef Ncn(self,data,inv, p):""":param data::param inv::param p::return: 最近min( Time on book,p )個月中,相鄰月份inv 相等的月份數"""arr = np.array(data.loc[:, inv + '1':inv + str(p)])auto_value = []for i in range(len(arr)):df_value = arr[i, :]value_lst = []for k in range(len(df_value) - 1):minus = df_value[k] - df_value[k + 1]value_lst.append(minus)value_ng = np.where(np.array(value_lst) == 0, 1, 0).sum()auto_value.append(np.nanmax(value_ng))return inv + '_ncn' + str(p), auto_valuedef Bup(self,data,inv, p):""":param p::return:desc:If 最近min( Time on book,p )個月中,對任意月份i ,都有 inv[i] > inv[i+1] 即嚴格遞增,且inv > 0則flag = 1 Else flag = 0"""arr = np.array(data.loc[:, inv + '1':inv + str(p)])auto_value = []for i in range(len(arr)):df_value = arr[i, :]index = 0for k in range(len(df_value) - 1):if df_value[k] > df_value[k + 1]:breakindex = + 1if index == p:value = 1else:value = 0auto_value.append(value)return inv + '_bup' + str(p), auto_valuedef Pdn(self,data,inv, p):""":param data::param inv::param p::return:desc: If 最近min( Time on book,p )個月中,對任意月份i ,都有 inv[i] < inv[i+1] ,即嚴格遞減,且inv > 0則flag = 1 Else flag = 0"""arr = np.array(data.loc[:, inv + '1':inv + str(p)])auto_value = []for i in range(len(arr)):df_value = arr[i, :]index = 0for k in range(len(df_value) - 1):if df_value[k + 1] > df_value[k]:breakindex = + 1if index == p:value = 1else:value = 0auto_value.append(value)return inv + '_pdn' + str(p), auto_valuedef Trm(self,data,inv, p):""":param data::param inv::param p::return: 最近min( Time on book,p )個月,inv的修建均值"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = []for i in range(len(df)):trm_mean = list(df.loc[i, :])trm_mean.remove(np.nanmax(trm_mean))trm_mean.remove(np.nanmin(trm_mean))temp = np.nanmean(trm_mean)auto_value.append(temp)return inv + '_trm' + str(p), auto_valuedef Cmx(self,data,inv, p):""":param data::param inv::param p::return: 當月inv / 最近p個月的inv中的最大值"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = (df[inv + '1'] - np.nanmax(df, axis=1)) / np.nanmax(df, axis=1)return inv + '_cmx' + str(p), auto_valuedef Cmp(self,data,inv, p):""":param data::param inv::param p::return: ( 當月inv - 最近p個月的inv均值 ) / inv均值"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = (df[inv + '1'] - np.nanmean(df, axis=1)) / np.nanmean(df, axis=1)return inv + '_cmp' + str(p), auto_valuedef Cnp(self,data,inv, p):""":param p::return: ( 當月inv - 最近p個月的inv最小值 ) /inv最小值"""df = data.loc[:, inv + '1':inv + str(p)]auto_value = (df[inv + '1'] - np.nanmin(df, axis=1)) / np.nanmin(df, axis=1)return inv + '_cnp' + str(p), auto_valuedef Msx(self,data,inv, p):""":param data::param inv::param p::return: 最近min( Time on book,p )個月取最大值的月份距現在的月份數"""df = data.loc[:, inv + '1':inv + str(p)]df['_max'] = np.nanmax(df, axis=1)for i in range(1, p + 1):df[inv + str(i)] = list(df[inv + str(i)] == df['_max'])del df['_max']df_value = np.where(df == True, 1, 0)auto_value = []for i in range(len(df_value)):row_value = df_value[i, :]indexs = 1for j in row_value:if j == 1:breakindexs += 1auto_value.append(indexs)return inv + '_msx' + str(p), auto_valuedef Rpp(self,data,inv, p):""":param data::param inv::param p::return: 近p個月的均值/((p,2p)個月的inv均值)"""df1 = data.loc[:, inv + '1':inv + str(p)]value1 = np.nanmean(df1, axis=1)df2 = data.loc[:, inv + str(p):inv + str(2 * p)]value2 = np.nanmean(df2, axis=1)auto_value = value1 / value2return inv + '_rpp' + str(p), auto_valuedef Dpp(self,data,inv, p):""":param data::param inv::param p::return: 最近p個月的均值 - ((p,2p)個月的inv均值)"""df1 = data.loc[:, inv + '1':inv + str(p)]value1 = np.nanmean(df1, axis=1)df2 = data.loc[:, inv + str(p):inv + str(2 * p)]value2 = np.nanmean(df2, axis=1)auto_value = value1 - value2return inv + '_dpp' + str(p), auto_valuedef Mpp(self,data,inv, p):""":param data::param inv::param p::return: (最近p個月的inv最大值)/ (最近(p,2p)個月的inv最大值)"""df1 = data.loc[:, inv + '1':inv + str(p)]value1 = np.nanmax(df1, axis=1)df2 = data.loc[:, inv + str(p):inv + str(2 * p)]value2 = np.nanmax(df2, axis=1)auto_value = value1 / value2return inv + '_mpp' + str(p), auto_valuedef Npp(self,data,inv, p):""":param data::param inv::param p::return: (最近p個月的inv最小值)/ (最近(p,2p)個月的inv最小值)"""df1 = data.loc[:, inv + '1':inv + str(p)]value1 = np.nanmin(df1, axis=1)df2 = data.loc[:, inv + str(p):inv + str(2 * p)]value2 = np.nanmin(df2, axis=1)auto_value = value1 / value2return inv + '_npp' + str(p), auto_valuedef auto_var(self,data_new,inv,p):""":param data::param inv::param p::return: 批量調用雙參數函數"""try:columns_name, values = self.Num(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Nmz(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Evr(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Avg(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Tot(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Tot2T(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Max(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Max(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Min(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Msg(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Msz(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cav(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cmn(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Std(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cva(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cmm(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cnm(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cxm(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cxp(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Ran(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Nci(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Ncd(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Ncn(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Pdn(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cmx(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cmp(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Cnp(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Msx(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Nci(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Trm(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Bup(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Mai(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Mad(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Rpp(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Dpp(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Mpp(data_new,inv, p)data_new[columns_name] = valuescolumns_name, values = self.Npp(data_new,inv, p)data_new[columns_name] = valuesexcept:passreturn data_newif __name__ == "__main__":file_dir = ""file_name = "textdata.xlsx"data_ = pd.read_excel(file_dir + file_name)auto_var2 = time_series_feature()for p in range(1,12):for inv in ['ft','gt']:data_ = auto_var2.auto_var(data_,inv,p)四、特征篩選
?
常用特征選擇三種方法:
1、Filter
移除低方差的特征 (Removing features with low variance)
單變量特征選擇 (Univariate feature selection)
2、Wrapper
遞歸特征消除 (Recursive Feature Elimination)
3、Embedded
使用SelectFromModel選擇特征 (Feature selection using SelectFromModel)
將特征選擇過程融入pipeline (Feature selection as part of a pipeline)
當數據預處理完成后,我們需要選擇有意義的特征輸入機器學習的算法和模型進行訓練。
通常來說,從兩個方面考慮來選擇特征:
1、特征是否發散
如果一個特征不發散,例如方差接近于0,也就是說樣本在這個特征上基本上沒有差異,這個特征對于樣本的區分并沒有什么用。
2、特征與目標的相關性
這點比較顯見,與目標相關性高的特征,應當優選選擇。除移除低方差法外,可從相關性考慮
根據特征選擇的形式又可以將特征選擇方法分為3種:
Filter:過濾法,按照發散性或者相關性對各個特征進行評分,設定閾值或者待選擇閾值的個數,選擇特征。
Wrapper:包裝法,根據目標函數(通常是預測效果評分),每次選擇若干特征,或者排除若干特征。
Embedded:嵌入法,先使用某些機器學習的算法和模型進行訓練,得到各個特征的權值系數,根據系數從大到小選擇特征。類似于Filter方法,但是是通過訓練來確定特征的優劣。
特征選擇主要有兩個目的:
減少特征數量、降維,使模型泛化能力更強,減少過擬合;
增強對特征和特征值之間的理解。
拿到數據集,一個特征選擇方法,往往很難同時完成這兩個目的
?
Filter
1)移除低方差的特征 (Removing features with low variance)
假設某特征的特征值只有0和1,并且在所有輸入樣本中,95%的實例的該特征取值都是1,那就可以認為這個特征作用不大。
如果100%都是1,那這個特征就沒意義了。當特征值都是離散型變量的時候這種方法才能用,如果是連續型變量,就需要將連續變量離散化之后才能用。
而且實際當中,一般不太會有95%以上都取某個值的特征存在,所以這種方法雖然簡單但是不太好用。可以把它作為特征選擇的預處理,
先去掉那些取值變化小的特征,然后再從接下來提到的的特征選擇方法中選擇合適的進行進一步的特征選擇。
?
2)單變量特征選擇 (Univariate feature selection)
單變量特征選擇的原理是分別單獨的計算每個變量的某個統計指標,根據該指標來判斷哪些變量重要,剔除那些不重要的變量。
對于分類問題(y離散),可采用:
卡方檢驗
f_classif
mutual_info_classif
互信息
對于回歸問題(y連續),可采用:
皮爾森相關系數
f_regression,
mutual_info_regression
最大信息系數
這種方法比較簡單,易于運行,易于理解,通常對于理解數據有較好的效果(但對特征優化、提高泛化能力來說不一定有效)。
SelectKBest 移除得分前 k 名以外的所有特征(取top k)
SelectPercentile 移除得分在用戶指定百分比以后的特征(取top k%)
對每個特征使用通用的單變量統計檢驗: 假正率(false positive rate) SelectFpr, 偽發現率(false discovery rate) SelectFdr, 或族系誤差率 SelectFwe.
GenericUnivariateSelect 可以設置不同的策略來進行單變量特征選擇。同時不同的選擇策略也能夠使用超參數尋優,從而讓我們找到最佳的單變量特征選擇策略。
Notice:
The methods based on F-test estimate the degree of linear dependency between two random variables.?
?
? (F檢驗用于評估兩個隨機變量的線性相關性)
??
? On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.
??
? (另一方面,互信息的方法可以捕獲任何類型的統計依賴關系,但是作為一個非參數方法,估計準確需要更多的樣本)
卡方(Chi2)檢驗
經典的卡方檢驗是檢驗定性自變量對定性因變量的相關性。
比如,我們可以對樣本進行一次chi2 測試來選擇最佳的兩項特征:
from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2iris = load_iris() X, y = iris.data, iris.target print(X.shape)X_new = SelectKBest(chi2, k=2).fit_transform(X, y) print(X_new.shape)Pearson相關系數 (Pearson Correlation)
皮爾森相關系數是一種最簡單的,能幫助理解特征和響應變量之間關系的方法,
該方法衡量的是變量之間的線性相關性,結果的取值區間為[-1,1],-1表示完全的負相關,+1表示完全的正相關,0表示沒有線性相關
import numpy as np from scipy.stats import pearsonrnp.random.seed(0) size = 300 x = np.random.normal(0, 1, size)""" pearsonr(x, y)的輸入為特征矩陣和目標向量,能夠同時計算 相關系數 和p-value. """print("Lower noise", pearsonr(x, x + np.random.normal(0, 1, size))) print("Higher noise", pearsonr(x, x + np.random.normal(0, 10, size)))""" 比較了變量在加入噪音之前和之后的差異。當噪音比較小的時候,相關性很強,p-value很低 """ """ 使用Pearson相關系數主要是為了看特征之間的相關性,而不是和因變量之間的。 """Wrapper
遞歸特征消除 (Recursive Feature Elimination)
遞歸消除特征法使用一個基模型來進行多輪訓練,每輪訓練后,移除若干權值系數的特征,再基于新的特征集進行下一輪訓練。
對特征含有權重的預測模型(例如,線性模型對應參數coefficients),RFE通過遞歸減少考察的特征集規模來選擇特征。
首先,預測模型在原始特征上訓練,每個特征指定一個權重。之后,那些擁有最小絕對值權重的特征被踢出特征集。如此往復遞歸,直至剩余的特征數量達到所需的特征數量。
RFECV 通過交叉驗證的方式執行RFE,以此來選擇最佳數量的特征:對于一個數量為d的feature的集合,他的所有的子集的個數是2的d次方減1(包含空集)。
指定一個外部的學習算法,比如SVM之類的。通過該算法計算所有子集的validation error。選擇error最小的那個子集作為所挑選的特征。
from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_irisrf = RandomForestClassifier() iris=load_iris() X,y=iris.data,iris.target rfe = RFE(estimator=rf, n_features_to_select=3) X_rfe = rfe.fit_transform(X,y) X_rfe.shapeEmbedded
使用SelectFromModel選擇特征 (Feature selection using SelectFromModel)
基于L1的特征選擇 (L1-based feature selection)
使用L1范數作為懲罰項的線性模型(Linear models)會得到稀疏解:大部分特征對應的系數為0。
當你希望減少特征的維度以用于其它分類器時,可以通過 feature_selection.SelectFromModel 來選擇不為0的系數。
特別指出,常用于此目的的稀疏預測模型有 linear_model.Lasso(回歸), linear_model.LogisticRegression 和 svm.LinearSVC(分類)
?
首先來回顧一下我們在業務中的模型會遇到什么問題。
1、模型效果不好:大概率數據有問題
2、訓練集效果好,跨時間測試(一般測試樣本是訓練數據的1/10)效果不好:
測試數據分布與訓練數據不太一樣導致的,說明選入特征變量有問題波動比較大,查看分析比較波動的特征變量?
3、跨時間測試效果也好,上線之后效果不好:線下和線上和變量的邏輯出了問題,線下特征信息可能包含未來變量
4、上線之后效果還好,幾周之后分數分布開始下滑:說明模型效果不行,說明一兩個變量在跨時間上效果比較差
5、一兩個月內都比較穩定,突然分數分布驟降:可能是外部因素,如運營部門一些操作或國家政策導致
6、沒有明顯問題,但模型每個月逐步失效:
然后我們來考慮一下業務所需要的變量是什么。
? ? 變量必須對模型有貢獻,也就是說必須能對客群加以區分
? ??
? ? 邏輯回歸要求變量之間線性無關
? ??
? ? 邏輯回歸評分卡也希望變量呈現單調趨勢?
? ??
? ? (有一部分也是業務原因,但從模型角度來看,單調變量未必一定比有轉折的變量好)
? ??
? ? 客群在每個變量上的分布穩定,分布遷移無可避免,但不能波動太大
? ??
為此我們從上述方法中找到最貼合當前使用場景的幾種方法。
3)單調性
- bivar圖
""" 等頻切分 """ df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 10) df_train.head() df_train = df_train.sort_values('Fare') alist = list(set(df_train['fare_qcut'])) badrate = {} for x in alist:a = df_train[df_train.fare_qcut == x]bad = a[a.label == 1]['label'].count()good = a[a.label == 0]['label'].count()badrate[x] = bad/(bad+good) f = zip(badrate.keys(),badrate.values()) f = sorted(f,key = lambda x : x[1],reverse = True ) badrate = pd.DataFrame(f) badrate.columns = pd.Series(['cut','badrate']) badrate = badrate.sort_values('cut') print(badrate) badrate.plot('cut','badrate') def var_PSI(dev_data, val_data):dev_cnt, val_cnt = sum(dev_data), sum(val_data)if dev_cnt * val_cnt == 0:return NonePSI = 0for i in range(len(dev_data)):dev_ratio = dev_data[i] / dev_cntval_ratio = val_data[i] / val_cnt + 1e-10psi = (dev_ratio - val_ratio) * math.log(dev_ratio/val_ratio)PSI += psireturn PSI注意分箱的數量將會影響著變量的PSI值。
PSI并不只可以對模型來求,對變量來求也一樣。只需要對跨時間分箱的數據分別求PSI即可。
import pandas as pd from sklearn.metrics import roc_auc_score,roc_curve,auc from sklearn.model_selection import train_test_split from sklearn import metrics from sklearn.linear_model import LogisticRegression import numpy as np import random import mathdata = pd.read_csv(file_dir + 'data.txt') data.head() """ 看一下月份分布,我們用最后一個月做為跨時間驗證集合 """ data.obs_mth.unique()train = data[data.obs_mth != '2018-11-30'].reset_index().copy() val = data[data.obs_mth == '2018-11-30'].reset_index().copy()feature_lst = ['person_info','finance_info','credit_info','act_info','td_score','jxl_score','mj_score','rh_score']x = train[feature_lst] y = train['bad_ind']val_x = val[feature_lst] val_y = val['bad_ind']lr_model = LogisticRegression(C=0.1) lr_model.fit(x,y) y_pred = lr_model.predict_proba(x)[:,1] fpr_lr_train,tpr_lr_train,_ = roc_curve(y,y_pred) train_ks = abs(fpr_lr_train - tpr_lr_train).max() print('train_ks : ',train_ks)y_pred = lr_model.predict_proba(val_x)[:,1] fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred) val_ks = abs(fpr_lr - tpr_lr).max() print('val_ks : ',val_ks)from matplotlib import pyplot as plt plt.plot(fpr_lr_train,tpr_lr_train,label = 'train LR') plt.plot(fpr_lr,tpr_lr,label = 'evl LR') plt.plot([0,1],[0,1],'k--') plt.xlabel('False positive rate') plt.ylabel('True positive rate') plt.title('ROC Curve') plt.legend(loc = 'best') plt.show() """ 做特征篩選 """from statsmodels.stats.outliers_influence import variance_inflation_factorX = np.array(x)for i in range(X.shape[1]):print(variance_inflation_factor(X,i)) import lightgbm as lgb from sklearn.model_selection import train_test_splittrain_x,test_x,train_y,test_y = train_test_split(x,y,random_state=0,test_size=0.2)def lgb_test(train_x,train_y,test_x,test_y):clf =lgb.LGBMClassifier(boosting_type = 'gbdt',objective = 'binary',metric = 'auc',learning_rate = 0.1,n_estimators = 24,max_depth = 5,num_leaves = 20,max_bin = 45,min_data_in_leaf = 6,bagging_fraction = 0.6,bagging_freq = 0,feature_fraction = 0.8,)clf.fit(train_x,train_y,eval_set = [(train_x,train_y),(test_x,test_y)],eval_metric = 'auc')return clf,clf.best_score_['valid_1']['auc'],lgb_model , lgb_auc = lgb_test(train_x,train_y,test_x,test_y)feature_importance = pd.DataFrame({'name':lgb_model.booster_.feature_name(),'importance':lgb_model.feature_importances_}).sort_values(by=['importance'],ascending=False) feature_importancefeature_lst = ['person_info','finance_info','credit_info','act_info'] x = train[feature_lst] y = train['bad_ind']val_x = val[feature_lst] val_y = val['bad_ind']lr_model = LogisticRegression(C=0.1,class_weight='balanced') lr_model.fit(x,y) y_pred = lr_model.predict_proba(x)[:,1] fpr_lr_train,tpr_lr_train,_ = roc_curve(y,y_pred) train_ks = abs(fpr_lr_train - tpr_lr_train).max() print('train_ks : ',train_ks)y_pred = lr_model.predict_proba(val_x)[:,1] fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred) val_ks = abs(fpr_lr - tpr_lr).max() print('val_ks : ',val_ks)from matplotlib import pyplot as plt plt.plot(fpr_lr_train,tpr_lr_train,label = 'train LR') plt.plot(fpr_lr,tpr_lr,label = 'evl LR') plt.plot([0,1],[0,1],'k--') plt.xlabel('False positive rate') plt.ylabel('True positive rate') plt.title('ROC Curve') plt.legend(loc = 'best') plt.show()# 系數print('變量名單:',feature_lst) print('系數:',lr_model.coef_) print('截距:',lr_model.intercept_)"""報告""" model = lr_model row_num, col_num = 0, 0 bins = 20 Y_predict = [s[1] for s in model.predict_proba(val_x)] Y = val_y nrows = Y.shape[0] lis = [(Y_predict[i], Y[i]) for i in range(nrows)] ks_lis = sorted(lis, key=lambda x: x[0], reverse=True) bin_num = int(nrows/bins+1) bad = sum([1 for (p, y) in ks_lis if y > 0.5]) good = sum([1 for (p, y) in ks_lis if y <= 0.5]) bad_cnt, good_cnt = 0, 0KS = [] BAD = [] GOOD = [] BAD_CNT = [] GOOD_CNT = [] BAD_PCTG = [] BADRATE = [] dct_report = {}for j in range(bins):ds = ks_lis[j*bin_num: min((j+1)*bin_num, nrows)]bad1 = sum([1 for (p, y) in ds if y > 0.5])good1 = sum([1 for (p, y) in ds if y <= 0.5])bad_cnt += bad1good_cnt += good1bad_pctg = round(bad_cnt/sum(val_y),3)badrate = round(bad1/(bad1+good1),3)ks = round(math.fabs((bad_cnt / bad) - (good_cnt / good)),3)KS.append(ks)BAD.append(bad1)GOOD.append(good1)BAD_CNT.append(bad_cnt)GOOD_CNT.append(good_cnt)BAD_PCTG.append(bad_pctg)BADRATE.append(badrate)dct_report['KS'] = KSdct_report['BAD'] = BADdct_report['GOOD'] = GOODdct_report['BAD_CNT'] = BAD_CNTdct_report['GOOD_CNT'] = GOOD_CNTdct_report['BAD_PCTG'] = BAD_PCTGdct_report['BADRATE'] = BADRATE val_repot = pd.DataFrame(dct_report) val_repot""" 映射分數 """ #['person_info','finance_info','credit_info','act_info']def score(person_info,finance_info,credit_info,act_info):xbeta = person_info * ( 3.49460978) + finance_info * ( 11.40051582 ) + credit_info * (2.45541981) + act_info * ( -1.68676079) --0.34484897 score = 650-34* (xbeta)/math.log(2)return scoreval['score'] = val.apply(lambda x : score(x.person_info,x.finance_info,x.credit_info,x.act_info) ,axis=1)fpr_lr,tpr_lr,_ = roc_curve(val_y,val['score']) val_ks = abs(fpr_lr - tpr_lr).max()print('val_ks : ',val_ks)#對應評級區間 def level(score):level = 0if score <= 600:level = "D"elif score <= 640 and score > 600 : level = "C"elif score <= 680 and score > 640:level = "B"elif score > 680 :level = "A"return levelval['level'] = val.score.map(lambda x : level(x) )val.level.groupby(val.level).count()/len(val)""" 畫圖展示區間分布情況 """ import seaborn as snssns.distplot(val.score,kde=True)val = val.sort_values('score',ascending=True).reset_index(drop=True) df2=val.bad_ind.groupby(val['level']).sum() df3=val.bad_ind.groupby(val['level']).count() print(df2/df3)總結
- 上一篇: Leetcode - 230. Kth
- 下一篇: 知识图谱学习笔记-知识图谱介绍