dnn模型 list index out of range_基于svm的财务预警模型
前言
????? ?本文將我國A股上市公司作為研究對象,選取了A股 2015-2019 年度被 ST 或被 *ST上市公司,剔除了部分非財務原因導致ST或*ST的上市公司。財務指標選擇了T-3期的資產(chǎn)負債率、流動比率、應收賬款周轉率等10個財務指標。
數(shù)據(jù)收集及預處理
????????導入所需要庫? ?
import?tushare?as?tspro?=?ts.pro_api('*')??#tushre的tokenimport?pandas?as?pdimport?time?????????通過tushare.pro取得上市公司信息列表?
data_Lcompany=pro.stock_basic(exchange='',?list_status='L',?fields='ts_code,symbol')data_Dcompany=pro.stock_basic(exchange='',?list_status='D',?fields='ts_code,symbol')data_Pcompany=pro.stock_basic(exchange='',?list_status='P',?fields='ts_code,symbol')data_company=pd.concat([data_Lcompany,data_Dcompany,data_Pcompany]????????讀取CSMAR數(shù)據(jù)庫導出的樣本數(shù)據(jù)并做數(shù)據(jù)預處理
data_sample=pd.read_excel(r"C:\Users\hasee\Desktop\樣本.xlsx")data_sample.columns=data_sample.iloc[0]data_sample.drop(index=0,inplace=True)data_sample.loc[:,"執(zhí)行日期"]=pd.to_datetime(data_sample.loc[:,"執(zhí)行日期"])data_sample["年份"]=pd.DatetimeIndex(data_sample.loc[:,"執(zhí)行日期"]).yeardata_sample=data_sample[["證券代碼","年份"]]data_sample.columns=["symbol","年份"]data_sample=pd.merge(data_sample,data_company,how='left',on="symbol"? ? ? ??對樣本數(shù)據(jù)進一步預處理,與正常公司數(shù)據(jù)合并,通過tushare.pro讀取財務指標并寫入csv
year=[2015,2016,2017,2018,2019]createVar = locals()for j in year: createVar["data_sample_"+str(j)]=data_sample[data_sample["年份"]==j]????createVar["data_origin_"+str(j)]=pd.merge(data_Lcompany,createVar["data_sample_"+str(j)],how='outer')?????????????????????????????????????????createVar["data_origin_"+str(j)]["label"]=createVar["data_origin_"+str(j)]["年份"].map({j:0})???????????????????????????????????????? createVar["data_origin_"+str(j)].fillna(1,inplace=True) print(j) del createVar["data_origin_"+str(j)]["年份"] for i in range(createVar["data_origin_"+str(j)].shape[0]): print(i) if i==0:????????????createVar["fina_indicator_"+str(j)]=pro.query('fina_indicator',?ts_code=createVar["data_origin_"+str(j)]["ts_code"][i],?start_date=str(j-3)+"1231",?end_date=str(j-3)+"1231") else: time.sleep(0.9) createVar["fina_indicator_"+str(j)]=pd.concat([createVar["fina_indicator_"+str(j)],pro.query('fina_indicator', ts_code=createVar["data_origin_"+str(j)]["ts_code"][i], start_date=str(j-3)+"1231", end_date=str(j-3)+"1231")]) ????createVar["fina_indicator_"+str(j)].to_csv("fina_indicator_"+str(j)+".csv")??????????????讀取財務指標?選取財務指標并做數(shù)據(jù)預處理
fina_indicator_2015=pd.read_csv("fina_indicator_2015.csv")del fina_indicator_2015["Unnamed: 0"]fina_indicator_2016=pd.read_csv("fina_indicator_2016.csv")del fina_indicator_2016["Unnamed: 0"]fina_indicator_2017=pd.read_csv("fina_indicator_2017.csv")del fina_indicator_2017["Unnamed: 0"]fina_indicator_2018=pd.read_csv("fina_indicator_2018.csv")del fina_indicator_2018["Unnamed: 0"]fina_indicator_2019=pd.read_csv("fina_indicator_2019.csv")del fina_indicator_2019["Unnamed: 0"]data_origin_2015=pd.merge(fina_indicator_2015,data_origin_2015,on="ts_code",how='left')data_origin_2016=pd.merge(fina_indicator_2016,data_origin_2016,on="ts_code",how='left')data_origin_2017=pd.merge(fina_indicator_2016,data_origin_2017,on="ts_code",how='left')data_origin_2018=pd.merge(fina_indicator_2016,data_origin_2018,on="ts_code",how='left')data_origin_2019=pd.merge(fina_indicator_2016,data_origin_2019,on="ts_code",how='left')data_origin=pd.concat([data_origin_2015,data_origin_2016,data_origin_2017,data_origin_2018,data_origin_2019])data_origin.index=range(len(data_origin))data_origin=data_origin[["debt_to_assets","fcff","profit_dedt","current_ratio","ar_turn","profit_to_gr","roa","grossprofit_margin","or_yoy","ebt_yoy","assets_yoy","label"]]data_origin["fcff_ratio"]=data_origin["fcff"]/data_origin["profit_dedt"]data_origin.dropna(axis=0,?how='any',inplace=True)data_origin.index=range(len(data_origin))label=data_origin["label"]data_origin.drop("fcff", axis=1,inplace=True)data_origin.drop("profit_dedt",?axis=1,inplace=True)模型調參
????????導入所需的庫
from sklearn.svm import SVCfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorefrom sklearn.metrics import roc_auc_score, recall_scorefrom sklearn.metrics import roc_curvefrom sklearn.metrics import roc_auc_score as AUCfrom time import time import datetimefrom sklearn.preprocessing import StandardScalerimport numpy as npimport matplotlib.pyplot as plt?????? ?劃分訓練集測試集
Xtrain,?Xtest,?Ytrain,?Ytest?=?train_test_split(data_origin,label,test_size=0.3,random_state=400)為什么要劃分訓練集測試集,請看這里。
https://www.zhihu.com/question/22872584
? ? ? ? 數(shù)據(jù)歸一化
ss = StandardScaler()ss = ss.fit(Xtrain)Xtrain = ss.transform(Xtrain)Xtest = ss.transform(Xtest)? ? ? ? ?選擇核函數(shù)
times?=?time()?for kernel in ["linear","poly","rbf","sigmoid"]: clf = SVC(kernel = kernel ,gamma="auto" ,degree = 1 ,cache_size = 3000 ,class_weight = "balanced").fit(Xtrain, Ytrain) result = clf.predict(Xtest) score = clf.score(Xtest,Ytest) recall = recall_score(Ytest, result) auc = roc_auc_score(Ytest,clf.decision_function(Xtest)) print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc)) print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))??????因為訓練集中正常公司遠遠多于非正常公司,所以存在樣本不均衡問題。分類模型天生會傾向于多數(shù)的類,讓多數(shù)類更容易被判斷正確,少數(shù)類被犧牲掉。但是這樣對我們的模式是沒有意義的,根本無法達到我們的“要識別幾年后會處于財務困境的公司”的建模目的。?SVC中class_weight中"balanced"
參數(shù)可以比較好地修正我們的樣本不均衡情況。
返回結果
? ???返回了三個指標Accuracy、Recall、AUC面積。
????準確率Accuracy就是所有預測正確的所有樣本除以總樣本,通常來說越接近1越好。
????召回率Recall又被稱為敏感度(sensitivity),真正率,查全率,表示所有真實為1的樣本中,被我們預測正確的樣本所占的比例。召回率越高,代表我們盡量捕捉出了越多的少數(shù)類,召回率越低,代表我們沒有捕捉出足夠的少數(shù)類。
??? AUC面積反應了模型捕捉少數(shù)類的多數(shù)類之間的平衡能力。AUC面積與ROC曲線有關系后面會講。
?????在這里我們選擇AUC面積作為選擇核函數(shù)的指標,rbf核函數(shù)auc面積最大,選擇rbf核函數(shù)(高斯徑向基核函數(shù))。
? ? ? ? ?對Gamma調參
recallall = []aucall = []scoreall = []gamma_range = np.logspace(-10,1,20)for i in gamma_range: clf = clf = SVC(kernel = "rbf" ,gamma=i ,degree = 1 ,cache_size = 3000 ,class_weight = "balanced").fit(Xtrain, Ytrain) result = clf.predict(Xtest) score = clf.score(Xtest,Ytest) recall = recall_score(Ytest, result) auc = roc_auc_score(Ytest,clf.decision_function(Xtest)) recallall.append(recall) aucall.append(auc) scoreall.append(score)print(max(scoreall), gamma_range[scoreall.index(max(scoreall))])print(max(aucall), gamma_range[aucall.index(max(aucall))])print(max(recallall), gamma_range[recallall.index(max(recallall))])plt.plot(gamma_range,scoreall)plt.plot(gamma_range,aucall)plt.plot(gamma_range,recallall)plt.show()? ?????返回結果
? ? ? ?可以看出在Gamma取0.04832930238571752時AUC面積達到了0.757。
??????? ?對軟間隔指標C調參
recallall = []aucall = []scoreall = []C_range = np.linspace(0.01,30,50)for i in C_range: clf = SVC(kernel = "rbf",C=i,gamma=0.04832930238571752 ,cache_size = 3000 ,class_weight = "balanced").fit(Xtrain, Ytrain) result = clf.predict(Xtest) score = clf.score(Xtest,Ytest) recall = recall_score(Ytest, result) auc = roc_auc_score(Ytest,clf.decision_function(Xtest)) recallall.append(recall) aucall.append(auc) scoreall.append(score)print(max(scoreall), C_range[scoreall.index(max(scoreall))])print(max(aucall), C_range[aucall.index(max(aucall))])print(max(recallall), C_range[recallall.index(max(recallall))])plt.plot(C_range,scoreall)plt.plot(C_range,aucall)plt.plot(C_range,recallall)plt.show()? ?????返回結果
? ? ? ? ?可以看出在C取0.6220408163265306時AUC面積達到了0.758。
????????因為Poly核函數(shù)可調的參數(shù)更多,我還嘗試用網(wǎng)格搜索對poly核函數(shù)調參,奈何跑了一天出來的結果也不理想。
????????繪制ROC曲線
clf_proba = clf = SVC(kernel = "rbf",C=0.6220408163265306,gamma=0.04832930238571752 ,cache_size = 3000 ,class_weight = "balanced").fit(Xtrain, Ytrain)FPR, recall, thresholds = roc_curve(Ytest,clf_proba.decision_function(Xtest), pos_label=1)area = AUC(Ytest,clf_proba.decision_function(Xtest))plt.figure()plt.plot(FPR, recall, color='red', label='ROC curve (area = %0.2f)' % area)plt.plot([0, 1], [0, 1], color='black',linestyle='--')plt.xlim([-0.05, 1.05])plt.ylim([-0.05, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('Recall')plt.title('Receiver operating characteristic example')plt.legend(loc="lower right")plt.show()?????? ROC曲線,全稱The Receiver Operating Characteristic Curve,譯為受試者操作特性曲線。這是一條以不同閾值下的假正率FPR(假正率就是一個模型將多數(shù)類判斷錯誤的能力)為橫坐標,不同閾值下的召回率Recall為縱坐標的曲線,在我們這個案例中閾值為每個樣本到劃分數(shù)據(jù)集的超平面的距離。而AUC面積就是ROC曲線下的面積。
關于ROC曲線更多的知識請看這里:
https://blog.csdn.net/dujiahei/article/details/87932096
????????返回結果
????? ??繪制最佳閾值在ROC曲線中的位置
max((recall - FPR).tolist())maxindex = (recall - FPR).tolist().index(max(recall - FPR))plt.figure()plt.plot(FPR, recall, color='red', label='ROC curve (area = %0.2f)' % area)plt.plot([0, 1], [0, 1], color='black', linestyle='--')plt.scatter(FPR[maxindex],recall[maxindex],c="black",s=30)plt.xlim([-0.05, 1.05])plt.ylim([-0.05, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('Recall')plt.title('Receiver operating characteristic example')plt.legend(loc="lower right")plt.show()? ?????返回結果
? ? ? ?基于我們選出的最佳閾值,我們來人為確定y_predict,并確定在這個閾值下的recall和準確度的值。
prob = pd.DataFrame(clf_proba.decision_function(Xtest))prob.loc[prob.iloc[:,0] >= thresholds[maxindex],"y_pred"]=1prob.loc[prob.iloc[:,0] < thresholds[maxindex],"y_pred"]=0prob.loc[:,"y_pred"].isnull().sum()times = time()score = AC(Ytest,prob.loc[:,"y_pred"].values)recall = recall_score(Ytest, prob.loc[:,"y_pred"])print("testing accuracy %f,recall is %f" % (score,recall))print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))? ? ? ?返回結果
? ? ? ?在這個最佳閾值下,精確度達到了0.683,召回率達到了0.726。
后記
???? ??總體來說,模型效果真的很一般,很大程度跟特征及樣本選擇有關系,這方面實在欠考慮,這個后期可以多嘗試,模型效果應該還能有一定改善。
? ? ? ?這里還有一個教訓就是對量綱不統(tǒng)一的數(shù)據(jù)一定要歸一化處理,第一次調參時沒做歸一化,線性核函數(shù)跑了一天也沒出結果(實際上我覺得應該不至于。。也不知道為什么)。
最后
????????位我上者燦爛星空,道德律令在我心中。
創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯,堅持創(chuàng)作打卡瓜分現(xiàn)金大獎總結
以上是生活随笔為你收集整理的dnn模型 list index out of range_基于svm的财务预警模型的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 随机森林算法 python_Python
- 下一篇: 大户型如何选智能路由器户型大的房子如何选