當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

dnn模型 list index out of range_基于svm的财务预警模型

發(fā)布時間：2023/12/3 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 dnn模型 list index out of range_基于svm的财务预警模型小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

前言

????? ?本文將我國A股上市公司作為研究對象，選取了A股 2015-2019 年度被 ST 或被 *ST上市公司，剔除了部分非財務原因導致ST或*ST的上市公司。財務指標選擇了T-3期的資產(chǎn)負債率、流動比率、應收賬款周轉率等10個財務指標。

數(shù)據(jù)收集及預處理

????????導入所需要庫? ?

import?tushare?as?tspro?=?ts.pro_api('*')??#tushre的tokenimport?pandas?as?pdimport?time

?????????通過tushare.pro取得上市公司信息列表?

data_Lcompany=pro.stock_basic(exchange='',?list_status='L',?fields='ts_code,symbol')data_Dcompany=pro.stock_basic(exchange='',?list_status='D',?fields='ts_code,symbol')data_Pcompany=pro.stock_basic(exchange='',?list_status='P',?fields='ts_code,symbol')data_company=pd.concat([data_Lcompany,data_Dcompany,data_Pcompany]

????????讀取CSMAR數(shù)據(jù)庫導出的樣本數(shù)據(jù)并做數(shù)據(jù)預處理

data_sample=pd.read_excel(r"C:\Users\hasee\Desktop\樣本.xlsx")data_sample.columns=data_sample.iloc[0]data_sample.drop(index=0,inplace=True)data_sample.loc[:,"執(zhí)行日期"]=pd.to_datetime(data_sample.loc[:,"執(zhí)行日期"])data_sample["年份"]=pd.DatetimeIndex(data_sample.loc[:,"執(zhí)行日期"]).yeardata_sample=data_sample[["證券代碼","年份"]]data_sample.columns=["symbol","年份"]data_sample=pd.merge(data_sample,data_company,how='left',on="symbol"

? ? ? ??對樣本數(shù)據(jù)進一步預處理，與正常公司數(shù)據(jù)合并，通過tushare.pro讀取財務指標并寫入csv

year=[2015,2016,2017,2018,2019]createVar = locals()for j in year: createVar["data_sample_"+str(j)]=data_sample[data_sample["年份"]==j]????createVar["data_origin_"+str(j)]=pd.merge(data_Lcompany,createVar["data_sample_"+str(j)],how='outer')?????????????????????????????????????????createVar["data_origin_"+str(j)]["label"]=createVar["data_origin_"+str(j)]["年份"].map({j:0})???????????????????????????????????????? createVar["data_origin_"+str(j)].fillna(1,inplace=True) print(j) del createVar["data_origin_"+str(j)]["年份"] for i in range(createVar["data_origin_"+str(j)].shape[0]): print(i) if i==0:????????????createVar["fina_indicator_"+str(j)]=pro.query('fina_indicator',?ts_code=createVar["data_origin_"+str(j)]["ts_code"][i],?start_date=str(j-3)+"1231",?end_date=str(j-3)+"1231") else: time.sleep(0.9) createVar["fina_indicator_"+str(j)]=pd.concat([createVar["fina_indicator_"+str(j)],pro.query('fina_indicator', ts_code=createVar["data_origin_"+str(j)]["ts_code"][i], start_date=str(j-3)+"1231", end_date=str(j-3)+"1231")]) ????createVar["fina_indicator_"+str(j)].to_csv("fina_indicator_"+str(j)+".csv")??????

????????讀取財務指標?選取財務指標并做數(shù)據(jù)預處理

fina_indicator_2015=pd.read_csv("fina_indicator_2015.csv")del fina_indicator_2015["Unnamed: 0"]fina_indicator_2016=pd.read_csv("fina_indicator_2016.csv")del fina_indicator_2016["Unnamed: 0"]fina_indicator_2017=pd.read_csv("fina_indicator_2017.csv")del fina_indicator_2017["Unnamed: 0"]fina_indicator_2018=pd.read_csv("fina_indicator_2018.csv")del fina_indicator_2018["Unnamed: 0"]fina_indicator_2019=pd.read_csv("fina_indicator_2019.csv")del fina_indicator_2019["Unnamed: 0"]data_origin_2015=pd.merge(fina_indicator_2015,data_origin_2015,on="ts_code",how='left')data_origin_2016=pd.merge(fina_indicator_2016,data_origin_2016,on="ts_code",how='left')data_origin_2017=pd.merge(fina_indicator_2016,data_origin_2017,on="ts_code",how='left')data_origin_2018=pd.merge(fina_indicator_2016,data_origin_2018,on="ts_code",how='left')data_origin_2019=pd.merge(fina_indicator_2016,data_origin_2019,on="ts_code",how='left')data_origin=pd.concat([data_origin_2015,data_origin_2016,data_origin_2017,data_origin_2018,data_origin_2019])data_origin.index=range(len(data_origin))data_origin=data_origin[["debt_to_assets","fcff","profit_dedt","current_ratio","ar_turn","profit_to_gr","roa","grossprofit_margin","or_yoy","ebt_yoy","assets_yoy","label"]]data_origin["fcff_ratio"]=data_origin["fcff"]/data_origin["profit_dedt"]data_origin.dropna(axis=0,?how='any',inplace=True)data_origin.index=range(len(data_origin))label=data_origin["label"]data_origin.drop("fcff", axis=1,inplace=True)data_origin.drop("profit_dedt",?axis=1,inplace=True)

模型調參

????????導入所需的庫

from sklearn.svm import SVCfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorefrom sklearn.metrics import roc_auc_score, recall_scorefrom sklearn.metrics import roc_curvefrom sklearn.metrics import roc_auc_score as AUCfrom time import time import datetimefrom sklearn.preprocessing import StandardScalerimport numpy as npimport matplotlib.pyplot as plt

?????? ?劃分訓練集測試集

Xtrain,?Xtest,?Ytrain,?Ytest?=?train_test_split(data_origin,label,test_size=0.3,random_state=400)

為什么要劃分訓練集測試集，請看這里。

https://www.zhihu.com/question/22872584

? ? ? ? 數(shù)據(jù)歸一化

ss = StandardScaler()ss = ss.fit(Xtrain)Xtrain = ss.transform(Xtrain)Xtest = ss.transform(Xtest)

? ? ? ? ?選擇核函數(shù)

times?=?time()?for kernel in ["linear","poly","rbf","sigmoid"]: clf = SVC(kernel = kernel ,gamma="auto" ,degree = 1 ,cache_size = 3000 ,class_weight = "balanced").fit(Xtrain, Ytrain) result = clf.predict(Xtest) score = clf.score(Xtest,Ytest) recall = recall_score(Ytest, result) auc = roc_auc_score(Ytest,clf.decision_function(Xtest)) print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc)) print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

??????因為訓練集中正常公司遠遠多于非正常公司，所以存在樣本不均衡問題。分類模型天生會傾向于多數(shù)的類，讓多數(shù)類更容易被判斷正確，少數(shù)類被犧牲掉。但是這樣對我們的模式是沒有意義的，根本無法達到我們的“要識別幾年后會處于財務困境的公司”的建模目的。?SVC中class_weight中"balanced"

參數(shù)可以比較好地修正我們的樣本不均衡情況。

返回結果

? ???返回了三個指標Accuracy、Recall、AUC面積。

????準確率Accuracy就是所有預測正確的所有樣本除以總樣本，通常來說越接近1越好。

????召回率Recall又被稱為敏感度(sensitivity)，真正率，查全率，表示所有真實為1的樣本中，被我們預測正確的樣本所占的比例。召回率越高，代表我們盡量捕捉出了越多的少數(shù)類，召回率越低，代表我們沒有捕捉出足夠的少數(shù)類。

??? AUC面積反應了模型捕捉少數(shù)類的多數(shù)類之間的平衡能力。AUC面積與ROC曲線有關系后面會講。

?????在這里我們選擇AUC面積作為選擇核函數(shù)的指標，rbf核函數(shù)auc面積最大，選擇rbf核函數(shù)(高斯徑向基核函數(shù))。

? ? ? ? ?對Gamma調參

recallall = []aucall = []scoreall = []gamma_range = np.logspace(-10,1,20)for i in gamma_range: clf = clf = SVC(kernel = "rbf" ,gamma=i ,degree = 1 ,cache_size = 3000 ,class_weight = "balanced").fit(Xtrain, Ytrain) result = clf.predict(Xtest) score = clf.score(Xtest,Ytest) recall = recall_score(Ytest, result) auc = roc_auc_score(Ytest,clf.decision_function(Xtest)) recallall.append(recall) aucall.append(auc) scoreall.append(score)print(max(scoreall), gamma_range[scoreall.index(max(scoreall))])print(max(aucall), gamma_range[aucall.index(max(aucall))])print(max(recallall), gamma_range[recallall.index(max(recallall))])plt.plot(gamma_range,scoreall)plt.plot(gamma_range,aucall)plt.plot(gamma_range,recallall)plt.show()

? ?????返回結果

? ? ? ?可以看出在Gamma取0.04832930238571752時AUC面積達到了0.757。

??????? ?對軟間隔指標C調參

recallall = []aucall = []scoreall = []C_range = np.linspace(0.01,30,50)for i in C_range: clf = SVC(kernel = "rbf",C=i,gamma=0.04832930238571752 ,cache_size = 3000 ,class_weight = "balanced").fit(Xtrain, Ytrain) result = clf.predict(Xtest) score = clf.score(Xtest,Ytest) recall = recall_score(Ytest, result) auc = roc_auc_score(Ytest,clf.decision_function(Xtest)) recallall.append(recall) aucall.append(auc) scoreall.append(score)print(max(scoreall), C_range[scoreall.index(max(scoreall))])print(max(aucall), C_range[aucall.index(max(aucall))])print(max(recallall), C_range[recallall.index(max(recallall))])plt.plot(C_range,scoreall)plt.plot(C_range,aucall)plt.plot(C_range,recallall)plt.show()

? ?????返回結果

? ? ? ? ?可以看出在C取0.6220408163265306時AUC面積達到了0.758。

????????因為Poly核函數(shù)可調的參數(shù)更多，我還嘗試用網(wǎng)格搜索對poly核函數(shù)調參，奈何跑了一天出來的結果也不理想。

????????繪制ROC曲線

clf_proba = clf = SVC(kernel = "rbf",C=0.6220408163265306,gamma=0.04832930238571752 ,cache_size = 3000 ,class_weight = "balanced").fit(Xtrain, Ytrain)FPR, recall, thresholds = roc_curve(Ytest,clf_proba.decision_function(Xtest), pos_label=1)area = AUC(Ytest,clf_proba.decision_function(Xtest))plt.figure()plt.plot(FPR, recall, color='red', label='ROC curve (area = %0.2f)' % area)plt.plot([0, 1], [0, 1], color='black',linestyle='--')plt.xlim([-0.05, 1.05])plt.ylim([-0.05, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('Recall')plt.title('Receiver operating characteristic example')plt.legend(loc="lower right")plt.show()

?????? ROC曲線，全稱The Receiver Operating Characteristic Curve，譯為受試者操作特性曲線。這是一條以不同閾值下的假正率FPR(假正率就是一個模型將多數(shù)類判斷錯誤的能力)為橫坐標，不同閾值下的召回率Recall為縱坐標的曲線，在我們這個案例中閾值為每個樣本到劃分數(shù)據(jù)集的超平面的距離。而AUC面積就是ROC曲線下的面積。

關于ROC曲線更多的知識請看這里：

https://blog.csdn.net/dujiahei/article/details/87932096

????????返回結果

????? ??繪制最佳閾值在ROC曲線中的位置

max((recall - FPR).tolist())maxindex = (recall - FPR).tolist().index(max(recall - FPR))plt.figure()plt.plot(FPR, recall, color='red', label='ROC curve (area = %0.2f)' % area)plt.plot([0, 1], [0, 1], color='black', linestyle='--')plt.scatter(FPR[maxindex],recall[maxindex],c="black",s=30)plt.xlim([-0.05, 1.05])plt.ylim([-0.05, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('Recall')plt.title('Receiver operating characteristic example')plt.legend(loc="lower right")plt.show()

? ?????返回結果

? ? ? ?基于我們選出的最佳閾值，我們來人為確定y_predict，并確定在這個閾值下的recall和準確度的值。

prob = pd.DataFrame(clf_proba.decision_function(Xtest))prob.loc[prob.iloc[:,0] >= thresholds[maxindex],"y_pred"]=1prob.loc[prob.iloc[:,0] < thresholds[maxindex],"y_pred"]=0prob.loc[:,"y_pred"].isnull().sum()times = time()score = AC(Ytest,prob.loc[:,"y_pred"].values)recall = recall_score(Ytest, prob.loc[:,"y_pred"])print("testing accuracy %f,recall is %f" % (score,recall))print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

? ? ? ?返回結果

? ? ? ?在這個最佳閾值下，精確度達到了0.683，召回率達到了0.726。

后記

???? ??總體來說，模型效果真的很一般，很大程度跟特征及樣本選擇有關系，這方面實在欠考慮，這個后期可以多嘗試，模型效果應該還能有一定改善。

? ? ? ?這里還有一個教訓就是對量綱不統(tǒng)一的數(shù)據(jù)一定要歸一化處理，第一次調參時沒做歸一化，線性核函數(shù)跑了一天也沒出結果(實際上我覺得應該不至于。。也不知道為什么)。

最后

????????位我上者燦爛星空，道德律令在我心中。

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯，堅持創(chuàng)作打卡瓜分現(xiàn)金大獎

總結

以上是生活随笔為你收集整理的dnn模型 list index out of range_基于svm的财务预警模型的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：随机森林算法 python_Python
下一篇：大户型如何选智能路由器户型大的房子如何选