Kaggle:Tabular Playground Series - May 2021
生活随笔
收集整理的這篇文章主要介紹了
Kaggle:Tabular Playground Series - May 2021
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
Kaggle:Tabular Playground Series - May 2021
簡介
本次比賽使用的數據集是合成的,基于真實的數據集使用CTGAN生成。原始數據集用于預測電子商務產品的類別,給出了與上市相關的各種屬性。雖然這些特征是經過處理的,但它們具有與現實中的特征相關的屬性。
Data
https://www.kaggle.com/c/tabular-playground-series-may-2021/data?select=train.csv
1.訓練集:10000052 , 包含(id,50個feature, target)
2.測試集: 5000051,包含(id,50個feature)
數據可視化
(1)特征0~49
離散數據,特征值固定在某幾個值,0值非常多。
(2)target:1、2、3、4四種標簽
模型:XGboost
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold from sklearn.preprocessing import LabelEncoder from keras.utils import np_utils from sklearn.preprocessing import MinMaxScalerimport xgboost as xgb# 1.數據觀察 # 查看是否有缺失值、unique點、特征之間的相關性# 相關性分析 path = "train.csv" file = pd.read_csv(path)''' cor = file.corr(method="pearson") cor = pd.DataFrame(cor) cor.to_csv("cor_pearson.csv") '''# 構建訓練集測試集 def load_data():train = pd.read_csv("train.csv")test = pd.read_csv("test.csv")train_X, train_Y = train.iloc[:, 1:-1], train.iloc[:, -1]X_test = test.iloc[:, 1:]le = LabelEncoder()train_Y = le.fit_transform(train.iloc[:, -1])pre_index = X_test.indexreturn train_X, train_Y, X_test, pre_indextrain_X, train_Y, X_test, pre_index = load_data()X_train, X_valid, y_train, y_valid = train_test_split(train_X, train_Y, test_size=0.2, random_state=0)#XGBoost自帶接口 params={'eta': 0.3,'max_depth':3,'min_child_weight':1,'gamma':0.3,'subsample':0.8,'colsample_bytree':0.8,'booster':'gbtree','objective': 'multi:softprob','num_class':4,'nthread':12,'scale_pos_weight': 1,'lambda':1,'seed':27,'silent':0 ,'eval_metric': 'mlogloss' } print(X_train.shape) print(y_train.shape) d_train = xgb.DMatrix(X_train, label=y_train) d_valid = xgb.DMatrix(X_valid, label=y_valid) d_test = xgb.DMatrix(X_test) watchlist = [(d_train, 'train'), (d_valid, 'valid')]print("XGBoost_自帶接口進行訓練:") model = xgb.train(params, d_train, 100, watchlist, early_stopping_rounds=500, verbose_eval=10)predictions= model.predict(d_test)StackingSubmission = pd.DataFrame(predictions) StackingSubmission.to_csv('Submission.csv',sep=',', float_format='%.2f',header=["Class_1", "Class_2", "Class_3", "Class_4"])預測結果
選擇xgboost的多分類(multi:softprob, num_class = 4),輸出為樣本屬于每一類的概率。
分析討論
博主嘗試使用PCA 降維、數據歸一化之后再做預測,但預測結果更差一點。
歡迎有興趣的朋友一起討論如何處理這類數據。
總結
以上是生活随笔為你收集整理的Kaggle:Tabular Playground Series - May 2021的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: C语言综合期末作业,内蒙古农业大学201
- 下一篇: 计算机表格复制粘贴后不变,excel表格