数据集哪些特征有多大的null列表
生活随笔
收集整理的這篇文章主要介紹了
数据集哪些特征有多大的null列表
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
代碼在kaggle notebook上運行通過:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt import seaborn as sns import numpy as np %matplotlib inline import os print(os.listdir("../input"))train_identity = pd.read_csv('../input/ieee-fraud-detection/train_identity.csv') train_transaction = pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv') test_identity = pd.read_csv('../input/ieee-fraud-detection/test_identity.csv') test_transaction = pd.read_csv('../input/ieee-fraud-detection/test_transaction.csv')nrows = len(train_transaction) ncols= len(train_transaction.columns) tot_null = train_transaction.isnull().sum().sum() print('Number of rows in Train_transaction file : ',nrows) print('Number of columns in Train_transaction file : ',ncols) print('Number of entries in Train_transaction file : ',ncols*nrows) print('Number of Null value entries in Train_transaction file : ',tot_null) print('Percentage of Null value entries in Train_transaction file : ',round(tot_null/(ncols*nrows)*100,2))def percent_na(df):percent_missing = df.isnull().sum() * 100 / len(df)missing_value_df = pd.DataFrame({'column_name': df.columns,'percent_missing': percent_missing},index=None)#missing_value_df.sort_values('percent_missing', inplace=True)missing_value_df=missing_value_df.reset_index().drop('index',axis=1)return missing_value_df train_transaction_na = percent_na(train_transaction)sns.set(rc={'figure.figsize':(16,10)}) col_na_count=train_transaction_na.percent_missing.value_counts()[train_transaction_na.percent_missing.value_counts()>1] col_na_count.index = col_na_count.index.to_series().apply(lambda x: np.round(x,2)) plot=sns.barplot(col_na_count.index,col_na_count.values) for p in plot.patches:plot.annotate("%d" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),ha='center', va='top', fontsize=12, color='black', xytext=(0, 20),textcoords='offset points') plot=plot.set(xlabel='% of Missing Values in a column ',ylabel= 'Number of columns')運行后效果如下(橫軸-缺失多少比例,縱軸-缺失值達到這種比例的特征有多少個):
?
然后繼續運行如下代碼:
pd.options.display.max_colwidth =300 col_na_group= train_transaction_na.groupby('percent_missing')['column_name'].unique().reset_index() num_columns=[] for i in range(len(col_na_group)):num_columns.append(len(col_na_group.column_name[i])) col_na_group['num_columns']=num_columns col_na_group = col_na_group.loc[(col_na_group['num_columns']>1) & (col_na_group['percent_missing']>0),].sort_values(by='percent_missing',ascending=False).reset_index() col_na_group?
| 37 | 87.312290 | [D8, D9] | 2 |
| 36 | 86.123717 | [V138, V139, V140, V141, V142, V146, V147, V148, V149, V153, V154, V155, V156, V157, V158, V161, V162, V163] | 18 |
| 35 | 86.122701 | [V143, V144, V145, V150, V151, V152, V159, V160, V164, V165, V166] | 11 |
| 34 | 86.054967 | [V322, V323, V324, V325, V326, V327, V328, V329, V330, V331, V332, V333, V334, V335, V336, V337, V338, V339] | 18 |
| 33 | 77.913435 | [V217, V218, V219, V223, V224, V225, V226, V228, V229, V230, V231, V232, V233, V235, V236, V237, V240, V241, V242, V243, V244, V246, V247, V248, V249, V252, V253, V254, V257, V258, V260, V261, V262, V263, V264, V265, V266, V267, V268, V269, V273, V274, V275, V276, V277, V278] | 46 |
| 31 | 76.355370 | [V167, V168, V172, V173, V176, V177, V178, V179, V181, V182, V183, V186, V187, V190, V191, V192, V193, V196, V199, V202, V203, V204, V205, V206, V207, V211, V212, V213, V214, V215, V216] | 31 |
| 30 | 76.323534 | [V169, V170, V171, V174, V175, V180, V184, V185, V188, V189, V194, V195, V197, V198, V200, V201, V208, V209, V210] | 19 |
| 29 | 76.053104 | [V220, V221, V222, V227, V234, V238, V239, V245, V250, V251, V255, V256, V259, V270, V271, V272] | 16 |
| 25 | 58.633115 | [M8, M9] | 2 |
| 21 | 47.293494 | [D11, V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11] | 12 |
| 20 | 45.907136 | [M1, M2, M3] | 3 |
| 17 | 28.612626 | [V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52] | 18 |
| 14 | 15.098723 | [V75, V76, V77, V78, V79, V80, V81, V82, V83, V84, V85, V86, V87, V88, V89, V90, V91, V92, V93, V94] | 20 |
| 12 | 13.055170 | [V53, V54, V55, V56, V57, V58, V59, V60, V61, V62, V63, V64, V65, V66, V67, V68, V69, V70, V71, V72, V73, V74] | 22 |
| 11 | 12.881939 | [V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34] | 23 |
| 9 | 11.126427 | [addr1, addr2] | 2 |
| 3 | 0.214888 | [D1, V281, V282, V283, V288, V289, V296, V300, V301, V313, V314, V315] | 12 |
| 2 | 0.053172 | [V95, V96, V97, V98, V99, V100, V101, V102, V103, V104, V105, V106, V107, V108, V109, V110, V111, V112, V113, V114, V115, V116, V117, V118, V119, V120, V121, V122, V123, V124, V125, V126, V127, V128, V129, V130, V131, V132, V133, V134, V135, V136, V137] | 43 |
| 1 | 0.002032 | [V279, V280, V284, V285, V286, V287, V290, V291, V292, V293, V294, V295, V297, V298, V299, V302, V303, V304, V305, V306, V307, V308, V309, V310, V311, V312, V316, V317, V318, V319, V320, V321] | 32 |
?
?
?
?
總結
以上是生活随笔為你收集整理的数据集哪些特征有多大的null列表的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: seaborn绘图后得到分布参数
- 下一篇: baidu aistudio使用小结