Python:在Pandas数据框中查找缺失值
How to find Missing values in a data frame using Python/Pandas
如何使用Python / Pandas查找數(shù)據(jù)框中的缺失值
介紹: (Introduction:)
When you start working on any data science project the data you are provided is never clean. One of the most common issue with any data set are missing values. Most of the machine learning algorithms are not able to handle missing values. The missing values needs to be addressed before proceeding to applying any machine learning algorithm.
當(dāng)您開始從事任何數(shù)據(jù)科學(xué)項(xiàng)目時(shí),所提供的數(shù)據(jù)永遠(yuǎn)不會干凈。 任何數(shù)據(jù)集最常見的問題之一就是缺少值。 大多數(shù)機(jī)器學(xué)習(xí)算法無法處理缺失值。 在繼續(xù)應(yīng)用任何機(jī)器學(xué)習(xí)算法之前,需要解決缺少的值。
Missing values can be handled in different ways depending on if the missing values are continuous or categorical. In this section I will address how to find missing values. In the next article i will address on how to address the missing values.
根據(jù)缺失值是連續(xù)的還是分類的,可以用不同的方式來處理缺失值。 在本節(jié)中,我將介紹如何查找缺失值。 在下一篇文章中,我將介紹如何解決缺失值。
查找缺失值: (Finding Missing Values:)
For this exercise i will be using “l(fā)istings.csv” data file from Seattle Airbnb data. The data can be found under this link : https://www.kaggle.com/airbnb/seattle?select=listings.csv
在本練習(xí)中,我將使用Seattle Airbnb數(shù)據(jù)中的“ listings.csv”數(shù)據(jù)文件。 可以在以下鏈接下找到數(shù)據(jù): https : //www.kaggle.com/airbnb/seattle?select=listings.csv
Step 1: Load the data frame and study the structure of the data frame.
步驟1:加載數(shù)據(jù)框并研究數(shù)據(jù)框的結(jié)構(gòu)。
First step is to load the file and look at the structure of the file. When you have a big dateset with high number of columns it is hard to look at each columns and study the types of columns.
第一步是加載文件并查看文件的結(jié)構(gòu)。 如果日期集較大且列數(shù)很高,則很難查看每個列并研究列的類型。
To find out how many of the columns are categorical and numerical we can use pandas “dtypes” to get the different data types and you can use pandas “value_counts()” function to get count of each data type. Value_counts groups all the unique instances and gives the count of each of those instances.
要了解有多少列是分類列和數(shù)字列,我們可以使用pandas“ dtypes”來獲取不同的數(shù)據(jù)類型,還可以使用pandas“ value_counts()”函數(shù)來獲取每種數(shù)據(jù)類型的計(jì)數(shù)。 Value_counts對所有唯一實(shí)例進(jìn)行分組,并給出每個實(shí)例的計(jì)數(shù)。
As you can see below we have 62 columns which are objects (categorical data), 17 columns which are of float data type and 13 columns which are of int data type.
如下所示,我們有62列是對象(分類數(shù)據(jù)),有17列是浮點(diǎn)數(shù)據(jù)類型,有13列是int數(shù)據(jù)類型。
Step 2: Separate categorical and numerical columns in the data frame
步驟2:將數(shù)據(jù)框中的類別和數(shù)字列分開
The reason to separate the categorical and numerical columns in the data frame is the method of handling missing values are different between these two data type which i will walk through in the next section.
在數(shù)據(jù)框中分隔類別和數(shù)字列的原因是,這兩種數(shù)據(jù)類型之間處理缺失值的方法不同,我將在下一節(jié)中介紹這些方法。
The easiest way to achieve this step is through filtering out the columns from the original data frame by data type. By using “dtypes” function and equality operator you can get which columns are objects (categorical variable) and which are not.
實(shí)現(xiàn)此步驟的最簡單方法是按數(shù)據(jù)類型從原始數(shù)據(jù)幀中過濾出列。 通過使用“ dtypes”函數(shù)和相等運(yùn)算符,您可以了解哪些列是對象(分類變量),哪些不是。
To get the column names of the columns which satisfy the above conditions we can use “df.columns”. The below code gives column names which are objects and column names which are not objects.
要獲得滿足上述條件的列的列名,我們可以使用“ df.columns”。 下面的代碼給出了作為對象的列名和不是對象的列名。
As you can see below we separated the original data frame into 2 and assigned them new variables. One for for categorical variables and one for non-categorical variables.
如下所示,我們將原始數(shù)據(jù)幀分為2個并為其分配了新變量。 一種用于分類變量,另一種用于非分類變量。
Step 3: Find the missing values
步驟3:找出遺漏的值
Finding the missing values is the same for both categorical and continuous variables. We will use “num_vars” which holds all the columns which are not object data type.
對于分類變量和連續(xù)變量,找到缺失值都是相同的。 我們將使用“ num_vars”來保存所有非對象數(shù)據(jù)類型的列。
df[num_vars] will give you all the columns in “num_vars” which consists of all the columns in the data frame which are not object data type.
df [num_vars]將為您提供“ num_vars”中的所有列,該列由數(shù)據(jù)框中的所有非對象數(shù)據(jù)類型的列組成。
We can use pandas “isnull()” function to find out all the fields which have missing values. This will return True if a field has missing values and false if the field does not have missing values.
我們可以使用熊貓的“ isnull()”函數(shù)來找出所有缺少值的字段。 如果字段缺少值,則返回True,否則返回false。
To get how many missing values are in each column we use sum() along with isnull() which is shown below. This will sum up all the True’s in each column from the step above.
為了獲得每列中有多少個缺失值,我們使用sum()以及isull() ,如下所示。 這將匯總上述步驟中每一列中的所有True。
Its always good practice to sort the columns in descending order so you can see what are the columns with highest missing values. To do this we can use sort_values() function. By default this function will sort in ascending order. Since we want the columns with highest missing values first we want to set it to descending. You can do this by passing “ascending=False” paramter in sort_values().
始終最好的做法是按降序?qū)α羞M(jìn)行排序,以便您可以看到缺失值最高的列。 為此,我們可以使用sort_values()函數(shù)。 默認(rèn)情況下,此功能將按升序排序。 因?yàn)槲覀兪紫纫谷笔е底罡叩牧?#xff0c;所以我們希望將其設(shè)置為降序。 您可以通過在sort_values()中傳遞“ ascending = False”參數(shù)來實(shí)現(xiàn)。
The above give you the count of missing values in each column. To get % of missing values in each column you can divide by length of the data frame. You can “l(fā)en(df)” which gives you the number of rows in the data frame.
上面給出了每一列中缺失值的計(jì)數(shù)。 要獲得每一列中丟失值的百分比,您可以除以數(shù)據(jù)幀的長度。 您可以“ len(df)”,它為您提供數(shù)據(jù)框中的行數(shù)。
As you can see below license column is missing 100% of the data and square_feet column is missing 97% of data.
如您所見,License列缺少100%的數(shù)據(jù),square_feet列缺少97%的數(shù)據(jù)。
結(jié)論 (Conclusion)
The above article goes over on how to find missing values in the data frame using Python pandas library. Below are the steps
上面的文章介紹了如何使用Python pandas庫在數(shù)據(jù)框中查找缺失值。 以下是步驟
Use isnull() function to identify the missing values in the data frame
使用isnull()函數(shù)來識別數(shù)據(jù)框中的缺失值
Use sum() functions to get sum of all missing values per column.
使用sum()函數(shù)可獲取每列所有缺失值的總和。
use sort_values(ascending=False) function to get columns with the missing values in descending order.
使用sort_values(ascending = False)函數(shù)以降序獲取缺少值的列。
Divide by len(df) to get % of missing values in each column.
用len(df)除以得到每一列中丟失值的%。
In this section we identified missing values, in the next we go over on how to handle these missing values.
在本節(jié)中,我們確定了缺失值,接下來,我們將繼續(xù)介紹如何處理這些缺失值。
翻譯自: https://medium.com/analytics-vidhya/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd
總結(jié)
以上是生活随笔為你收集整理的Python:在Pandas数据框中查找缺失值的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到恐龙什么意思
- 下一篇: Tableau Desktop认证:为什