pandas数据处理 代码_使用Pandas方法链接提高代码可读性
pandas數據處理 代碼
We have been talking about using the Pandas pipe function to improve code readability. In this article, let’s have a look at Pandas Method Chaining.
我們一直在討論使用Pandas管道函數來提高代碼的可讀性 。 在本文中,我們來看看Pandas 方法鏈接 。
In Data Processing, it is often necessary to perform operations on a certain row or column to obtain new data. Instead of writing
在數據處理中,通常需要對特定的行或列執行操作以獲得新數據。 而不是寫
df = pd.read_csv('data.csv')df = df.fillna(...)
df = df.query('some_condition')
df['new_column'] = df.cut(...)
df = df.pivot_table(...)
df = df.rename(...)
We can do
我們可以做的
(pd.read_csv('data.csv').fillna(...)
.query('some_condition')
.assign(new_column = df.cut(...))
.pivot_table(...)
.rename(...)
)
Method Chaining has always been available in Pandas, but support for chaining has increased through the addition of new “chain-able” methods. For example, query(), assign(), pivot_table(), and in particular pipe() for allowing user-defined methods in method chaining.
方法鏈接在Pandas中一直可用,但是通過添加新的“可鏈接”方法, 對鏈接的支持得到了增加 。 例如query() , assign() , pivot_table() ,特別是pipe() 用于允許用戶定義的方法鏈接到方法中 。
Method chaining is a programmatic style of invoking multiple method calls sequentially with each call performing an action on the same object and returning it.
方法鏈接是一種程序設計風格,可以依次調用多個方法調用,每個調用對同一個對象執行一個動作并返回它。
It eliminates the cognitive burden of naming variables at each intermediate step. Fluent Interface, a method of creating object-oriented API relies on method cascading (aka method chaining). This is akin to piping in Unix systems.
它消除了在每個中間步驟中命名變量的認知負擔。 Fluent Interface是一種創建面向對象API的方法,它依賴于方法級聯(也稱為方法鏈)。 這類似于Unix系統中的管道傳輸。
By Adiamaan Keerthi
通過阿迪亞馬安·基爾蒂
Method chaining substantially increases the readability of the code. Let’s dive into a tutorial to see how it improves our code readability.
方法鏈接大大提高了代碼的可讀性。 讓我們深入研究一個教程,看看它如何提高我們的代碼可讀性。
For source code, please visit my Github notebook.
有關源代碼,請訪問我的Github筆記本 。
數據準備 (Dataset preparation)
For this tutorial, we will be working on the Titanic Dataset from Kaggle. This is a very famous dataset and very often is a student’s first step in data science. Let’s import some libraries and load data to get started.
對于本教程,我們將使用Kaggle的Titanic Dataset 。 這是一個非常著名的數據集,通常是學生在數據科學中的第一步。 讓我們導入一些庫并加載數據以開始使用。
import pandas as pdimport sys
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'df = pd.read_csv('data/train.csv')
df.head()
We load train.csv file into Pandas DataFrame
我們將train.csv文件加載到Pandas DataFrame中
Preview of Titanic data泰坦尼克號數據預覽 KaggleKaggle的數據字典Let’s start by checking out missing values. We can use seaborn to create a simple heatmap to see where are missing values
讓我們從檢查缺失值開始。 我們可以使用seaborn來創建一個簡單的熱圖,以查看缺少的值
sns.heatmap(df.isnull(),yticklabels=False,
cbar=False,
cmap='viridis')Output of seaborn heatmap plot for missing values遺失值的海底熱圖圖的輸出
Age, Cabin, and Embarked have missing values. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like a lot of missing values. The proportion of Embarked missing is very small.
年齡 , 機艙和登機缺少值。 年齡缺失的比例可能很小,不足以用某種形式的估算合理地替代。 查看“ 機艙”列,看起來好像有很多缺失值。 登機失蹤的比例很小。
任務 (Task)
Suppose we have been asked to take a look at passengers departed from Southampton, and work out the survival rate for different age groups and Pclass.
假設我們被要求看一下從南安普敦出發的乘客,并計算出不同年齡段和Pclass的生存率 。
Let’s split this task into several steps and accomplish them step by step.
讓我們將該任務分為幾個步驟,并逐步完成它們。
Data cleaning: replace the missing Age with some form of imputation
數據清理:用某種形式的插補代替缺失的年齡
Create a pivot table to display the survival rate for different age groups and Pclass
創建數據透視表以顯示不同年齡組和Pclass的生存率
Cool, let’s go ahead and use Pandas Method Chaining to accomplish them.
太酷了,讓我們繼續使用Pandas 方法鏈接來完成它們。
1.用某種形式的插補代替失落的時代 (1. Replacing the missing Age with some form of imputation)
As mentioned in the Data preparation, we would like to replace the missing Age with some form of imputation. One way to do this is by filling in the mean age of all the passengers. However, we can be smarter about this and check the average age by passenger class. For example:
如數據準備中所述,我們想用某種形式的估算代替缺失的年齡 。 一種方法是填寫所有乘客的平均年齡。 但是,我們可以對此有所了解,并按旅客等級檢查平均年齡。 例如:
sns.boxplot(x='Pclass',y='Age',
data=df,
palette='winter')
We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll use these average age values to impute based on Pclass for Age.
我們可以看到,較高階層的較富裕乘客往往年齡較大,這是有道理的。 我們將使用這些平均年齡值根據年齡的Pclass進行估算。
pclass_age_map = {1: 37,
2: 29,
3: 24,
}def replace_age_na(x_df, fill_map):
cond=x_df['Age'].isna()
res=x_df.loc[cond,'Pclass'].map(fill_map)
x_df.loc[cond,'Age']=res return x_df
x_df['Age'].isna() selects the Age column and detects the missing values. Then, x_df.loc[cond, 'Pclass'] is used to access Pclass values conditionally and call Pandas map() for substituting each value with another value. Finally, x_df.loc[cond, 'Age']=res conditionally replace all missing Age values with res.
x_df['Age'].isna()選擇“ 年齡”列并檢測缺少的值。 然后,使用x_df.loc[cond, 'Pclass']有條件地訪問Pclass值,并調用Pandas map()將每個值替換為另一個值。 最后, x_df.loc[cond, 'Age']=res條件與替換所有失蹤年齡值res 。
Running the following code
運行以下代碼
res = (pd.read_csv('data/train.csv')
.pipe(replace_age_na, pclass_age_map)
)res.head()
All missing ages should be replaced based on Pclass for Age. Let’s check this by running the heatmap on res.
所有缺少的年齡都應根據年齡的Pclass進行替換。 讓我們通過在res上運行熱圖進行檢查。
sns.heatmap(res.isnull(),yticklabels=False,
cbar=False,
cmap='viridis')
Great, it works!
太好了!
2.選擇從南安普敦出發的乘客 (2. Select passengers departed from Southampton)
According to Titanic Data Dictionary, passengers departed from Southampton should have Embarked with value S . Let’s query that using the Pandas query() function.
根據《泰坦尼克號數據詞典》,從南安普敦出發的乘客應該登上價值S 讓我們使用Pandas query()函數進行query() 。
res = (pd.read_csv('data/train.csv')
.pipe(replace_age_na, pclass_age_map)
.query('Embarked == "S"')
)res.head()
To evaluate the query result, we can check it with value_counts()
要評估查詢結果,我們可以使用value_counts()檢查
res.Embarked.value_counts()S 644Name: Embarked, dtype: int64
3.將年齡轉換為年齡組:≤12,青少年(≤18),成人(≤60)和年齡較大(> 60) (3. Convert ages to groups of age ranges: ≤12, Teen (≤ 18), Adult (≤ 60) and Older (>60))
We did this with a custom function in the Pandas pipe function article. Alternatively, we can use Pandas built-in function assign() to add new columns to a DataFrame. Let’s go ahead withassign().
我們在Pandas管道函數文章中使用了自定義函數來完成此操作。 另外,我們可以使用Pandas內置函數Assign assign()將新列添加到DataFrame中。 讓我們繼續進行assign() 。
bins=[0, 13, 19, 61, sys.maxsize]labels=['<12', 'Teen', 'Adult', 'Older']res = (
pd.read_csv('data/train.csv')
.pipe(replace_age_na, pclass_age_map)
.query('Embarked == "S"')
.assign(ageGroup = lambda df: pd.cut(df['Age'], bins=bins, labels=labels))
)res.head()
Pandas assign() is used to create a new column ageGroup. The new column is created with a lambda function together with Pandas cut() to convert ages to groups of ranges.
熊貓assign()用于創建新列ageGroup 。 將使用lambda函數以及Pandas cut()創建新列,以將年齡轉換為范圍組。
By running the code, we should get an output like below:
通過運行代碼,我們應該得到如下輸出:
4.創建一個數據透視表以顯示不同年齡組和Pclass的生存率 (4. Create a pivot table to display the survival rate for different age groups and Pclass)
A pivot table allows us to insights into our data. Let’s figure out the survival rate with it.
數據透視表使我們能夠洞察數據。 讓我們用它算出生存率。
bins=[0, 13, 19, 61, sys.maxsize]labels=['<12', 'Teen', 'Adult', 'Older'](
pd.read_csv('data/train.csv')
.pipe(replace_age_na, pclass_age_map)
.query('Embarked == "S"')
.assign(ageGroup = lambda df: pd.cut(df['Age'], bins=bins, labels=labels))
.pivot_table(
values='Survived',
columns='Pclass',
index='ageGroup',
aggfunc='mean')
)
The first parameter values='Survived' specifies the column Survived to aggregate. Since the value of Survived is 1 or 0, we can use the aggregation function mean to calculate the survival rate and therefore aggfunc='mean' is used. index='ageGroup' and columns='Pclass' will display ageGroup as rows and Pclass as columns in the output table.
第一個參數values='Survived'指定要匯總的Survived列。 由于Survived的值為1或0 ,我們可以使用聚合函數mean來計算生存率,因此使用aggfunc='mean' 。 index='ageGroup'和columns='Pclass'將在輸出表中將ageGroup顯示為行,將Pclass顯示為列。
By running the code, we should get an output like below:
通過運行代碼,我們應該得到如下輸出:
5.通過重命名軸標簽和格式化值來改進數據透視表的顯示。 (5. Improve the display of pivot table by renaming axis labels and formatting values.)
The output we have got so far is not very self-explanatory. Let’s go ahead and improve the display.
到目前為止,我們得到的輸出不是很容易解釋。 讓我們繼續改進顯示效果。
bins=[0, 13, 19, 61, sys.maxsize]labels=['<12', 'Teen', 'Adult', 'Older'](
pd.read_csv('data/train.csv')
.pipe(replace_age_na, pclass_age_map)
.query('Embarked == "S"')
.assign(ageGroup = lambda df: pd.cut(df['Age'], bins=bins, labels=labels))
.pivot_table(
values='Survived',
columns='Pclass',
index='ageGroup',
aggfunc='mean')
.rename_axis('', axis='columns')
.rename('Class {}'.format, axis='columns')
.style.format('{:.2%}')
)
rename_axis() is used to clear the columns label. After that, rename('Class {}'.format, axis='columns') is used to format the columns label. Finally,style.format('{:.2%}') is used to format values into percentages with 2 decimal places.
rename_axis()用于清除列標簽。 之后,使用rename('Class {}'.format, axis='columns')設置列標簽的格式。 最后,使用style.format('{:.2%}')將值格式化為百分比,并style.format('{:.2%}')兩位小數。
By running the code, we should get an output like below
通過運行代碼,我們應該得到如下輸出
性能與缺點 (Performance and drawback)
In terms of performance, according to DataSchool [2], the method chain tells pandas everything ahead of time, so pandas can plan its operations more efficiently, and thus it should have better performance than conventional ways.
在性能方面,根據DataSchool [2],方法鏈可以提前告知熊貓所有信息,因此熊貓可以更有效地計劃其操作,因此它應該比常規方法具有更好的性能。
Method Chainings are more readable. However, a very long method chaining could be less readable, especially when other functions get called inside the chain, for example, the cut() is used inside the assign() method in our tutorial.
方法鏈接更具可讀性。 但是,很長的方法鏈接可能不太容易理解,特別是當在鏈內調用其他函數時,例如,在本教程的assign()方法內使用了cut() 。
In addition, a major drawback of using Method Chaining is that debugging can be harder, especially in a very long chain. If something looks wrong at the end, you don’t have intermediate values to inspect.
此外,使用方法鏈接的主要缺點是調試可能會更困難,尤其是在很長的鏈中。 如果最后看起來有問題,則沒有要檢查的中間值。
For a longer discussion of this topic, see Tom Augspurger’s Method Chaining post [1].
有關該主題的詳細討論,請參見Tom Augspurger的“ 方法鏈接”一文 [1]。
而已 (That’s it)
Thanks for reading.
謝謝閱讀。
Please checkout the notebook on my Github for the source code.
請在我的Github上查看筆記本中的源代碼。
Stay tuned if you are interested in the practical aspect of machine learning.
如果您對機器學習的實用方面感興趣,請繼續關注。
Lastly, here are 2 related articles you may be interested in
最后,這是您可能感興趣的2條相關文章
Working with missing values in Pandas
在熊貓中使用缺失值
Using Pandas pipe function to improve code readability
使用Pandas管道功能提高代碼可讀性
翻譯自: https://towardsdatascience.com/using-pandas-method-chaining-to-improve-code-readability-d8517c5626ac
pandas數據處理 代碼
總結
以上是生活随笔為你收集整理的pandas数据处理 代码_使用Pandas方法链接提高代码可读性的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: android电池充电以及电量检测驱动分
- 下一篇: 禅道开源版本安装