熊猫分发_实用熊猫指南
熊貓分發
Pandas is a very powerful and versatile Python data analysis library that expedites the data analysis and exploration process. The best way to learn the functions and methods offered by pandas is practicing.
Pandas是一個功能強大且用途廣泛的Python數據分析庫,可加快數據分析和探索過程。 學習熊貓提供的功能和方法的最好方法是練習。
Practice makes perfect.
實踐使完美。
In this post, we will do lots of examples to explore various capabilities of pandas. We will do both simple and advanced examples to see what pandas is capable of.
在這篇文章中,我們將做很多例子來探索熊貓的各種功能。 我們將通過簡單示例和高級示例來了解熊貓的功能。
As always, we start with importing numpy and pandas.
與往常一樣,我們從導入numpy和pandas開始。
import pandas as pdimport numpy as np
Let’s first create a sample dataframe to work on. We can pass a dictionary to the DataFrame function of pandas.
讓我們首先創建一個示例數據框以進行處理。 我們可以將字典傳遞給pandas的DataFrame函數。
df = pd.DataFrame({'num_1':np.random.random(100),'num_2':np.random.random(100),
'num_3':np.random.randint(0,5,100),
'num_4':np.random.randint(0,100,100)})df.head()
We’ve used numpy arrays to create numerical columns. Let’s also add categorical columns to our dataframe.
我們使用numpy數組創建數字列。 讓我們還將分類列添加到我們的數據框中。
from random import samplename = ['Linda','John','Ashley','Xavi','Betty','Mike'] * 100cat = ['A','B','C','D'] * 100names = sample(name, 100)
cats = sample(cat, 100)
The lists “names” and “cats” contain 100 randomly selected samples from the longer lists “name” and “cat”. We used sample function from random module of python.
列表“名稱”和“貓”包含從較長列表“名稱”和“貓”中隨機選擇的100個樣本。 我們使用了來自python 隨機模塊的樣本函數。
It is time add these two categorical features to the dataframe.
現在是時候將這兩個分類功能添加到數據框中了。
df.insert(0, 'names', names)df['cats'] = catsdf.head()
We added two new columns with two different ways. df[‘col’] = col adds the new column at the end. We can specify the location of new column using insert function as we did with “names” column.
我們以兩種不同的方式添加了兩個新列。 df ['col'] = col在末尾添加新列。 我們可以像使用“名稱”列那樣使用插入函數指定新列的位置。
Consider we are interested in the rows in which “num_1” is greater than “num_2”. The following two lines of codes accomplish this task and displays the first five rows.
考慮我們對“ num_1”大于“ num_2”的行感興趣。 以下兩行代碼完成了此任務并顯示了前五行。
df[df.num_1 > df.num_2][:5]df.query('num_1 > num_2')[:5]You can use either one but I prefer query function. I think it is simpler in case of more complex filters. If we want to see the comparison of “num_1” and “num_2” based on different categories(“cats”), we can apply groupby function on filtered dataframe.
您可以使用任何一個,但我更喜歡查詢功能。 我認為在使用更復雜的過濾器的情況下更簡單。 如果要查看基于不同類別(“貓”)的“ num_1”和“ num_2”的比較,可以對過濾的數據幀應用groupby函數。
df1 = df.query('num_1 > num_2') [['cats','names']].groupby('cats').count().reset_index()df1
The column “names” is irrelevant here and just selected to be able to count rows. It seems like categories C and D have more rows in which “num_1” is higher than “num_2”. However, these numbers do not make sense unless we know how many rows each category has in the entire dataframe.
列“名稱”在這里無關緊要,只是為了能夠計算行而選擇。 似乎類別C和D具有更多的行,其中“ num_1”高于“ num_2”。 但是,除非我們知道每個類別在整個數據框中具有多少行,否則這些數字沒有意義。
ser = df['cats'].value_counts()df2 = pd.concat((df1,ser), axis=1)df2.rename(columns={'names':'num_1>num_2', 'cats':'total'}, inplace=True)df2We have created a series that contains the count of each category in cats column using value_counts. Then we combined df1 and ser with concat function and renamed the columns. The final dataframe df2 shows the total number of rows for each category as well as the number of rows that fits the filtering argument.
我們使用value_counts創建了一個包含cats列中每個類別的計數的系列。 然后,我們將df1和ser與concat函數結合在一起,并重命名了列。 最終數據框df2顯示每個類別的總行數以及適合過濾參數的行數。
Assume we want to see the average value in “num_4” for each category in “cats” but we want these values for only a few names. In this case, we can use isin method for filtering and then apply groupby.
假設我們希望看到“貓”中每個類別的平均值“ num_4”,但我們只希望這些名稱只包含幾個名稱。 在這種情況下,我們可以使用isin方法進行過濾,然后應用groupby 。
name = ['Ashley','Betty','Mike']df[df.names.isin(name)][['cats','num_4']].groupby('cats').mean()If the number of occurrences is also needed, we can apply multiple aggregate functions on groupby.
如果還需要出現次數,則可以在groupby上應用多個聚合函數。
name = ['Ashley','Betty','Mike']df[df.names.isin(name)][['cats','num_4']].groupby('cats').agg(['mean','count'])We have 4 different measurements stored in 4 different columns. We can combine them in one column and indicate the name of measurement in another column. Here is the original dataframe:
我們在4個不同的列中存儲了4個不同的度量。 我們可以將它們合并在一個列中,并在另一列中指示測量名稱。 這是原始數據框:
Melt function can be used to achieve what I just described.
融化功能可用于實現我剛剛描述的功能。
df_melted = df.melt(id_vars=['names','cats'])df_melted.head()Melt is especially useful when working with wide dataframes (i.e. lots of features). For instance, if we had 100 different measurements (num_1 to num_100), it would be much easier to do analysis on a melted dataframe.
當使用寬數據框(即許多功能)時,融解特別有用。 例如,如果我們有100個不同的測量值(從num_1到num_100),則在融化的數據幀上進行分析會容易得多。
Our dataframe contains measurements so it is likely that we update the dataframe by adding new measurements. Let’s say we update the dataframe with the new_row below.
我們的數據框包含測量值,因此我們很可能會通過添加新的測量值來更新數據框。 假設我們使用下面的new_row更新數據框。
new_row = {'names':'Mike', 'num_1':0.678, 'num_2':0.345,'num_3':3, 'num_4':[68,80], 'cats':'C'}df_updated = df.append(new_row, ignore_index=True)
df_updated.tail()
The new_row is added but there is a problem. It contains a couple of values for in “num_4” . We should have them in separate rows. The explode function of pandas can be used for this task.
添加了new_row,但是有問題。 它在“ num_4”中包含幾個值。 我們應該將它們放在單獨的行中。 熊貓的爆炸功能可用于此任務。
df_updated = df_updated.explode('num_4').reset_index(drop=True)df_updated.tail()
We have separated the values in “num_4”. There might be cases in which a column contains lots of combined values in many rows. Explode function may come handy in those cases.
我們將“ num_4”中的值分開。 在某些情況下,一列可能包含許多行中的許多組合值。 在這種情況下,爆炸功能可能會派上用場。
There might be cases in which we need to replace some values. Pandas replace function makes it very simple. We can even replace multiple values by passing a dictionary.
在某些情況下,我們需要替換一些值。 熊貓替換功能使其非常簡單。 我們甚至可以通過傳遞字典來替換多個值。
replacement = {'D':'F', 'C':'T'}df.cats.replace(replacement, inplace=True)df.head()
Let’s assume we need to replace some values based on a condition. The condition is specified as “the values below 0.5 in num_1 are to be replaced with 0”. We can use where function in this case.
假設我們需要根據條件替換一些值。 條件指定為“將num_1中的0.5以下的值替換為0”。 在這種情況下,我們可以使用where函數。
df['num_1'] = df['num_1'].where(df['num_1'] >= 0.5, 0)df.head()The way “where” works is that values that fit the condition are selected and the remaining values are replaced with the specified value. where(df[‘num_1’]≥0.5, 0) selects all the values in “num_1” that are greater than 0.5 and the remaining values are replaced with 0.
“ where”的工作方式是選擇適合條件的值,然后將剩余的值替換為指定的值。 where(df ['num_1']≥0.5,0)選擇“ num_1”中所有大于0.5的值,并將其余值替換為0。
The dataframe contains 4 numerical and 2 categorical features. We may need to see a quick summary on how numerical values change based on categories. Pandas pivot_table function can provide this kind of summary and it is very flexible in terms of display options.
數據框包含4個數字特征和2個分類特征。 我們可能需要查看有關數值如何根據類別變化的快速摘要。 熊貓的pivot_table函數可以提供這種摘要,并且在顯示選項方面非常靈活。
df.pivot_table(index='names', columns='cats', values='num_2', aggfunc='mean', margins=True)This table shows how num_2 values change according to names-cats combinations. It is better to set margin parameter as True to to see the comparison with the overall value. There are many options for aggfunc parameter such as count, min, max.
下表顯示num_2值如何根據名稱-貓的組合而變化。 最好將margin參數設置為True,以查看與總值的比較。 aggfunc參數有很多選項,例如count,min,max。
A highly common task with dataframes is to handle missing values. The dataframe we created does not have any missing values. Let’s first add some missing values randomly.
數據框最常見的任務是處理缺失值。 我們創建的數據框沒有任何缺失值。 首先讓我們隨機添加一些缺失值。
a = np.random.randint(0,99,20)df.iloc[a, 3] = np.nan
We created an array with 20 random integers between 0 and 99. Then we used it as index in iloc. 20 rows in the fourth column (column index is 3) are replaced with np.nan which is a missing value representation of pandas.
我們創建了一個數組,其中包含20個介于0到99之間的隨機整數。然后將其用作iloc中的索引。 第四列中的20行(列索引為3)被替換為np.nan,這是熊貓的缺失值表示。
df.isna().sum() returns the number of missing values in each column.
df.isna()。sum()返回每列中缺失值的數量。
Although we passed in 20 indices, the number of missing values seems to be 15. It is because of the duplicate values in array a.
盡管我們傳入了20個索引,但缺失值的數量似乎為15。這是因為數組a中的值重復。
len(np.unique(a))15
We can replace the missing values with fillna function. We can use a constant to replace missing values or a statistic of the column such as mean or median.
我們可以用fillna函數替換缺少的值。 我們可以使用常量來替換缺失值或列的統計信息,例如平均值或中位數。
df['num_3'].fillna(df['num_3'].mean(), inplace=True)df.isna().sum()
This is just a small piece of what you can do with Pandas. The more you work with it, the more useful and practical ways you find. I suggest approaching a problem using different ways and never place a limit in your mind.
這只是您對Pandas可以做的一小部分。 您使用它的次數越多,發現的方法就越有用和實用。 我建議您使用不同的方法來解決問題,切勿在您的腦海中置身事外。
When you are trying hard to find a solution for a problem, you will almost always learn more than the solution of the problem at hand. You will improve your skillset step-by-step to build a robust and efficient data analysis process.
當您努力尋找問題的解決方案時,您幾乎總是會比手頭的問題解決方案了解更多。 您將逐步提高技能,以建立可靠而有效的數據分析過程。
Thank you for reading. Please let me know if you have any feedback.
感謝您的閱讀。 如果您有任何反饋意見,請告訴我。
翻譯自: https://towardsdatascience.com/practical-pandas-guide-b3eedeb3e88
熊貓分發
總結
以上是生活随笔為你收集整理的熊猫分发_实用熊猫指南的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 征信花了怎么借网贷,选择不查征信的网贷
- 下一篇: 银行托管账户是什么意思