pandas内置绘图_使用Pandas内置功能探索数据集
pandas內(nèi)置繪圖
Each and every data scientist is using the very famous libraries for data manipulation that is Pandas, built on top of the Python programming language. It is a powerful python package that makes importing, cleaning, analyzing, and exporting the data easier.
每個數(shù)據(jù)科學家都在使用非常著名的Pandas庫進行數(shù)據(jù)處理,該庫基于Python編程語言構(gòu)建。 它是一個功能強大的python軟件包,可簡化導入,清理,分析和導出數(shù)據(jù)的過程。
In a nutshell, Pandas is like excel for Python, with tables (which in pandas are called DataFrames), rows and columns (which in pandas are called Series), and many functionalities that make it an awesome library for processing and data inspection and manipulation.
簡而言之,Pandas就像Python的excel一樣,具有表(在Pandas中稱為DataFrames),行和列(在Pandas中稱為Series)以及許多功能,使其成為處理,數(shù)據(jù)檢查和操作的出色庫。
Sharing some of the great insights and hacks in pandas which makes data analysis more fun and handy.
分享大熊貓的一些見解和技巧,從而使數(shù)據(jù)分析變得更加有趣和便捷。
import pandas as pdWhile reading the dataframe many times we face the problem that a complete set of rows are not visible, thus analyzing the data becomes quite difficult. So pandas provide the function as “set_options” which help us to define the maximum number of counts of rows to be displayed.
在多次讀取數(shù)據(jù)幀時,我們面臨的問題是看不到完整的行集,因此分析數(shù)據(jù)變得非常困難。 因此,熊貓?zhí)峁┝恕?set_options ”功能,可以幫助我們定義要顯示的最大行數(shù)。
pd.set_option('display.max_rows', 500)pd.set_option('display.max_columns', 500)
1.導入數(shù)據(jù) (1. Importing the data)
Here are the different formats of data that can be imported by using pandas read functionality.
這是可以使用熊貓讀取功能導入的數(shù)據(jù)的不同格式。
Csv, Excel, Html, Binary files, Pickle file, Json file, SQL query
Csv,Excel,HTML,二進制文件,Pickle文件,Json文件,SQL查詢
The format which is most commonly used for machine learning is CSV i.e (comma separated file) and every data scientist encounters a CSV file on a daily basis, so we would restrict it to a CSV file.
機器學習最常用的格式是CSV,即(逗號分隔的文件),每個數(shù)據(jù)科學家每天都會遇到CSV文件,因此我們將其限制為CSV文件。
Some of the important key arguments while reading CSV file in pandas are
在熊貓中讀取CSV文件時,一些重要的關(guān)鍵參數(shù)是
delimiter: a blank space, comma, or other character or symbol that separates different cells in a row
定界符:空格,逗號或其他字符或符號,用于分隔行中的不同單元格
header: a row which is to be used as column name
標頭:將用作列名的行
index_col: columns to be used as a row labels
index_col:用作行標簽的列
usecol: the name of columns to be used while reading the file if provided only a subset of the file will be read
usecol:如果僅讀取文件的一個子集,則在讀取文件時使用的列名
skiprows: number of lines to be skipped, generally if a file has a blank line or unnecessary content which need to skip
skiprows:要跳過的行數(shù),通常在文件中包含空白行或需要跳過的不必要內(nèi)容的情況下
Once the data is imported, it is known as the dataframe in Pandas definition.
數(shù)據(jù)導入后,在Pandas定義中稱為數(shù)據(jù)框。
2. 了解數(shù)據(jù) (2. Understanding the data)
df = pd.DataFrame({"Company": ["Kia", "Hyundai", "Hyundai", "Hyundai", "Hyundai","Honda","Honda", "Honda", "Honda", "Kia"],
"Segment": ["premium", "budget", "luxury", "premium", "budget","premium", "budget", "budget", "premium", "luxury"],
"Type": ["large", "small", "large", "small","small", "large", "small", "small", "large", "large"],
"CrashRating": [4.5, 2.5, 4, np.nan, 3, 4, 3, 4.2, 4.5, 4.2],
"CustomerFeedback": [9, 7, 5, 5, 8, 5.6, np.nan, 9, 9, 4.8]})
We can check the first n rows or last n rows of raw data by using the function “head(number of rows)”
我們可以使用“ head(行數(shù))”功能檢查原始數(shù)據(jù)的前n行或后n行
df.head(5) #here 5 is first 5 items in the dataframedf.tail(5) #displays the bottom 5 rows of dataframe
This data is in the format of the table which is the same as visualized in excel or any other CSV reader. To interact more with data lets see some useful inbuilt functions.
此數(shù)據(jù)采用表格格式,與在excel或任何其他CSV閱讀器中顯示的格式相同。 要與數(shù)據(jù)進行更多交互,請看一些有用的內(nèi)置函數(shù)。
info: This function provides the summary of the dataframe, that are number of rows, number of columns, name of each column along with the number of null values in that column and the type of data in that column
info:此函數(shù)提供數(shù)據(jù)幀的摘要,即行數(shù),列數(shù),每列的名稱以及該列中的空值數(shù)量和該列中的數(shù)據(jù)類型
describe: used to analyze the data statistically, and thus only returns results for numerical columns in the dataframe. Returns the table comprises count, mean, standard deviation, minimum value, maximum value, and quantile values which are useful to detect outliers and see the distribution of data.
describe:用于統(tǒng)計分析數(shù)據(jù),因此僅返回數(shù)據(jù)框中數(shù)字列的結(jié)果。 返回的表包含計數(shù),平均值,標準偏差,最小值,最大值和分位數(shù),這些值可用于檢測異常值和查看數(shù)據(jù)分布。
memoryusage: used to understand the memory usage of each column in bytes
memoryusage:用于了解每列的內(nèi)存使用情況(以字節(jié)為單位)
dtype: to analyze the datatype of each column within the dataframe. Returns a series with a data type of each column
dtype:分析數(shù)據(jù)框中每個列的數(shù)據(jù)類型。 返回具有每一列數(shù)據(jù)類型的序列
isnull or isna: this function is used to calculate the missing values in the dataframe when used independently returns a bool (True or False) indicating if the value is NA, we can use it in multiple manners
isull或isna:此函數(shù)用于在數(shù)據(jù)幀中單獨使用時返回布爾值(True或False)以指示該值是否為NA時,計算數(shù)據(jù)框中的缺失值,我們可以以多種方式使用它
df.isnull().mean()*100 #return the percentage of missing values in each column
unique: when we need to count the unique number of values in one pandas series (i.e. in one specific column of dataframe). Generally used to analyze categorical columns.
唯一:當我們需要計算一個熊貓系列中唯一值的數(shù)量時(即在數(shù)據(jù)框的特定列中)。 通常用于分析分類列。
shape: used to define the dimensionality of the dataframe
形狀:用于定義數(shù)據(jù)框的尺寸
3. 探索數(shù)據(jù) (3. Exploring the Data)
Now that we have loaded our data into a DataFrame and understood its structure, let’s pick and perform visualizations on the data. When it comes to selecting your data, you can do it with both Indexes or based on certain conditions. In this section, let’s go through each one of these methods and do some exploratory analysis.
現(xiàn)在,我們已經(jīng)將數(shù)據(jù)加載到DataFrame中并了解了其結(jié)構(gòu),現(xiàn)在讓我們選擇數(shù)據(jù)并對其進行可視化處理。 在選擇數(shù)據(jù)時,可以同時使用兩個索引或根據(jù)特定條件來執(zhí)行。 在本節(jié)中,我們將逐一介紹這些方法中的每一種,并進行一些探索性分析。
Selecting the Columns
選擇列
Set of columns which we need to analyze can be select in the following way
我們需要分析的一組列可以通過以下方式選擇
df[['Company', 'Type']]df.loc[:,['Company', 'Type']]
df.iloc[:,[0,1]]
Selecting the Rows
選擇行
Selecting the specific rows for analysis can be achieved in the following manner
選擇要分析的特定行可以通過以下方式實現(xiàn)
df.iloc[[1,2], :]df.loc[[1,2], :]
Selecting the specific type of columns
選擇特定的列類型
Sometimes it is helpful to select the subset of the column having specific data types than this function can be used
有時選擇列的子集會有所幫助 可以使用具有特定于此功能的數(shù)據(jù)類型
df.select_dtypes(include=['object'])Selecting both rows and columns
選擇行和列
Most of you are curious to understand that is pandas so week that only one index can be selected at a time either set of rows or columns, no we can select a subset of rows and column at a single time
你們中的大多數(shù)人很好奇地理解這是大熊貓,所以一周只能一次選擇一組行或列的索引,不,我們不能一次選擇行和列的子集
df.iloc[0:2][['Segment', 'Type']]df.iloc[0:2,1:3]
Applying filter
應用過濾器
Now, in a real-time scenario, selecting the particular number of rows based on the indexes is quite tough. So the actual real-life requirement would be to filter out the rows that satisfy a certain condition. With respect to our dataset, we can filter by any of the following conditions
現(xiàn)在,在實時情況下,根據(jù)索引選擇特定的行數(shù)非常困難。 因此,實際的實際需求是過濾出滿足特定條件的行。 對于我們的數(shù)據(jù)集,我們可以通過以下任意條件進行過濾
df[df['Type']=='large']df[(df['Type']=='large') & (df['Segment']=='luxury')]
4.處理和轉(zhuǎn)換數(shù)據(jù) (4. Handling and Transforming the Data)
After doing basic exploration analysis on data, now it’s time to handle missing values and transform the data to perform some advanced data exploration.
在對數(shù)據(jù)進行了基本的探索分析之后,現(xiàn)在是時候處理缺失值并轉(zhuǎn)換數(shù)據(jù)以執(zhí)行一些高級數(shù)據(jù)探索了。
Missing data handling
缺少數(shù)據(jù)處理
Handling missing values is one the trickest and crucial part of data manipulation because replacing the missing cells reflects the change in the distribution of data. Depending on the characteristics of the dataset and the task we can choose to
處理丟失的值是數(shù)據(jù)操作中最棘手且至關(guān)重要的部分之一,因為替換丟失的單元格反映了數(shù)據(jù)分布的變化。 根據(jù)數(shù)據(jù)集的特征和我們可以選擇的任務
Drop missing values: We can drop a row or column having missing values. Scenarios where more than 40% of the column have missing values than that whole column is dropped from the analysis. Dropping results in eliminating the entire row from the observation thus reducing the size of the dataframe.
刪除缺失值:我們可以刪除具有缺失值的行或列。 場景中超過40%的列缺少值的情況比從分析中刪除整個列的情況。 刪除導致從觀察中消除了整個行,??從而減小了數(shù)據(jù)幀的大小。
Replace missing values: Depending upon the distribution of the column we can replace the missing values with a special value or an aggregate value such as mean, median, or any other dynamic value which could be average of similar observations. For time-series data missing values are generally replaced with a window of values before and after the observation.
替換缺失值 :根據(jù)列的分布,我們可以將缺失值替換為特殊值或合計值,例如平均值,中位數(shù)或任何其他可能是相似觀察值的平均值的動態(tài)值。 對于時間序列數(shù)據(jù),通常將缺失值替換為觀察前后的值窗口。
df.fillna(axis=0, method = ‘ffill’, limit =1)
Drop column or rows
刪除列或行
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.
通過指定標簽名稱和相應的軸,或直接指定索引或列名稱,刪除行或列。
Dropping the column is helpful in the labeled dataframe where we want to remove y_true from training and test data.
在要從訓練和測試數(shù)據(jù)中刪除y_true的帶標簽數(shù)據(jù)框中,刪除該列會很有幫助。
df.drop(['CrashRating'], axis=1)df.drop([0,1], axis=0)
Group By
通過...分組
In many situations, we split the data into sets like grouping records into buckets by categorical values and apply some functionality on each subset. In the apply functionality, we can perform the following operations:
在許多情況下,我們將數(shù)據(jù)分成幾組,例如通過分類值將記錄分組到存儲桶中,并對每個子集應用某些功能。 在應用功能中,我們可以執(zhí)行以下操作:
- Aggregation ? computing a summary statistic 聚合-計算摘要統(tǒng)計
- Transformation ? perform some group-specific operation 轉(zhuǎn)換-執(zhí)行一些特定于組的操作
- Filtration ? discarding the data with some condition 過濾-在某些條件下丟棄數(shù)據(jù)
Pivot table
數(shù)據(jù)透視表
This is much similar to “groupby” which is also composed of counts, sums, or other aggregations derived from a table of data. You may have used this feature in spreadsheets, where you would choose the rows and columns to aggregate on, and the values for those rows and columns. It allows us to summarize data as grouped by different values, including values in categorical columns.
這與“ groupby”非常相似,“ groupby”也由從數(shù)據(jù)表派生的計數(shù),總和或其他聚合組成。 您可能在電子表格中使用了此功能,可以在其中選擇要匯總的行和列,以及這些行和列的值。 它使我們可以按不同值(包括分類列中的值)對數(shù)據(jù)進行匯總。
Pivot table function in pandas takes certain arguments as input:
pandas中的數(shù)據(jù)透視表函數(shù)將某些參數(shù)作為輸入:
index, columns
索引 , 列
values = the name of the column of values to be aggregated in the ultimate table, then grouped by the Index and Columns and aggregated according to the Aggregation Function
values =要在最終表中聚合的值列的名稱,然后按索引和列分組,并根據(jù)聚合函數(shù)進行聚合
aggfunc= (Aggregation Function) how rows are summarized, such as sum, mean, or count
aggfunc =(聚合函數(shù))如何匯總行,例如求和,均值或計數(shù)
Merge and Concatenation
合并與串聯(lián)
When importing data from multiple files in a separate dataframe it becomes necessary to concat, merge, or join such files into one.
從一個單獨的數(shù)據(jù)框中的多個文件導入數(shù)據(jù)時,有必要將這些文件合并,合并或合并為一個文件。
concat() — performs all the concatenation operations along an axis while performing optimal set logic (union or intersection) of the indexes
concat() —沿軸執(zhí)行所有串聯(lián)操作,同時執(zhí)行索引的最佳設(shè)置邏輯(聯(lián)合或交集)
"Company": [ "Hyundai","Honda", "Honda", "Honda", "Kia"],
"Segment": ["premium", "budget", "luxury", "luxury", "luxury"],
"Type": ["large", "small", "large", "large", "large"],
"CrashRating": [3.8, 3.5, 4, 4.2, 3],
"CustomerFeedback": [8, 7, 7, 6, 7.5 ]})df_result = pd.concat([df, df_new])
df_result = pd.concat([df, df_new], keys=[‘old’,’new’])
merge() — pandas have in-memory join operations very similar to relational databases like SQL, in SQL where we use “join” to combine two tables on one common index.
merge()—大熊貓具有類似于SQL之類的關(guān)系數(shù)據(jù)庫的內(nèi)存中聯(lián)接操作,在SQL中,我們使用“聯(lián)接”將兩個表組合在一個公共索引上。
"Company": [ "Hyundai","Honda", "Honda", "Honda", "Kia"],
"LaunchingYear": [2015, 2018, 2017, 2012, 2019]})pd.merge(df, df_launingyear, on='Company')
Create dummy variables
創(chuàng)建虛擬變量
Categorical variables whose type are ‘object’ can not be used as it is for training the ML model, we need to create a dummy variable of that specific column using pandas “get_dummies” function
無法將類型為“對象”的分類變量原樣用于訓練ML模型,我們需要使用pandas“ get_dummies”函數(shù)為該特定列創(chuàng)建一個虛擬變量
pd.get_dummies(df['Company'])保存數(shù)據(jù)框 (Saving a dataframe)
After performing exploratory analysis on the dataset we want to store the observations in the form of a new CSV file, which comprises additional information such as table returned by applying pivot_table function or filtering unnecessary details or new dataframe obtained after running concat or merge-operations.
在對數(shù)據(jù)集執(zhí)行探索性分析之后,我們希望以新的CSV文件的形式存儲觀察值,其中包括其他信息,例如通過應用數(shù)據(jù)透視表功能或過濾不必要的詳細信息返回的表,或在運行concat或合并操作后獲得的新數(shù)據(jù)框。
Exporting the results in the form of a CSV file is a simpler step, as we just need to call “to_csv()” function with some arguments which are same as we used while reading data from CSV.
以CSV文件的形式導出結(jié)果是一個簡單的步驟,因為我們只需要使用一些參數(shù)來調(diào)用“ to_csv()”函數(shù),這些參數(shù)與從CSV讀取數(shù)據(jù)時使用的參數(shù)相同。
df.to_csv('./data.csv', index_label=False)結(jié)論 (Conclusion)
In this article, we have listed some general pandas function used to analyzed each dataset which we have gathered while working with Python and Jupyter Notebooks. We are sure these simple hacks will be of use to you and you will take back something from this article. Till then Happy Coding!.
在本文中,我們列出了一些通用的pandas函數(shù),用于分析在使用Python和Jupyter Notebooks時收集的每個數(shù)據(jù)集。 我們確信這些簡單的技巧對您有用,您將從本文中取回一些東西。 直到快樂編碼!
Let us know if you like the blog, please do comment for any queries or suggestions and follow us on LinkedIn and Instagram. Your love and support inspire us to post our learning in a much better way..!!
讓我們知道您是否喜歡該博客,如有任何疑問或建議,請發(fā)表評論,并在LinkedIn和Instagram上關(guān)注我們。 您的愛心與支持會激勵我們以更好的方式發(fā)表我們的學習經(jīng)驗。
翻譯自: https://medium.com/@datasciencewhoopees/exploring-datasets-with-pandas-inbuilt-functionality-6c322c0cdd7d
pandas內(nèi)置繪圖
總結(jié)
以上是生活随笔為你收集整理的pandas内置绘图_使用Pandas内置功能探索数据集的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 脉冲多普勒雷达_是人类还是动物? 多普勒
- 下一篇: 旅行青蛙车票怎么获得(旅行的意义是什么)