當前位置：首頁 > 编程语言 > python >内容正文

python

【Python学习】 - 超详细的零基础Pandas学习（附Python数据分析与应用课本第四章实训答案）

發布時間：2023/12/10 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了【Python学习】 - 超详细的零基础Pandas学习（附Python数据分析与应用课本第四章实训答案）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

（博文體系參考：《Python數據分析與應用》課本）

任務數據如下：

讀入csv文件時，encoding必須是正確的，常用的編碼格式有：UTF-8 , UTF-16 , GBK , GB2312 , GB18030等。

如果和文件的編碼格式不符合時，則會報錯：

import pandas as pdpath = "D:/mystudy/大三上學期作業/PythonPython/chapter4/第4章任務數據/data/meal_order_info.csv" order1 = pd.read_csv(path,sep=',',encoding = 'utf-8') # 默認格式就是utf-8報錯： UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: invalid continuation byte

比較讀入文件的sep參數不同時，讀入文件的不同：

order1 = pd.read_csv(path,sep=',',encoding = 'gbk')order2 = pd.read_csv(path,sep=';',encoding = 'gbk')order3 = pd.read_csv(path,sep='"',encoding = 'gbk')

可以用os庫中的os.listdir()函數來查看當前文件夾下的所有文件：

import os print(os.listdir()) type(os.listdir())輸出： ['meal_order_detail.xlsx','meal_order_detail1.sql','meal_order_detail2.sql','meal_order_detail3.sql','meal_order_info.csv','test.py','users.xlsx','數據特征說明.xlsx']list

?查看當前目錄的子文件夾下的文件目錄：（假設現在在“第四章任務數據”子文件夾下，絕對目錄認為是本博文第一段代碼中的path路徑）

os.listdir('../data/')輸出：（未實際運行，但理論上是這樣？） ['meal_order_detail.xlsx','meal_order_detail1.sql','meal_order_detail2.sql','meal_order_detail3.sql','meal_order_info.csv','test.py','users.xlsx','數據特征說明.xlsx']

可以將數據以csv文件格式或者是xls文件格式（也就是Excel對應的）保存到本地。?

關于pandas的重要部分：DataFrame

?注意取元素的時候，盡量用index，而不是直接取行，因為數據可能被改變過了。

我們先來看一下order1讀入的具體內容，并且查看一下order1的類型：

注意到，最左側的列一定是行的索引，索引為0的行是標題(用.column獲取)，

注意：不論是DataFrame還是Series，訪問索引的唯一方法只有：? .index，沒法用loc或者其他方法獲取到！！！

print("查看order1的主要參數：") print("values:") print(order1.values) print("index:") print(order1.index) print("columns:") print(order1.columns) print("dtypes:") print(order1.dtypes)輸出：查看order1的主要參數： values: [[417 1442 4 ... 1 18688880641 '苗宇怡'][301 1095 3 ... 1 18688880174 '趙穎'][413 1147 6 ... 1 18688880276 '徐毅凡']...[692 1155 8 ... 1 18688880327 '習一冰'][647 1094 4 ... 1 18688880207 '章春華'][570 1113 8 ... 1 18688880313 '唐雅嘉']] index: RangeIndex(start=0, stop=945, step=1) columns: Index(['info_id', 'emp_id', 'number_consumers', 'mode', 'dining_table_id','dining_table_name', 'expenditure', 'dishes_count', 'accounts_payable','use_start_time', 'check_closed', 'lock_time', 'cashier_id', 'pc_id','order_number', 'org_id', 'print_doc_bill_num', 'lock_table_info','order_status', 'phone', 'name'],dtype='object') dtypes: info_id int64 emp_id int64 number_consumers int64 mode float64 dining_table_id int64 dining_table_name int64 expenditure int64 dishes_count int64 accounts_payable int64 use_start_time object check_closed float64 lock_time object cashier_id float64 pc_id float64 order_number float64 org_id int64 print_doc_bill_num float64 lock_table_info float64 order_status int64 phone int64 name object dtype: object

注意，DataFrame可以轉置！

有了轉置操作的加持，許多需求就變得簡單了。

?注意，DataFrame可以做運算！（比如乘法運算）

輸入：（注意！兩個int類型的Series是可以做乘法的！） order1['info_id']*order1['emp_id']Out[87]: 0 601314 1 329595 2 473711 3 483890 4 428848940 701895 941 731808 942 799260 943 707818 944 634410 Length: 945, dtype: int64

1.查看DataFrame中數據

使用命令： order1['info_id'] # 當成字典去訪問，要加單引號或：order1.info_id # 當成dataframe中的屬性去訪問，不需要加單引號輸出：0 417 1 301 2 413 3 415 4 392... 940 641 941 672 942 692 943 647 944 570 Name: info_id, Length: 945, dtype: int64雖然都可以訪問，但是推薦第一種，因為避免和關鍵字重名或是列名為'123'這樣的數字字符串等，容易引起混淆

注意，將DataFrame轉置后，則不能這樣用上述方法數據了！

下面來學習取出一列，多列，所有列的元素。

但是要注意一個區別，這里的第一個中括號代表的不是行索引！！而是屬性列！

想要訪問索引的唯一方法就是調用.index函數！

?同時還有兩個函數送上：.head(x)，.tail(x) ，分別用來訪問前x個，后x個。（不傳參數則默認為5）

當然，還有訪問操作更加靈活的.loc[]和.iloc[]函數（注意是中括號！！！這個后面再說）

請取出‘info_id’這一個Series中的前三個元素

輸入： order1['info_id'][0:3]Out[43]: 0 417 1 301 2 413 Name: info_id, dtype: int64輸入： order1['info_id'].shapeOut[45]: (945,)輸入： type(order1['info_id'])Out[46]: pandas.core.series.Series

請取出'info_id'和'emp_id'這兩列的前三個元素：

輸入： order1[['info_id','emp_id']].shape #注意別忘了里面這一層中括號！他需要傳入一個listOut[48]: (945, 2)輸入： order1[['info_id','emp_id']][0:3]Out[49]: info_id emp_id 0 417 1442 1 301 1095 2 413 1147輸入 order1[['info_id','emp_id']].head(3) #不傳參數則默認為5列Out[53]: info_id emp_id 0 417 1442 1 301 1095 2 413 1147

取出所有列的前三個元素：

輸入： order1[:][0:3]Out[50]: info_id emp_id number_consumers ... order_status phone name 0 417 1442 4 ... 1 18688880641 苗宇怡 1 301 1095 3 ... 1 18688880174 趙穎 2 413 1147 6 ... 1 18688880276 徐毅凡[3 rows x 21 columns]輸入： order1[:][0:3].shapeOut[51]: (3, 21)

下面來介紹兄弟函數loc[]和iloc[]函數（其實就是和array幾乎一樣的多維訪問方式了），這兩個函數的區別僅在對參數的要求有部分區別。（當然，也有交叉重疊的部分）

注意：loc[]函數傳入的行索引是前閉后閉區間，iloc[]函數傳入的行索引和列索引都是前閉后開

報錯系列： order1.iloc[0:3,'info_id']order1.loc[0:3,0:3]無錯系列： order1.loc[0:3,'info_id']order1.loc[0:3,:]order1.iloc[0:3,0:3]order1.iloc[0:3,:]例子：（前兩個例子的輸出一致）輸入1： order1.iloc[0:3,[0,1,2]]Out[65]: info_id emp_id number_consumers 0 417 1442 4 1 301 1095 3 2 413 1147 6輸入2： order1.iloc[0:3,0:3]Out[62]: info_id emp_id number_consumers 0 417 1442 4 1 301 1095 3 2 413 1147 6輸入3：（我也不知道為啥標簽變成行了....求大佬告知） order1.iloc[3,[0,1,2]]Out[66]: info_id 415 emp_id 1166 number_consumers 4 Name: 3, dtype: object輸入4：（loc極其靈活，第一維可以傳入表達式來篩選行，就和數據庫的select語句一樣靈活！iloc卻不行，根本原因在于表達式語句返回一個bool類型的Series，而iloc可接收的數據類型不包括Series！但非要使用iloc也是可以的，比較麻煩，在返回的Series中取其value，讓他變成一個bool類型的array即可。） order1.loc[order1['number_consumers']==5,['number_consumers','info_id']]Out[72]: number_consumers info_id 38 5 688 43 5 317 71 5 831 97 5 266 126 5 712 .. ... ... 852 5 1017 858 5 1025 870 5 1104 896 5 1300 911 5 588[71 rows x 2 columns]

其實還有一種ix[]方法，但是效率低于loc和iloc，所以在此不做介紹。

2.修改DataFrame中數據

注意修改前先備份原文件！order2 = order1.copy()

盡量使用loc[]或者iloc[]而不是用屬性去訪問，原因不詳，反正會有warning（有懂的大神歡迎討論區討論）

（目前思考：原因和二維array中取數，不能直接[x][y]而應該[x , y]是一個道理，對于array：你直接[][]相當于對得到的第一個一維數組再去取值，所以速度會慢很多，而且這樣不是標準寫法。對于這里，warning表示的是你對一個slice的一個切片，相當于是一個臨時變量副本？然后你再對他進行操作，所以他給了你一個提示？所以這樣取數字的時候不會給你這個warning，但是當你改變值的時候，就給你這樣一個警示，讓你檢查一下語法是否正確？所以對于DataFrame一般來說這樣二維操作，取數沒有問題，對于一維的運算也沒有任何問題，可以任意使用，但是對于二維問題的賦值操作需要謹慎？所以一般還是用loc[]比較常見？以上僅時個人意見，不具有權威性。反正對于下面用到的order1['use_start_time'] = pd.to_datetime(order1['use_start_time'])這一句對一維的賦值操作，是不會有任何問題的，也就是說按照屬性直接取值是可以的，但是對于[][]這樣的二維切片操作就要注意了）

輸入； order1['info_id'][0] = 400警告： __main__:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy輸入： order1.loc[0,'info_id'] = 417無警告

3.增添DataFrame中數據

新增一列：

輸入1：（前面提到了，兩個int類型的Series是可以做乘法的！） order1['hahaha']=order1['info_id']*10 order1['hahaha']Out[84]: 0 4170 1 3010 2 4130 3 4150 4 3920940 6410 941 6720 942 6920 943 6470 944 5700

3.刪除DataFrame中數據

輸入1：（注意，可以一口氣刪多列） order1.drop(labels=['order_status','lock_table_info'],axis=1) #當labels是列標的時候，axis=1Out[29]: info_id emp_id number_consumers ... print_doc_bill_num phone name 0 417 1442 4 ... NaN 18688880641 苗宇怡 1 301 1095 3 ... NaN 18688880174 趙穎 2 413 1147 6 ... NaN 18688880276 徐毅凡 3 415 1166 4 ... NaN 18688880231 張大鵬 4 392 1094 10 ... NaN 18688880173 孫熙凱 .. ... ... ... ... ... ... ... 940 641 1095 8 ... NaN 18688880307 李靖 941 672 1089 6 ... NaN 18688880305 莫言 942 692 1155 8 ... NaN 18688880327 習一冰 943 647 1094 4 ... NaN 18688880207 章春華 944 570 1113 8 ... NaN 18688880313 唐雅嘉[945 rows x 19 columns]輸入2：（刪除第0行） order1.drop(labels=0,axis=0)Out[30]: info_id emp_id number_consumers ... order_status phone name 1 301 1095 3 ... 1 18688880174 趙穎 2 413 1147 6 ... 1 18688880276 徐毅凡 3 415 1166 4 ... 1 18688880231 張大鵬 4 392 1094 10 ... 1 18688880173 孫熙凱 5 381 1243 4 ... 1 18688880441 沈曉雯 .. ... ... ... ... ... ... ... 940 641 1095 8 ... 1 18688880307 李靖 941 672 1089 6 ... 1 18688880305 莫言 942 692 1155 8 ... 1 18688880327 習一冰 943 647 1094 4 ... 1 18688880207 章春華 944 570 1113 8 ... 1 18688880313 唐雅嘉[944 rows x 21 columns]

關于時間模塊

將字符串時間轉換成標準時間

注意那些有趣的函數都是來自datatime庫的而不是numpy庫！有興趣的同學可以學一下datatime庫！！

以上類型，概念上理解分成兩類，一類代表時間點，一類代表時間段。

字符串轉換成標準時間分成三類：

1.直接轉化成Timestamp格式

將字符串轉換成標準時間的代碼十分簡短，這里提供兩種方式（其實本質上是一種，只不過一種是以屬性形式去訪問，另一種是loc[]去訪問）

order1.loc[:,'lock_time'] = pd.to_datetime(order1.loc[:,'lock_time'])order1['use_start_time'] = pd.to_datetime(order1['use_start_time'])

Timestamp類型時間，代表時間點可以進行減法，比如用終止時間減去起始時間，取出那個min，如果這個min小于0，那么肯定是異常值了，也就是說這是處理異常值的一個方法。

2.直接轉化成 DatatimeIndex類型

不常用，不介紹

3.直接轉化成PeriodIndex類型

不常用，不介紹

?

提取時間序列數據信息

?樣例代碼：

yy = [item.year for item in order1['lock_time']]輸出會是一個list

?加減時間數據，學習Timedelta類！

對于時間，可以做加法也可以做減法。但是作為Timestamp類型，一定要注意加完或減完的范圍一定要在 1677-09-21 到 2262-04-11?之間

加減時間數據練習及加深理解：?

在對話框中按照順序輸入下面代碼：輸入1： df_Update['ListingInfo1']Out[74]: 0 2014-03-05 1 2014-03-05 2 2014-03-05 3 2014-03-05 4 2014-03-05372458 2014-03-05 372459 2014-03-05 372460 2014-03-05 372461 2014-03-05 372462 2014-03-05 Name: ListingInfo1, Length: 372463, dtype: timedelta64[ns]輸入2： df_Update['ListingInfo1'] = df_Update['ListingInfo1'] + pd.to_datetime('2014-03-05 11:22:33')報錯輸入3： df_Update['ListingInfo1'] = df_Update['ListingInfo1'] - pd.to_datetime('2014-03-05 11:22:33')df_Update['ListingInfo1'] Out[77]: 0 -1 days +12:37:27 1 -1 days +12:37:27 2 -1 days +12:37:27 3 -1 days +12:37:27 4 -1 days +12:37:27372458 -1 days +12:37:27 372459 -1 days +12:37:27 372460 -1 days +12:37:27 372461 -1 days +12:37:27 372462 -1 days +12:37:27 Name: ListingInfo1, Length: 372463, dtype: timedelta64[ns]輸入4：df_Update['ListingInfo1'] = df_Update['ListingInfo1'] + pd.to_datetime('2014-03-05 11:22:33')df_Update['ListingInfo1']Out[79]: 0 2014-03-05 1 2014-03-05 2 2014-03-05 3 2014-03-05 4 2014-03-05372458 2014-03-05 372459 2014-03-05 372460 2014-03-05 372461 2014-03-05 372462 2014-03-05 Name: ListingInfo1, Length: 372463, dtype: datetime64[ns]輸入5：df_Update['ListingInfo1'] = df_Update['ListingInfo1'] + pd.Timedelta(seconds = 1)df_Update['ListingInfo1']Out[81]: 0 2014-03-05 00:00:01 1 2014-03-05 00:00:01 2 2014-03-05 00:00:01 3 2014-03-05 00:00:01 4 2014-03-05 00:00:01372458 2014-03-05 00:00:01 372459 2014-03-05 00:00:01 372460 2014-03-05 00:00:01 372461 2014-03-05 00:00:01 372462 2014-03-05 00:00:01 Name: ListingInfo1, Length: 372463, dtype: datetime64[ns]輸入6：df_Update['ListingInfo1'][0] = df_Update['ListingInfo1'][0] - pd.Timedelta(seconds = 1)報warning：（但是依舊可以執行） __main__:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy輸入7：df_Update['ListingInfo1']Out[84]: 0 2014-03-04 23:59:59 1 2014-03-05 00:00:00 2 2014-03-05 00:00:00 3 2014-03-05 00:00:00 4 2014-03-05 00:00:00372458 2014-03-05 00:00:00 372459 2014-03-05 00:00:00 372460 2014-03-05 00:00:00 372461 2014-03-05 00:00:00 372462 2014-03-05 00:00:00 Name: ListingInfo1, Length: 372463, dtype: datetime64[ns]

對于求訪問數據類型的幾種方法：

要么就是直接用DataFrame的dtypes函數（別落下了s！！）

要么就是單獨取出一個數據出來type()（或者直接輸出，一般也會給出對應的類型）

但是為什么

會不一樣？百思不得其解

答：可能是Console的原因吧，我print一下就好了

end?

總結

以上是生活随笔為你收集整理的【Python学习】 - 超详细的零基础Pandas学习（附Python数据分析与应用课本第四章实训答案）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：人人买得起的“布加迪”来了！不到1000
下一篇：四川男孩高考700分数学满分：喜欢打游戏