python统计行号_利用Python进行数据分析(第三篇上)
上一篇文章我記錄了自己在入門 Python 學習的一些基礎內容以及實際操作代碼時所碰到的一些問題。
這篇我將會記錄我在學習和運用 Python 進行數據分析的過程:
- 介紹 Numpy 和 Pandas 兩個包
- 運用 Numpy 和 Pandas 分析一維、二維數據
- 數據分析的基本過程
- 實戰項目【用 Python 分析朝陽醫院2018季度的藥物銷售數據】
一、簡單介紹 Numpy 和 Pandas 兩個包
NumPy 和 pandas 是 Python 常見的兩個科學運算的包,提供了比 Python 列表更高級的數組對象且運算效率更高。常用于處理大量數據并從中提取、分析有用指標。
NumPy 是 Numerical Python 的簡稱, 它是目前 Python 數值計算中最為重要的基礎包。大多數計算包都提供了基于 NumPy 的科學函數功能,將 NumPy 的數組對象作為數據交換的通用語。NumPy 的核心是 ndarray 對象,它封裝了 Python 的原生數據類型的N維數組。NumPy 創建的數組在創建時就要有固定大小,數組元素需要有相同的數據類型,NumPy 也可以像Python 數組一樣使用切片。矢量化和廣播是 Numpy 的特性。
pandas 所包含的數據結構和數據梳理工具的設計使得在 Python 中 進行數據清晰和分析非??旖荨andas 經常是和其它數值計算工具,比如 NumPy 和 SciPy,以及數據可視化工具比如 matplotlib 一起使用的。 pandas 支持大部分 NumPy 語言風格的數組計算。pandas 可以直觀的描述一維和二維數據結構,分別是 Series 對象和 DataFrame 對象,理解起來很直觀清晰。pandas 可以處理多種不同的數據類型,可以處理缺失數據,可以分組和聚合,也支持切片功能。
二、運用 NumPy 和 pandas 分析一維、二維數據
首先在 conda 中安裝這兩個包,安裝命令:
conda install numpy, pandas
''' Install two packages in conda, installation command: conda install numpy, pandas ''' # import numpy package import numpy as np # import pandas package import pandas as pd1.1 定義一維數組:
定義一維數組 array,參數傳入的是一個列表 [2,3,4,5]
''' Definition: One dimension array, parameters passed was a list[2,3,4,5] ''' a = np.array([2,3,4,5])1.2 查詢:
# check items a[0]21.3 切片訪問 - 獲取指定序號范圍的元素
# section acess: Acquired items from designated range series number # a[1:3] Acquired items from series no. 1 to series no.3 a[1:3]array([3, 4])1.4 查詢數據類型:
''' dtype detail info link reference: https://docs.scipy.org/doc/numpy-1.10.1/reference/arrays.dtypes.html ''' # Check data types a.dtypedtype('int32')1.5 統計計算 - 平均值
# Statistical caculation # mean a.mean()3.51.6 統計計算 - 標準差
# standard deviation a.std()1.1180339887498951.7 向量化運行 - 乘以標量
# vectorization: multiply scalar b = np.array([1,2,3]) c = b * 4 carray([ 4, 8, 12])2. 運用 NumPy 分析二維數據
2.1 定義二維數組:
''' Numpy Two-dimensional data structure: Array ''' # Define Two-dimensional data array a = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12] ])2.2 獲取元素:
獲取行號是0,列號是2的元素
# Acquire the items that row number is 0, and Column number is 2 a[0,2]32.3 獲取行:
獲取第1行
# Acquire first row items a[0,:]array([1, 2, 3, 4])2.4 獲取列:
獲取第1列
# Acquire first column items a[:,0]array([1, 5, 9])2.5 NumPy數軸參數:axis
1) 如果沒有指定數軸參數,會計算整個數組的平均值
''' If the axis parameters is not designated, the mean of the entire array will be calculated ''' a.mean()6.52) 按軸計算:axis=1 計算每一行
# caculate according to axis: axis = 1 , caculate evey single row a.mean(axis = 1)array([ 2.5, 6.5, 10.5])3) 按軸計算:axis=0 計算每一列
a.mean(axis = 0)array([5., 6., 7., 8.])3. 運用 pandas 分析一維數據
3.1 定義 Pandas 一維數據結構:
定義 Pandas 一維數據結構 - Series
''' Definition: Pandas One Dimension Data Analysis: Series '''''' One day stock price saved for 6 companies(USD), Tenent 427 HKD equal to 54.74 USD. ''' stockS = pd.Series([54.74, 190.9, 173.14, 1050.3, 181.86, 1139.49],index = ['tencent','alibaba','apple','google','facebook','amazon'])3.2 查詢
查詢 stockS
stockStencent 54.74alibaba 190.90
apple 173.14
google 1050.30
facebook 181.86
amazon 1139.49
dtype: float64
3.3 獲取描述統計信息:
# Acquired describe statistical info stockS.describe()count 6.000000mean 465.071667
std 491.183757
min 54.740000
25% 175.320000
50% 186.380000
75% 835.450000
max 1139.490000
dtype: float64
3.4 iloc屬性用于根據索引獲取值
stockS.iloc[0]54.743.5 loc屬性用于根據索引獲取值
# loc attribution: used to acquire value according to the index stockS.loc['tencent']54.743.6 向量化運算 - 向量相加
# vectorization: vectors addition s1 = pd.Series([1,2,3,4], index = ['a','b','c','d']) s2 = pd.Series([10,20,30,40], index = ['a','b','e','f']) s3 = s1 + s2 s3a 11.0b 22.0
c NaN
d NaN
e NaN
f NaN
dtype: float64
3.7 刪除缺失值
# Method 1: Delete missing value s3.dropna()a 11.0b 22.0
dtype: float64
3.8 填充缺失值
# Filled up the missing values s3 = s2.add(s1, fill_value = 0) s3a 11.0b 22.0
c 3.0
d 4.0
e 30.0
f 40.0
dtype: float64
4. 運用 pandas 分析二維數據
pandas 二維數組:數據框(DataFrame)
4.1 定義數據框
''' Pandas Two-dimensional array: DataFrame ''' # Step1: Define a dict, Mapping names and corresponding values salesDict = {'medecine purchased date':['01-01-2018 FRI','02-01-2018 SAT','06-01-2018 WED'],'social security card number':['001616528','001616528','0012602828'],'commodity code':[236701,236701,236701],'commodity name':['strong yinqiao VC tablets', 'hot detoxify clearing oral liquid','GanKang compound paracetamol and amantadine hydrochloride tablets'],'quantity sold':[6,1,2],'amount receivable':[82.8,28,16.8],'amount received':[69,24.64,15] }# import OrdererDict from collections import OrderedDict# Define an OrderedDict salesOrderDict = OrderedDict(salesDict)# Define DataFrame: passing Dict, list name salesDf = pd.DataFrame(salesOrderDict)4.2 查看
salesDf4.3 平均值
是按每列來求平均值
# mean: caculating according to columns salesDf.mean()commodity code 236701.000000quantity sold 3.000000
amount receivable 42.533333
amount received 36.213333
dtype: float64
4.4 查詢數據 - iloc屬性用于根據位置獲取值
1) 查詢第1行第2列的元素
''' iloc attributes used to acquired value according to position ''' # check items at 1st row and 2nd column salesDf.iloc[0,1] '001616528'2) 獲取第1行 - 代表所有列
# Acquired all items of first row - collect every single colum salesDf.iloc[0,:]medecine purchased date 01-01-2018 FRIsocial security card number 001616528
commodity code 236701
commodity name strong yinqiao VC tablets
quantity sold 6
amount receivable 82.8
amount received 69
Name: 0, dtype: object
3) 獲取第1列 - 代表所有行
# Acquired all items of first column - collect every single row salesDf.iloc[:,0]0 01-01-2018 FRI1 02-01-2018 SAT
2 06-01-2018 WED
Name: medecine purchased date, dtype: object
4.5 查詢數據 - loc屬性用于根據索引獲取值
1) 獲取第1行
''' loc attributes used to acquired value according to index ''' # Check items from first row first column salesDf.loc[0,'medecine purchased date']'01-01-2018 FRI'2) 獲取“商品編碼”這一列
# Acquired all items of column 'commodity code' # Method 1: salesDf.loc[:,'commodity code']0 2367011 236701
2 236701
Name: commodity code, dtype: int64
3) 簡單方法:獲取“商品編碼”這一列
# Acquired all items of column 'commodity code' # Method 2: Easy way salesDf['commodity code']0 2367011 236701
2 236701
Name: commodity code, dtype: int64
4.6 數據框復雜查詢 - 切片功能
1) 通過列表來選擇某幾列的數據
# Select a few column data via list salesDf[['commodity name','quantity sold']]2)通過切片功能,獲取指定范圍的列
# Acquired data from define range of column via section salesDf.loc[:,'medecine purchased date':'quantity sold']4.7 數據框復雜查詢 - 條件判斷
1) 通過條件判斷篩選 - 第1步:構建查詢條件
# Select via condition test # Step 1: Establish query condition querySer = salesDf.loc[:,'quantity sold'] > 1 type(querySer)pandas.core.series.SeriesquerySer0 True1 False
2 True
Name: quantity sold, dtype: boolsalesDf.loc[querySer,:]
4.8 查看數據集描述統計信息
1 ) 讀取 Ecxcel 數據
# Read data from Excel fileNameStr = 'C:UsersUSERDesktop#3Python3_The basic process of data analysisSales data of Chaoyang Hospital in 2018 - Copy.xlsx' xls = pd.ExcelFile(fileNameStr) salesDf = xls.parse('Sheet1')2) 打印出前3行,以確保數據運行正常
# Print first three row to make sure data can work properly salesDf.head(3)3) 查詢行、列總數
salesDf.shape(6578, 7)4)查看某一列的數據類型
# Check the data type of one column salesDf.loc[:,'quantity sold'].dtypedtype('float64')5)查看每一列的統計數值
# Check the statistics for each column salesDf.describe()下一篇我將繼續后半部分的學習
- 數據分析的基本過程
- 實戰項目【用 Python 分析朝陽醫院2018季度的藥物銷售數據】
總結
以上是生活随笔為你收集整理的python统计行号_利用Python进行数据分析(第三篇上)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Haproxy实现负载均衡
- 下一篇: linux查看磁盘io带宽,[Linux