生活随笔
收集整理的這篇文章主要介紹了
Python数据清洗处理,csv,pandas,数据匹配
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
寫在前面
本人python新手一枚,最近畢業論文需要用python處理數據,大概就是兩年前學過一些python基礎,最近重新把python撿起來了,由于這個數據處理過程對我這種新手來說過于艱難,所以想要把它記錄下來,方便以后需要的時候再回顧,也希望其他同學遇到類似問題的時候可以更快找到解決方法~
下面正文開始
1. 要用到的庫
import os
import pandas
as pd
import openpyxl
import csv
import xlrd
2. 遍歷文件夾,獲取文件夾下的文件路徑
def get_files ( ) : l
= [ ] directory
= '文件夾路徑' for f
in os
. listdir
( directory
) : file_path
= os
. path
. join
( directory
, f
) for folder
in os
. listdir
( file_path
) : files
= os
. path
. join
( file_path
, folder
) l
. append
( files
) return l
path
= get_files
( )
3. 文件合并
如果上面文件夾里的文件是xlsx文件,則直接使用pd.read_excel()函數。 如果是xls文件,則使用pd.read_excel(文件路徑,engine=‘xlrd’, index_col=False)
for i
in path
: df
= pd
. read_csv
( i
, index_col
= False ) df
= df
. loc
[ : , [ '區域' , '板塊' , '租金' , '被叫手機號' , '被叫姓名' , '被叫時長(秒)' , '撥打時間' ] ] df
. to_csv
( '文件路徑.csv' , mode
= 'a' , index
= False ) print ( i
)
4. 數據清洗
df2
= pd
. read_csv
( '文件路徑' , index_col
= False )
df2
. drop_duplicates
( inplace
= True )
print ( 'finish part2' )
df2
[ '被叫手機號' ] = df2
[ '被叫手機號' ] . fillna
( 'null' )
df2
= df2
[ ~ df2
[ '被叫手機號' ] . isin
( [ 'null' ] ) ]
df2
. to_csv
( '文件路徑' , mode
= 'w' , index
= False )
print ( 'finish part3' )
5. 遍歷csv文件的每一行
因為我的數據需要,pandas里面我不知道怎么去遍歷每一行,所以我用了csv這個包,如果某一行的數據不符合要求就刪掉,把符合要求的寫進一個新的csv里面
import csv
l
= [ ]
f
= open ( '文件路徑' , 'r' , encoding
= 'utf-8' )
reader
= csv
. reader
( f
)
f1
= open ( '文件路徑' , 'w+' , encoding
= 'utf-8' , newline
= '' )
writer
= csv
. writer
( f1
)
header
= [ '區域' , '板塊' , '租金' , '被叫手機號' , '被叫姓名' , '被叫時長(秒)' , '撥打時間' ]
writer
. writerow
( header
)
for line
in reader
: if line
[ 4 ] [ 0 ] != '1' : continue if len ( line
[ 4 ] ) != 11 : continue if '測試' in line
[ 5 ] : continue else : writer
. writerow
( line
)
print ( 'finish' )
f
. close
( )
f1
. close
( )
數據清洗階段基本上就完成了,如果還有其他需求的小伙伴也可以根據上面的內容自行修改。
接下來就是進行一些數據的計算、分組統計和匹配等等,這里僅列出比較通用的部分。
6. 數據計算和匹配
- 分組統計
如果你的數據里同一個id或者同一個時間有多個值,那么就可以用pandas里的groupby函數進行分組計數、求和、求平均值等等。groupby用法很多,可自行百度。
df
= pd
. read_csv
( '文件路徑' , index_col
= False )
g
= df
. groupby
( [ 'id' , '月份' ] ) [ '被叫次數' ] . sum ( ) . reset_index
( )
g
. to_csv
( '文件路徑' , mode
= 'w' , index
= False )
如果需要計數的話需要用count()或size()函數,這兩個函數的區別是count()只計算有值的數,size則是就算該行有缺失值也會計算進去。
- 數據匹配
當有兩個或多個表格,它們中包含同一個唯一字段(即數據庫里的主碼)需要根據該字段進行匹配時,可以用pandas的merge函數。
left
= pd
. read_csv
( '文件路徑' , index_col
= False )
right
= pd
. read_csv
( '文件路徑' , index_col
= False )
result
= pd
. merge
( left
, right
, left_on
= [ '注冊時間' , '時間' ] , right_on
= [ '注冊時間' , '時間' ] , how
= "left" )
result
. to_csv
( '文件路徑' , mode
= 'w' , index
= False )
print ( 'finish!' )
- 計算兩個時間之間的間隔天數
這個是參考某位大佬的,具體出處戳這里
import time
def demo ( day1
, day2
) : time_array1
= time
. strptime
( day1
, "%Y/%m/%d" ) timestamp_day1
= int ( time
. mktime
( time_array1
) ) time_array2
= time
. strptime
( day2
, "%Y/%m/%d" ) timestamp_day2
= int ( time
. mktime
( time_array2
) ) result
= ( timestamp_day2
- timestamp_day1
) // 60 // 60 // 24 return result
- 生成虛擬變量
df
= pd
. read_csv
( '文件路徑' , index_col
= False )
renttype_dummy
= pd
. get_dummies
( df
[ '租賃類型' ] )
df
= df
. join
( renttype_dummy
)
df
. to_csv
( '文件路徑' , mode
= 'w' , index
= False )
- 判斷某個房子在某月是否處于上架狀態
(在上架日期和下架日期之間) 這其實是一個非常笨的辦法,但我也想不到其他的了… 邏輯就是把上架日期和下架日期的年份和月份取出來合并到一起,即201701這種形式,然后比較大小即可。 例如201709 < 201710 <201801,那這個房子在201710的時候肯定是處于上架狀態的。
f
= open ( '文件路徑' , 'r' , encoding
= 'utf-8' )
reader
= csv
. reader
( f
)
f1
= open ( '文件路徑' , 'w' , encoding
= 'utf-8' )
writer
= csv
. writer
( f1
)
header
= [ 'id' , '月份' ]
writer
. writerow
( header
)
for line
in reader
: xiajia
= line
[ - 6 ] . split
( '/' ) print ( xiajia
) shangjia
= line
[ - 5 ] . split
( '/' ) if len ( xiajia
[ 1 ] ) == 1 : month1
= '0' + xiajia
[ 1 ] xjdate
= xiajia
[ 0 ] + month1
else : xjdate
= xiajia
[ 0 ] + xiajia
[ 1 ] if len ( shangjia
[ 1 ] ) == 1 : month2
= '0' + shangjia
[ 1 ] sjdate
= shangjia
[ 0 ] + month2
else : sjdate
= shangjia
[ 0 ] + shangjia
[ 1 ]
if int ( sjdate
) <= 201701 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201701' ) writer
. writerow
( l
) if int ( sjdate
) <= 201702 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201702' ) writer
. writerow
( l
) if int ( sjdate
) <= 201703 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201703' ) writer
. writerow
( l
) if int ( sjdate
) <= 201704 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201704' ) writer
. writerow
( l
) if int ( sjdate
) <= 201705 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201705' ) writer
. writerow
( l
) if int ( sjdate
) <= 201706 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201706' ) writer
. writerow
( l
) if int ( sjdate
) <= 201707 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201707' ) writer
. writerow
( l
) if int ( sjdate
) <= 201708 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201708' ) writer
. writerow
( l
) if int ( sjdate
) <= 201709 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201709' ) writer
. writerow
( l
) if int ( sjdate
) <= 201710 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201710' ) writer
. writerow
( l
) if int ( sjdate
) <= 201711 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201711' ) writer
. writerow
( l
) if int ( sjdate
) <= 201712 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201712' ) writer
. writerow
( l
) if int ( sjdate
) <= 201801 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201801' ) writer
. writerow
( l
) if int ( sjdate
) <= 201802 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201802' ) writer
. writerow
( l
) if int ( sjdate
) <= 201803 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201803' ) writer
. writerow
( l
) if int ( sjdate
) <= 201804 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201804' ) writer
. writerow
( l
) if int ( sjdate
) <= 201805 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201805' ) writer
. writerow
( l
) if int ( sjdate
) <= 201806 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201806' ) writer
. writerow
( l
) if int ( sjdate
) <= 201807 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201807' ) writer
. writerow
( l
) if int ( sjdate
) <= 201808 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201808' ) writer
. writerow
( l
) if int ( sjdate
) <= 201809 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201809' ) writer
. writerow
( l
) if int ( sjdate
) <= 201810 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201810' ) writer
. writerow
( l
) if int ( sjdate
) <= 201811 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201811' ) writer
. writerow
( l
) if int ( sjdate
) <= 201812 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201812' ) writer
. writerow
( l
) if int ( sjdate
) <= 201901 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201901' ) writer
. writerow
( l
) if int ( sjdate
) <= 201902 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201902' ) writer
. writerow
( l
) if int ( sjdate
) <= 201903 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201903' ) writer
. writerow
( l
) if int ( sjdate
) <= 201904 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201904' ) writer
. writerow
( l
) if int ( sjdate
) <= 201905 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201905' ) writer
. writerow
( l
) if int ( sjdate
) <= 201906 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201906' ) writer
. writerow
( l
) if int ( sjdate
) <= 201907 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201907' ) writer
. writerow
( l
) if int ( sjdate
) <= 201908 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201908' ) writer
. writerow
( l
) if int ( sjdate
) <= 201909 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201909' ) writer
. writerow
( l
) if int ( sjdate
) <= 201910 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201910' ) writer
. writerow
( l
) if int ( sjdate
) <= 201911 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201911' ) writer
. writerow
( l
) if int ( sjdate
) <= 201912 <= int ( xjdate
) : l
= [ ] l
. append
( line
[ 0 ] ) l
. append
( line
[ 1 ] ) l
. append
( '201912' ) writer
. writerow
( l
)
f
. close
( )
f1
. close
( )
最最后,立個小小的flag,今年一定要更深入的學習python和sql,希望所有的數據人都能快樂(哭)。 如果有大佬有更好的方法,請多多指教。
總結
以上是生活随笔 為你收集整理的Python数据清洗处理,csv,pandas,数据匹配 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。