當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

【Python基础】如何用Pandas处理文本数据？

發(fā)布時間：2025/3/8 python 12 豆豆

生活随笔收集整理的這篇文章主要介紹了【Python基础】如何用Pandas处理文本数据？小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

作者：耿遠昊，Datawhale成員，華東師范大學(xué) 文本數(shù)據(jù)是指不能參與算術(shù)運算的任何字符，也稱為字符型數(shù)據(jù)。如英文字母、漢字、不作為數(shù)值使用的數(shù)字(以單引號開頭)和其他可輸入的字符。文本數(shù)據(jù)具有數(shù)據(jù)維度高、數(shù)據(jù)量大且語義復(fù)雜等特點，是一種較為復(fù)雜的數(shù)據(jù)類型。今天，我們就來一起看看如何使用Pandas對文本數(shù)據(jù)進行數(shù)據(jù)處理。?本文目錄1. string類型的性質(zhì)????????1.1.?string與object的區(qū)別??????? 1.2. string類型的轉(zhuǎn)換????2.?拆分與拼接????????2.1. str.split方法????????2.2.?str.cat方法????3. 替換????????3.1. str.replace的常見用法????????3.2. 子組與函數(shù)替換????????3.3. 關(guān)于str.replace的注意事項????4.?字串匹配與提取????????4.1.?str.extract方法????????4.2.?str.extractall方法????????4.3.?str.contains和str.match????5. 常用字符串方法????????5.1. 過濾型方法????????5.2. isnumeric方法????6.?問題及練習(xí)????????6.1. 問題????????6.2. 練習(xí)

一、string類型的性質(zhì)

1. 1 string與object的區(qū)別

string類型和object不同之處有三點：

① 字符存取方法（string accessor methods，如str.count）會返回相應(yīng)數(shù)據(jù)的Nullable類型，而object會隨缺失值的存在而改變返回類型；

② 某些Series方法不能在string上使用，例如：Series.str.decode()，因為存儲的是字符串而不是字節(jié)；

③ string類型在缺失值存儲或運算時，類型會廣播為pd.NA，而不是浮點型np.nan

其余全部內(nèi)容在當(dāng)前版本下完全一致，但迎合Pandas的發(fā)展模式，我們?nèi)匀蝗坑胹tring來操作字符串。

1.2 string類型的轉(zhuǎn)換

首先，導(dǎo)入需要使用的包

import pandas as pd import numpy as np

如果將一個其他類型的容器直接轉(zhuǎn)換string類型可能會出錯：

#pd.Series([1,'1.']).astype('string') #報錯 #pd.Series([1,2]).astype('string') #報錯 #pd.Series([True,False]).astype('string') #報錯

當(dāng)下正確的方法是分兩部轉(zhuǎn)換，先轉(zhuǎn)為str型object，在轉(zhuǎn)為string類型：

pd.Series([1,'1.']).astype('str').astype('string') 0 1 1 1 dtype: stringpd.Series([1,2]).astype('str').astype('string') 0 1 1 2 dtype: stringpd.Series([True,False]).astype('str').astype('string') 0 True 1 False dtype: string

二、拆分與拼接

2.1 str.split方法

（a）分割符與str的位置元素選取

s = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string") s 0 a_b_c 1 c_d_e 2 <NA> 3 f_g_h dtype: string

根據(jù)某一個元素分割，默認(rèn)為空格

s.str.split('_') 0 [a, b, c] 1 [c, d, e] 2 <NA> 3 [f, g, h] dtype: object

這里需要注意split后的類型是object，因為現(xiàn)在Series中的元素已經(jīng)不是string，而包含了list，且string類型只能含有字符串。

對于str方法可以進行元素的選擇，如果該單元格元素是列表，那么str[i]表示取出第i個元素，如果是單個元素，則先把元素轉(zhuǎn)為列表在取出。

s.str.split('_').str[1] 0 b 1 d 2 <NA> 3 g dtype: objectpd.Series(['a_b_c',?['a','b','c']],?dtype="object").str[1]? #第一個元素先轉(zhuǎn)為['a','_','b','_','c'] 0 _ 1 b dtype: object

（b）其他參數(shù)

expand參數(shù)控制了是否將列拆開，n參數(shù)代表最多分割多少次

s.str.split('_',expand=True)

s.str.split('_',n=1) 0 [a, b_c] 1 [c, d_e] 2 <NA> 3 [f, g_h] dtype: objects.str.split('_',expand=True,n=1)

2.2?str.cat方法

（a）不同對象的拼接模式

cat方法對于不同對象的作用結(jié)果并不相同，其中的對象包括：單列、雙列、多列

① 對于單個Series而言，就是指所有的元素進行字符合并為一個字符串

s = pd.Series(['ab',None,'d'],dtype='string') s 0 ab 1 <NA> 2 d dtype: strings.str.cat() 'abd'

其中可選sep分隔符參數(shù)，和缺失值替代字符na_rep參數(shù)

s.str.cat(sep=',') 'ab,d's.str.cat(sep=',',na_rep='*') 'ab,*,d'

② 對于兩個Series合并而言，是對應(yīng)索引的元素進行合并

s2 = pd.Series(['24',None,None],dtype='string') s2 0 24 1 <NA> 2 <NA> dtype: strings.str.cat(s2) 0 ab24 1 <NA> 2 <NA> dtype: string

同樣也有相應(yīng)參數(shù)，需要注意的是兩個缺失值會被同時替換

s.str.cat(s2,sep=',',na_rep='*') 0 ab,24 1 *,* 2 d,* dtype: string

③ 多列拼接可以分為表的拼接和多Series拼接

表的拼接

s.str.cat(pd.DataFrame({0:['1','3','5'],1:['5','b',None]},dtype='string'),na_rep='*') 0 ab15 1 *3b 2 d5* dtype: string

多個Series拼接

s.str.cat([s+'0',s*2]) 0 abab0abab 1 <NA> 2 dd0dd dtype: string

（b）cat中的索引對齊

當(dāng)前版本中，如果兩邊合并的索引不相同且未指定join參數(shù)，默認(rèn)為左連接，設(shè)置join='left'

s2 = pd.Series(list('abc'),index=[1,2,3],dtype='string') s2 1 a 2 b 3 c dtype: strings.str.cat(s2,na_rep='*') 0 ab* 1 *a 2 db dtype: string

三、替換

廣義上的替換，就是指str.replace函數(shù)的應(yīng)用，fillna是針對缺失值的替換，上一章已經(jīng)提及。

提到替換，就不可避免地接觸到正則表達式，這里默認(rèn)讀者已掌握常見正則表達式知識點，若對其還不了解的，可以通過這份資料來熟悉

3.1?str.replace的常見用法

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'],dtype="string") s 0 A 1 B 2 C 3 Aaba 4 Baca 5 6 <NA> 7 CABA 8 dog 9 cat dtype: string

第一個值寫r開頭的正則表達式，后一個寫替換的字符串

s.str.replace(r'^[AB]','***') 0 *** 1 *** 2 C 3 ***aba 4 ***aca 5 6 <NA> 7 CABA 8 dog 9 cat dtype: string

3.2 子組與函數(shù)替換

通過正整數(shù)調(diào)用子組（0返回字符本身，從1開始才是子組）

s.str.replace(r'([ABC])(\w+)',lambda x:x.group(2)[1:]+'*') 0 A 1 B 2 C 3 ba* 4 ca* 5 6 <NA> 7 BA* 8 dog 9 cat dtype: string

利用?P<....>表達式可以對子組命名調(diào)用

s.str.replace(r'(?P<one>[ABC])(?P<two>\w+)',lambda x:x.group('two')[1:]+'*') 0 A 1 B 2 C 3 ba* 4 ca* 5 6 <NA> 7 BA* 8 dog 9 cat dtype: string

3.3 關(guān)于str.replace的注意事項

首先，要明確str.replace和replace并不是一個東西：

str.replace針對的是object類型或string類型，默認(rèn)是以正則表達式為操作，目前暫時不支持DataFrame上使用；
replace針對的是任意類型的序列或數(shù)據(jù)框，如果要以正則表達式替換，需要設(shè)置regex=True，該方法通過字典可支持多列替換。

但現(xiàn)在由于string類型的初步引入，用法上出現(xiàn)了一些問題，這些issue有望在以后的版本中修復(fù)。

（a）str.replace賦值參數(shù)不得為pd.NA

這聽上去非常不合理，例如對滿足某些正則條件的字符串替換為缺失值，直接更改為缺失值在當(dāng)下版本就會報錯

#pd.Series(['A','B'],dtype='string').str.replace(r'[A]',pd.NA) #報錯 #pd.Series(['A','B'],dtype='O').str.replace(r'[A]',pd.NA) #報錯

此時，可以先轉(zhuǎn)為object類型再轉(zhuǎn)換回來，曲線救國：

pd.Series(['A','B'],dtype='string').astype('O').replace(r'[A]',pd.NA,regex=True).astype('string') 0 <NA> 1 B dtype: string

至于為什么不用replace函數(shù)的regex替換（但string類型replace的非正則替換是可以的），原因在下面一條

（b）對于string類型Series

在使用replace函數(shù)時不能使用正則表達式替換，該bug現(xiàn)在還未修復(fù)

pd.Series(['A','B'],dtype='string').replace(r'[A]','C',regex=True) 0 A 1 B dtype: stringpd.Series(['A','B'],dtype='O').replace(r'[A]','C',regex=True) 0 C 1 B dtype: object

（c）string類型序列如果存在缺失值，不能使用replace替換

#pd.Series(['A',np.nan],dtype='string').replace('A','B') #報錯 pd.Series(['A',np.nan],dtype='string').str.replace('A','B') 0 B 1 <NA> dtype: string

綜上，概況的說，除非需要賦值元素為缺失值（轉(zhuǎn)為object再轉(zhuǎn)回來），否則請使用str.replace方法

四、子串匹配與提取

4.1 str.extract方法

（a）常見用法

pd.Series(['10-87', '10-88', '10-89'],dtype="string").str.extract(r'([\d]{2})-([\d]{2})')

使用子組名作為列名

pd.Series(['10-87', '10-88', '-89'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})-(?P<name_2>[\d]{2})')

利用?正則標(biāo)記選擇部分提取

pd.Series(['10-87',?'10-88',?'-89'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})?-(?P<name_2>[\d]{2})')

pd.Series(['10-87',?'10-88',?'10-'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})-(?P<name_2>[\d]{2})?')

（b）expand參數(shù)（默認(rèn)為True）

對于一個子組的Series，如果expand設(shè)置為False，則返回Series，若大于一個子組，則expand參數(shù)無效，全部返回DataFrame。

對于一個子組的Index，如果expand設(shè)置為False，則返回提取后的Index，若大于一個子組且expand為False，報錯。

s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string") s.index

Index(['A11', 'B22', 'C33'], dtype='object')

s.str.extract(r'([\w])')

s.str.extract(r'([\w])',expand=False) A11 a B22 b C33 c dtype: strings.index.str.extract(r'([\w])')

s.index.str.extract(r'([\w])',expand=False)

Index(['A', 'B', 'C'], dtype='object')

s.index.str.extract(r'([\w])([\d])')

#s.index.str.extract(r'([\w])([\d])',expand=False) #報錯

4.2 str.extractall方法

與extract只匹配第一個符合條件的表達式不同，extractall會找出所有符合條件的字符串，并建立多級索引（即使只找到一個）

s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],dtype="string") two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])' s.str.extract(two_groups, expand=True)

s.str.extractall(two_groups)

s['A']='a1' s.str.extractall(two_groups)

如果想查看第i層匹配，可使用xs方法

s = pd.Series(["a1a2", "b1b2", "c1c2"], index=["A", "B", "C"],dtype="string") s.str.extractall(two_groups).xs(1,level='match')

4.3 str.contains和str.match

前者的作用為檢測是否包含某種正則模式

pd.Series(['1', None, '3a', '3b', '03c'], dtype="string").str.contains(r'[0-9][a-z]') 0 False 1 <NA> 2 True 3 True 4 True dtype: boolean

可選參數(shù)為na

pd.Series(['1', None, '3a', '3b', '03c'], dtype="string").str.contains('a', na=False) 0 False 1 False 2 True 3 False 4 False dtype: boolean

str.match與其區(qū)別在于，match依賴于python的re.match，檢測內(nèi)容為是否從頭開始包含該正則模式

pd.Series(['1', None, '3a_', '3b', '03c'], dtype="string").str.match(r'[0-9][a-z]',na=False) 0 False 1 False 2 True 3 True 4 False dtype: booleanpd.Series(['1', None, '_3a', '3b', '03c'], dtype="string").str.match(r'[0-9][a-z]',na=False) 0 False 1 False 2 False 3 True 4 False dtype: boolean

五、常用字符串方法

5.1 過濾型方法

（a）str.strip

常用于過濾空格

pd.Series(list('abc'),index=[' space1 ','space2 ',' space3'],dtype="string").index.str.strip() Index(['space1', 'space2', 'space3'], dtype='object')

（b）str.lower和str.upper

pd.Series('A',dtype="string").str.lower() 0 a dtype: stringpd.Series('a',dtype="string").str.upper() 0 A dtype: string

（c）str.swapcase和str.capitalize

分別表示交換字母大小寫和大寫首字母

pd.Series('abCD',dtype="string").str.swapcase() 0 ABcd dtype: stringpd.Series('abCD',dtype="string").str.capitalize() 0 Abcd dtype: string

5.2 isnumeric方法

檢查每一位是否都是數(shù)字，請問如何判斷是否是數(shù)值？（問題二）

pd.Series(['1.2','1','-0.3','a',np.nan],dtype="string").str.isnumeric() 0 False 1 True 2 False 3 False 4 <NA> dtype: boolean

六、問題與練習(xí)

6.1 問題

【問題一】 str對象方法和df/Series對象方法有什么區(qū)別？

【問題二】給出一列string類型，如何判斷單元格是否是數(shù)值型數(shù)據(jù)？

【問題三】 rsplit方法的作用是什么？它在什么場合下適用？

【問題四】在本章的第二到第四節(jié)分別介紹了字符串類型的5類操作，請思考它們各自應(yīng)用于什么場景？

6.2?練習(xí)

【練習(xí)一】現(xiàn)有一份關(guān)于字符串的數(shù)據(jù)集，請解決以下問題：

（a）現(xiàn)對字符串編碼存儲人員信息（在編號后添加ID列），使用如下格式：“×××（名字）：×國人，性別×，生于×年×月×日”

# 方法一 > ex1_ori = pd.read_csv('data/String_data_one.csv',index_col='人員編號') > ex1_ori.head()姓名國籍性別出生年出生月出生日人員編號 1 aesfd 2 男 1942 8 10 2 fasefa 5 女 1985 10 4 3 aeagd 4 女 1946 10 15 4 aef 4 男 1999 5 13 5 eaf 1 女 2010 6 24> ex1 = ex1_ori.copy() > ex1['冒號'] = '：' > ex1['逗號'] = '，' > ex1['國人'] = '國人' > ex1['性別2'] = '性別' > ex1['生于'] = '生于' > ex1['年'] = '年' > ex1['月'] = '月' > ex1['日'] = '日' > ID = ex1['姓名'].str.cat([ex1['冒號'], ex1['國籍'].astype('str'), ex1['國人'],ex1['逗號'],ex1['性別2'],ex1['性別'],ex1['逗號'],ex1['生于'],ex1['出生年'].astype('str'),ex1['年'],ex1['出生月'].astype('str'),ex1['月'],ex1['出生日'].astype('str'),ex1['日']]) > ex1_ori['ID'] = ID > ex1_ori姓名國籍性別出生年出生月出生日 ID 人員編號 1 aesfd 2 男 1942 8 10 aesfd：2國人，性別男，生于1942年8月10日 2 fasefa 5 女 1985 10 4 fasefa：5國人，性別女，生于1985年10月4日 3 aeagd 4 女 1946 10 15 aeagd：4國人，性別女，生于1946年10月15日 4 aef 4 男 1999 5 13 aef：4國人，性別男，生于1999年5月13日 5??eaf??1??女??2010??6??24? eaf：1國人，性別女，生于2010年6月24日

（b）將（a）中的人員生日信息部分修改為用中文表示（如一九七四年十月二十三日），其余返回格式不變。

（c）將（b）中的ID列結(jié)果拆分為原列表相應(yīng)的5列，并使用equals檢驗是否一致。

# 參考答案 > dic_year = {i[0]:i[1] for i in zip(list('零一二三四五六七八九'),list('0123456789'))} > dic_two = {i[0]:i[1] for i in zip(list('十一二三四五六七八九'),list('0123456789'))} > dic_one = {'十':'1','二十':'2','三十':'3',None:''} > df_res = df_new['ID'].str.extract(r'(?P<姓名>[a-zA-Z]+):(?P<國籍>[\d])國人，性別(?P<性別>[\w])，生于(?P<出生年>[\w]{4})年(?P<出生月>[\w]+)月(?P<出生日>[\w]+)日') > df_res['出生年'] = df_res['出生年'].str.replace(r'(\w)+',lambda x:''.join([dic_year[x.group(0)[i]] for i in range(4)])) > df_res['出生月'] = df_res['出生月'].str.replace(r'(?P<one>\w?十)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'0','10') > df_res['出生日'] = df_res['出生日'].str.replace(r'(?P<one>\w?十)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'^0','10') > df_res.head()姓名國籍性別出生年出生月出生日人員編號 1 aesfd 2 男 1942 8 10 2 fasefa 5 女 1985 10 4 3 aeagd 4 女 1946 10 15 4 aef 4 男 1999 5 13 5 eaf 1 女 2010 6 24

【練習(xí)二】 現(xiàn)有一份半虛擬的數(shù)據(jù)集，第一列包含了新型冠狀病毒的一些新聞標(biāo)題，請解決以下問題：

（a）選出所有關(guān)于北京市和上海市新聞標(biāo)題的所在行。

（b）求col2的均值。

ex2.col2.str.rstrip('-`').str.lstrip('/').astype(float).mean()

-0.984

（c）求col3的均值。

ex2.columns?=?ex2.columns.str.strip('?')## ！！！用于尋找臟數(shù)據(jù) def?is_number(x):try:float(x)return Trueexcept (SyntaxError, ValueError) as e:return False ex2[~ex2.col3.map(is_number)]

?ex2.col3.str.replace(r'[`\\{]',?'').astype(float).mean()

24.707484999999988

往期精彩回顧適合初學(xué)者入門人工智能的路線及資料下載機器學(xué)習(xí)及深度學(xué)習(xí)筆記等資料打印機器學(xué)習(xí)在線手冊深度學(xué)習(xí)筆記專輯《統(tǒng)計學(xué)習(xí)方法》的代碼復(fù)現(xiàn)專輯 AI基礎(chǔ)下載機器學(xué)習(xí)的數(shù)學(xué)基礎(chǔ)專輯獲取一折本站知識星球優(yōu)惠券，復(fù)制鏈接直接打開：https://t.zsxq.com/yFQV7am本站qq群1003271085。加入微信群請掃碼進群：

總結(jié)

以上是生活随笔為你收集整理的【Python基础】如何用Pandas处理文本数据？的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：我用Python爬取并分析了30万个房产
下一篇：【论文解读】打破常规，逆残差模块超强改进