當前位置：首頁 > 编程语言 > python >内容正文

python

python字符串与文本处理技巧(1)：分割、首尾匹配、模式搜索、匹配替换

發布時間：2025/3/15 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 python字符串与文本处理技巧(1)：分割、首尾匹配、模式搜索、匹配替换小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. 字符串分割

將一個字符串分割為多個字段，但是分隔符(還有周圍的空格)并不是固定的。

str.split() 和 re.split()

string 對象的 split() 方法只適應于非常簡單的字符串分割情形，它不允許有多個分隔符或者是分隔符周圍不確定的空格。當需要更加靈活的切割字符串的時候，應該使用 re.split()方法：

import reline = 'asdf fjdk; afed, fjek,asdf, foo' print(re.split(r'[;,]', line)) # >>> ['asdf fjdk', ' afed', ' fjek', 'asdf', ' foo'] print(re.split(r'[;,\s]', line)) # >>> ['asdf', 'fjdk', '', 'afed', '', 'fjek', 'asdf', '', 'foo'] print(re.split(r'[;,\s]\s', line)) # >>> ['asdf fjdk', 'afed', 'fjek,asdf', 'foo'] print(re.split(r'[;,\s]\s*', line)) # \s* 表示連續的空格 # >>> ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

函數 re.split() 是非常實用的，因為它允許為分隔符指定多個正則模式。比如，在上面的例子中，分隔符可以是逗號，分號或者是空格，并且后面緊跟著任意個的空格。只要這個模式被找到，那么匹配的分隔符兩邊的實體都會被當成是結果中的元素返回。返回結果為一個字段列表，這個跟 str.split() 返回值類型是一樣的。

2. 字符串首尾匹配

當我們需要通過指定的文本模式去檢查字符串的開頭或者結尾，比如文件名后綴，URL Scheme等等。

str.startswith() +?str.endswith()

檢查字符串開頭或結尾的一個簡單方法是使用 str.startswith() 或者是 str.endswith() 方法。比如：

filename = 'spam.txt' print(filename.endswith('.txt')) # >>> True print(filename.startswith('file:')) # >>> False url = 'http://www.python.org' print(url.startswith('http:')) # >>>True

startswith() +?endswith()

如果我們想檢查多種匹配可能，只需要將所有的匹配項放入到一個元組中去，然后傳給 startswith() 或者 endswith() 方法：

import os filenames = os.listdir('.') print(filenames) """ >>> ['dictOperation.py', 'lookupNelement.py', 'MatchStartEnd.py', 'multidict.py', 'preorityQueue.py', 'saveNelements.py', 'SegString.py', 'somefile.txt', 'yield_experiment.py'] """print ( [name for name in filenames if name.endswith(('.txt','.ini'))] ) # >>> ['somefile.txt', 'text.ini']print( any(name.endswith('.py') for name in filenames) )

3. 字符串匹配與搜索

當我們想匹配或者搜索特定模式的文本的時候該如何處理？

str.find() + str.endswith() + str.startswith()

如果想匹配的是字面字符串，那么通常只需要調用基本字符串方法就行，比如 str.find() , str.endswith() , str.startswith() 或者類似的方法：

text = 'yeah, but no, but yeah, but no, but yeah' # Exact match print( text == 'yeah' ) # >>> False# Match at start or end print( text.startswith('yeah') ) # >>> True print( text.endswith('no') ) # >>> False# Search for the location of the first occurrence print( text.find('no') ) # >>> 10

re module中的更強大的方法： re.match()

對于復雜的匹配需要使用正則表達式和 re 模塊。為了解釋正則表達式的基本原理，假設想匹配數字格式的日期字符串比如 11/27/2012 ，我們可以這樣做：

text1 = '11/27/2012' text2 = 'Nov 27, 2012'import re # Simple matching: \d+ means match one or more digits if re.match(r'\d+/\d+/\d+', text1):print('yes') else:print('no') # >>> yesif re.match(r'\d+/\d+/\d+', text2):print('yes') else:print('no') # >>> no

如果我們想使用同一個模式去做多次匹配，那么應該先將模式字符串預編譯為模式對象。比如：

text1 = '11/27/2012' text2 = 'Nov 27, 2012'import re datepat = re.compile(r'\d+/\d+/\d+') if datepat.match(text1):print('yes') else:print('no') # >>> yesif datepat.match(text2):print('yes') else:print('no') # >>> no

match() 總是從字符串開始去匹配，如果你想查找字符串任意部分的模式出現位置，使用 findall() 方法去代替。比如：

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.' print( datepat.findall(text) ) # >>> ['11/27/2012', '3/13/2013']

findall() 方法會搜索文本并以列表形式返回所有的匹配。如果我們想以迭代方式返回匹配，可以使用 finditer() 方法來代替，比如：

datepat_new = re.compile(r'(\d+)/(\d+)/(\d+)') # group for m in datepat_new .finditer(text):print(m.groups()) # >>> ('11', '27', '2012') # ('3', '13', '2013')

核心步驟就是先使用 re.compile() 編譯正則表達式字符串，然后使用 match() , findall() 或者 finditer() 等方法。當寫正則式字符串的時候，相對普遍的做法是使用原始字符串比如 r'(\d+)/(\d+)/(\d+)' 。這種字符串將不去解析反斜杠，這在正則表達式中是很有用的。?

4. 字符串搜索替換

str.replace()

對于簡單的字面模式，直接使用 str.replace() 方法即可，比如：

text = 'yeah, but no, but yeah, but no, but yeah' print( text.replace('yeah', 'yep') ) # >>> yep, but no, but yep, but no, but yep

re.sub()

對于復雜的模式，我們應該使用 re 模塊中的 sub() 函數。假設我們想將形式為 11/27/2012 的日期字符串改成 2012-11-27 。示例如下：

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.' import re print( re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text) ) # >>> Today is 2012-11-27. PyCon starts 2013-3-13.

sub() 函數中的第一個參數是被匹配的模式，第二個參數是替換模式。反斜杠數字比如 \3 指向前面模式的捕獲組號。

re.sub() + re.compile()

如果我們打算用相同的模式做多次替換，可以考慮先編譯re.compile()它來提升性能。比如：

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.' import re datepat = re.compile(r'(\d+)/(\d+)/(\d+)') print( datepat.sub(r'\3-\1-\2', text) ) # >>> Today is 2012-11-27. PyCon starts 2013-3-13.

其實，sub()已經足夠強大了，剩下的更難的問題，可能就是自己編寫正則化表達式了。

博文主要參考《python3-codebook》.

總結

以上是生活随笔為你收集整理的python字符串与文本处理技巧(1)：分割、首尾匹配、模式搜索、匹配替换的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：中文名称：程序员杂志2007精华本及附赠
下一篇： python字符串与文本处理技巧(2)：