Python -删除一个或多个单词开头和结尾的标点符号



我想知道如何删除一个或多个单词开头和结尾的标点符号。如果单词之间有标点符号,我们不删除。

例如

输入:

word = "!.test-one,-"

输出:

word = "test-one"

usestrip

>>> import string
>>> word = "!.test-one,-"
>>> word.strip(string.punctuation)
'test-one'

最好的解决方案是使用Python内置类str.strip(chars)方法。

另一种方法是使用正则表达式和正则表达式模块。

为了理解strip()和正则表达式的作用,您可以看一下两个复制strip()行为的函数。第一个使用递归,第二个使用while循环:


chars = '''!"#$%&'()*+,-./:;<=>?@[]^_`{|}~'''
def cstm_strip_1(word, chars):
# Approach using recursion: 
w = word[1 if word[0] in chars else 0: -1 if word[-1] in chars else None]
if w == word:
return w
else: 
return cstm_strip_1(w, chars)
def cstm_strip_2(word, chars):
# Approach using a while loop: 
i , j = 0, -1
while word[i] in chars:
i += 1
while word[j] in chars:
j -= 1
return word[i:j+1]
import re, string
chars = string.punctuation
word = "~!.test-one^&test-one--two???"
wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
assert wsc == re.sub(r"(^[^w]+)|([^w]+$)", "", word)
word = "__~!.test-one^&test-one--two??__"
wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
# assert wsc == re.sub(r"(^[^w]+)|([^w]+$)", "", word)
assert re.sub(r"(^[^w]+)|([^w]+$)", "", word) == word
print(re.sub(r"(^[^w]+)|([^w]+$)", "", word), '!=', wsc )
print('"',re.sub(r"(^[^w]+)|([^w]+$)", "", "twordt"), '" != "', "twordt".strip(chars), '"', sep='' )

请注意,使用给定正则表达式模式时的结果可能与使用.strip(string.punctuation)时的结果不同,因为正则表达式[^w]模式所覆盖的字符集不同于string.punctuation中的字符集。

补充

正则表达式模式:

(^[^w]+)|([^w]+$)

的意思吗?

下面详细说明:

The '|' character means 'or' providing two alternatives for the 
sub-string (called match) which is to find in the provided string. 
'(^[^w]+)' is the first of the two alternatives for a match
'(' ')' enclose what is called a "capturing group" (^[^w]+)
The first of the two '^' asserts position at start of a line
'w' : with  escaped 'w' means: "word character" 
(i.e. letters a-z, A-Z, digits 0-9 and the underscore '_').
The second of the two '^' means: logical "not" 
(here not a "word character")
i.e. all characters except a-zA-z0-9 and '_'
(for example '~' or 'ö')
Notice that the meaning of '^' depends on context: 
'^' outside of [ ] it means start of line/string
'^' inside  of [ ] as first char means logical not 
and not as first means itself 
'[', ']' enclose specification of a set of characters 
and mean the occurrence of exactly one of them
'+' means occurrence between one and unlimited times
of what was defined in preceding token
'([^w]+$)' is the second alternative for a match 
differing from the first by stating that the match
should be found at the end of the string
'$' means: "end of the line" (or "end of string")

正则表达式模式告诉正则表达式引擎按如下方式工作:

引擎查看字符串的开头是否出现非单词的性格。如果找到一个,它将被记住为匹配和下一个字符将被检查并添加到已经找到的字符,如果它也是非单词字符。这样可以检查字符串的开头非单词字符的出现将从如果模式在re.sub(r"(^[^w]+)|([^w]+$)", "", word)中使用它用空字符串(换句话说它从字符串中删除找到的字符)。

在搜索引擎命中字符串中的第一个单词字符之后字符串的开头将跳转到字符串的结尾,因为的第二个备选项,该模式将作为第一个查找

这样,任何非单词字符在字符串的中间部分将不会被搜索。

引擎在字符串的末尾查找非单词字符就像开始的时候一样,但是要往回走,以确保找到的非单词字符位于字符串的末尾。

使用re.sub

import re
word = "!.test-one,-"
out = re.sub(r"(^[^w]+)|([^w]+$)", "", word)
print(out)

给了#

test-one

使用slice检查此示例

import string
sentence = "_blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers."    
if sentence[0] in string.punctuation:
sentence = sentence[1:]
if sentence[-1] in string.punctuation:
sentence = sentence[:-1]
print(sentence)

输出:

blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers

最新更新