标点符号删除，但不删除表情符号

我怎么能只删除标点符号而不删除表情符号呢?我想也许有一种方法可以用正则表达式做到这一点?但是不确定。

sentence = ['hello', 'world', '!', '🤬']
def remove_punct(token):
return [word for word in token if word.isalpha()]
print(remove_punct(sentence))
#output
#['hello', 'world']
#desired output
#['hello', 'world', '🤬']

一种方法:

from string import punctuation
sentence = ["hello", "world", "!", "🤬"]
punct_set = set(punctuation)

def remove_punct(token):
return [word for word in token if word not in punct_set]

print(remove_punct(sentence))

['hello', 'world', '🤬']

变量punctuation包括:

'!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'

如果有一个由多个标点符号组成的单词，您可以使用set.isdisjoint，过滤掉那些包含至少一个标点符号的单词:

# notice the ...
sentence = ["hello", "world", "!", "🤬", "..."]
def remove_punct(token):
return [word for word in token if punct_set.isdisjoint(word)]
print(remove_punct(sentence))

(使用set.isdisjoint)

['hello', 'world', '🤬']

最后，如果您想保留包含至少一个非标点符号的单词，请使用set.issuperset，如下所示:

# notice the ... and Mr.
sentence = ["hello", "world", "mr.", "!", "🤬", "..."]
def remove_punct(token):
return [word for word in token if not punct_set.issuperset(word)]
print(remove_punct(sentence))

(set.issuperset)

['hello', 'world', 'mr.', '🤬']  # mr. is kept because it contains mr

使用Unicode代码点:ASCII字符Unicode中的字符都低于127，如果考虑到扩展ASCII则为255。

sentence = ['hello', 'world', '!', '🤬']
result = [word for word in sentence if (word.isalpha() or ord(word[0]) > 127)]
# ord('🤬') = 129324
# use index to prevent multicharacter words breaking the code
print(result)
# ['hello', 'world', '🤬']

python中的ord()函数返回(通常)字符的Unicode码位。常用标点符号是ASCII码，因此其码位小于127。

请注意，由于这个原因，如果python程序不是在Unicode(或类似的)环境中运行，则该代码可能会中断(其他代码使用大量的标点符号列表，如字符串)。另一个答案中的标点符号也可能会中断，但通过添加更多的标点符号(比如nbsp)比修复这个要容易得多。但是，如果不是在Unicode环境下运行，表情符号的显示应该更值得担心。

另一种使用列表推导式的方法，允许抑制多个在列表中一次性使用标点符号:

from string import punctuation
sentence = ['hello', 'world', '!', '🤬']
r = list(filter(None, [w.translate(str.maketrans('', '', punctuation)) for w in sentence]))
print(r)
# ['hello', 'world', '🤬']

stringmaketrans()方法返回一个映射表，用于translate()方法。

maketrans()方法是一个静态方法，它创建一个字符到其翻译/替换的一对一映射。

因此，对于列表中的每个单词(w)，我们寻找标点符号并将其替换为空项''。之后，我们需要过滤列表以摆脱空项''。我们使用表达式list(filter(None, [...]))

例如，在回复注释时，这种方法也适用于一组标点符号，如:sentence = ['hello', 'world', '!*;', '🤬', '...']

2步:

1/首先，去掉标点符号:['hello', 'world', '', '🤬', '']

2/然后过滤列表:['hello', 'world', '🤬']

from string import punctuation
sentence = ['hello', 'world', '!*', '🤬', '...']
r = list(filter(None, [w.translate(str.maketrans('', '', punctuation)) for w in sentence]))
print(r)
# ['hello', 'world', '🤬']

相关内容

最新更新

热门标签：