我有这个字符串:
("abs, aaaa aaa")
我想退回这个:
("abs",",","aaaa","aaa")
我试过这个:
([i for item in lst for i in item.split()])
但它返回的是:
("abs","aaaa","aaa")
您可以使用正则表达式:
import re
data = "abs, aaaa aaa"
out = re.findall(r'w+|S', data)
print(out)
# ['abs', ',', 'aaaa', 'aaa']
我们寻找单词(w+
(或任何非空格字符(S
(
我的解决方案很简单:用","替换所有逗号,然后拆分:
lst = "abc, aaaa aaa"
lst.replace(",", " , ").split() # ==> ['abs', ',', 'aaaa', 'aaa']
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokenizer.tokenize("abs, aaaa aaa")
#op
['abs', ',', 'aaaa', 'aaa']