用nltk word_tokenize表示kenizing后,像原始句子一样重新加入句子



如果我用 nltk.tokenize.word_tokenize()拆分句子,则用 ' '.join()重新加入句子,它不会完全像原来的那样,因为内部有标点符号的单词会被分成单独的令牌。

我如何以以前的编程重新加入?

from nltk import word_tokenize
sentence = "Story: I wish my dog's hair was fluffier, and he ate better"
print(sentence)
=> Story: I wish my dog's hair was fluffier, and he ate better
tokens = word_tokenize(sentence)
print(tokens)
=> ['Story', ':', 'I', 'wish', 'my', 'dog', "'s", 'hair', 'was', 'fluffier', ',', 'and', 'he', 'ate', 'better']
sentence = ' '.join(tokens)
print(sentence)
=> Story : I wish my dog 's hair was fluffier , and he ate better

注意:'s与原始。

来自此答案。您可以将MosesDeTokenizer用作解决方案。

只记得下载nltk的子软件包首先: nltk.download('perluniprops')

>>>import nltk
>>>sentence = "Story: I wish my dog's hair was fluffier, and he ate better"
>>>tokens = nltk.word_tokenize(sentence)
>>>tokens
['Story', ':', 'I', 'wish', 'my', 'dog', "'s", 'hair', 'was', 'fluffier', ',', 'and', 'he', 'ate', 'better']
>>>from nltk.tokenize.moses import MosesDetokenizer
>>>detokens = MosesDetokenizer().detokenize(tokens, return_str=True)
>>>detokens
"Story: I wish my dog's hair was fluffier, and he ate better"

加入后可以使用替换功能

 sentence.replace(" '","'").replace(" : ",': ')
 #o/p 
 Story: I wish my dog's hair was fluffier , and he ate better

最新更新