这仅仅是我的代码中的错误，还是NLTK在检测单词方面真的很差?

我刚刚开始开发一个非常简单的程序，它获取一个txt文件，并根据它告诉你拼写错误的单词。我查找了最好的程序，我阅读了NLTK并使用"单词"。我做到了，注意到它没有正确完成工作，或者我可能没有正确做某事，这实际上是我的错，但有人可以检查一下吗？

from nltk.corpus import words
setwords = set(words.words())
def prompt():
userinput = input("File to Evaluate: ").strip()
with open(userinput, 'r') as file:
words = file.read()
return words
def main():
error_list = []
words = prompt()
words_splitted = words.split()
for i in words_splitted:
if i in setwords:
pass
elif i not in setwords:
error_list.append(i)
print(f"We're not sure these words exist: {error_list}")


if __name__ == '__main__':
main()

该程序运行良好，但请给我一些帮助，以确定NLTK是否实际上不擅长检测单词或在我的程序中失败。我正在使用这个程序进行测试.txt这是一个文件，其中包含他母亲著名的约翰昆西亚当斯信。

终端上的输出是这样的：截图输出

正如你在图片中看到的，它只是打印出了很多甚至不应该混淆的单词，例如"年龄"，"天堂"和"比赛"。

NLTK旨在帮助进行自然语言分析。我真的不确定它是否是尝试进行拼写更正的最佳工具。首先，您使用的单词列表不会尝试包含每个可能的正确拼写单词，因为它假设您将使用 NLTK 中内置的"词干分析器"之一;词干分析器试图弄清楚每个单词的"词干"(或基数(是什么。词干分析让包将"age"分析为"age"的复数形式，并且这将起作用的事实意味着没有必要在单词列表中包含"ages"。

值得注意的是，NLTK包含的实用程序在将输入拆分为单词方面做得比仅仅调用string.split()要好得多，后者对标点符号一无所知。如果你打算使用NLTK，建议让它为你完成这项工作，例如使用nltk.word_tokenize函数。

此外，NLTK如果不知道一个单词，通常会尝试猜测它是什么，这意味着它通常能够识别拼写错误甚至发明的单词的词性。

例如，我在刘易斯·卡罗尔(Lewis Carroll(著名的Jabberwocky上运行了它的默认词性标记器，以产生以下输出。(我添加了每个词性标签的定义，以使其更易于阅读。

>>> poem = """'Twas brillig, and the slithy toves
... did gyre and gimble in the wabe:
... All mimsy were the borogoves,
... and the mome raths outgrabe.
... """
>>> print('n'.join(f"{word+' :':<12}({tag}) {tagdict[tag][0]}"
...                 for word, tag in nltk.pos_tag(nltk.word_tokenize(poem))))
'T :        (NN) noun, common, singular or mass
was :       (VBD) verb, past tense
brillig :   (NN) noun, common, singular or mass
, :         (,) comma
and :       (CC) conjunction, coordinating
the :       (DT) determiner
slithy :    (JJ) adjective or numeral, ordinal
toves :     (NNS) noun, common, plural
did :       (VBD) verb, past tense
gyre :      (NN) noun, common, singular or mass
and :       (CC) conjunction, coordinating
gimble :    (JJ) adjective or numeral, ordinal
in :        (IN) preposition or conjunction, subordinating
the :       (DT) determiner
wabe :      (NN) noun, common, singular or mass
: :         (:) colon or ellipsis
All :       (DT) determiner
mimsy :     (NNS) noun, common, plural
were :      (VBD) verb, past tense
the :       (DT) determiner
borogoves : (NNS) noun, common, plural
, :         (,) comma
and :       (CC) conjunction, coordinating
the :       (DT) determiner
mome :      (JJ) adjective or numeral, ordinal
raths :     (NNS) noun, common, plural
outgrabe :  (RB) adverb
. :         (.) sentence terminator

NLTK是一个非凡的工作体系，具有许多实际应用。我只是不确定你的是其中之一。但是，如果您还没有这样做，请查看知识共享许可书籍，其中描述了NLTK，使用Python进行自然语言处理。这本书不仅是NLTK库的好指南，也是对Python 3中文本处理的温和介绍。

相关内容

最新更新

热门标签：