NLTK仅处理TXT文件中的最后一个字符串



i有一个带有四个字符串的.txt文件,全部由换行符分开。

当我将文件化时,它会处理每一条数据行,这是完美的。

但是,当我尝试从文件中删除停止单词时,它只会从最后一个字符串中删除停止单词。

我想处理文件中的所有内容,而不仅仅是最后一句话。

我的代码:

 with open ('example.txt') as fin:
    for tkn in fin:
        print(word_tokenize(tkn))

#STOP WORDS
stop_words = set(stopwords.words("english"))
words = word_tokenize(tkn)
stpWordsRemoved = []
for stp in words:
    if stp not in stop_words:
        stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)

输出:

['this', 'is', 'an', 'example', 'of', 'how', 'stop', 'words', 'are', 'utilized', 'in', 'natural', 'language', 'processing', '.']
[]
['drive', 'driver', 'driving', 'driven']
[]
['smile', 'smiling', 'smiled']
[]
['there', 'are', 'multiple', 'words', 'here', 'that', 'you', 'should', 'be', 'able', 'to', 'use', 'for', 'lemmas/synonyms', '.']
STOP WORDS REMOVED:  ['multiple', 'words', 'able', 'use', 'lemmas/synonyms', '.']

如上所述,它仅处理最后一行。

编辑:我的txt文件的内容:

this is an example of how stop words are utilized in natural language processing.
A driver goes on a drive while being driven mad. He is sick of driving.  
smile smiling smiled 
there are multiple words here that you should be able to use for lemmas/synonyms.

考虑在您的读取循环中合并删除stopwords函数,如下:

import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
with open("d:/example.txt") as the_file:
    for each_line in the_file:
        print(nltk.word_tokenize(each_line))
        words = nltk.word_tokenize(each_line)
        stp_words_removed = []
        for word in words:
            if word not in stop_words:
                stp_words_removed.append(word)
        print("STOP WORDS REMOVED: ", stp_words_removed)

从您的描述中看,您似乎只将最后一行喂给了停止Word删除剂。我不明白的是,如果是这种情况,您不应该获得所有这些空列表。

您需要将word_tokenize的结果附加到列表,然后处理列表。在您的示例中,您仅在迭代文件之后就使用文件的最后一行。

尝试:

words = []
with open ('example.txt') as fin:
   for tkn in fin:
       if tkn:
           words.append(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
stpWordsRemoved = []
for stp in words:
    if stp not in stop_words:
        stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)

相关内容

最新更新