i有一个带有四个字符串的.txt文件,全部由换行符分开。
当我将文件化时,它会处理每一条数据行,这是完美的。
但是,当我尝试从文件中删除停止单词时,它只会从最后一个字符串中删除停止单词。
我想处理文件中的所有内容,而不仅仅是最后一句话。
我的代码:
with open ('example.txt') as fin:
for tkn in fin:
print(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
words = word_tokenize(tkn)
stpWordsRemoved = []
for stp in words:
if stp not in stop_words:
stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)
输出:
['this', 'is', 'an', 'example', 'of', 'how', 'stop', 'words', 'are', 'utilized', 'in', 'natural', 'language', 'processing', '.']
[]
['drive', 'driver', 'driving', 'driven']
[]
['smile', 'smiling', 'smiled']
[]
['there', 'are', 'multiple', 'words', 'here', 'that', 'you', 'should', 'be', 'able', 'to', 'use', 'for', 'lemmas/synonyms', '.']
STOP WORDS REMOVED: ['multiple', 'words', 'able', 'use', 'lemmas/synonyms', '.']
如上所述,它仅处理最后一行。
编辑:我的txt文件的内容:
this is an example of how stop words are utilized in natural language processing.
A driver goes on a drive while being driven mad. He is sick of driving.
smile smiling smiled
there are multiple words here that you should be able to use for lemmas/synonyms.
考虑在您的读取循环中合并删除stopwords函数,如下:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
with open("d:/example.txt") as the_file:
for each_line in the_file:
print(nltk.word_tokenize(each_line))
words = nltk.word_tokenize(each_line)
stp_words_removed = []
for word in words:
if word not in stop_words:
stp_words_removed.append(word)
print("STOP WORDS REMOVED: ", stp_words_removed)
从您的描述中看,您似乎只将最后一行喂给了停止Word删除剂。我不明白的是,如果是这种情况,您不应该获得所有这些空列表。
您需要将word_tokenize的结果附加到列表,然后处理列表。在您的示例中,您仅在迭代文件之后就使用文件的最后一行。
尝试:
words = []
with open ('example.txt') as fin:
for tkn in fin:
if tkn:
words.append(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
stpWordsRemoved = []
for stp in words:
if stp not in stop_words:
stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)