Python NLTK提取包含关键字的句子



我的目标是从文本文件中提取包含关键字列表中任何单词的句子。我的脚本清理了文本文件,并使用NLTK来标记句子和删除停止语。脚本的这一部分工作正常,并产生看起来正确的输出["确认本月早些时候提供的更新的2020年区间指导长期盈利股息增长前景","最终展望未来几个月现有潜在投资者的参与度增加","转向"]我为提取包含关键字的句子而写的脚本没有按我想要的方式工作。它提取关键字,但不提取关键字所在的句子。输出如下所示;['','',''','',''','''','''','''',''影响,'zone']

fileinC=nltk.sent_tokenize(fileinB)
fileinD=[]
for sent in fileinC:
fileinD.append(' '.join(w for w in word_tokenize(sent) if w not in allinstops))
fileinE=[sent.replace('n', " ") for sent in fileinD]
#extract sentences containing keywords
fileinF=[]
for sent in fileinE:
fileinF.append(' '.join(w for w in word_tokenize(sent) if w  in keywords))

很可能是最后一行中的条件附加导致了问题,更直观的做法是将其分解为更小的步骤,如:

fileinF = []
for sent in fileinE:
# tokenize and lowercase tokens of the sentence
tokenized_sent = [word.lower() for word in word_tokenize(sent)]
# if any item in the tokenized sentence is a keyword, append the original sentence
if any(keyw in tokenized_sent for keyw in keywords):
fileinF.append(sent)

最新更新