在逻辑条件中包含更多的停止语列表以筛选单词



我需要在清理数据中添加更多条件,包括删除停止语、星期几和月份。对于一周中的每一天和每一个月,我都创建了一个单独的列表(我不知道python中是否有一些内置的包可以包含它们(。对于数字,我认为是数字。这样的东西:

days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
# need to put into lower case
months=['January','February','March', 'April','May','June','July','August','September','October','November','December']
# need to put into lower case
cleaned = [w for w in remove_punc.split() if w.lower() not in stopwords.words('english')]

我如何将其包含在上面的代码中?我知道这是关于额外的if语句需要考虑,但我正在与之斗争。

您可以将所有列表转换为集合,并将它们的并集作为最终集合。然后它只是关于检查你的单词在集合中的成员资格。以下内容会起作用:

# existing code
from nltk.corpus import stopwords
days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
# need to put into lower case
months=['January','February','March', 'April','May','June','July','August','September','October','November','December']
# need to put into lower case
# add these lines
stop_words = set(stopwords.words('english'))
lowercase_days = {item.lower() for item in days}
lowercase_months = {item.lower() for item in months}
exclusion_set = lowercase_days.union(lowercase_months).union(stop_words)
# now do the final check
cleaned = [w for w in remove_punc.split() if w.lower() not in exclusion_set and not w.isdigit()]

最新更新