有一个长长的评论列表(50个),比如这个:
"这是我们旅行中最令人失望的事。这家餐厅有收到了一些非常好的评论,所以我们的期望很高。的虽然餐厅客满,但服务很慢。我有家常沙拉,在美国任何一家快餐店都可以做。凯什叶娜,虽然好吃,却让我想起了烧烤拉鸡。这家餐厅被高估了。
我想使用python创建一个保留句子标记化的单词列表。
删除停止词后,我想要所有50条评论的结果,其中保留了句子标记,并将单词标记保留到每个标记化的句子中。最后我希望结果类似于:
list(c("disappointment", "trip"),
c("restaurant", "received", "good", "reviews", "expectations", "high"),
c("service", "slow", "even", "though", "restaurant", "full"),
c("house", "salad", "come", "us"),
c("although", "tasty", "reminded", "pulled"),
"restaurant")
如何在python中实现呢?在这种情况下,R是一个好的选择吗?我真的很感激你的帮助。
如果您不想手工创建一个停止词列表,我建议您使用python中的nltk库。它还可以处理句子拆分(而不是每个句点都拆分)。解析句子的示例可能如下所示:
import nltk
stop_words = set(nltk.corpus.stopwords.words('english'))
text = "this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated"
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sentence_detector.tokenize(text.strip())
results = []
for sentence in sentences:
tokens = nltk.word_tokenize(sentence)
words = [t.lower() for t in tokens if t.isalnum()]
not_stop_words = tuple([w for w in words if w not in stop_words])
results.append(not_stop_words)
print results
但是,请注意,这并没有给出与问题中列出的完全相同的输出,而是看起来像这样:
[('biggest', 'disappointment', 'trip'), ('restaurant', 'received', 'good', 'reviews', 'expectations', 'high'), ('service', 'slow', 'even', 'though', 'restaurant', 'full'), ('house', 'salad', 'could', 'come', 'sizzler', 'us'), ('keshi', 'yena', 'although', 'tasty', 'reminded', 'barbequed', 'pulled', 'chicken'), ('restaurant', 'overrated')]
如果输出需要看起来相同,您可能需要手动添加一些停止词。
不确定您是否需要R,但根据您的要求,我认为它也可以以纯python的方式完成。
你基本上想要一个包含每个句子的重要单词(不是停止词)的小列表。
你可以这样写
input_reviews = """
this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high.
the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us.
the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated.
"""
# load your stop words list here
stop_words_list = ['this', 'was', 'the', 'of', 'our', 'biggest', 'had', 'some', 'very', 'so', 'were', 'not']
def main():
sentences = input_reviews.split('.')
sentence_list = []
for sentence in sentences:
inner_list = []
words_in_sentence = sentence.split(' ')
for word in words_in_sentence:
stripped_word = str(word).lstrip('n')
if stripped_word and stripped_word not in stop_words_list:
# this is a good word
inner_list.append(stripped_word)
if inner_list:
sentence_list.append(inner_list)
print(sentence_list)
if __name__ == '__main__':
main()
在我这端,输出
[['disappointment', 'trip'], ['restaurant', 'received', 'good', 'reviews,', 'expectations', 'high'], ['service', 'slow', 'even', 'though', 'restaurant', 'full'], ['I', 'house', 'salad', 'which', 'could', 'have', 'come', 'out', 'any', 'sizzler', 'in', 'us'], ['keshi', 'yena,', 'although', 'tasty', 'reminded', 'me', 'barbequed', 'pulled', 'chicken'], ['restaurant', 'is', 'overrated']]
这是一种方法。您可能需要初始化适合您的应用程序的stop_words
。我假设stop_words
是小写的:因此,在原始句子上使用lower()
进行比较。sentences.lower().split('.')
给出句子。s.split()
给出了每个句子中的单词列表。
stokens = [list(filter(lambda x: x not in stop_words, s.split())) for s in sentences.lower().split('.')]
你可能想知道为什么我们使用filter
和lambda
。另一种方法是:
stokens = [word for s in sentences.lower().split('.') for word in s.split() if word not in stop_words]
filter
是一个函数式编程结构。它帮助我们处理整个列表,在本例中,通过使用lambda
语法的匿名函数。