计算评论中大量名词和动词/形容词的所有共同出现

我有一个数据框架，其中包含大量评论、一个包含名词词的大列表(1000(和另一个包含动词/形容词的大名单(1000(。

示例数据帧和列表：

import pandas as pd
data = {'reviews':['Very professional operation. Room is very clean and comfortable',
'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
'The room is very quiet, and well decorated, very clean.',
'He provides the room with towels, tea, coffee and a wardrobe.',
'Daniel is a great host. Always recomendable.',
'My friend and I were very satisfied with our stay in his apartment.']}
df = pd.DataFrame(data)
nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
'à','station','bed','experience','hosts','Thank','bien']
verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]

我想创建一本字典，存储每次评论中名词和动词/形容词的所有共同出现，例如

"操作非常专业。房间非常干净舒适

{'room': {'is': 1, 'clean': 1, 'comfortable': 1}

使用以下代码：

def count_co_occurences(reviews):
# Iterate on each review and count
occurences_per_review = {
f"review_{i+1}": {
noun: dict(Counter(review.lower().split(" ")))
for noun in nouns
if noun in review.lower()
}
for i, review in enumerate(reviews)
}
# Remove verb_adj not found in main list
opr = deepcopy(occurences_per_review)
for review, occurences in opr.items():
for noun, counts in occurences.items():
for verb_adj in counts.keys():
if verb_adj not in verbs_adj:
del occurences_per_review[review][noun][verb_adj]

return occurences_per_review
pprint(count_co_occurences(data["reviews"]))

适用于列表和评论数量较少的情况，但当此功能用于大列表/大量评论时，我的笔记本会崩溃。如何修改代码以处理此问题？

我认为您可能需要使用几个库来简化您的生活。在这个例子中，我使用的是nltk和collections，当然除了panda：

import pandas as pd
import nltk
from collections import Counter
data = {'reviews':['Very professional operation. Room is very clean and comfortable',
'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
'The room is very quiet, and well decorated, very clean.',
'He provides the room with towels, tea, coffee and a wardrobe.',
'Daniel is a great host. Always recomendable.',
'My friend and I were very satisfied with our stay in his apartment.']}
df = pd.DataFrame(data)
nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
'à','station','bed','experience','hosts','Thank','bien']
verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]
def buildict(x):
occurdict={}
tokens = nltk.word_tokenize(x)
tokenslower = list(map(str.lower, tokens)) 
allnouns=[word for word in tokenslower if word in nouns]
allverbs_adj=Counter(word for word in tokenslower if word in verbs_adj)
for noun in allnouns:
occurdict[noun]=dict(allverbs_adj)
return occurdict
df['words']=df['reviews'].apply(lambda x: buildict(x))

输出：

0   Very professional operation. Room is very clea...   {'room': {'is': 1, 'clean': 1, 'comfortable': 1}}
1   Daniel is the most amazing host! His place is ...   {'host': {'is': 3, 'amazing': 1, 'clean': 1, '...
2   The room is very quiet, and well decorated, ve...   {'room': {'is': 1, 'quiet': 1, 'clean': 1}}
3   He provides the room with towels, tea, coffee ...   {'room': {}}
4   Daniel is a great host. Always recomendable.    {'host': {'is': 1, 'great': 1}}
5   My friend and I were very satisfied with our s...   {'stay': {'were': 1, 'stay': 1}, 'apartment': ...

相关内容

最新更新

热门标签：