从字典中查找熊猫的词频



这是我正在使用的代码:

import pandas as pd
data = [['This is a long sentence which contains a lot of words among them happy', 1],
['This is another sentence which contains the word happy* with special character', 1],
['Content and merry are another words which implies happy', 2],
['Sad is not happy', 2],
['unfortunate has negative conotations', 1]]
df = pd.DataFrame(data, columns=['string', 'id'])
words = {
"positive" : ["happy", "content"],
"negative" : ["sad", "unfortunate"],
"neutral" : ["neutral", "000"]
}

我希望输出数据帧在字典中查找关键字,并在数据帧中搜索它们,但关键字只能根据id计数一次。

简单地说:

  • 按id分组
  • 每组:看一组句子中是否至少有一个单词是阳性、阴性和中性的
  • 然后将所有组的计数相加

例如。

string  id
0   This is a long sentence which contains a lot o...   1
1   This is another sentence which contains the wo...   1
2   Content and merry are another words which impl...   2
3   Sad is not happy    2
4   unfortunate has negative connotations   1

id";1〃;在第0行和第1行中,都包含关键字positive的dict值。因此,对于id 1,CCD_ 1只能被计数1次。同样在最后一行中,它包含单词";不幸的";因此

对于id 1

阳性:1

阴性:1

中性:0

在所有id相加后,最终的数据帧应该如下所示:

word        freq
positive     2
negative     2
neutral      0

你能告诉我如何在熊猫中实现这一点吗

这是有效的,因为any()短路(在第一个匹配的值处停止求值(。

texts = df.groupby('id')[['string']].agg(lambda x: ' '.join(x))
for k, v in words.items():
texts[k] = texts['string'].transform(
lambda text: any(word.lower() in text.lower() for word in v)
)
result = texts[words.keys()].sum(axis=0)

result是一个系列:

positive    2
negative    2
neutral     0
dtype: int64

你可以把它转换成这样的DataFrame:

result_df = result.to_frame().reset_index().set_axis(['word', 'freq'], axis=1)
word  freq
0  positive     2
1  negative     2
2   neutral     0

下面的代码应该能完成任务,尽管它不能完全用于panda。注意,我使用短语.lower((来匹配正确的计数。

from collections import Counter
out = df.groupby("id")['string'].apply(list)
def get_count(grouped_element):
counter = Counter({"postive": 0, "negative": 0, "neutral": 0})
words = {
"postive" : ["happy", "content"],
"negative" : ["sad", "unfortunate"],
"neutral" : ["neutral", "000"]
}
for phrase in grouped_element:
if counter["postive"] < 1:
for word in words["postive"]:
if word in phrase.lower():
counter.update(["postive"])
break 
if counter["negative"] < 1:
for word in words["negative"]:
if word in phrase.lower():
counter.update(["negative"])
break 
if counter["neutral"] < 1:
for word in words["neutral"]:
if word in phrase.lower():
counter.update(["neutral"])
break 
return counter
counter = Counter({"postive": 0, "negative": 0, "neutral": 0})
for phrases in out:
result = get_count(phrases)
counter.update(result)
print(counter)

输出为:

Counter({'postive': 2, 'negative': 2, 'neutral': 0})

转换为数据帧:

out = {"word": [], "freq": []}
for key, val in counter.items():
out["word"].append(key)
out["freq"].append(val)
pd.DataFrame(out)
word    freq
0   postive     2
1   negative    2
2   neutral     0

最新更新