我有一个2列的数据框-列消息是由带有令牌(单词)的列表组成的列,分数是一个数字列
Message Score Month
["a", "a", "b", "c"] 5 1
["a", "b", "d", "e"] 4 1
["b", "b", "d", "e"] 4 1
和我有一个单词列表:
l = ["a", "b", "c", "d", "e"]
我需要列表中每个单词每个月的平均分。因此,如果单词"在我的数据框中有2行,它应该返回这2行的平均值。预期的结果应该是:
month word avg_score
1 a 4.5 #--> it's in first and second row of my dataframe, so avg = (5+4)/2 in
1 b 4.33 #--> it's in first, second and third row of my dataframe, so avg =(5+4+4)/3
1 c 5 #--> it's in first row of my dataframe, so avg =(5)/1
1 d 4.5 #--> it's in second and third row of my dataframe, so avg =(5+4)/2
1 e 4.5 #--> it's in second and third row of my dataframe, so avg =(5+4)/2
<<p>我尝试/em>我只计算了每个单词,而不是每个月。可能是因为我在字典中分配数字,但我不知道还有其他方法。
dicts = {}
for item in l:
df_new_2 = df[df['Word'].apply(lambda x: item in x)]
mean = df_new.Score.mean()
dicts[item] = mean
df_new = pd.DataFrame(dicts.items(), columns=['word', 'avg_score'])
df_new
给定数据框架
df
Message Score Month
0 [a, a, b, c] 5 1
1 [a, b, d, e] 4 1
2 [b, b, d, e] 4 1
因为我们只关心唯一项,所以可以转换为set来删除重复项
df['Message']=df['Message'].apply(set)
df
Message Score Month
0 {a, c, b} 5 1
1 {a, e, b, d} 4 1
2 {e, b, d} 4 1
然后将集合分解为行并重新分组以获得每个唯一条目的平均值
df.explode('Message').groupby(['Month','Message']).mean()
Score
Month Message
1 a 4.500000
b 4.333333
c 5.000000
d 4.000000
e 4.000000