如何解决熊猫这一模糊且过于本地化的问题



我有这个数据框架,我想在其中为不同值列中的每个元素打分。

+-------------------------------+-------------------------------+------------------------------+
|            A                  |             B                 |      Distinct Values         |
+-------------------------------+-------------------------------+------------------------------+
| ['a', 'b', 'c']               |   ['a', 'b']                  |  ['a', 'b', 'c']             | 
| ['c', 'b', 'e', 'a']          |   ['b', 'e', 'a']             |  ['a', 'b', 'e', 'c']        |
| ['a', 'b', 'd', 'e']          |   ['a', 'b', 'c']             |  ['a', 'b', 'd', 'e', 'c']   |
| ['a', 'b', 'c']               |   ['a', 'd', 'c']             |  ['a', 'b', 'c', 'd']        |
|                               |                               |                              |  
+-- ----------------------------+-------------------------------+------------------------------+


( NO. of times that element has occurred in A and B)
Scoring =  ----------------------------------------------------
(Total number of elements(Distinct Values))

这就是得分后的样子:

+------------------------+--------------------+-----------------------------------------------+
|            A           |         B          |      Distinct_Values_with_scoring             |
+------------------------+--------------------+-----------------------------------------------+
| ['a', 'b', 'c']        |   ['a', 'b']       |['a':2/3, 'b':2/3, 'c':1/3]                    | 
| ['c', 'b', 'e', 'a']   |   ['b', 'e', 'a']  |['a':2/4, 'b':2/4, 'e':2/4, 'c':1/4]           |
| ['a', 'b', 'd', 'e']   |   ['a', 'b', 'c']  |['a':2/5, 'b':2/5, 'd':1/5, 'e':1/5, 'c':1/5]  |
| ['a', 'b', 'c']        |   ['a', 'd', 'c']  |['a':2/4, 'b':1/4, 'c':2/4, 'd':1/4]           |
|                        |                    |                                               |  
+-- ---------------------+--------------------+-----------------------------------------------+

我该如何着手解决熊猫的这个问题?

d = {"A":[['a', 'b', 'c'], ['c', 'b', 'e', 'a'],
['a', 'b', 'd', 'e'], ['a', 'b', 'c']], 
"B": [['a', 'b'],['b', 'e', 'a'],['a', 'b', 'c'],
['a', 'd', 'c']],
"Distinct Values": [['a', 'b', 'c'], ['a', 'b', 'e', 'c'],
['a', 'b', 'd', 'e', 'c'], ['a', 'b', 'c', 'd']]}
data = pd.DataFrame(d)

因为列AB都是列表,所以只需将它们相加即可获得总元素。然后在字典理解中使用Counter来获得每个字母的计数,并将每个计数除以唯一字母的总数(由集合的长度决定(。

from collections import Counter
# Sample data.
df = pd.DataFrame({
'A': [['a', 'b', 'c'], ['c', 'b', 'e', 'a'], ['a', 'b', 'd', 'e'], ['a', 'b', 'c']],
'B': [['a', 'b'], ['b', 'e', 'a'], ['a', 'b', 'c'], ['a', 'd', 'c']]
})
# Solution.
>>> df.assign(
Distinct_Values_with_scoring=
df['A']
.add(df['B'])
.apply(lambda x: {k: v / len(set(x)) for k, v in Counter(x).items()})
)
A          B                       Distinct_Values_with_scoring
0     [a, b, c]     [a, b]  {'a': 0.6666666666666666, 'b': 0.6666666666666...
1  [c, b, e, a]  [b, e, a]          {'c': 0.25, 'b': 0.5, 'e': 0.5, 'a': 0.5}
2  [a, b, d, e]  [a, b, c]  {'a': 0.4, 'b': 0.4, 'd': 0.2, 'e': 0.2, 'c': ...
3     [a, b, c]  [a, d, c]         {'a': 0.5, 'b': 0.25, 'c': 0.5, 'd': 0.25}

你可以做:

def func(x):
d = {}
ele = x['A'] + x['B']
for i in x["Distinct Values"]:
d[i] = ele.count(i)/len(x["Distinct Values"])
return d
df["Distinct_Values_with_scoring"] = df.apply(func, axis=1)
print(df)

首先,您可以找到不同的长度和所有值。然后,您可以将值计数应用于所有值除以不同长度

data['all_values'] = data['A'] + data['B']
data['distinct len'] = data['Distinct Values'].apply(len)
data[['all_values', 'distinct len']].apply(lambda x: pd.Series.value_counts(x[0])/x[1], axis = 1)

我正在等待我的模型运行,所以这里有一个过度扼杀的答案:

df = pd.DataFrame({
'A':[['a','b','c'], ['c','b','e','a'],['a','b','c']],
'B':[['a','b'],['b','e','a'],['a','b','d','e']]
})
(df['A'].explode().reset_index()
.merge(df['B'].explode().reset_index(),
left_on=['index','A'],
right_on=['index','B'],
how='outer')
.set_index('index')
.assign(occur=lambda x: x.notna().sum(axis=1),
value=lambda x: x.ffill(1)['B'])
.assign(total=lambda x: x.groupby('index')['occur'].transform('count'))
.assign(score=lambda x: x['occur']/x['total'])
.drop(['A','B','occur','total'], axis=1)
.groupby('index').apply(lambda x: x.set_index('value')['score'].to_dict())
)

输出:

index
0    {'a': 0.6666666666666666, 'b': 0.6666666666666...
1            {'c': 0.25, 'b': 0.5, 'e': 0.5, 'a': 0.5}
2    {'a': 0.4, 'b': 0.4, 'c': 0.2, 'd': 0.2, 'e': ...
dtype: object

最新更新