统计数据框列中列表项的出现次数,按另一列分组



我有一个数据框架,其中一列包含非唯一的单词列表,另一列包含从1到3的数字。我想计算数据框中每个单词的出现次数,按第二列中的数字分组。

的例子:

tbody> <<tr>
words category
猫,狗,狗1
猫,猫,老鼠1
老鼠,猫,狗,大象2
大象,大象3

我们可以用DataFrame.explode+crosstab:

# If Not Already a List
# df['words'] = df['words'].str.split(', ')
new_df = df.explode('words')
new_df = pd.crosstab(
new_df['words'], new_df['category']
).reset_index().rename_axis(columns=None)

explodegroupby size+unstack或:

new_df = (
df.explode('words')  # Explode List into Rows
.groupby(['words', 'category']).size()  # Calculate Group Sizes
.unstack(fill_value=0)  # Convert Category values to column names 
.reset_index().rename_axis(columns=None)  # Cleanup
)

DataFrame.value_counts+unstackexplode:

new_df = (
df.explode('words')  # Explode List into Rows
.value_counts()  # Count Value Pairs
.unstack(level='category',  # Convert Category values to column names
fill_value=0)
.reset_index().rename_axis(columns=None)  # Cleanup
)

new_df:

words  1  2  3
0       cat  3  1  0
1       dog  2  1  0
2  elephant  0  1  2
3     mouse  1  1  0

设置:

import pandas as pd
df = pd.DataFrame({
'words': [['cat', 'dog', 'dog'], ['cat', 'cat', 'mouse'],
['mouse', 'cat', 'dog', 'elephant'], ['elephant', 'elephant']],
'category': [1, 1, 2, 3]
})

我觉得对于这样的数据结构,在返回Pandas之前,如果数据在Pandas外部被打乱,您可能会有更好的性能(当然,这只在您关心性能时才重要,没有必要进行不必要的优化)-当然,测试是确保这是正确的唯一方法:

from collections import defaultdict
d = defaultdict(int)
for words, number in zip(df.words, df.category):
for word in words:
d[(word, number)] += 1

d
defaultdict(int,
{('cat', 1): 3,
('dog', 1): 2,
('mouse', 1): 1,
('mouse', 2): 1,
('cat', 2): 1,
('dog', 2): 1,
('elephant', 2): 1,
('elephant', 3): 2})

构建DataFrame:

(pd.DataFrame(d.values(), index = d)
.unstack(fill_value = 0)
.droplevel(0, axis = 1)
)
1  2  3
cat       3  1  0
dog       2  1  0
elephant  0  1  2
mouse     1  1  0

参考@HenryEcker,你也可以使用Counter函数:

from itertools import product, chain
from collections import Counter
# integers are put into a list as `product` works on iterables
pairing = (product(left, [right]) 
for left, right 
in zip(df.words, df.category))
outcome = Counter(chain.from_iterable(pairing))
outcome
Counter({('cat', 1): 3,
('dog', 1): 2,
('mouse', 1): 1,
('mouse', 2): 1,
('cat', 2): 1,
('dog', 2): 1,
('elephant', 2): 1,
('elephant', 3): 2})

像前面那样构建数据框架:

(pd.DataFrame(outcome.values(), index = outcome)
.unstack(fill_value = 0)
.droplevel(0, axis = 1)
)
1  2  3
cat       3  1  0
dog       2  1  0
elephant  0  1  2
mouse     1  1  0

最新更新