我有一个数据框架,其中一列包含非唯一的单词列表,另一列包含从1到3的数字。我想计算数据框中每个单词的出现次数,按第二列中的数字分组。
的例子:
words | category | 猫,狗,狗 | 1 |
---|---|
猫,猫,老鼠 | 1 |
老鼠,猫,狗,大象 | 2 |
大象,大象 | 3 |
我们可以用DataFrame.explode
+crosstab
:
# If Not Already a List
# df['words'] = df['words'].str.split(', ')
new_df = df.explode('words')
new_df = pd.crosstab(
new_df['words'], new_df['category']
).reset_index().rename_axis(columns=None)
与explode
后groupby size
+unstack
或:
new_df = (
df.explode('words') # Explode List into Rows
.groupby(['words', 'category']).size() # Calculate Group Sizes
.unstack(fill_value=0) # Convert Category values to column names
.reset_index().rename_axis(columns=None) # Cleanup
)
或DataFrame.value_counts
+unstack
后explode
:
new_df = (
df.explode('words') # Explode List into Rows
.value_counts() # Count Value Pairs
.unstack(level='category', # Convert Category values to column names
fill_value=0)
.reset_index().rename_axis(columns=None) # Cleanup
)
new_df
:
words 1 2 3
0 cat 3 1 0
1 dog 2 1 0
2 elephant 0 1 2
3 mouse 1 1 0
设置:
import pandas as pd
df = pd.DataFrame({
'words': [['cat', 'dog', 'dog'], ['cat', 'cat', 'mouse'],
['mouse', 'cat', 'dog', 'elephant'], ['elephant', 'elephant']],
'category': [1, 1, 2, 3]
})
我觉得对于这样的数据结构,在返回Pandas之前,如果数据在Pandas外部被打乱,您可能会有更好的性能(当然,这只在您关心性能时才重要,没有必要进行不必要的优化)-当然,测试是确保这是正确的唯一方法:
from collections import defaultdict
d = defaultdict(int)
for words, number in zip(df.words, df.category):
for word in words:
d[(word, number)] += 1
d
defaultdict(int,
{('cat', 1): 3,
('dog', 1): 2,
('mouse', 1): 1,
('mouse', 2): 1,
('cat', 2): 1,
('dog', 2): 1,
('elephant', 2): 1,
('elephant', 3): 2})
构建DataFrame:
(pd.DataFrame(d.values(), index = d)
.unstack(fill_value = 0)
.droplevel(0, axis = 1)
)
1 2 3
cat 3 1 0
dog 2 1 0
elephant 0 1 2
mouse 1 1 0
参考@HenryEcker,你也可以使用Counter
函数:
from itertools import product, chain
from collections import Counter
# integers are put into a list as `product` works on iterables
pairing = (product(left, [right])
for left, right
in zip(df.words, df.category))
outcome = Counter(chain.from_iterable(pairing))
outcome
Counter({('cat', 1): 3,
('dog', 1): 2,
('mouse', 1): 1,
('mouse', 2): 1,
('cat', 2): 1,
('dog', 2): 1,
('elephant', 2): 1,
('elephant', 3): 2})
像前面那样构建数据框架:
(pd.DataFrame(outcome.values(), index = outcome)
.unstack(fill_value = 0)
.droplevel(0, axis = 1)
)
1 2 3
cat 3 1 0
dog 2 1 0
elephant 0 1 2
mouse 1 1 0