python和pandas:使用一个df中的bin计数,从另一个没有共享列的df中获得类似的bined计数



给定df1,我知道如何使用.value_counts():获得装箱值计数

df1 = pd.DataFrame({'numbers': [0.1, 0.11, 0.2, 0.3, 0.33, 0.6, 0.66, 0.7, 0.9, 1],
'another_column': ['blue', 'blue', 'blue', 'red', 'green', 'purple', 'blue', 'blue', 'blue', 'orange']})
df1['numbers'].value_counts(bins=[0, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1])

结果:

(0.6, 0.7]       2
(0.1, 0.2]       2
(0.9, 1.0]       1
(0.8, 0.9]       1
(0.5, 0.6]       1
(0.3, 0.4]       1
(0.2, 0.3]       1
(-0.001, 0.1]    1
(0.7, 0.8]       0
(0.4, 0.5]       0
Name: numbers, dtype: int64

给定另一个比df1大得多的df(下面的例子(:

df2 = pd.DataFrame({'numbers': [0.1, 0.11, 0.2, 0.3, 0.33, 0.6, 0.66, 0.7, 0.9, 0.1, 0.11, 0.2, 0.3, 0.33, 0.6, 0.66, 0.7, 0.9, 0.1, 0.11, 0.2, 0.3, 0.33, 0.6, 0.66, 0.7, 0.98],
'nonshared_column': ['cat', 'dog', 'cat', 'dog', 'fish', 'cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'dog', 'fish', 'cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'dog', 'fish', 'cat', 'dog', 'dog', 'fish']})

我想从df1中提取bin来过滤df2,所以输出df是df2的一个子集,它与来自df1的bin匹配,

因此,输出df将有1行的"数字"值在0-0.1之间,2行的"数值"值在0.1-0.2之间……一直到1行的‘数字’值在0.9-1之间。输出df行应该包括df2中的所有列(本例中为nonshared_column,以及numbers列(。

cutbins一起使用Series索引中的s:

s = df1['numbers'].value_counts(bins=[0, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1])
df2['new'] = pd.cut(df2['numbers'], bins=s.index)
print (df2)
numbers nonshared_column            new
0      0.10              cat  (-0.001, 0.1]
1      0.11              dog     (0.1, 0.2]
2      0.20              cat     (0.1, 0.2]
3      0.30              dog     (0.2, 0.3]
4      0.33             fish     (0.3, 0.4]
5      0.60              cat     (0.5, 0.6]
6      0.66              dog     (0.6, 0.7]
7      0.70              dog     (0.6, 0.7]
8      0.90             fish     (0.8, 0.9]
9      0.10              cat  (-0.001, 0.1]
10     0.11              dog     (0.1, 0.2]
11     0.20              cat     (0.1, 0.2]
12     0.30              dog     (0.2, 0.3]
13     0.33             fish     (0.3, 0.4]
14     0.60              cat     (0.5, 0.6]
15     0.66              dog     (0.6, 0.7]
16     0.70              dog     (0.6, 0.7]
17     0.90             fish     (0.8, 0.9]
18     0.10              cat  (-0.001, 0.1]
19     0.11              dog     (0.1, 0.2]
20     0.20              cat     (0.1, 0.2]
21     0.30              dog     (0.2, 0.3]
22     0.33             fish     (0.3, 0.4]
23     0.60              cat     (0.5, 0.6]
24     0.66              dog     (0.6, 0.7]
25     0.70              dog     (0.6, 0.7]
26     0.98             fish     (0.9, 1.0]

所有3列的最后一个计数(如果需要(:

df3 = df2.groupby(['numbers','nonshared_column','new'], observed=True).size().reset_index(name='count')
print (df3)
numbers nonshared_column            new  count
0     0.10              cat  (-0.001, 0.1]      3
1     0.11              dog     (0.1, 0.2]      3
2     0.20              cat     (0.1, 0.2]      3
3     0.30              dog     (0.2, 0.3]      3
4     0.33             fish     (0.3, 0.4]      3
5     0.60              cat     (0.5, 0.6]      3
6     0.66              dog     (0.6, 0.7]      3
7     0.70              dog     (0.6, 0.7]      3
8     0.90             fish     (0.8, 0.9]      2
9     0.98             fish     (0.9, 1.0]      1

编辑:

如果需要与s相同的计数,首先使用sample进行行的随机顺序,然后使用head()s的映射进行计数过滤:

df2 = df2.sample(frac=1).groupby('new', group_keys=False).apply(lambda x: x.head(s[x.name])).sort_index()
print (df2)
numbers nonshared_column            new
2      0.20              cat     (0.1, 0.2]
3      0.30              dog     (0.2, 0.3]
4      0.33             fish     (0.3, 0.4]
9      0.10              cat  (-0.001, 0.1]
11     0.20              cat     (0.1, 0.2]
14     0.60              cat     (0.5, 0.6]
15     0.66              dog     (0.6, 0.7]
17     0.90             fish     (0.8, 0.9]
24     0.66              dog     (0.6, 0.7]
26     0.98             fish     (0.9, 1.0]

最新更新