给定df1,我知道如何使用.value_counts()
:获得装箱值计数
df1 = pd.DataFrame({'numbers': [0.1, 0.11, 0.2, 0.3, 0.33, 0.6, 0.66, 0.7, 0.9, 1],
'another_column': ['blue', 'blue', 'blue', 'red', 'green', 'purple', 'blue', 'blue', 'blue', 'orange']})
df1['numbers'].value_counts(bins=[0, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1])
结果:
(0.6, 0.7] 2
(0.1, 0.2] 2
(0.9, 1.0] 1
(0.8, 0.9] 1
(0.5, 0.6] 1
(0.3, 0.4] 1
(0.2, 0.3] 1
(-0.001, 0.1] 1
(0.7, 0.8] 0
(0.4, 0.5] 0
Name: numbers, dtype: int64
给定另一个比df1大得多的df(下面的例子(:
df2 = pd.DataFrame({'numbers': [0.1, 0.11, 0.2, 0.3, 0.33, 0.6, 0.66, 0.7, 0.9, 0.1, 0.11, 0.2, 0.3, 0.33, 0.6, 0.66, 0.7, 0.9, 0.1, 0.11, 0.2, 0.3, 0.33, 0.6, 0.66, 0.7, 0.98],
'nonshared_column': ['cat', 'dog', 'cat', 'dog', 'fish', 'cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'dog', 'fish', 'cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'dog', 'fish', 'cat', 'dog', 'dog', 'fish']})
我想从df1中提取bin来过滤df2,所以输出df是df2的一个子集,它与来自df1的bin匹配,
因此,输出df将有1行的"数字"值在0-0.1之间,2行的"数值"值在0.1-0.2之间……一直到1行的‘数字’值在0.9-1之间。输出df行应该包括df2中的所有列(本例中为nonshared_column
,以及numbers
列(。
将cut
与bins
一起使用Series
索引中的s
:
s = df1['numbers'].value_counts(bins=[0, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1])
df2['new'] = pd.cut(df2['numbers'], bins=s.index)
print (df2)
numbers nonshared_column new
0 0.10 cat (-0.001, 0.1]
1 0.11 dog (0.1, 0.2]
2 0.20 cat (0.1, 0.2]
3 0.30 dog (0.2, 0.3]
4 0.33 fish (0.3, 0.4]
5 0.60 cat (0.5, 0.6]
6 0.66 dog (0.6, 0.7]
7 0.70 dog (0.6, 0.7]
8 0.90 fish (0.8, 0.9]
9 0.10 cat (-0.001, 0.1]
10 0.11 dog (0.1, 0.2]
11 0.20 cat (0.1, 0.2]
12 0.30 dog (0.2, 0.3]
13 0.33 fish (0.3, 0.4]
14 0.60 cat (0.5, 0.6]
15 0.66 dog (0.6, 0.7]
16 0.70 dog (0.6, 0.7]
17 0.90 fish (0.8, 0.9]
18 0.10 cat (-0.001, 0.1]
19 0.11 dog (0.1, 0.2]
20 0.20 cat (0.1, 0.2]
21 0.30 dog (0.2, 0.3]
22 0.33 fish (0.3, 0.4]
23 0.60 cat (0.5, 0.6]
24 0.66 dog (0.6, 0.7]
25 0.70 dog (0.6, 0.7]
26 0.98 fish (0.9, 1.0]
所有3列的最后一个计数(如果需要(:
df3 = df2.groupby(['numbers','nonshared_column','new'], observed=True).size().reset_index(name='count')
print (df3)
numbers nonshared_column new count
0 0.10 cat (-0.001, 0.1] 3
1 0.11 dog (0.1, 0.2] 3
2 0.20 cat (0.1, 0.2] 3
3 0.30 dog (0.2, 0.3] 3
4 0.33 fish (0.3, 0.4] 3
5 0.60 cat (0.5, 0.6] 3
6 0.66 dog (0.6, 0.7] 3
7 0.70 dog (0.6, 0.7] 3
8 0.90 fish (0.8, 0.9] 2
9 0.98 fish (0.9, 1.0] 1
编辑:
如果需要与s
相同的计数,首先使用sample
进行行的随机顺序,然后使用head()
和s
的映射进行计数过滤:
df2 = df2.sample(frac=1).groupby('new', group_keys=False).apply(lambda x: x.head(s[x.name])).sort_index()
print (df2)
numbers nonshared_column new
2 0.20 cat (0.1, 0.2]
3 0.30 dog (0.2, 0.3]
4 0.33 fish (0.3, 0.4]
9 0.10 cat (-0.001, 0.1]
11 0.20 cat (0.1, 0.2]
14 0.60 cat (0.5, 0.6]
15 0.66 dog (0.6, 0.7]
17 0.90 fish (0.8, 0.9]
24 0.66 dog (0.6, 0.7]
26 0.98 fish (0.9, 1.0]