熊猫一组一组地为每组抽取不同的部分

import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
'b': [1,1,1,0,0,0,0],
})
grouped = df.groupby('b')

现在从每组中取样，例如，我想要来自b = 1组的30%，以及来自b = 0组的20%。我该怎么做？如果我想为某个团体获得150%，我能做到吗？

您可以动态返回随机样本数据帧，其中每个组定义的样本百分比不同。您可以通过replace=True:在百分比低于100%(参见示例1(且高于100%[参见示例2(

使用np.select创建一个新列c，该列根据您设置的20%、40%等百分比返回要随机采样的每组的行数
从那里，您可以根据这些百分比条件为每组samplex行。从这些行中，返回行的.index，并筛选具有.loc的行以及列'a','b'。代码grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0]))创建了您要查找的输出的多索引系列，但它需要一些清理。这就是为什么对我来说，获取.index并用.loc过滤原始数据帧比清理混乱的多索引序列更容易

grouped = df.groupby('b', group_keys=False)
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)], [0.4, 0.2])
df.loc[grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])).index, ['a','b']]
Out[1]: 
a  b
6  7  0
8  9  0
3  4  1

如果您想使用现有值的副本返回更大的随机样本，只需传递replace=True。然后，进行一些清理以获得输出。

grouped = df.groupby('b', group_keys=False) v = df['b'].value_counts() df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)], [int(v.loc[0] * 1.2), int(v.loc[1] * 2)]) #frac parameter doesn't work with sample when frac > 1, so we have to calcualte the integer value for number of rows to be sampled. (grouped.apply(lambda x: x['b'].sample(x['c'].iloc[0], replace=True)) .reset_index() .rename({'index' : 'a'}, axis=1)) Out[2]: a b 0 7 0 1 8 0 2 9 0 3 7 0 4 7 0 5 8 0 6 1 1 7 3 1 8 3 1 9 1 1 10 0 1 11 0 1 12 4 1 13 2 1 14 3 1 15 0 1

您可以使用从GroupBy对象获取DataFrame，例如grouped.get_group(0)。如果您想从中采样，可以使用.sample方法。例如grouped.get_group(0).sample(frac=0.2)给出：

a
5  6

对于您给出两个样本的示例，将只给出一个元素，因为组有4个和3个元素，并且0.2*4 = 0.8和0.3*3 = 0.9都舍入为1。

相关内容

最新更新

热门标签：