随机组合熊猫组对象

问题：

如何使用pandas-df.groupby（）函数创建随机选择的组？

示例：

我想将数据帧分组为大小为n的随机组，其中n对应于给定列中唯一值的数量。

我有一个包含各种列的数据框架，包括"id"。有些行具有唯一的id，而另一些行可能具有相同的id。例如：

   c1 id c2
0   a  1  4
1   b  2  6
2   c  2  2
3   d  5  7 
4   y  9  3

实际上，这个数据帧最多可以有1000行左右。

我希望能够使用以下标准对该数据帧进行分组：

每个组最多应包含n个唯一的id
任何id都不应出现在多个组中
给定组中的特定id应该随机选择
每个id应该只出现在一个组中

例如，示例数据帧（上面）可以变成：

第1组：

   c1 id c2
0   a  1  4
4   y  9  3

第2组：

   c1 id c2
1   b  2  6
2   c  2  2
3   d  5  7

其中n=2

谢谢你的建议。

对于uniq groupby语句来说似乎很困难。一种方法：

uniq=df['id'].unique()
random.shuffle(uniq)
groups=np.split(uniq,2)
dfr=df.set_index(df['id'])
for gp in groups : print (dfr.loc[gp])

对于

   c1  id  c2
id           
9   y   9   3
1   a   1   4
   c1  id  c2
id           
5   d   5   7
2   b   2   6
2   c   2   2

如果组的大小（n）不划分len(uniq)，则可以使用np.split(uniq,range(n,len(uniq),n))。

这里有一种方法：

import numpy as np
df = pd.DataFrame({'c1':list('abcdy'), 'id':[1,2,2,5,9], 'c2':[4,6,2,7,3]})
n = 2
shuffled_ids = np.random.permutation(df['id'].unique())
id_groups  = [shuffled_ids[i:i+n] for i in xrange(0, len(shuffled_ids), n)]
groups = [df['id'].apply(lambda x: x in g) for g in id_groups]

输出：

In [1]: df[groups[0]]
Out[1]:
  c1  c2  id
1  b   6   2
2  c   2   2
3  d   7   5
In [2]: df[groups[1]]
Out[2]:
  c1  c2  id
0  a   4   1
4  y   3   9

这种方法不需要更改索引，以防您需要保留它。

相关内容

最新更新

热门标签：