在分层抽样中从每一类中抽取相等的样本



所以我有1000个class 1和2500个class 2。使用时自然如此:

sklearn的CCD_ 1。我得到了一个不平衡的测试集,因为它保留了原始数据集的数据分布。然而,我希望在测试集中有100个class 1和100个class 2。

我该怎么做?如有任何建议,我们将不胜感激。

手动拆分

手动解决方案并没有那么可怕。解释的主要步骤:

  1. 隔离类-1和类-2行的索引
  2. 使用np.random.permutation()分别为类别1和类别2随机选择n1n2测试样本
  3. 使用df.index.difference()对列车样本执行反向选择

代码可以很容易地推广到任意数量的类和任意数量的测试数据(只需将n1/n2、idx1/idx2等放入列表中并通过循环进行处理(。但这超出了问题本身的范围。

代码

import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
# data
df = pd.DataFrame(
data={
"label": np.array([1]*1000 + [2]*2500),
# label 1 has value > 0, label 2 has value < 0
"value": np.hstack([np.random.uniform(0, 1, 1000),
np.random.uniform(-1, 0, 2500)])
}
)
df = df.sample(frac=1).reset_index(drop=True)
# sampling number for each class
n1 = 100
n2 = 100
# 1. get indexes and lengths for the classes respectively
idx1 = df.index.values[df["label"] == 1]
idx2 = df.index.values[df["label"] == 2]
len1 = len(idx1)  # 1000
len2 = len(idx2)  # 2500
# 2. draw index for test dataset
draw1 = np.random.permutation(len1)[:n1]  # keep the first n1 entries to be selected
idx1_test = idx1[draw1]
draw2 = np.random.permutation(len2)[:n2]
idx2_test = idx2[draw2]
# combine the drawn indexes
idx_test = np.hstack([idx1_test, idx2_test])
# 3. derive index for train dataset
idx_train = df.index.difference(idx_test)
# split
df_train = df.loc[idx_train, :]  # optional: .reset_index(drop=True)
df_test = df.loc[idx_test, :]
# len(df_train) = 3300
# len(df_test) = 200    
# verify that no row was missing
idx_merged = np.hstack([df_train.index.values, df_test.index.values])
assert len(np.unique(idx_merged)) == 3500

最新更新