所以我有1000个class 1和2500个class 2。使用时自然如此:
sklearn的CCD_ 1。我得到了一个不平衡的测试集,因为它保留了原始数据集的数据分布。然而,我希望在测试集中有100个class 1和100个class 2。
我该怎么做?如有任何建议,我们将不胜感激。
手动拆分
手动解决方案并没有那么可怕。解释的主要步骤:
- 隔离类-1和类-2行的索引
- 使用
np.random.permutation()
分别为类别1和类别2随机选择n1
和n2
测试样本 - 使用
df.index.difference()
对列车样本执行反向选择
代码可以很容易地推广到任意数量的类和任意数量的测试数据(只需将n1/n2、idx1/idx2等放入列表中并通过循环进行处理(。但这超出了问题本身的范围。
代码
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
# data
df = pd.DataFrame(
data={
"label": np.array([1]*1000 + [2]*2500),
# label 1 has value > 0, label 2 has value < 0
"value": np.hstack([np.random.uniform(0, 1, 1000),
np.random.uniform(-1, 0, 2500)])
}
)
df = df.sample(frac=1).reset_index(drop=True)
# sampling number for each class
n1 = 100
n2 = 100
# 1. get indexes and lengths for the classes respectively
idx1 = df.index.values[df["label"] == 1]
idx2 = df.index.values[df["label"] == 2]
len1 = len(idx1) # 1000
len2 = len(idx2) # 2500
# 2. draw index for test dataset
draw1 = np.random.permutation(len1)[:n1] # keep the first n1 entries to be selected
idx1_test = idx1[draw1]
draw2 = np.random.permutation(len2)[:n2]
idx2_test = idx2[draw2]
# combine the drawn indexes
idx_test = np.hstack([idx1_test, idx2_test])
# 3. derive index for train dataset
idx_train = df.index.difference(idx_test)
# split
df_train = df.loc[idx_train, :] # optional: .reset_index(drop=True)
df_test = df.loc[idx_test, :]
# len(df_train) = 3300
# len(df_test) = 200
# verify that no row was missing
idx_merged = np.hstack([df_train.index.values, df_test.index.values])
assert len(np.unique(idx_merged)) == 3500