创建允许子集的自定义LeaveOneGroupOut交叉验证器



我有一个包含两个组的数据集。对于我的交叉验证策略来说,重要的是(训练或测试(折叠始终只包含来自其中一组的样本。使用sklearn.model_selection.LeaveOneGroupOut已经朝着正确的方向发展,但它不允许绘制子集,这意味着n_splits永远不会高于样本中的组数。我正在寻找的是通过从组中绘制子集来扩展sklearn.model_selection.LeaveOneGroupOut,从而导致更多的折叠,每个折叠中的样本更少。

示例数据:

import numpy as np
X = np.arange(16).reshape(8,2)
y = np.arange(8)
groups = np.array([0,0,0,0,1,1,1,1])

在这个例子中,最小n_splits将是2(这与使用LeaveOneGroupOut相同(,但最大n_splits可以是8,这意味着每个样本在一个点上形成单个训练或测试折叠。我是否监督过sklearn的交叉验证算法,可以实现这种分裂策略?如果没有,我会很高兴收到一些可以做到这一点的代码。

附言:如果该算法同时允许非随机折叠(通过将数据集分割成块来绘制训练和测试折叠(和随机折叠(从两组中任意一组中随机绘制n样本来形成训练和测试褶皱(,那么最棒的是。

现在应该可以了:


# create data
import numpy as np
# user settings
sample_size = 100
group_probablities = [0.6,0.4]
rng = np.random.RandomState(42)
X = np.arange(sample_size*2).reshape(sample_size,2)
y = np.arange(sample_size)
groups = rng.choice(a=[1,2],size=sample_size,p=group_probablities)
groups = np.sort(groups)
# user settings
n_splits = 4
def grouped_kfold_subsetted(groups,n_splits):
# get unique group labels (because they don't always have to be 0 and 1) and
# count how often a certain group is present
unique_groups,group_counts = np.unique(groups, return_counts=True)

# get the size of the smallest group. This will determine how big/small 
# a fold can maximally get
smallest_group_size = np.min(group_counts)
fold_size = smallest_group_size // n_splits

if n_splits > smallest_group_size:
raise ValueError('Number of folds must not be greater than the number of samples in the smallest group')
if fold_size == 1:
raise ValueError('Training folds must contain at least two samples. Choose a smaller n_splits to increase fold size')

train_and_test_idxs = []
group_switch = 0
fold_start = 0

for split in range(n_splits):

# decide which of the two groups will form the train and the test fold
train_group = unique_groups[group_switch]
test_group = unique_groups[~group_switch]

# get all training and testing indices (we will subset afterwards)
train_idxs = np.where(groups == train_group)[0] 
test_idxs = np.where(groups == test_group)[0]

# subset training idxs chosen fold size
fold_end = fold_start + fold_size
train_idxs = train_idxs[fold_start:fold_end]

# Optional: Make test set equally large as training set
# test_idxs = test_idxs[fold_start:fold_end]
train_and_test_idxs.append((train_idxs,test_idxs))

# in the next cycle the other group will from the train fold
group_switch = 1 - group_switch

# this has to happen every 2nd cycle because each of the groups 
# wil a train fold (so we start )
if split % 2 != 0 and split != 0:
fold_start = fold_end

return train_and_test_idxs
train_and_test_idxs = grouped_kfold_subsetted(groups, n_splits)

相关内容

  • 没有找到相关文章

最新更新