将数据集拆分为两个子集,采用 matlab/octave 格式



将数据集拆分为两个子集,例如"train"和"test",使用 包含 80% 数据的训练集和包含剩余 20% 数据的测试集。

拆分是指生成长度等于 数据集中的观测值数,其中 1 表示训练 样本和 0 表示测试样本。

N=长度(数据.x)

输出:名为 idxTrain 和 idxTest 的逻辑数组。

这应该可以解决问题:

% Generate sample data...
data = rand(32000,1);
% Calculate the number of training entries...
train_off = round(numel(data) * 0.8);
% Split data into training and test vectors...
train = data(1:train_off);
test = data(train_off+1:end);

但是,如果您确实想依赖逻辑索引,则可以执行以下操作:

% Generate sample data...
data = rand(32000,1);
data_len = numel(data);
% Calculate the number of training entries...
train_count = round(data_len * 0.8);
% Create the logical indexing...
is_training = [true(train_count,1); false(data_len-train_count,1)];
% Split data into training and test vectors...
train = data(is_training);
test = data(~is_training);

您也可以使用 randsample 函数来实现提取中的一些随机性,但这不会在每次运行脚本时为您提供测试和训练元素的确切抽奖次数:

% Generate sample data...
data = rand(32000,1);
% Generate a random true/false indexing with unequally weighted probabilities...
is_training = logical(randsample([0 1],32000,true,[0.2 0.8]));
% Split data into training and test vectors...
train = data(is_training);
test = data(~is_training);

您可以通过生成正确数量的测试和训练索引,然后使用基于 randperm 的索引对其进行洗牌来避免此问题:

% Generate sample data...
data = rand(32000,1);
data_len = numel(data);
% Calculate the number of training entries...
train_count = round(data_len * 0.8);
% Create the logical indexing...
is_training = [true(train_count,1); false(data_len-train_count,1)];
% Shuffle the logical indexing...
is_training = is_training(randperm(32000));
% Split data into training and test vectors...
train = data(is_training);
test = data(~is_training);