我之前使用cross_validation.train_test_split将我的数据集拆分为90:10的比例。我现在转到分层洗牌拆分(在scikit-learn中合并Kfold和Shuffle Split)。我想了解使用指定的测试大小进行分层除法是否更好,还是应该在不指定测试大小的情况下进行分层除法?
这就是我正在做的:
train=[]
with open("/Users/minks/Documents/documents.txt") as f:
for line in f:
train.append(line.strip().split())
train=np.array(train)
labels=[]
with open("/Users/minks/Documents/Labels.txt") as t:
for line in t:
labels.extend(line.strip().split())
labels=np.array(labels)
kf=StratifiedShuffleSplit(labels, n_iter=5, test_size=0.10)
for train_index, test_index in kf:
X_train, X_test = train[train_index],train[test_index]
Y_train, Y_test = labels[train_index],labels[test_index]
我想知道指定test_size是否是一个好的性能决定,因为如果我不这样做,它会拾取随机比率。
如果不指定自己的测试大小,它将默认为 0.1
。它不会使用随机比率。您可以在文档中找到默认值(函数的 tring):
在IPython中,做
[1]: from sklearn.cross_validation import StratifiedShuffleSplit
[2]: StratifiedShuffleSplit?
你会看到
[...]
Parameters
----------
n : int
Total number of elements in the dataset.
n_iter : int (default 10)
Number of re-shuffling & splitting iterations.
test_size : float (default 0.1), int, or None
If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the test split. If
int, represents the absolute number of test samples. If None,
the value is automatically set to the complement of the train size.
[...]