给定一个具有10,000个观测值和50个功能的数据集,以及一个标签,x_train,y_train,x_test和y_test的尺寸是什么,假设火车/测试拆分为75%/25%?应该是
X_train:(2500, 50)
y_train: (2500, )
X_test: (7500, 50)
y_test: (7500, )
或
X_train: (7500, 50)
y_train: (7500, )
X_test: (2500, 50)
y_test: (2500, )
您可以从sklearn
中使用train_test_split
看到自己,
import numpy as np
from sklearn.model_selection import train_test_split
n = 10000
p = 50
X = np.random.random((n,p))
y = np.random.randint(0,2,n)
test = 0.25
d = {}
d["X_train"], d["X_test"], d["y_train"], d["y_test"] = train_test_split(X,y,test_size=test)
for split in d:
print(split, d[split].shape)
X_train (7500, 50)
X_test (2500, 50)
y_train (7500,)
y_test (2500,)
第二个。
假设火车/测试拆分为75%/25%
这意味着数据集的75%用于培训,其余的用于测试。您有10000个观察结果,所以训练是7500,测试2500。
通常,当我们说 A
/ B
拆分为 X%
/ Y%
时。这意味着A
获取X%
,并且B
获取Y%
。总是。而且,X+Y
应为100。
您应该希望用作75%数据的培训集,其余25%为测试集。这通常会给您带来良好的结果。(这也取决于您的数据集卷。)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, train_size=0.75, random_state=101)
X_train: (7500, 50)
y_train: (7500, )
X_test: (2500, 50)
y_test: (2500, )
考虑火车/测试拆分方法的内部" train_size"one_answers" testrongize"。您可以在此设置其他值。
Random_State参数也用于洗牌数据集零件