我正在尝试使用 numpy 而不是使用 sklearn 的 train_test_split 函数编写自己的训练测试拆分函数。我将数据分为 70% 的训练和 30% 的测试。我正在使用来自sklearn的波士顿住房数据集。
这是数据的形状:
housing_features.shape #(506,13) where 506 is sample size and it has 13 features.
这是我的代码:
city_data = datasets.load_boston()
housing_prices = city_data.target
housing_features = city_data.data
def shuffle_split_data(X, y):
split = np.random.rand(X.shape[0]) < 0.7
X_Train = X[split]
y_Train = y[split]
X_Test = X[~split]
y_Test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_Train, y_Train, X_Test, y_Test
try:
X_train, y_train, X_test, y_test = shuffle_split_data(housing_features, housing_prices)
print "Successful"
except:
print "Fail"
我得到的打印输出是:
362 362 144 144
"Successful"
但我知道它并不成功,因为当我再次运行它时,我会得到不同的长度数字,而仅使用 SKlearn 的训练测试功能并且始终得到 354 的长度X_train。
#correct output
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_prices, test_size=0.3, random_state=42)
print len(X_train)
#354
我缺少什么功能?
因为你使用的是np.random.rand
给你随机数,对于非常大的数字,0.7 的限制将接近 70%。您可以使用np.percentile
来获取 70% 的值,然后像您所做的那样与该值进行比较:
def shuffle_split_data(X, y):
arr_rand = np.random.rand(X.shape[0])
split = arr_rand < np.percentile(arr_rand, 70)
X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_train, y_train, X_test, y_test
编辑
或者,您可以使用np.random.choice
来选择具有所需金额的指数。对于您的情况:
np.random.choice(range(X.shape[0]), int(0.7*X.shape[0]))