我有一个房价预测数据集。我必须将数据集分为train
和test
。
我想知道是否可以通过使用numpy
或scipy
来做到这一点?
我现在不能用scikit
学习。
你的问题只是用numpy
或scipy
做一个train_test_split,但实际上有一种非常简单的方法可以和熊猫一起做:
import pandas as pd
# Shuffle your dataset
shuffle_df = df.sample(frac=1)
# Define a size for your train set
train_size = int(0.7 * len(df))
# Split your dataset
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]
对于那些想要快速简便解决方案的人。
虽然这是一个老问题,但这个答案可能会有所帮助。
这就是 sklearn 实现train_test_split
的方式,下面给出的这种方法采用与 sklearn 类似的参数。
import numpy as np
from itertools import chain
def _indexing(x, indices):
"""
:param x: array from which indices has to be fetched
:param indices: indices to be fetched
:return: sub-array from given array and indices
"""
# np array indexing
if hasattr(x, 'shape'):
return x[indices]
# list indexing
return [x[idx] for idx in indices]
def train_test_split(*arrays, test_size=0.25, shufffle=True, random_seed=1):
"""
splits array into train and test data.
:param arrays: arrays to split in train and test
:param test_size: size of test set in range (0,1)
:param shufffle: whether to shuffle arrays or not
:param random_seed: random seed value
:return: return 2*len(arrays) divided into train ans test
"""
# checks
assert 0 < test_size < 1
assert len(arrays) > 0
length = len(arrays[0])
for i in arrays:
assert len(i) == length
n_test = int(np.ceil(length*test_size))
n_train = length - n_test
if shufffle:
perm = np.random.RandomState(random_seed).permutation(length)
test_indices = perm[:n_test]
train_indices = perm[n_test:]
else:
train_indices = np.arange(n_train)
test_indices = np.arange(n_train, length)
return list(chain.from_iterable((_indexing(x, train_indices), _indexing(x, test_indices)) for x in arrays))
当然,sklearn的实现支持分层k折叠,熊猫系列的拆分等。这个仅适用于拆分列表和 numpy 数组,我认为这将适用于您的情况。
此解决方案仅使用熊猫和numpy
def split_train_valid_test(data,valid_ratio,test_ratio):
shuffled_indcies=np.random.permutation(len(data))
valid_set_size= int(len(data)*valid_ratio)
valid_indcies=shuffled_indcies[:valid_set_size]
test_set_size= int(len(data)*test_ratio)
test_indcies=shuffled_indcies[valid_set_size:test_set_size+valid_set_size]
train_indices=shuffled_indcies[test_set_size:]
return data.iloc[train_indices],data.iloc[valid_indcies],data.iloc[test_indcies]
train_set,valid_set,test_set=split_train_valid_test(dataset,valid_ratio=0.2,test_ratio=0.2)
print(len(train_set),len(valid_set),len(test_set))
##out: (16512, 4128, 4128)
这段代码应该可以工作(假设X_data
是一个熊猫数据帧):
import numpy as np
num_of_rows = len(X_data) * 0.8
values = X_data.values
np.random_shuffle(values) #shuffles data to make it random
train_data = values[:num_of_rows] #indexes rows for training data
test_data = values[num_of_rows:] #indexes rows for test data
希望这有帮助!
import numpy as np
import pandas as pd
X_data = pd.read_csv('house.csv')
Y_data = X_data["prices"]
X_data.drop(["offers", "brick", "bathrooms", "prices"],
axis=1, inplace=True) # important to drop prices as well
# create random train/test split
indices = range(X_data.shape[0])
num_training_instances = int(0.8 * X_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]
# split the actual data
X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices]
Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices]
这假设您想要随机拆分。发生的情况是,我们正在创建一个索引列表,只要您拥有的数据点数量,即X_data(或Y_data)的第一个轴。然后,我们将它们按随机顺序排列,只取这些随机指数的前 80% 作为训练数据,其余的用于测试。 [:num_training_indices]
只是从列表中选择第一个num_training_indices。之后,您只需使用随机索引列表从数据中提取行,您的数据就会被拆分。请记住从您的X_data中降低价格,如果您希望拆分可重现(开始时np.random.seed(some_integer)
),请设置种子。
以下是仅使用random
导入即可执行80/20
拆分的快速方法:
import random
# Define a sample size, here 80% of the observations
sample_size = int(len(x)*0.80)
# Set seed for reproducibility
random.seed(47202182)
# indices are randomly sampled from 0 to the length of the original sample
train_idx = random.sample(range(0, len(x)), sample_size)
# Indices not in the train set must be in the test set
test_idx = [i for i in range(0, len(x)) if i not in train_idx]
# apply indices to lists to assign data to corresponding variables
x_train = [x[i] for i in train_idx]
x_test = [x[i] for i in test_idx]
y_train = [y[i] for i in train_idx]
y_test = [y[i] for i in test_idx]