按标签分隔数据帧(将数据帧转换为 numpy 数组)



>我有一个数据帧,我想根据它们的标签将它们分成不同的数组,我不确定如何按其索引过滤它。不确定如何正确完成此操作:

数据集示例 (df(

Cancer_Type  | Variable | Data Split | Target
Cancer1         43        Train        Good
Cancer5         34        Train        Bad
Cancer2         34        Test         Good
Cancer3         23        Test         Bad
Cancer4         25        Test         Good

可能会做这样的事情?

#initial split into train/test data
train = df['split'] == 'train'
print("train")
print(train)
test = df['split'] == 'test'
print("valid")
print(test)
X_test = test.values[-1, :-1]
y_test = test.values[-1, -1]
# Get the remaining dataset
X = train.values[:-1, :-1]
y = train.values[:-1, -1]
print("X")
#print(type(X))
#print(X)
print("y")
#print(type(y))
#print(y)
# Split the remaining dataset into train and calibration sets.
X_train, X_cal, y_train, y_cal = train_test_split(X, y)

print(X_train.shape, y_train.shape)
print(X_cal.shape, y_cal.shape)

希望是行。

根据我的理解,您希望根据观察值将数据拆分为训练集和测试集Data Split。之后,您将再次将列车组拆分为列车和校准。标准数据预处理方法涉及创建我们的功能、X和目标y

# Get dataframes of train and test features
X_train = df[df['Data Split'] == 'Train'].drop(columns = ['Target']).to_numpy()
X_test = df[df['Data Split'] == 'Test'].drop(columns = ['Target']).to_numpy()
# Get arrays of train and test targets
y_train = df[(df['Data Split'] == 'Train')]["Target"].to_numpy()
y_test = df[(df['Data Split'] == 'Test')]["Target"].to_numpy()
# Split the train dataset further into train and validation/calibration sets.
X_train, X_cal, y_train, y_cal = train_test_split(X_train, y_train)

现在,您拥有阵列形式的训练、验证/校准和测试集。

如果您希望保留Target变量,只需

train = df[df['Data Split'] == 'Train'].to_numpy()
test = df[df['Data Split'] == 'Test'].to_numpy()

最新更新