我试图将predict
方法的结果与pandas.DataFrame
对象中的原始数据合并。
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
要将这些预测与原始df
合并,我尝试了以下方法:
df['y_hats'] = y_hats
但这加剧了:
valueerror:值的长度与索引的长度
不匹配
我知道我可以将df
分为train_df
和test_df
,并且将解决此问题,但实际上,我需要遵循上面的路径来创建矩阵X
和y
(我的实际问题是我的文本分类问题,其中我在其中i。将整个全部归一化矩阵,然后将其分成火车和测试)。我如何将这些预测值与我的df
中的适当行对齐,因为y_hats
阵列零索引且似乎所有有关的信息都包含在X_test
中,而y_test
丢失了?还是我会首先降级到将数据框架分成火车测试中,然后构建功能矩阵?我想将train
中包含的行填充数据框中的np.nan
值。
您的y_hats长度仅是测试数据的长度(20%),因为您在x_test上进行了预测。一旦验证了模型并对测试预测感到满意(通过检查模型在X_Test预测上与X_Test True值相比的准确性),您应该在完整数据集(x)上重新运行预测。将这两行添加到底部:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
edit 根据您的评论,这是一个更新的结果,返回数据集,并在测试数据集中附加了预测的数据集
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
我有相同的问题(几乎)
我以这种方式修复了
...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
您可以从x_test创建y_hat dataframe复制索引,然后与原始数据合并。
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)
注意,左连接将包括火车数据行。省略"如何"参数将仅导致测试数据。
尝试以下:
y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2
首先,您需要将y_val或y_test数据转换为dataframe。
compare_df = pd.DataFrame(y_val)
然后只创建一个带有预测数据的新列。
compare_df['predicted_res'] = y_pred_val
之后,您可以轻松地过滤数据显示哪些数据与基于简单条件的原始预测匹配的数据。
test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]
您可能可以制作一个新的数据框,并将测试数据添加到预测的值:
:data['y_hats'] = y_hats
data.to_csv('data1.csv')
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'],
index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True,
right_index=True)
这对我来说很好。它保持索引位置。
pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)
这是一个对我有用的解决方案:
它包括为您的每个折叠/迭代而建造一个数据帧,其中包括测试集的观察值和预测值;这样,您就可以使用y_true中包含的索引(id),该索引应与您的主题的IDS相对应(在我的代码中:'subjid')。
然后,您将生成的数据范围(在我的情况下通过5倍的测试数据)加和粘贴到原始数据集中。
我希望这会有所帮助!
FoldNr = 0
for train_index, test_index in skf.split(X, y):
FoldNr = FoldNr + 1
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# [...] your model
# performance is measured on test set
y_true, y_pred = y_test, clf.predict(X_test)
# Save predicted values for each test set
a = pd.DataFrame(y_true).reset_index()
b = pd.Series(y_pred, name = 'y_pred')
globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']
您也可以使用
y_hats = model.predict(X)
df['y_hats'] = y_hats.reset_index()['name of the target column']