将RandomForestClassifier preditive_proba结果添加到原始数据框架中



我是第一个'真实'ML算法的新手。抱歉,如果这是重复的,但我找不到答案。

我有以下数据框(df(:

index    Feature1  Feature2  Feature3  Target
001       01         01        03        0
002       03         03        01        1
003       03         02        02        1

我的代码看起来像这样:

data = df[['Feature1', 'Feature2', 'Feature3']]
labels = df['Target']
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = 0.8)
clf = RandomForestClassifier().fit(X_train, y_train)
prediction_of_probability = clf.predict_proba(X_test)

我正在努力的是如何将'prediction_of_probability'返回到数据框中df

我了解原始数据帧中所有项目的预测不会。

预先感谢您帮助像我这样的新手!

您所做的是训练模型。这意味着有了功能和标签,您可以训练模型以获取未来数据。为了测试模型的质量(例如,选择功能(,在X_Test和y_test上测试了模型。在这种情况下,您没有将来的数据,因此您没有应用模型,而只是在培训它。您可以使用AUC或ROC曲线看到模型的质量。

无论如何,您可以以这种方式将结果附加到数据框中:

df_test = pd.DataFrame(X_test)
df_test['Target'] = y_test
df_test['prob_0'] = prediction_of_probability[:,0] 
df_test['prob_1'] = prediction_of_probability[:,1]

您可以尝试保留火车的索引和测试,然后将所有内容放在一起:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
data = df[['Feature1', 'Feature2', 'Feature3']]
labels = df['Target']
indices = df.index.values 
# use the indices instead the labels to save the order of the split.
X_train, X_test,indices_train,indices_test = train_test_split(data,indices, test_size=0.33, random_state=42)
y_train, y_test = labels[indices_train],  labels[indices_test]

clf = RandomForestClassifier().fit(X_train, y_train)
prediction_of_probability = clf.predict_proba(X_test)

然后,您可以将概率放在新的df_new中:

>>> df_new = df.copy()
>>> df_new.loc[indices_test,'pred_test'] = prediction_of_probability # clf.predict_proba(X_test)
>>> print(df_new)
   Feature1  Feature2  Feature3  Target  pred_test
1         3         3         1       1        NaN
2         3         2         2       1        NaN
0         1         1         3       0        1.0

甚至火车的预测:

>>> df_new.loc[indices_train,'pred_train'] = clf.predict_proba(X_train)
>>> print(df_new)
   Feature1  Feature2  Feature3  Target  pred_test  pred_train
1         3         3         1       1        NaN         1.0
2         3         2         2       1        NaN         1.0
0         1         1         3       0        1.0         NaN

或如果要混合火车和测试的概率,只需使用相同的列名(即pred(。

您需要这样的东西:

# Create new dataframe to store test data.
df1 = pd.DataFrame(X_test)
df1['Target'] = y_test
df1['prob'] = prediction_of_probability[:,0]  
# Create another dataframe to store train data
df2 = pd.DataFrame(X_train)
df2['Target'] = y_train
# Append both dataframes
df = df1.append(df2).sort_index()

最新更新