这可能是一个愚蠢的问题,但是当我在R中使用H2O Predict函数时,我想知道是否有一种方法可以指定它保留评分数据中的一列或多列。具体来说,我想保留我的唯一 ID 密钥。
就目前而言,我最终采用了非常低效的方法,将索引键分配给原始数据集,为分数分配一个索引键,然后将分数合并到评分数据集。我宁愿说"对这个数据集进行评分并保持 x,y,z......列也。有什么建议吗?
低效代码:
#Use H2O predict function to score new data
NL2L_SCore_SetScored.hex = h2o.predict(object = best_gbm, newdata =
NL2L_SCore_Set.hex)
#Convert scores hex to data frame from H2O
NL2L_SCore_SetScored.df<-as.data.frame(NL2L_SCore_SetScored.hex)
#add index to the scores so we can merge the two datasets
NL2L_SCore_SetScored.df$ID <- seq.int(nrow(NL2L_SCore_SetScored.df))
#Convert orignal scoring set to data frame from H2O
NL2L_SCore_Set.df<-as.data.frame(NL2L_SCore_Set.hex)
#add index to original scoring data so we can merge the two datasets
NL2L_SCore_Set.df$ID <- seq.int(nrow(NL2L_SCore_Set.df))
#Then merge by newly created ID Key so I have the scores on my scoring data
#set. Ideally I wouldn't have to even create this key and could keep
#original Columns from the data set, which include the customer id key
Full_Scored_Set=inner_join(NL2L_SCore_Set.df,NL2L_SCore_Set.df, by="ID" )
无需执行联接,只需将 ID 列绑定到预测帧,因为预测帧行的顺序相同。
R 示例(忽略我在原始训练集上进行预测的事实,这仅用于演示目的):
library(h2o)
h2o.init()
data(iris)
iris$id <- 1:nrow(iris) #add ID column
iris_hf <- as.h2o(iris) #convert iris to an H2OFrame
fit <- h2o.gbm(x = 1:4, y = 5, training_frame = iris_hf)
pred <- h2o.predict(fit, newdata = iris_hf)
pred$id <- iris_hf$id
head(pred)
现在,您有一个包含 ID 列的预测框架:
predict setosa versicolor virginica id
1 setosa 0.9989301 0.0005656447 0.0005042210 1
2 setosa 0.9985183 0.0006462680 0.0008354416 2
3 setosa 0.9989298 0.0005663071 0.0005038929 3
4 setosa 0.9989310 0.0005660443 0.0005029535 4
5 setosa 0.9989315 0.0005649384 0.0005035886 5
6 setosa 0.9983457 0.0011517334 0.0005025218 6