我使用targets作为带有H2O的ML项目的流水线工具。这里使用H2O的主要独特性在于；集群"；(基本上是一个新的本地进程/服务器，据我所知，它通过Rest API进行通信(。

我的问题有两个方面。

如何在目标框架内以智能方式停止/操作集群
如何保存&在目标框架内加载数据/模型

MWE

我提出的一个最低限度的工作示例如下(_targets.R文件(：

library(targets)
library(h2o)
# start h20 cluster once _targets.R gets evaluated
h2o.init(nthreads = 2, max_mem_size = "2G", port = 54322, name = "TESTCLUSTER")
create_dataset_h2o <- function() {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
# convert the data to h2o dataframe
as.h2o(iris)
}
train_model <- function(hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.randomForest(x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
y = c("Species"),
training_frame = hex_data,
model_id = "our.rf",
seed = 1234)
}
predict_model <- function(model, hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.predict(model, newdata = hex_data)
}
list(
tar_target(data, create_dataset_h2o()),
tar_target(model, train_model(data), format = "qs"),
tar_target(predict, predict_model(model, data), format = "qs")
)

这有点奏效，但面临着我从上到下两个离经叛道的问题。。。

广告1-停止群集

通常我会在脚本的末尾输出一个h2o::h2o.shutdown(prompt = FALSE)，但在这种情况下这不起作用。或者，我提出了一个总是运行的新目标。

# in _targets.R in the final list
tar_target(END, h2o.shutdown(prompt = FALSE), cue = tar_cue(mode = "always"))

这在运行tar_make()时有效，但在使用tar_visnetwork()时无效。

另一种选择是使用。

# after the h2o.init(...) call inside _targets.R
on.exit(h2o.shutdown(prompt = FALSE), add = TRUE)

我想出的另一个替代方案是处理目标之外的服务器，只连接到它。但我觉得这可能会破坏目标的工作流程。。。

你对如何处理这个问题还有其他想法吗？

广告2-保存数据集和模型

MWE中的代码没有以正确的格式(format = "qs"(保存目标model和predict的数据。有时(我认为当集群重新启动时(；无效"；而h2o抛出了一个错误。R会话中h2o格式的数据是指向h2o数据帧的指针(另请参阅文档(。

对于keras，它类似地存储R之外的模型，有一个选项format = "keras"，它在幕后调用keras::save_model_hdf5()。类似地，H2O对于数据集需要h2o::h2o.exportFile()和h2o::h2o.importFile()，对于模型需要h2o::h2o.saveModel()和h2o::h2o.loadModel()(另请参阅文档(。

有没有办法为tar_targets创建其他格式，或者我需要将数据写入文件并返回文件？这样做的缺点是，如果我没有弄错的话，这个文件在_targets文件夹系统之外。

广告1

我建议在单独的脚本中处理管道外的H2O集群。这样，tar_visnetwork()就不会启动或停止集群，而且可以更干净地将软件工程与数据分析分离开来。

# run_pipeline.R
start_h2o_cluster(port = ...)
on.exit(stop_h2o_cluster(port = ...))
targets::tar_make_clustermq(workers = 4)

广告2

听起来H2O对象是不可导出的。目前，您需要手动保存这些文件，识别路径，并在tar_target()中写入format = "file"。我愿意考虑基于H20的格式。h2o.exportFile()、h2o.importFile()、h2o::h2o.saveModel()和h2o::h2o.loadModel()是否以某种方式覆盖了所有对象，或者是否有更多种类的对象具有不同的序列化功能？h2o是否具有像keras中的serialize_model()/unserialize_model()那样在内存中执行此(取消(序列化的实用程序？

R targets with H2O

MWE

广告1-停止群集

广告2-保存数据集和模型

广告1

广告2

相关内容

最新更新

热门标签：