从H2O.Word2Vec对象提取每个单词的嵌入式vecor



我正在尝试使用h2o.word2vec创建一个预训练的嵌入层,我希望在模型中提取每个单词及其等效的嵌入式向量。

代码:

library(data.table)
library(h2o)
h2o.init(nthreads = -1)
comment <- data.table(comments='ExplanationWhy the edits made under my username Hardcore Metallica 
                      Fan were reverted They werent vandalisms just closure on some GAs after I voted 
                      at New York Dolls FAC And please dont remove the template from the talk page since Im retired now')
comments.hex <- as.h2o(comment, destination_frame = "comments.hex", col.types=c("String"))
words <- h2o.tokenize(comments.hex$comments, "\\W+")
vectors <- 3 # Only 10 vectors to save time & memory
w2v.model <- h2o.word2vec(words
                          , model_id = "w2v_model"
                          , vec_size = vectors
                          , min_word_freq = 1
                          , window_size = 2
                          , init_learning_rate = 0.025
                          , sent_sample_rate = 0
                          , epochs = 1) # only a one epoch to save time
print(h2o.findSynonyms(w2v.model, "the",2))

h2o API使我能够获得两个单词的余弦,但是我只是想在词汇量中获取每个作品的矢量,我该如何获得它?在API中找不到任何简单的方法

预先感谢

您可以使用方法w2v_model.transform(words=words)

(完整的选项是:w2v_model.transform(words =, aggregate_method =)

其中words是由包含源单词的单列制成的H2O帧(请注意,您可以指定包含此帧的子集(,并且aggregate_method指定了如何汇总单词序列。

如果您不指定聚合方法,则不会执行聚合,并且每个输入单词都映射到一个单词矢量。如果该方法是平均值,则将输入视为Na界定的单词序列。

例如:

av_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")

最新更新