我正在尝试使用h2o.word2vec
创建一个预训练的嵌入层,我希望在模型中提取每个单词及其等效的嵌入式向量。
代码:
library(data.table)
library(h2o)
h2o.init(nthreads = -1)
comment <- data.table(comments='ExplanationWhy the edits made under my username Hardcore Metallica
Fan were reverted They werent vandalisms just closure on some GAs after I voted
at New York Dolls FAC And please dont remove the template from the talk page since Im retired now')
comments.hex <- as.h2o(comment, destination_frame = "comments.hex", col.types=c("String"))
words <- h2o.tokenize(comments.hex$comments, "\\W+")
vectors <- 3 # Only 10 vectors to save time & memory
w2v.model <- h2o.word2vec(words
, model_id = "w2v_model"
, vec_size = vectors
, min_word_freq = 1
, window_size = 2
, init_learning_rate = 0.025
, sent_sample_rate = 0
, epochs = 1) # only a one epoch to save time
print(h2o.findSynonyms(w2v.model, "the",2))
h2o
API使我能够获得两个单词的余弦,但是我只是想在词汇量中获取每个作品的矢量,我该如何获得它?在API中找不到任何简单的方法
预先感谢
您可以使用方法w2v_model.transform(words=words)
(完整的选项是:w2v_model.transform(words =, aggregate_method =)
其中words
是由包含源单词的单列制成的H2O帧(请注意,您可以指定包含此帧的子集(,并且aggregate_method
指定了如何汇总单词序列。
如果您不指定聚合方法,则不会执行聚合,并且每个输入单词都映射到一个单词矢量。如果该方法是平均值,则将输入视为Na界定的单词序列。
例如:
av_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")