是否可以将word2vec预先训练的可用向量加载到Spark中?

有没有办法将Google或Glove的预先训练的向量(模型(加载到Spark中，例如GoogleNews-vectors-negative300.bin.gz并执行从Spark提供的findSynonyms等操作？还是我需要从头开始进行加载和操作？

在这篇文章中在 Spark 中加载 Word2Vec 模型，Tom Lous 建议将 bin 文件转换为 txt 并从那里开始，我已经这样做了.. 但是接下来呢？

在我昨天发布的一个问题中，我得到了一个答案，即镶木地板格式的模型可以在火花中加载，因此我发布这个问题以确保没有其他选择。

免责声明：我对 Spark 很陌生，但以下内容至少对我有用。

诀窍是弄清楚如何从一组词向量构建Word2VecModel，以及处理尝试以这种方式创建模型的一些陷阱。

首先，将词向量加载到地图中。例如，我已经将我的词向量保存为镶木地板格式(在一个名为"wordvectors.parquet"的文件夹中(，其中"term"列保存字符串字，"vector"列将向量保存为数组[float]，我可以像在Java中这样加载它：

// Loads the dataset with the "term" column holding the word and the "vector" column 
// holding the vector as an array[float] 
Dataset<Row> vectorModel = pSpark.read().parquet("wordvectors.parquet");
//convert dataset to a map.
Map<String, List<Float>> vectorMap = Arrays.stream((Row[])vectorModel.collect())
.collect(Collectors.toMap(row -> row.getAs("term"), row -> row.getList(1)));
//convert to the format that the word2vec model expects float[] rather than List<Float>
Map<String, float[]> word2vecMap = vectorMap.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, entry -> (float[]) Floats.toArray(entry.getValue())));
//need to convert to scala immutable map because that's what word2vec needs
scala.collection.immutable.Map<String, float[]> scalaMap = toScalaImmutableMap(word2vecMap);
private static <K, V> scala.collection.immutable.Map<K, V> toScalaImmutableMap(Map<K, V> pFromMap) {
final List<Tuple2<K,V>> list = pFromMap.entrySet().stream()
.map(e -> Tuple2.apply(e.getKey(), e.getValue()))
.collect(Collectors.toList());
Seq<Tuple2<K,V>> scalaSeq = JavaConverters.asScalaBufferConverter(list).asScala().toSeq();
return (scala.collection.immutable.Map<K, V>) scala.collection.immutable.Map$.MODULE$.apply(scalaSeq);
}

现在，您可以从头开始构建模型。由于 Word2VecModel 的工作方式存在一个怪癖，您必须手动设置矢量大小，并以奇怪的方式进行设置。否则，它默认为 100，并且在尝试调用 .transform(( 时出现错误。这是我发现的一种有效的方法，不确定是否一切都是必要的：

//not used for fitting, only used for setting vector size param (not sure if this is needed or if result.set is enough
Word2Vec parent = new Word2Vec();
parent.setVectorSize(300);
Word2VecModel result = new Word2VecModel("w2vmodel", new org.apache.spark.mllib.feature.Word2VecModel(scalaMap)).setParent(parent);
result.set(result.vectorSize(), 300);

现在你应该能够像使用自我训练模型一样使用 result.transform((。

我还没有测试其他 Word2VecModel 函数来查看它们是否正常工作，我只测试了 .transform((。

相关内容

最新更新

热门标签：