Spark MLin Word2vec



我正在尝试运行Spark MLlibs word2vec实现。我为此使用斯卡拉。我对模型的输入是字符串序列数组。它看起来像下面

scala> f.take(5)
res11: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0_42)], [WrappedArray(big, baller, shoe, ?)], [WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, tribe, become, future, kal...
val v=f.map(l=>Seq(l.toString))
scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List  ([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ....

每个句子都在一个单独的列表中,如上所示。我通过给出 v 作为输入来运行模型

scala> val model = word2vec.fit(v)

但是这个模型的输出看起来并不合适。当我保存模型并尝试读取其镶木地板文件(a(时,我得到以下结果。

   model.save(sc, "myModelPath")
   val a=sqlContext.read.parquet("myModelPath")
   a.show(20,false)
+--------------------------------------------------------------------+
|word                                                                |
+--------------------------------------------------------------------+
|[WrappedArray(coffee, machine)]                                     |
|[WrappedArray(good, experience)]                                    |
|[WrappedArray(love, room, !)]                                       |
|[WrappedArray(parking, .)]                                          |
|[WrappedArray(breakfast, great, !)]                                 |
|[WrappedArray(bed, comfortable, room, spacious, .)]                 |

这个word2vec模型不是为每个单词创建向量,而是为单词数组创建向量。我不确定向该模型输入的正确方法是什么,以及它如何破坏句子或单词。

我敢打赌,如果你看v.first你会看到List([WrappedArray(0_42)]),如果你看v.first.head你会看到[WrappedArray(0_42)]。 但是v.first.head是一根弦,你实际看到的是"[WrappedArray(0_42)]"。 没有 WrappedArray,只有一个字符串。 也许您不小心在WrappedArray上调用了toString(或成为隐式转换为字符串的受害者(。 Word2Vec实际上在其输入中看到像"[WrappedArray(coffee, machine)]"这样的字符串,并基于这些字符串生成模型。

更新

如果我的类型正确,f 是一个DataFrame,其中每个Row都包含一个保存Seq[String]的字段(实际上是一个WrappedArray(。

所以,而不是

val v=f.map(l=>Seq(l.toString))

您应该做的是提取该字段

val v = f.map(r => r.getSeq[String](0))

这将产生一个应该适合输入到Word2VecDataset[Seq[String]]

最新更新