我在看Word2Ven:的Spark站点示例
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
val synonyms = model.findSynonyms("country name here", 40)
我该如何做有趣的向量,比如国王-男人+女人=王后。我可以使用model.getVectors,但不确定如何进一步操作。
这里有一个pyspark
的例子,我想它可以直接移植到Scala,关键是model.transform
的使用。
首先,我们按照示例训练模型:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
sc = SparkContext()
inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
k = 220 # vector dimensionality
word2vec = Word2Vec().setVectorSize(k)
model = word2vec.fit(inp)
k
是单词向量的维度-越高越好(默认值为100),但您需要内存,我的机器可以使用的最高数字是220。(编辑:相关出版物中的典型值在300到1000之间)
在我们训练了模型之后,我们可以定义一个简单的函数如下:
def getAnalogy(s, model):
qry = model.transform(s[0]) - model.transform(s[1]) - model.transform(s[2])
res = model.findSynonyms((-1)*qry,5) # return 5 "synonyms"
res = [x[0] for x in res]
for k in range(0,3):
if s[k] in res:
res.remove(s[k])
return res[0]
现在,以下是一些国家及其首都的例子:
s = ('france', 'paris', 'portugal')
getAnalogy(s, model)
# u'lisbon'
s = ('china', 'beijing', 'russia')
getAnalogy(s, model)
# u'moscow'
s = ('spain', 'madrid', 'greece')
getAnalogy(s, model)
# u'athens'
s = ('germany', 'berlin', 'portugal')
getAnalogy(s, model)
# u'lisbon'
s = ('japan', 'tokyo', 'sweden')
getAnalogy(s, model)
# u'stockholm'
s = ('finland', 'helsinki', 'iran')
getAnalogy(s, model)
# u'tehran'
s = ('egypt', 'cairo', 'finland')
getAnalogy(s, model)
# u'helsinki'
结果并不总是正确的——我将留给您进行实验,但随着更多的训练数据和向量维度k
的增加,结果会变得更好。
函数中的for
循环删除了属于输入查询本身的条目,因为我注意到,正确答案通常是返回列表中的第二个,第一个通常是输入项之一。
val w2v_map = sameModel.getVectors//this gives u a map {word:vec}
val (king, man, woman) = (w2v_map.get("king").get, w2v_map.get("man").get, w2v_map.get("women").get)
val n = king.length
//daxpy(n: Int, da: Double, dx: Array[Double], incx: Int, dy: Array[Double], incy: Int);
blas.saxpy(n,-1,man,1,king,1)
blas.saxpy(n,1,woman,1,king,1)
val vec = new DenseVector(king.map(_.toDouble))
val most_similar_word_to_vector = sameModel.findSynonyms(vec, 10) //they have an api to get synonyms for word, and one for vector
for((synonym, cosineSimilarity) <- most_similar_word_to_vector) {
println(s"$synonym $cosineSimilarity")
}
运行结果如下:
women 0.628454885964967
philip 0.5539534290356802
henry 0.5520055707837214
vii 0.5455116413024774
elizabeth 0.5290994886254643
**queen 0.5162519562606844**
men 0.5133851770249461
wenceslaus 0.5127030522678778
viii 0.5104392579985102
eldest 0.510425791249559
这是伪代码。有关完整实施,请阅读文档:https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/feature/Word2VecModel.html
w2v_map = model.getVectors() # this gives u a map {word:vec}
my_vector = w2v_map.get('king') - w2v_map.get('man') + w2v_map.get('queen') # do vector algebra here
most_similar_word_to_vector = model.findSynonyms(my_vector, 10) # they have an api to get synonyms for word, and one for vector
编辑:https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/feature/Word2VecModel.html#findSynonyms(org.apache.spark.mllib.linalg.Vvector,%20int)