使用笛卡尔的杰卡德相似性

我有这段代码：

StructType schema = new StructType(
new StructField[] { DataTypes.createStructField("file_path", DataTypes.StringType, false),
DataTypes.createStructField("file_content",
DataTypes.createArrayType(DataTypes.StringType, false), false) });
Dataset<Row> df = spark.createDataFrame(shinglesDocs.map(new Function<Tuple2<String, String[]>, Row>() {
@Override
public Row call(Tuple2<String, String[]> record) {
return RowFactory.create(record._1().substring(record._1().lastIndexOf("/") + 1), record._2());
}
}), schema);
df.show(true);
CountVectorizer vectorizer = new CountVectorizer().setInputCol("file_content").setOutputCol("feature_vector")
.setBinary(true);
CountVectorizerModel cvm = vectorizer.fit(df);
Broadcast<Integer> vocabSize = sc.broadcast(cvm.vocabulary().length);
System.out.println("vocab size = " + cvm.vocabulary().length;
for (int i = 0; i < vocabSize.value(); i++) {
System.out.print(cvm.vocabulary()[i] + "(" + i + ") ");
}
System.out.println();
Dataset<Row> characteristicMatrix = cvm.transform(df);
characteristicMatrix.show(false);

cm 包含 = [ 文档 1 的列、文档 2 的列、文档的列 3 ]

其中 document1 的列如下所示 (1

， 0， 1， 1， 0， 0， 1， 1， 1 (我需要计算 JS=a/(a+b+c(

列对文档 1 和列对文档 2 之间的杰卡德相似性 (JS(
对文档 1 和列对文档 3 之间的杰卡德相似性 (JS(
2和列对文档3之间的杰卡德相似性(JS(

但是CM是一个大文件，它在3台不同的计算机上(因为它是大数据编程(，所以，

Column-for-document1 位于一台计算机上;column-for-document2 位于另一台计算机上;column-for-document3 位于第三台计算机上

如果它们都在不同的计算机上，您如何计算上述内容？

我需要为此使用笛卡尔

cm.cartesian(cm)

但我什至不确定从哪里开始，因为cm在数据集中。我想也许我可以将其转换为数组然后比较索引，但我以前从未使用过数据集，所以我不知道该怎么做，或者最好的策略是什么。

请在java spark中写下您的答案。

这似乎是MinHash算法的理想情况。

该算法允许您接收数据流(例如来自 3 台不同的计算机(并使用许多哈希函数计算流之间的相似性，即 jaccard 相似性。

您可以在 spark wiki 上找到 MinHash 算法的实现：http://spark.apache.org/docs/2.2.3/ml-features.html#minhash-for-jaccard-distance

相关内容

最新更新

热门标签：