为了训练自己使用Spark
和经典统计分析,我试图执行一些书籍中的样本(中性统计书籍:不专门用于计算或Spark(。
书中的样本提供了计算两位评委对十名运动员的斯皮尔曼相关系数的方法:
|Judge 1|8.3|7.6|9.1|9.5|8.4|6.9|9.2|7.8|8.6|8.2
|Judge 2|7.9|7.4|9.1|9.3|8.4|7.5|9.0|7.2|8.2|8.1
创建列的中间矩阵,
|Judge 1|5|2|8|10|6|1|9|3|7|4
| Judge 2|4|2|9|10|7|3|8|1|6|5
书中的样本最终以的结果结束
r=0.915
根据Correlation:的API文档,我尝试用Spark
以这种方式实现它
List<Row> data = Arrays.asList(
RowFactory.create(Vectors.dense(8.3, 7.6, 9.1, 9.5, 8.4, 6.9, 9.2, 7.8, 8.6, 8.2)),
RowFactory.create(Vectors.dense(7.9, 7.4, 9.1, 9.3, 8.4, 7.5, 9.0, 7.2, 8.2, 8.1))
);
StructType schema = new StructType(new StructField[]{
new StructField("features", new VectorUDT(), false, Metadata.empty()),
});
Dataset<Row> df = this.session.createDataFrame(data, schema);
Row r2 = Correlation.corr(df, "features", "spearman").head();
System.out.println("Spearman correlation matrix:n" + r2.get(0).toString());
但它不会给我一个系数。相反,另一个在我看来很奇怪的矩阵:
Spearman correlation matrix:
1.0 0.9999999999999998 NaN ... (10 total)
0.9999999999999998 1.0 NaN ...
NaN NaN 1.0 ...
0.9999999999999998 0.9999999999999998 NaN ...
NaN NaN NaN ...
-0.9999999999999998 -0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
我是MLib
的新手,统计学不太强。很明显,我做错了。
我在这里看到了什么,而不是我所期望的,
以及我该如何实现我所希望的结果?
问题解决方案的一部分是
我只是把矢量放错了一边。这个,纠正那个:
List<Row> data = Arrays.asList(
RowFactory.create(Vectors.dense(8.3, 7.9)),
RowFactory.create(Vectors.dense(7.6, 7.4)),
RowFactory.create(Vectors.dense(9.1, 9.1)),
RowFactory.create(Vectors.dense(9.5, 9.3)),
RowFactory.create(Vectors.dense(8.4, 8.4)),
RowFactory.create(Vectors.dense(6.9, 7.5)),
RowFactory.create(Vectors.dense(9.2, 9.0)),
RowFactory.create(Vectors.dense(7.8, 7.2)),
RowFactory.create(Vectors.dense(8.6, 8.2)),
RowFactory.create(Vectors.dense(8.2, 8.1))
);
体育双壶音符之间的相关性:
1.0 0.91515151515151515153
0.9151515151515151153 1.0