我们有一个在Hadoop 2.7.2和Centos 7.2上运行Apache Spark 2.0的集群。我们使用Spark DataFrame/DataSet api编写了一些新代码,但注意到在将数据写入和读取到Windows Azure Storage blob(默认的HDFS位置)后,join上出现了不正确的结果。我已经能够在集群上运行以下代码片段来重复这个问题。
case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)
val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS
dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
输出+-----+---------+-----+
| user|dimension|score|
+-----+---------+-----+
|12345| 0| 1.0|
+-----+---------+-----+
+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
| 0| 1| 1.0|
| 1| 0| 1.0|
| 2| 2| 1.0|
+---------+-------+-----+
+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345| 0| 1.0| 0| 1| 1.0|
+-----+---------+-----+---------+-------+-----+
是正确的。然而,在写入和读取数据之后,我们看到这个
dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")
val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]
dims2.show
cent2.show
dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
输出+-----+---------+-----+
| user|dimension|score|
+-----+---------+-----+
|12345| 0| 1.0|
+-----+---------+-----+
+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
| 0| 1| 1.0|
| 1| 0| 1.0|
| 2| 2| 1.0|
+---------+-------+-----+
+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345| 0| 1.0| null| null| null|
+-----+---------+-----+---------+-------+-----+
然而,使用RDD API会产生正确的结果
dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5)
res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0))))
我们已经尝试将输出格式更改为ORC而不是parquet,但我们看到的结果相同。在本地运行Spark 2.0,而不是在集群上运行,不会有这个问题。在Hadoop集群的主节点上以本地模式运行spark也可以。只有在YARN上运行时,我们才会看到这个问题。
这似乎也非常类似于这个问题:https://issues.apache.org/jira/browse/SPARK-10896
此问题已通过https://issues.apache.org/jira/browse/SPARK-17806