Spark数据帧自联接的结果是生成空的数据帧



下面是我在csv中读取到数据帧中的数据。

id,pid,pname,ppid
1,  1,    5,  -1
2,  1,    7,  -1
3,  2,    9,   1
4,  2,   11,   1
5,  3,    5,   1
6,  4,    7,   2
7,  1,    9,   3

我正在将该数据读取到数据帧data_df中。我试着在不同的专栏上做一个自联接。但是结果数据帧是空的。尝试了多种选择。

下面是我的代码。只有最后一个joined4产生结果。

val joined = data_df.as("first").join(data_df.as("second")).where( col("first.ppid") === col("second.pid"))
joined.show(50, truncate = false)
val joined2 = data_df.as("first").join(data_df.as("second"), col("first.ppid") === col("second.pid"), "inner")
joined2.show(50, truncate = false)
val df1 = data_df.as("df1")
val df2 = data_df.as("df2")
val joined3  = df1.join(df2, $"df1.ppid" === $"df2.id")
joined3.show(50, truncate = false)
val joined4  = data_df.as("df1").join(data_df.as("df2"), Seq("id"))
joined4.show(50, truncate = false)

以下分别是joined、joined2、joined3和joined4的输出:

+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+-----+----+
|id |pid|pname|ppid|pid|pname|ppid|
+---+---+-----+----+---+-----+----+
| 1 |  1|    5|  -1|  1|    5|  -1|
| 2 |  1|    7|  -1|  1|    7|  -1|
| 3 |  2|    9|   1|  2|    9|   1|
| 4 |  2|   11|   1|  2|   11|   1|
| 5 |  3|    5|   1|  3|    5|   1|
| 6 |  4|    7|   2|  4|    7|   2|
| 7 |  1|    9|   3|  1|    9|   3|
+---+---+-----+----+---+-----+----+

很抱歉,后来发现csv中的空格导致了问题。如果我为初始数据创建了一个结构正确的csv,问题就会消失。

按如下方式更正csv格式。

id,pid,pname,ppid
1,1,5,-1
2,1,7,-1
3,2,9,1
4,2,1,1
5,3,5,1
6,4,7,2
7,1,9,3

理想情况下,我也可以使用该选项来忽略前导空格,如下所示:

val data_df = spark.read
.schema(dataSchema)
.option("mode", "FAILFAST")
.option("header", "true")
.option("ignoreLeadingWhiteSpace", "true")
.csv(dataSourceName)

pySpark(v2.4(DataFrameReader在列名中添加前导空格

最新更新