将所有数据帧列合并为一列,作为保留JSON类型的JSON


val someDF = Seq(
(8, """{"details":{"decision":"ACCEPT","source":"Rules"}"""),
(64, """{"details":{"decision":"ACCEPT","source":"Rules"}""")
).toDF("number", "word")

someDF.show(false):

+------+---------------------------------------------------------------+
|number|word                                                           |
+------+---------------------------------------------------------------+
|8     |{"details":{"decision":"ACCEPT","source":"Rules"}              |
|64    |{"details":{"decision":"ACCEPT","source":"Rules"}              |
+------+---------------------------------------------------------------+

问题语句:我想将所有列合并为一列,JSON类型保留在单个输出列中。这是无法逃脱的引用等,就像我在下面得到的。

我尝试了什么:

someDF.toJSON.toDF.show(false)

// this escaped the quotes, which I don't want
+------------------------------------------------------------------------------------------------+
|value                                                                                           |
+------------------------------------------------------------------------------------------------+
|{"number":8,"word":"{"details":{"decision":"ACCEPT","source":"Rules"}"}               |
|{"number":64,"word":"{"details":{"decision":"ACCEPT","source":"Rules"}"}              |
+------------------------------------------------------------------------------------------------+

someDF.select( to_json(struct(col("*"))).alias("value"))问题相同

我想要什么:

+------------------------------------------------------------------------------------------------+
|value                                                                                           |
+------------------------------------------------------------------------------------------------+
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}                          |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}                         |
+------------------------------------------------------------------------------------------------+

有办法做到这一点吗?

更新:虽然我在这里使用了一个简单的数据帧,但实际上我有数百列,所以手动定义的模式对我来说不起作用。

;单词";列中的";someDF";是字符串类型,因此to_json将其视为常规字符串。这里的关键是将";单词";列转换为结构类型,然后再使用to_json

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val someDF = Seq(
(8, """{"details":{"decision":"ACCEPT","source":"Rules"}}"""),
(64, """{"details":{"decision":"ACCEPT","source":"Rules"}}""")
).toDF("number", "word")
val schema = StructType(Seq(StructField("details", StructType(Seq(StructField("decision", StringType), StructField("source", StringType))))))
someDF.select(to_json(struct($"number", from_json($"word", schema).alias("word"))).alias("value")).show(false)

结果:

+-----------------------------------------------------------------------+
|value                                                                  |
+-----------------------------------------------------------------------+
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}|
+-----------------------------------------------------------------------+

您可以在数据帧上使用columns方法检索列列表,然后使用concatconcat_ws内置函数的组合手动构建JSON字符串:

import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
val result = someDF.select(
concat(
lit("{"),
concat_ws(
",", 
someDF.columns.map(x => concat(lit("""), lit(x), lit("":"), col(x))): _*
),
lit("}")).as("value")
)

最新更新