val someDF = Seq(
(8, """{"details":{"decision":"ACCEPT","source":"Rules"}"""),
(64, """{"details":{"decision":"ACCEPT","source":"Rules"}""")
).toDF("number", "word")
someDF.show(false)
:
+------+---------------------------------------------------------------+
|number|word |
+------+---------------------------------------------------------------+
|8 |{"details":{"decision":"ACCEPT","source":"Rules"} |
|64 |{"details":{"decision":"ACCEPT","source":"Rules"} |
+------+---------------------------------------------------------------+
问题语句:我想将所有列合并为一列,JSON类型保留在单个输出列中。这是无法逃脱的引用等,就像我在下面得到的。
我尝试了什么:
someDF.toJSON.toDF.show(false)
// this escaped the quotes, which I don't want
+------------------------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------------------------+
|{"number":8,"word":"{"details":{"decision":"ACCEPT","source":"Rules"}"} |
|{"number":64,"word":"{"details":{"decision":"ACCEPT","source":"Rules"}"} |
+------------------------------------------------------------------------------------------------+
与someDF.select( to_json(struct(col("*"))).alias("value"))
问题相同
我想要什么:
+------------------------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------------------------+
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
+------------------------------------------------------------------------------------------------+
有办法做到这一点吗?
更新:虽然我在这里使用了一个简单的数据帧,但实际上我有数百列,所以手动定义的模式对我来说不起作用。
;单词";列中的";someDF";是字符串类型,因此to_json
将其视为常规字符串。这里的关键是将";单词";列转换为结构类型,然后再使用to_json
。
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val someDF = Seq(
(8, """{"details":{"decision":"ACCEPT","source":"Rules"}}"""),
(64, """{"details":{"decision":"ACCEPT","source":"Rules"}}""")
).toDF("number", "word")
val schema = StructType(Seq(StructField("details", StructType(Seq(StructField("decision", StringType), StructField("source", StringType))))))
someDF.select(to_json(struct($"number", from_json($"word", schema).alias("word"))).alias("value")).show(false)
结果:
+-----------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------+
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}|
+-----------------------------------------------------------------------+
您可以在数据帧上使用columns
方法检索列列表,然后使用concat
和concat_ws
内置函数的组合手动构建JSON字符串:
import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
val result = someDF.select(
concat(
lit("{"),
concat_ws(
",",
someDF.columns.map(x => concat(lit("""), lit(x), lit("":"), col(x))): _*
),
lit("}")).as("value")
)