如何使用foreach拆分Spark数据帧中的Json格式列值

我想在 Spark 数据帧中拆分 JSON 格式列结果：

allrules_internalHive 中的表：

----------------------------------------------------------------
|tablename  |                 condition            | filter     |
|---------------------------------------------------------------|
| documents | {"col_list":"document_id,comments"}  | NA         |
| person    | {"per_list":"person_id, name, age"}  | NA         |
---------------------------------------------------------------

法典：

val allrulesDF = spark.read.table("default" + "." + "allrules_internal")
allrulesDF.show()
val df1 = allrulesDF.select(allrulesDF.col("tablename"), allrulesDF.col("condition"), allrulesDF.col("filter"), allrulesDF.col("dbname")).collect()

在这里，我想拆分condition列值。从上面的例子中，我想保留"document_id，评论"部分。换句话说，条件列有一个键/值对，但我只想要值部分。

如果表中多行allrules_internal如何拆分值。

df1.foreach(row => { 
//   condition = row.getAs("condition").toString() // here how to retrive ?
println(condition)
val tableConditionDF = spark.sql("SELECT "+ condition + " FROM " + db_name + "." + table_name)
tableConditionDF.show()
})

您可以使用from_json函数：

import org.apache.spark.sql.functions._
import spark.implicits._
allrulesDF
.withColumn("condition", from_json($"condition", StructType(Seq(StructField("col_list", DataTypes.StringType, true)))))
.select($"tablename", $"condition.col_list".as("condition"))

它将打印：

+---------+---------------------+
|tablename|condition            |
+---------+---------------------+
|documents|document_id, comments|
+---------+---------------------+

解释：

使用withColumn方法，可以使用组合一个或多个列的函数创建新列。在本例中，我们使用from_json函数，该函数接收包含 JSON 字符串的列和StructType对象，以及列中表示的 JSON 字符串的架构。最后，您只需要选择所需的列即可。

希望它有帮助！

相关内容

最新更新

热门标签：