我有一个数据框架,其中包含具有相同id的行。我需要将所有具有相同id的行合并为一行(一个json)
数据示例:
id first_name last_name
1 JAMES SMITH
2 MARY BROWN
2 DAVID WILLIAMS
1 ROBERT DAVIS
请求的结果是:
{
id:1,
entities: [{
first_name:JAMES,
last_name:SMITH
}, {
first_name:ROBERT,
last_name:DAVIS
}]
}
{
id:2,
entities: [{
first_name:MARY,
last_name:BROWN
}, {
first_name:DAVID,
last_name:WILLIAMS
}]
}
可以做到吗?
问候,Yaniv
您可以在将相关列"合并"到一个嵌套结构后使用groupBy
和collect_list
:
val input: DataFrame = Seq(
(1, "JAMES", "SMITH"),
(2, "MARY", "BROWN"),
(2, "DAVID", "WILLIAMS"),
(1, "ROBERT", "DAVIS")
).toDF("id", "first_name", "last_name")
import org.apache.spark.sql.functions._
val result = input
.withColumn("entity", struct($"first_name", $"last_name"))
.groupBy("id").agg(collect_list($"entity"))
result.show(false)
// +---+--------------------------------+
// |id |entities |
// +---+--------------------------------+
// |1 |[[JAMES,SMITH], [ROBERT,DAVIS]] |
// |2 |[[MARY,BROWN], [DAVID,WILLIAMS]]|
// +---+--------------------------------+
result.printSchema()
// root
// |-- id: integer (nullable = false)
// |-- entities: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- first_name: string (nullable = true)
// | | |-- last_name: string (nullable = true)