使用 Spark (1.6) 从 Scala 中的数据帧中的数组列中删除 null



我有一个带有"id"列的数据帧和一个具有结构数组的列。架构:

root
|-- id: string (nullable = true)
|-- desc: array (nullable = false)
|    |-- element: struct (containsNull = true)
|    |    |-- name: string (nullable = true)
|    |    |-- age: long (nullable = false)

数组 "desc" 可以具有任意数量的空值。我想使用 Spark 1.6 创建数组中没有空值的最终数据帧:

一个例子是:

Key  .   Value
1010 .   [[George,21],null,[MARIE,13],null]
1023 .   [null,[Watson,11],[John,35],null,[Kyle,33]]

我希望最终数据帧为:

id   .   desc
1010 .   [[George,21],[MARIE,13]]
1023 .   [[Watson,11],[John,35],[Kyle,33]]

我尝试使用 UDF 和case class执行此操作,但得到了

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema 不能强制转换为...。

非常感谢任何帮助,如果需要,我宁愿在不转换为RDD的情况下这样做。

这是另一个版本:

case class Person(name: String, age: Int)
root
|-- id: string (nullable = true)
|-- desc: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- name: string (nullable = true)
|    |    |-- age: integer (nullable = false)
+----+-----------------------------------------------+
|id  |desc                                           |
+----+-----------------------------------------------+
|1010|[[George,21], null, [MARIE,13], null]          |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|
+----+-----------------------------------------------+

val filterOutNull = udf((xs: Seq[Row]) => {
xs.flatMap {
case null => Nil
// convert the Row back to your specific struct:
case Row(s: String,i: Int) => List(Person(s, i))
}
})
val result = df.withColumn("filteredListDesc", filterOutNull($"desc"))
+----+-----------------------------------------------+-----------------------------------+
|id  |desc                                           |filteredListDesc                   |
+----+-----------------------------------------------+-----------------------------------+
|1010|[[George,21], null, [MARIE,13], null]          |[[George,21], [MARIE,13]]          |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------------------+-----------------------------------+

假设原始数据帧具有以下架构

root
|-- id: string (nullable = true)
|-- desc: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- name: string (nullable = true)
|    |    |-- age: long (nullable = false)

定义一个udf函数来从数组中删除空值应该适合

import org.apache.spark.sql.functions._
def removeNull = udf((array: Seq[Row])=> array.filterNot(_ == null).map(x => element(x.getAs[String]("name"), x.getAs[Long]("age"))))
df.withColumn("desc", removeNull(col("desc")))

其中elementcase class

case class element(name: String, age: Long)

你应该得到

+----+-----------------------------------+
|id  |desc                               |
+----+-----------------------------------+
|1010|[[George,21], [MARIE,13]]          |
|1010|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------+

Spark 3.4+

array_compact($"desc")
<小时 />

示例输入:

case class Person(name: String, age: Long)
val df1 = Seq(
("1010", Seq(Person("George", 21), null, Person("MARIE", 13), null)),
("1023", Seq(null, Person("Watson", 11), Person("John", 35), null, Person("Kyle", 33)))
).toDF("id", "desc")
df1.show(truncate=false)
// +----+--------------------------------------------------+
// |id  |desc                                              |
// +----+--------------------------------------------------+
// |1010|[{George, 21}, null, {MARIE, 13}, null]           |
// |1023|[null, {Watson, 11}, {John, 35}, null, {Kyle, 33}]|
// +----+--------------------------------------------------+
df1.printSchema()
// root
//  |-- id: string (nullable = true)
//  |-- desc: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- name: string (nullable = true)
//  |    |    |-- age: long (nullable = false)

使用array_compact

val df2 = df1.withColumn("desc", array_compact($"desc"))
df2.show(truncate=false)
// +----+--------------------------------------+
// |id  |desc                                  |
// +----+--------------------------------------+
// |1010|[{George, 21}, {MARIE, 13}]           |
// |1023|[{Watson, 11}, {John, 35}, {Kyle, 33}]|
// +----+--------------------------------------+

相关内容

  • 没有找到相关文章

最新更新