检查数据帧中的重复值并实现ignoreNulls参数



我创建了一个函数来检查数据帧中是否存在基于列序列的重复值。

我想实现一个";ignoreNulls";,作为布尔参数传递到函数

  • 如果为true,将忽略而不对null值进行分组和计数。因此,对于null值;newColName";将返回false
  • 如果为false(默认值(,则将nulls值视为一个组,如果我正在检查的键有多个带nulls的值,则返回true

我不知道该怎么做。我应该使用if还是case?有一些表达式可以忽略partitionBy语句中的null?

有人能帮我吗?

这是的当前功能

def checkRepeatedKey(newColName: String, keys: Seq[String])(dataframe: DataFrame): DataFrame = {
val repeatedCondition = $"sum" > 1
val windowCondition   = Window.partitionBy(keys.head, keysToCheck.tail: _*)
dataframe
.withColumn("count", lit(1))
.withColumn("sum", sum("count").over(windowCondition))
.withColumn(newColName, repeatedCondition)
.drop("count", "sum")
}

一些测试数据

val testDF = Seq(
("1", Some("name-1")),
("2", Some("repeated-name")),
("3", Some("repeated-name")),
("4", Some("name-4")),
("5", None),
("6", None)
).toDF("name_key", "name")

测试功能

val results = testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"))

输出(没有ignoreNulls实现(

+--------+---------------+--------------------+
|name_key|       name    |  has_repeated_name |
+--------+---------------+--------------------+
|     1  |      name-1   |              false |
+--------+---------------+--------------------+
|     2  | repeated-name |               true |
+--------+---------------+--------------------+
|     3  | repeated-name |               true |
+--------+---------------+--------------------+
|     4  |      name-4   |              false |
+--------+---------------+--------------------+
|     5  |         null  |               true |
+--------+---------------+--------------------+
|     6  |         null  |               true |
+--------+---------------+--------------------+

使用ignoreNulls=,真正的实现应该是这样的


-- function header with ignoreNulls parameter
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = 
-- using the function, passing true for ignoreNulls
testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"), true)
-- expected output for nulls
+--------+---------------+--------------------+
|     5  |         null  |              false |
+--------+---------------+--------------------+
|     6  |         null  |              false |
+--------+---------------+--------------------+

首先,如果keys中只有一部分列为null,您应该正确定义逻辑-它应该算作null值还是只有当keys中的所有列都为null时才定义

空值为了简单起见,我们假设keys中只有一列(您可以轻松地将逻辑扩展到多列(。您只需在checkRepeatedKey函数中添加一个简单的if

def checkIfNullValue(keys: Seq[String]): Column = {
// for the sake of simplicity checking only the first key 
col(keys.head).isNull
}

def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = {
...
...
val df = dataframe
.withColumn("count", lit(1))
.withColumn("sum", sum("count").over(windowCondition))
.withColumn(newColName, repeatedCondition)
.drop("count", "sum")

if (ignoreNulls) 
df.withColumn(newColName, when(checkIfNullValue(keys), df(newColName)).otherwise(lit(false)) 
else df
}

最新更新