检查数据帧中的重复值并实现ignoreNulls参数

我创建了一个函数来检查数据帧中是否存在基于列序列的重复值。

我想实现一个"；ignoreNulls"；，作为布尔参数传递到函数

如果为true，将忽略而不对null值进行分组和计数。因此，对于null值；newColName"；将返回false
如果为false(默认值(，则将nulls值视为一个组，如果我正在检查的键有多个带nulls的值，则返回true

我不知道该怎么做。我应该使用if还是case？有一些表达式可以忽略partitionBy语句中的null？

有人能帮我吗？

这是的当前功能

def checkRepeatedKey(newColName: String, keys: Seq[String])(dataframe: DataFrame): DataFrame = {
val repeatedCondition = $"sum" > 1
val windowCondition   = Window.partitionBy(keys.head, keysToCheck.tail: _*)
dataframe
.withColumn("count", lit(1))
.withColumn("sum", sum("count").over(windowCondition))
.withColumn(newColName, repeatedCondition)
.drop("count", "sum")
}

一些测试数据

val testDF = Seq(
("1", Some("name-1")),
("2", Some("repeated-name")),
("3", Some("repeated-name")),
("4", Some("name-4")),
("5", None),
("6", None)
).toDF("name_key", "name")

测试功能

val results = testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"))

输出(没有ignoreNulls实现(

+--------+---------------+--------------------+
|name_key|       name    |  has_repeated_name |
+--------+---------------+--------------------+
|     1  |      name-1   |              false |
+--------+---------------+--------------------+
|     2  | repeated-name |               true |
+--------+---------------+--------------------+
|     3  | repeated-name |               true |
+--------+---------------+--------------------+
|     4  |      name-4   |              false |
+--------+---------------+--------------------+
|     5  |         null  |               true |
+--------+---------------+--------------------+
|     6  |         null  |               true |
+--------+---------------+--------------------+

使用ignoreNulls=，真正的实现应该是这样的


-- function header with ignoreNulls parameter
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = 
-- using the function, passing true for ignoreNulls
testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"), true)
-- expected output for nulls
+--------+---------------+--------------------+
|     5  |         null  |              false |
+--------+---------------+--------------------+
|     6  |         null  |              false |
+--------+---------------+--------------------+

首先，如果keys中只有一部分列为null，您应该正确定义逻辑-它应该算作null值还是只有当keys中的所有列都为null时才定义

空值为了简单起见，我们假设keys中只有一列(您可以轻松地将逻辑扩展到多列(。您只需在checkRepeatedKey函数中添加一个简单的if：

def checkIfNullValue(keys: Seq[String]): Column = { // for the sake of simplicity checking only the first key col(keys.head).isNull } def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = { ... ... val df = dataframe .withColumn("count", lit(1)) .withColumn("sum", sum("count").over(windowCondition)) .withColumn(newColName, repeatedCondition) .drop("count", "sum") if (ignoreNulls) df.withColumn(newColName, when(checkIfNullValue(keys), df(newColName)).otherwise(lit(false)) else df }

相关内容

最新更新

热门标签：