我创建了一个函数来检查数据帧中是否存在基于列序列的重复值。
我想实现一个";ignoreNulls";,作为布尔参数传递到函数
- 如果为true,将忽略而不对null值进行分组和计数。因此,对于null值;newColName";将返回false
- 如果为false(默认值(,则将nulls值视为一个组,如果我正在检查的键有多个带nulls的值,则返回true
我不知道该怎么做。我应该使用if
还是case
?有一些表达式可以忽略partitionBy语句中的null?
有人能帮我吗?
这是的当前功能
def checkRepeatedKey(newColName: String, keys: Seq[String])(dataframe: DataFrame): DataFrame = {
val repeatedCondition = $"sum" > 1
val windowCondition = Window.partitionBy(keys.head, keysToCheck.tail: _*)
dataframe
.withColumn("count", lit(1))
.withColumn("sum", sum("count").over(windowCondition))
.withColumn(newColName, repeatedCondition)
.drop("count", "sum")
}
一些测试数据
val testDF = Seq(
("1", Some("name-1")),
("2", Some("repeated-name")),
("3", Some("repeated-name")),
("4", Some("name-4")),
("5", None),
("6", None)
).toDF("name_key", "name")
测试功能
val results = testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"))
输出(没有ignoreNulls实现(
+--------+---------------+--------------------+
|name_key| name | has_repeated_name |
+--------+---------------+--------------------+
| 1 | name-1 | false |
+--------+---------------+--------------------+
| 2 | repeated-name | true |
+--------+---------------+--------------------+
| 3 | repeated-name | true |
+--------+---------------+--------------------+
| 4 | name-4 | false |
+--------+---------------+--------------------+
| 5 | null | true |
+--------+---------------+--------------------+
| 6 | null | true |
+--------+---------------+--------------------+
使用ignoreNulls=,真正的实现应该是这样的
-- function header with ignoreNulls parameter
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame =
-- using the function, passing true for ignoreNulls
testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"), true)
-- expected output for nulls
+--------+---------------+--------------------+
| 5 | null | false |
+--------+---------------+--------------------+
| 6 | null | false |
+--------+---------------+--------------------+
首先,如果 空值为了简单起见,我们假设keys
中只有一部分列为null,您应该正确定义逻辑-它应该算作null值还是只有当keys
中的所有列都为null时才定义keys
中只有一列(您可以轻松地将逻辑扩展到多列(。您只需在checkRepeatedKey
函数中添加一个简单的if
:def checkIfNullValue(keys: Seq[String]): Column = {
// for the sake of simplicity checking only the first key
col(keys.head).isNull
}
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = {
...
...
val df = dataframe
.withColumn("count", lit(1))
.withColumn("sum", sum("count").over(windowCondition))
.withColumn(newColName, repeatedCondition)
.drop("count", "sum")
if (ignoreNulls)
df.withColumn(newColName, when(checkIfNullValue(keys), df(newColName)).otherwise(lit(false))
else df
}