要忽略NaN值的pyspark列的总和

我有一个pypark数据帧，格式如下：

+---+----+----+
| id|col1|col2|
+---+----+----+
|  1|   1|   3|
|  2| NaN|   4|
|  3|   3|   5|
+---+----+----+

我想对col1和col2求和，这样结果看起来像这样：

+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
|  1|   1|   3|  4|
|  2| NaN|   4|  4|
|  3|   3|   5|  8|
+---+----+----+---+

以下是我尝试过的：

import pandas as pd
test = pd.DataFrame({
'id': [1, 2, 3],
'col1': [1, None, 3],
'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()

此代码返回：

+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
|  1|   1|   3|  4|
|  2| NaN|   4|NaN| # <-- I want a 4 here, not this NaN
|  3|   3|   5|  8|
+---+----+----+---+

有人能帮我吗？

使用F.nanvl将NaN替换为给定值(此处为0)：

import pyspark.sql.functions as F
result = test.withColumn('sum', F.nanvl(F.col('col1'), F.lit(0)) + F.col('col2'))

请评论：

result = test.withColumn('sum', 
F.when(
F.isnan(F.col('col1')) & F.isnan(F.col('col2')), 
F.lit(float('nan'))
).otherwise(
F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
)
)

如果某人有两列以上，这里有一个更通用的解决方案。注意，这是针对nan的，null值与不同

import pyspark.sql.functions as F
def with_notnan_sum(df, cols=None):
if cols is None:
cols = df.columns
df = df.withColumn("_sum", sum(F.when(F.isnan(c), 0).otherwise(df[c]) for c in cols))
df = df.withColumn("_count", sum(F.when(F.isnan(c), 0).otherwise(1) for c in cols))
df = df.withColumn("notnan_sum", F.when(F.col("_count") > 0, F.col("_sum")))
df = df.drop("_sum", "_count")
return df
test = spark.createDataFrame(
[
(1, 1., 3),
(2, float('nan'), 4),
(3, 3., 5),
], 
('id', 'col1', 'col2'),
)
test.show()
test = with_notnan_sum(test, ['col1', 'col2'])
test.show()

+---+----+----+
| id|col1|col2|
+---+----+----+
|  1| 1.0|   3|
|  2| NaN|   4|
|  3| 3.0|   5|
+---+----+----+
+---+----+----+----------+
| id|col1|col2|notnan_sum|
+---+----+----+----------+
|  1| 1.0|   3|       4.0|
|  2| NaN|   4|       4.0|
|  3| 3.0|   5|       8.0|
+---+----+----+----------+

相关内容

最新更新

热门标签：