要忽略NaN值的pyspark列的总和



我有一个pypark数据帧,格式如下:

+---+----+----+
| id|col1|col2|
+---+----+----+
|  1|   1|   3|
|  2| NaN|   4|
|  3|   3|   5|
+---+----+----+

我想对col1col2求和,这样结果看起来像这样:

+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
|  1|   1|   3|  4|
|  2| NaN|   4|  4|
|  3|   3|   5|  8|
+---+----+----+---+

以下是我尝试过的:

import pandas as pd
test = pd.DataFrame({
'id': [1, 2, 3],
'col1': [1, None, 3],
'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()

此代码返回:

+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
|  1|   1|   3|  4|
|  2| NaN|   4|NaN| # <-- I want a 4 here, not this NaN
|  3|   3|   5|  8|
+---+----+----+---+

有人能帮我吗?

使用F.nanvlNaN替换为给定值(此处为0):

import pyspark.sql.functions as F
result = test.withColumn('sum', F.nanvl(F.col('col1'), F.lit(0)) + F.col('col2'))

请评论:

result = test.withColumn('sum', 
F.when(
F.isnan(F.col('col1')) & F.isnan(F.col('col2')), 
F.lit(float('nan'))
).otherwise(
F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
)
)

如果某人有两列以上,这里有一个更通用的解决方案。注意,这是针对nan的,null值与不同

import pyspark.sql.functions as F
def with_notnan_sum(df, cols=None):
if cols is None:
cols = df.columns
df = df.withColumn("_sum", sum(F.when(F.isnan(c), 0).otherwise(df[c]) for c in cols))
df = df.withColumn("_count", sum(F.when(F.isnan(c), 0).otherwise(1) for c in cols))
df = df.withColumn("notnan_sum", F.when(F.col("_count") > 0, F.col("_sum")))
df = df.drop("_sum", "_count")
return df
test = spark.createDataFrame(
[
(1, 1., 3),
(2, float('nan'), 4),
(3, 3., 5),
], 
('id', 'col1', 'col2'),
)
test.show()
test = with_notnan_sum(test, ['col1', 'col2'])
test.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
|  1| 1.0|   3|
|  2| NaN|   4|
|  3| 3.0|   5|
+---+----+----+
+---+----+----+----------+
| id|col1|col2|notnan_sum|
+---+----+----+----------+
|  1| 1.0|   3|       4.0|
|  2| NaN|   4|       4.0|
|  3| 3.0|   5|       8.0|
+---+----+----+----------+

最新更新