我有一个pypark数据帧,格式如下:
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| 1| 3|
| 2| NaN| 4|
| 3| 3| 5|
+---+----+----+
我想对col1
和col2
求和,这样结果看起来像这样:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4| 4|
| 3| 3| 5| 8|
+---+----+----+---+
以下是我尝试过的:
import pandas as pd
test = pd.DataFrame({
'id': [1, 2, 3],
'col1': [1, None, 3],
'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()
此代码返回:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4|NaN| # <-- I want a 4 here, not this NaN
| 3| 3| 5| 8|
+---+----+----+---+
有人能帮我吗?
使用F.nanvl
将NaN
替换为给定值(此处为0):
import pyspark.sql.functions as F
result = test.withColumn('sum', F.nanvl(F.col('col1'), F.lit(0)) + F.col('col2'))
请评论:
result = test.withColumn('sum',
F.when(
F.isnan(F.col('col1')) & F.isnan(F.col('col2')),
F.lit(float('nan'))
).otherwise(
F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
)
)
如果某人有两列以上,这里有一个更通用的解决方案。注意,这是针对nan的,null值与不同
import pyspark.sql.functions as F
def with_notnan_sum(df, cols=None):
if cols is None:
cols = df.columns
df = df.withColumn("_sum", sum(F.when(F.isnan(c), 0).otherwise(df[c]) for c in cols))
df = df.withColumn("_count", sum(F.when(F.isnan(c), 0).otherwise(1) for c in cols))
df = df.withColumn("notnan_sum", F.when(F.col("_count") > 0, F.col("_sum")))
df = df.drop("_sum", "_count")
return df
test = spark.createDataFrame(
[
(1, 1., 3),
(2, float('nan'), 4),
(3, 3., 5),
],
('id', 'col1', 'col2'),
)
test.show()
test = with_notnan_sum(test, ['col1', 'col2'])
test.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| 1.0| 3|
| 2| NaN| 4|
| 3| 3.0| 5|
+---+----+----+
+---+----+----+----------+
| id|col1|col2|notnan_sum|
+---+----+----+----------+
| 1| 1.0| 3| 4.0|
| 2| NaN| 4| 4.0|
| 3| 3.0| 5| 8.0|
+---+----+----+----------+