我有以下数据帧:
+---+---------+
| ID| Title|
+---+---------+
| 1|[2, test]|
| 3| [4,]|
+---+---------+
使用下方的代码创建
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import col, expr
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, [2, 'test']), (3, [4, None])]
schema = (StructType([
StructField("ID",IntegerType(),False),
StructField("Title",StructType([
StructField("TitleID",IntegerType(),False),
StructField("Type",StringType(),True),
]),False)
]))
df = spark.createDataFrame(data, schema)
现在我尝试用默认值替换空标题类型。我用fillna
尝试过,但没有任何效果:
default_type = 'type one'
df = df.fillna({'Title.Type':default_type})
我也尝试过使用expr
df = df.withColumn('Title', expr('struct(Title.TitleID, Title.Type if Title.Type.isNotNull() else default_type'))
但现在我得到了ParseException
:
ParseException:
extraneous input 'Title' expecting {')', ','}(line 1, pos 36)
== SQL ==
struct(Title.TitleID, Title.Type if Title.Type.isNotNull() else default_type
------------------------------------^^^
我在这里做错了什么?
您混淆了Spark SQL expr和Python expr:
import pyspark.sql.functions as F
df = df.withColumn(
'Title',
F.expr(f"struct(Title.TitleID as TitleID, case when Title.Type is not null then Title.Type else '{default_type}' end as Type)")
)