PySpark-在Struct列中填充null值



我有以下数据帧:

+---+---------+
| ID|    Title|
+---+---------+
|  1|[2, test]|
|  3|     [4,]|
+---+---------+

使用下方的代码创建

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import col, expr
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, [2, 'test']), (3, [4, None])]
schema = (StructType([ 
StructField("ID",IntegerType(),False),   
StructField("Title",StructType([
StructField("TitleID",IntegerType(),False),
StructField("Type",StringType(),True),
]),False) 
]))
df = spark.createDataFrame(data, schema)

现在我尝试用默认值替换空标题类型。我用fillna尝试过,但没有任何效果:

default_type = 'type one'
df = df.fillna({'Title.Type':default_type})

我也尝试过使用expr

df = df.withColumn('Title', expr('struct(Title.TitleID, Title.Type if Title.Type.isNotNull() else default_type'))

但现在我得到了ParseException:

ParseException: 
extraneous input 'Title' expecting {')', ','}(line 1, pos 36)
== SQL ==
struct(Title.TitleID, Title.Type if Title.Type.isNotNull() else default_type
------------------------------------^^^

我在这里做错了什么?

您混淆了Spark SQL expr和Python expr:

import pyspark.sql.functions as F
df = df.withColumn(
'Title', 
F.expr(f"struct(Title.TitleID as TitleID, case when Title.Type is not null then Title.Type else '{default_type}' end as Type)")
)

最新更新