我有一个数据框架,其中列有一个数组,每个元素是一个字典。
<表类>
类产品 tbody><<tr>{"deleteDate"零,"class":"AB","validFrom":"2022 - 09 - 01 -","validTo":"2009 - 08 - 31"},{"deleteDate"零,"class":"CD","validFrom":"2009 - 09 - 01 -","validTo":"2024 - 08 - 31"} {"deleteDate"2021 - 09 - 01 -","class":"AB","validFrom":"2003 - 09 - 01 -","validTo":"2009-03-01"}, {"deleteDate"; null, "class";CD";validFrom"; "2009-09-01";;validTo"; "2024-08-31"} 表类>
为了提高性能(Spark函数vs UDF性能?),您可以只使用Spark转换:
我假设(value[i].validFrom >= (date of today))
应该实际上是(value[i].validTo >= (date of today))
import pyspark.sql.functions as f
def getelement(value, entity):
df = (
df
.withColumn('output', f.expr(f'filter({value}, element -> (element.deleteDate is null) AND (element.validFrom <= current_date()) AND (element.validTo >= current_date()))')[entity][-1])
)
return df
可以使用struct将参数绑定到一个对象中。然后用.
操作符访问结构体的元素。
代码示例:
def getelement(object):
value = object.value
entity = object.entity
return str( entity + " " + value )
udfgeturl=f.udf(getelement , StringType() )
df.select(
udfgeturl(
f.struct(
f.col("col1").alias("value"),
f.col("col2").alias("entity"))
)
).show()