如何在PySpark中通过选择结构数组列的一个字段来提取数组列



我有一个数据帧df,它包含一个结构数组列properties(数组列的元素是具有关键字xy的结构字段(,我想通过从列properties中提取x值来创建一个新的数组列。

一个示例输入数据帧类似于这个

import pyspark.sql.functions as F
from pyspark.sql.types import *
data = [
(1, [{'x':11, 'y':'str1a'}, ]), 
(2, [{'x':21, 'y':'str2a'}, {'x':22, 'y':0.22, 'z':'str2b'}, ]), 
]
my_schema = StructType([
StructField('id', LongType()),
StructField('properties', ArrayType(
StructType([
StructField('x', LongType()),
StructField('y', StringType()),
])
)           
),
])
df = spark.createDataFrame(data, schema=my_schema)
df.show()
# +---+--------------------+
# | id|          properties|
# +---+--------------------+
# |  1|       [[11, str1a]]|
# |  2|[[21, str2a], [22...|
# +---+--------------------+
df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- properties: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- x: long (nullable = true)
#  |    |    |-- y: string (nullable = true)

另一方面,期望的输出df_new应该看起来像

df_new.show()
# +---+--------------------+--------+
# | id|          properties|x_values|
# +---+--------------------+--------+
# |  1|       [[11, str1a]]|    [11]|
# |  2|[[21, str2a], [22...|[21, 22]|
# +---+--------------------+--------+
df_new.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- properties: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- x: long (nullable = true)
#  |    |    |-- y: string (nullable = true)
#  |-- x_values: array (nullable = true)
#  |    |-- element: long (containsNull = true)

有人知道这项任务的解决方案吗

理想情况下,我正在寻找一种逐行操作而不依赖F.explode的解决方案。事实上,在我实际的数据库中,我还没有确定一个等效于id列的列,在调用F.explode之后,我不确定如何将分解的值重新合并在一起。

尝试使用properties.x,然后从属性数组中提取所有值。

示例:

df.withColumn("x_values",col("properties.x")).show(10,False)
#or by using higher order functions
df.withColumn("x_values",expr("transform(properties,p -> p.x)")).show(10,False)
#+---+-------------------------+--------+
#|id |properties               |x_values|
#+---+-------------------------+--------+
#|1  |[[11, str1a]]            |[11]    |
#|2  |[[21, str2a], [22, 0.22]]|[21, 22]|
#+---+-------------------------+--------+

最新更新