array_contains()的结构体数组列不是Iterable



我有两列有这样的模式:

root
|-- parent_column: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- item_1: integer (nullable = true)
|    |    |-- item_2: long (nullable = true)
|    |    |-- item_3: integer (nullable = true)
|    |    |-- item_4: boolean (nullable = true)
|-- child_column: struct (nullable = false)
|    |-- item_1: integer (nullable = true)
|    |-- item_2: long (nullable = true)
|    |-- item_3: integer (nullable = true)
|    |-- item_4: boolean (nullable = false)

我想通过执行array_contains(F.col('parent_column'), F.col('child_column'))来检查child_column是否存在于parent_column中,但我遇到了Column is not Iterable错误。

样本数据:

+----------------------------------------------+--------------------------------------------+--------------+
|parent_column                                 | child_column                               | data_check   |
+----------------------------------------------+--------------------------------------------+--------------+
|[[1, 2, 3, 4, false]]                         | [1, 2, 3, 4, false]                        |    true      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
+----------------------------------------------+--------------------------------------------+--------------+
可运行代码示例:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[([(2, 2, 2,)],)],
'parent_column:array<struct<item_1:bigint,item_2:bigint,item_3:bigint>>'
)
df = df.withColumn(
'child_column',
F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")
)
# WITH ERRORS
# df = df.withColumn(
#     'contains',
#     F.array_contains(F.col('parent_column'), F.col('child_column'))
# )
df.show(truncate=False)

在我的脑海中,我正在检查结构是否存在于数组结构中。所以我不确定为什么我得到这个错误。任何建议吗?

似乎你的样本数据是关闭的。我修好了。参见子列定义。不确定这是否是您对原始查询的问题。

>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame(
...     [([(2, 2, 2,)],)],
...     'parent_column:array<struct<item_1:bigint,item_2:bigint,item_3:bigint>>'
... )
>>> 
>>> df = df.withColumn(
...     'child_column',
...     F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")
... )
>>> df = df.withColumn(
...     'child_column',
...     F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")[0])
>>> df.withColumn( 'contains',expr(" array_contains(parent_column, child_column )" )).show()
+-------------+------------+--------+
|parent_column|child_column|contains|
+-------------+------------+--------+
|  [[2, 2, 2]]|   [2, 2, 2]|    true|
+-------------+------------+--------+

最新更新