我有两列有这样的模式:
root
|-- parent_column: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- item_1: integer (nullable = true)
| | |-- item_2: long (nullable = true)
| | |-- item_3: integer (nullable = true)
| | |-- item_4: boolean (nullable = true)
|-- child_column: struct (nullable = false)
| |-- item_1: integer (nullable = true)
| |-- item_2: long (nullable = true)
| |-- item_3: integer (nullable = true)
| |-- item_4: boolean (nullable = false)
我想通过执行array_contains(F.col('parent_column'), F.col('child_column'))
来检查child_column
是否存在于parent_column
中,但我遇到了Column is not Iterable
错误。
样本数据:
+----------------------------------------------+--------------------------------------------+--------------+
|parent_column | child_column | data_check |
+----------------------------------------------+--------------------------------------------+--------------+
|[[1, 2, 3, 4, false]] | [1, 2, 3, 4, false] | true |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
+----------------------------------------------+--------------------------------------------+--------------+
可运行代码示例:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[([(2, 2, 2,)],)],
'parent_column:array<struct<item_1:bigint,item_2:bigint,item_3:bigint>>'
)
df = df.withColumn(
'child_column',
F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")
)
# WITH ERRORS
# df = df.withColumn(
# 'contains',
# F.array_contains(F.col('parent_column'), F.col('child_column'))
# )
df.show(truncate=False)
在我的脑海中,我正在检查结构是否存在于数组结构中。所以我不确定为什么我得到这个错误。任何建议吗?
似乎你的样本数据是关闭的。我修好了。参见子列定义。不确定这是否是您对原始查询的问题。
>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame(
... [([(2, 2, 2,)],)],
... 'parent_column:array<struct<item_1:bigint,item_2:bigint,item_3:bigint>>'
... )
>>>
>>> df = df.withColumn(
... 'child_column',
... F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")
... )
>>> df = df.withColumn(
... 'child_column',
... F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")[0])
>>> df.withColumn( 'contains',expr(" array_contains(parent_column, child_column )" )).show()
+-------------+------------+--------+
|parent_column|child_column|contains|
+-------------+------------+--------+
| [[2, 2, 2]]| [2, 2, 2]| true|
+-------------+------------+--------+