2 个列表和/或 1 个二维数组的 udf 声明

我想声明一个返回 2 个 1D 数组或 1 个 2D 数组的 udf（两者的示例会很好）。我知道这适用于 1D：

@udf("array<int>")

但是，我已经尝试了许多变体，例如以下变体，但没有运气：

@udf("array<int>,array<int>")
@udf("array<int>","array<int>")
@udf("array<int,int>")
etc.

要返回两个列表，您可以使用struct

@udf("struct<_1: array<int>, _2: array<int>>")

或

from pyspark.sql.types import ArrayType, StructField, StructType, IntegerType 
@udf(StructType([
    StructField("_1", ArrayType(IntegerType())),
    StructField("_2", ArrayType(IntegerType()))]))

其中函数应返回（PEP 484 键入表示法）

Tuple[List[int], List[int]]

即

return [1, 2, 3], [4, 5, 6]

要返回二维数组，请声明：

@udf("array<array<int>>")

或

@udf(ArrayType(ArrayType(IntegerType())))

函数应返回的位置

List[List[int]]

即

return [[1, 2, 3], [4, 5, 6]]

如果返回固定大小元组的数组

List[Tuple[int, int]]

即

return  [(1, 2), (3, 4), (5, 6)]

架构应为

@udf("array<struct<_1: int, _2: int>>")

或

@udf(ArrayType(StructType([
    StructField("_1", IntegerType()),
    StructField("_2", IntegerType())])))

虽然array<array<int>>，虽然不是规范的，但在这种情况下也应该有效。

注：

上面使用的名称选择（_1和_2）是任意的，可以根据您的要求进行调整。

相关内容

最新更新

热门标签：