创建从列表中获得上下边框的列



我有一个DataFrame和一个边框列表:

test = spark.createDataFrame(
[
(1,),
(2,),
(234,),
(0,),
(6,),
(7,),
(35,),
(46,),
(8,),
],  
"Population int",
)
border_list = [0, 1.5, 7, 41, 235]

我想为DataFrame ("LowerBorder", "UpperBorder")添加两个新列,用于"Population"列。

当我尝试用python列表和函数来做时,它工作了:

lower = lambda x: max([i for i in border_list if x >= i])
upper = lambda x: min([i for i in border_list if x < i])
list_value = [1, 2, 234, 0, 6, 7, 35, 46, 8]
for i in list_value:
print(lower(i), upper(I))
# Output:
# low  high
0    1.5
1.5  7
41   235
0    1.5
1.5  7
7    41
7    41
41   235
7    41

但是当我尝试将其转换为与列一起工作时,它没有:

from pyspark.sql.types import FloatType
lower_border = F.udf(lambda x: max([i for i in border_list if x >= i]), FloatType())
upper_border = F.udf(lambda x: min([i for i in border_list if x < i]), FloatType())
test.withColumn("LowBorder", lower_border("Population")) 
.withColumn("UpBorder", upper_border("Population"))
display(test) # no changes in test Dataframe

如果我尝试通过select添加列,它也不像预期的那样工作:

display(test.select(lower_border("Population").alias('low'), upper_border("Population").alias('high')))
# Output:
low    high
-----------
null    1.5
1.5     null
null    null
null    1.5
1.5     null
null    null
null    null
null    null
null    null

测试DataFrame的预期输出是:

Population | LowBorder | UpBorder
---------------------------------
1            0           1.5
2            1.5         7
234          41          235
0            0           1.5
6            1.5         7
7            7           41
35           7           41
46           41          235
8            7           41

您可以从您的border_list创建一个数组,然后filter它并选择最小或最大

from pyspark.sql import functions as F
test = spark.createDataFrame([(1,), (2,), (234,), (0,), (6,), (7,), (35,), (46,), (8,)], "Population int")
border_list = [0, 1.5, 7, 41, 235]
arr = F.array_sort(F.array([F.lit(x) for x in border_list]))
test = test.select(
'Population',
F.element_at(F.filter(arr, lambda x: x <= F.col('Population')), -1).alias('LowBorder'),
F.element_at(F.filter(arr, lambda x: x > F.col('Population')), 1).alias('UpBorder'),
)
test.show(truncate=0)
# +----------+---------+--------+
# |Population|LowBorder|UpBorder|
# +----------+---------+--------+
# |1         |0.0      |1.5     |
# |2         |1.5      |7.0     |
# |234       |41.0     |235.0   |
# |0         |0.0      |1.5     |
# |6         |1.5      |7.0     |
# |7         |7.0      |41.0    |
# |35        |7.0      |41.0    |
# |46        |41.0     |235.0   |
# |8         |7.0      |41.0    |
# +----------+---------+--------+

最新更新