在pyspark中找到距离最小的中间值



我需要帮助我有这个数据框,列'b'有偶数个值

df1 = spark.createDataFrame([
('c',1),
('c',2),
('c',4),
('c',6),
('c',7),
('c',8),
], ['a', 'b'])
df1.show()
+---+---+
|  a|  b|
+---+---+
|  c|  1|
|  c|  2|
|  c|  4|
|  c|  6|
|  c|  7|
|  c|  8|

我想取middle[4,6]的中间值

计算这些数与相邻数之间的距离:dis[2,4]=2, dis[6,7]=1

取最小距离min_dis[6,7]=1

,最后显示min_dis[6,7]和mid[4,6]中的值

result = spark.createDataFrame([
('c',6),
], ['a', 'b'])
result.show()
+---+---+
|  a|  b|
+---+---+
|  c|  6|
+---+---+

有没有办法在pyspark中做到这一点?

试试这个:

假设df1如下:

df1 = spark.createDataFrame(
[
("c", 1),
("c", 2),
("c", 4),
("c", 6),
("c", 7),
("c", 8),
("d", 1),
("d", 2),
("d", 4),
("d", 6),
("d", 7),
("d", 8),
("d", 9),
("d", 10),
],
["a", "b"],
)

首先定义window函数

from pyspark.sql import Window
import pyspark.sql.functions as F
w = (
Window.partitionBy("a")
.orderBy(F.asc("b"))
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)

然后让我们看一步一步的部分:

df2 = df1.withColumn("list", F.collect_list("b").over(w)).
select("a", "list").distinct()
df2.show(2, False)
+---+-------------------------+
|a  |list                     |
+---+-------------------------+
|c  |[1, 2, 4, 6, 7, 8]       |
|d  |[1, 2, 4, 6, 7, 8, 9, 10]|
+---+-------------------------+

然后找到距离:

df3 = df2.withColumn("mid_ind", (F.size("list") / 2).cast("int")).
withColumn('lower_mid_diff', F.abs(F.col('list')[F.col('mid_ind')-1]-F.col('list')[F.col('mid_ind')-2])).
withColumn('upper_mid_diff', F.abs(F.col('list')[F.col('mid_ind')]-F.col('list')[F.col('mid_ind')+1]))
df3.show(2, False)
+---+-------------------------+-------+--------------+--------------+
|a  |list                     |mid_ind|lower_mid_diff|upper_mid_diff|
+---+-------------------------+-------+--------------+--------------+
|c  |[1, 2, 4, 6, 7, 8]       |3      |2             |1             |
|d  |[1, 2, 4, 6, 7, 8, 9, 10]|4      |2             |1             |
+---+-------------------------+-------+--------------+--------------+

和应用最终的when语句:

df3.withColumn('b', F.when(F.col('lower_mid_diff')<=F.col('upper_mid_diff'), F.col('list')[F.col('mid_ind')-1]).
otherwise(F.col('list')[F.col('mid_ind')])).select('a', 'b').show()
+---+---+
|  a|  b|
+---+---+
|  c|  6|
|  d|  7|
+---+---+

希望这对你有帮助。但也要注意,您可以将许多步骤组合起来,使其更短:)

相关内容

  • 没有找到相关文章

最新更新