我需要帮助我有这个数据框,列'b'有偶数个值
df1 = spark.createDataFrame([
('c',1),
('c',2),
('c',4),
('c',6),
('c',7),
('c',8),
], ['a', 'b'])
df1.show()
+---+---+
| a| b|
+---+---+
| c| 1|
| c| 2|
| c| 4|
| c| 6|
| c| 7|
| c| 8|
我想取middle[4,6]的中间值
计算这些数与相邻数之间的距离:dis[2,4]=2, dis[6,7]=1
取最小距离min_dis[6,7]=1
,最后显示min_dis[6,7]和mid[4,6]中的值
result = spark.createDataFrame([
('c',6),
], ['a', 'b'])
result.show()
+---+---+
| a| b|
+---+---+
| c| 6|
+---+---+
有没有办法在pyspark中做到这一点?
试试这个:
假设df1
如下:
df1 = spark.createDataFrame(
[
("c", 1),
("c", 2),
("c", 4),
("c", 6),
("c", 7),
("c", 8),
("d", 1),
("d", 2),
("d", 4),
("d", 6),
("d", 7),
("d", 8),
("d", 9),
("d", 10),
],
["a", "b"],
)
首先定义window
函数
from pyspark.sql import Window
import pyspark.sql.functions as F
w = (
Window.partitionBy("a")
.orderBy(F.asc("b"))
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
然后让我们看一步一步的部分:
df2 = df1.withColumn("list", F.collect_list("b").over(w)).
select("a", "list").distinct()
df2.show(2, False)
+---+-------------------------+
|a |list |
+---+-------------------------+
|c |[1, 2, 4, 6, 7, 8] |
|d |[1, 2, 4, 6, 7, 8, 9, 10]|
+---+-------------------------+
然后找到距离:
df3 = df2.withColumn("mid_ind", (F.size("list") / 2).cast("int")).
withColumn('lower_mid_diff', F.abs(F.col('list')[F.col('mid_ind')-1]-F.col('list')[F.col('mid_ind')-2])).
withColumn('upper_mid_diff', F.abs(F.col('list')[F.col('mid_ind')]-F.col('list')[F.col('mid_ind')+1]))
df3.show(2, False)
+---+-------------------------+-------+--------------+--------------+
|a |list |mid_ind|lower_mid_diff|upper_mid_diff|
+---+-------------------------+-------+--------------+--------------+
|c |[1, 2, 4, 6, 7, 8] |3 |2 |1 |
|d |[1, 2, 4, 6, 7, 8, 9, 10]|4 |2 |1 |
+---+-------------------------+-------+--------------+--------------+
和应用最终的when
语句:
df3.withColumn('b', F.when(F.col('lower_mid_diff')<=F.col('upper_mid_diff'), F.col('list')[F.col('mid_ind')-1]).
otherwise(F.col('list')[F.col('mid_ind')])).select('a', 'b').show()
+---+---+
| a| b|
+---+---+
| c| 6|
| d| 7|
+---+---+
希望这对你有帮助。但也要注意,您可以将许多步骤组合起来,使其更短:)