使用窗口 Hive 或 Spark scala 进行数据排列



我必须组织数据。

I/p:

ID |VALUE
1|a
2|null
3|null
4|b
5|null
6|null
7|c

需要使用配置单元或数据框进行输出。

操作/操作:

ID|Value
1|a
2|b
3|b
4|b
5|c
6|c
7|c

在 Spark 中使用具有窗口顺序的first(expr[, isIgnoreNull=true])通过monotonically_increasing_id()函数和行之间作为currentRowunboundedFollowing.

Example:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.show()
//+---+-----+
//| ID|VALUE|
//+---+-----+
//|  1|    a|
//|  2| null|
//|  3| null|
//|  4|    b|
//|  5| null|
//|  6| null|
//|  7|    c|
//+---+-----+
//if ID will be sequentially increasing
val w=Window.orderBy("ID").rowsBetween(0,Window.unboundedFollowing) 
val w=Window.orderBy(monotonically_increasing_id()).rowsBetween(0,Window.unboundedFollowing) 
df.withColumn("VALUE", first("value",true).over(w)).show()
//+---+-----+
//| ID|VALUE|
//+---+-----+
//|  1|    a|
//|  2|    b|
//|  3|    b|
//|  4|    b|
//|  5|    c|
//|  6|    c|
//|  7|    c|
//+---+-----+

Hive 解决方案:

with mytable as (
select stack(7,
1,'a'  ,
2,null ,
3,null ,
4,'b'  ,
5,null ,
6,null ,
7,'c'
) as (id, value)
)
SELECT id, 
first_value(value,true) over(order by id rows between current row and unbounded following) value
FROM mytable;

结果:

id  value
1   a
2   b
3   b
4   b
5   c
6   c
7   c

最新更新