我有一个大型数据集,其中包含引用、资产、开始和结束日期。我想通过引用为每个资产分配一个密钥,从1开始,如果引用是相同的,并且开始和结束日期彼此相同,那么使用相同的密钥,这样我最终得到:
Asset Ref Start End Key
A23BCD 12345 01/01/1900 01/01/2020 1
A23BCD 12345 02/01/2020 17/06/2020 1
A23BCD 67890 01/09/2020 31/10/2020 2
A23BCD 77777 01/11/2020 31/12/9999 3
我在Hadoop中使用数据,并使用HiveQL来分配密钥,但这只检查前5行:
create table temp_user.a1
row format delimited fields terminated by ' 01'
stored as orc tblproperties("ORC.COMPRESS"="SNAPPY","ORC.COMPRESS.SIZE"="16384") as
select a.*
,LAG(ref) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref
,LAG(endDt) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt
,LAG(rowNum) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum
,LAG(ref,2) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_1
,LAG(endDt,2) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_1
,LAG(rowNum,2) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_1
,LAG(ref,3) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_2
,LAG(endDt,3) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_2
,LAG(rowNum,3) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_2
,LAG(ref,4) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_3
,LAG(endDt,4) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_3
,LAG(rowNum,4) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_3
,LAG(ref,5) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_4
,LAG(endDt,5) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_4
,LAG(rowNum,5) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_4
from temp_user.BigDataSet a;
然后,我可以将上一次引用和上一次结束日期与当前引用和当前开始日期进行比较。
有没有更好的方法来获取以前的记录,而不是多个LAG?
使用asset和ref以及结束日期=(开始日期-1(的位置将大数据集连接到自身会更好吗?然后我将如何分配密钥?
谢谢D
您应该能够在一个滞后的情况下使其工作,并使其适用于任何数量的先前记录。
我多年前为甲骨文写过这篇文章:
https://betteratoracle.com/posts/35-collapsing-continuous-ranges-into-single-rows
这在很大程度上满足了你的需求。下面的查询会给出一个想法,但我没有Hive来测试它。
select t.*,
max(t.grp) over (order by asset, ref, start asc)
from
(
select a.*,
case
when start - lag(end) over (partition by asset, ref order by start asc) < 1 then
null
else
rownumber // this is an oracle function - maybe Hive has something similar?
end grp
from table
order by asset, ref, start
) t