根据条件确定hive中的第二个最大值



我有一个表,其中的行看起来像这样,其中一列根据票证id和时间戳desc对所有行进行排序。

所有行只能有一个标志等于1

ticketID  |  flag 1  | flag 2 | flag 3 | flag 4 | Timestamp  |  Rank    |  stringvalue  |  
----------------------------------------------------------------------------------------|
1      |    0     |    0   |    1   |    0   |  xxxxxx    |    2     |   aaaaaa      |
1      |    0     |    0   |    0   |    1   |  xxxxxx    |    1     |   bbbbbb      |
1      |    0     |    1   |    0   |    0   |  xxxxxx    |    3     |   aaaaaa      |
2      |    1     |    0   |    0   |    0   |  xxxxxx    |    2     |   bbbbbb      |
2      |    0     |    0   |    0   |    1   |  xxxxxx    |    1     |   xxxxxx      |
3      |    0     |    0   |    1   |    0   |  xxxxxx    |    4     |   aaaaaa      |
3      |    0     |    1   |    0   |    0   |  xxxxxx    |    3     |   bbbbbb      |
3      |    1     |    0   |    0   |    0   |  xxxxxx    |    1     |   ssssss      |
3      |    0     |    0   |    0   |    1   |  xxxxxx    |    2     |   nnnnnn      |
4      |    0     |    1   |    0   |    0   |  xxxxxx    |    2     |   gggggg      |
4      |    0     |    0   |    0   |    1   |  xxxxxx    |    1     |   iiiiii      |

对于每个ticketID,我需要根据排名获得第一行,但对于特定标志有一个例外:

当一张票的第1位是标志为4 = 1的一行时,我需要将第2位作为第1位。如果票证的第二个等级是标志3 = 1,那么我需要从第一个等级(标志= 4)与第二个等级(标志= 3)连接stringvalue。

如果第二个秩是flag = 1或flag = 2,则忽略第一个秩,将第二个秩作为第一个返回。

我希望我的问题很清楚。

感谢编辑

示例输出:

----------------------------------------------------------------------------------------
ticketID  |  flag 1  | flag 2 | flag 3 | Timestamp  |  Rank    |  stringvalue          |  
---------------------------------------------------------------------------------------|
1      |    0     |    0   |    1   |  xxxxxx    |    1     |   aaaaaa / bbbbbbb    |
2      |    1     |    0   |    0   |  xxxxxx    |    1     |        bbbbbb         |
3      |    1     |    0   |    0   |  xxxxxx    |    1     |        ssssss         |
4      |    0     |    1   |    0   |  xxxxxx    |    1     |        gggggg         |
---------------------------------------------------------------------------------------- 

我将使用结构组by的一些子查询。这将允许我们在不使用窗口的情况下询问有关多行的问题。可能会执行得更快,因为我们不需要维护窗口状态。

create table theRanks (ticketID int, flag_1 int, flag_2 int, flag_3 int, flag_4 int, Timestamp string, Rank int, stringvalue string)
-- create some dummy data
insert into theRanks values ( 1 , 0, 0, 1, 0, 'xxxxxx', 2, 'aaaaaa')
insert into theRanks values ( 1 , 0, 0, 0, 1, 'xxxxxx', 1, 'bbbbbb')
insert into theRanks values ( 1 , 0, 1, 0, 0, 'xxxxxx', 3, 'aaaaaa')
with stuct_table as -- sub-query syntax
( 
select 
ticketID, 
struct( -- struct will allow us to group rows together.
Rank as rawRank, -- this has to be first in strut as we use it for sorting
flag_1 , 
flag_2, 
flag_3, 
flag_4 , 
Timestamp , 
stringvalue 
) as myRow 
from 
theRanks 
where 
rank in (1,2) -- only look at first two ranks
), 
constants as -- subquery
( 
select 0 as rank1, 1 as rank2 -- strictly not needed just to help make it more readable 
), 
grouped_rows as --subquery
(
select 
ticketID, 
array_sort(collect_list(myRow)) as row_list  -- will sort on rank all structs into a list
from stuct_table 
group by ticketID
) , 
raw_rows as (select --sub-query styntax
ticketId, 
case 
when 
row_list[constants.rank2].flag_1 + row_list[constants.rank2].flag_2 > 0 or (row_list[constants.rank1].flag_4 = 1 and row_list[constants.rank2].flag_3  = 0 )
then
row_list[constants.rank2]
when 
row_list[constants.rank1].flag_4 = 1 and row_list[constants.rank2].flag_3  = 1 -- condition to concat string
then
struct( -- this struct must match the original one we created
row_list[constants.rank2].rawRank as rawRank, 
row_list[constants.rank2].flag_1 as flag_1,
row_list[constants.rank2].flag_2 as flag_2,
row_list[constants.rank2].flag_3 as flag_3,
row_list[constants.rank2].flag_4 as flag_4,
row_list[constants.rank2].Timestamp as Timestamp,
concat(
row_list[constants.rank1].stringvalue, 
' / ', 
row_list[constants.rank2].stringvalue) as stringvalue
)
else
row_list[constants.rank1]
end as rankedRow,
1 as Rank
from grouped_rows
cross join constants) -- not strictly needed, just replace all constants.rank1 with 0 and constants.rank2 with 1.  I just use it to make it more clear what I'm doing.  Could be replaced in production.
select rankedRow.* , 1 as Rank from raw_rows; -- makes struct columns into table columns

相关内容

  • 没有找到相关文章

最新更新