在我的hadoop集群中有如下表
ID | CATEGORY | COUNT
101 | A | 40
101 | B | 40
101 | C | 20
102 | D | 10
102 | A | 20
102 | E | 30
102 | F | 40
我必须写一个Hive查询,将显示id和前2个类别的百分比计数作为列。结果表应该像
ID | CAT1 | % | CAT2 | %
101 | A | 40 | B | 40
102 | F | 40 | E | 30
要获得每个ID的前2名,可以使用rank()
函数,参见这里的示例。
要获得总体百分比,您可以将ID与聚合表连接:
select ID,sum(count) as sum from input_table group by ID
最后,如果您想将表从ID, Cat, %
转换为每行一个ID,则需要在子查询中为Cat
和%
使用collect_list
,然后为数组元素创建一列
Select ID, categories[0], pcts[0],categories[1], pcts[1] from (
Select a.ID, collect_list(Cat) as categories , collect_list(Count/sum) as pcts from (
Select ID, Cat, Count, rank from (
SELECT ID, Cat, Count,
rank() over (PARTITION BY ID ORDER BY Count DESC) as rank
FROM input_table) inner where rank <= 2 ) a,
(select ID,sum(count) as sum from input_table group by ID) b where a.ID = b.ID
group by a.ID ) inner;