Hive查询按百分比选择前2行并显示为列



在我的hadoop集群中有如下表

ID  |   CATEGORY     |  COUNT
101 |   A            |   40
101 |   B            |   40
101 |   C            |   20
102 |   D            |   10
102 |   A            |   20
102 |   E            |   30
102 |   F            |   40

我必须写一个Hive查询,将显示id和前2个类别的百分比计数作为列。结果表应该像

ID  |   CAT1 |  %   | CAT2  | %
101 |   A    |   40 | B     | 40
102 |   F    |   40 | E     | 30

要获得每个ID的前2名,可以使用rank()函数,参见这里的示例。

要获得总体百分比,您可以将ID与聚合表连接:

select ID,sum(count) as sum from input_table group by ID

最后,如果您想将表从ID, Cat, %转换为每行一个ID,则需要在子查询中为Cat%使用collect_list,然后为数组元素创建一列

Select ID, categories[0], pcts[0],categories[1], pcts[1] from (
  Select a.ID, collect_list(Cat) as categories , collect_list(Count/sum) as pcts from (
    Select ID, Cat, Count, rank from (
      SELECT ID, Cat, Count, 
      rank() over (PARTITION BY ID ORDER BY Count DESC) as rank
      FROM input_table) inner where rank <= 2 ) a, 
    (select ID,sum(count) as sum from input_table group by ID) b where a.ID = b.ID
  group by a.ID ) inner;

最新更新