计算hive中每组重复次数最多的值?



我在hortonworks数据平台上使用hive 0.14.0,在一个类似于以下输入数据的大文件上:

tbody> <<tr>
tpep_pickup_datetime pulocationid
2022-01-28 23:32:52.0100
2022-02-28 23:02:40.0202
2022-02-28 17:22:45.0102
2022-02-28 23:19:37.0102
2022-03-29 17:32:02.0102
2022-01-28 23:32:40.0101
2022-02-28 17:28:09.0201
2022-03-28 23:59:54.0100
2022-02-28 21:02:40.0100
with raw_groups -- subquery syntax
(
select 
struct( 
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid , 
hour(time_stamp) as hour 
) as mylocation -- create a struct to make max do the work for us
from 
location 
group by 
location.pulocationid,  
hour(time_stamp)
),
grouped_data as  -- another subquery syntax based on `with`
( 
select 
max(mylocation) as location -- will pick max based on count(time_stamp)
from 
raw_groups 
group by 
mylocation.pulocationid
) 
select --format data into your requested format
location.pulocationid, 
location.hour 
from 
grouped_data

我不记得hive 0.14可以使用with子句,但你可以很容易地重写查询不使用它。(通过在表名的节奏中替换select)我只是不觉得它是可读的:

select --format data into your requested format
location.pulocationid, 
location.hour 
from 
( 
select 
max(mylocation) as location -- will pick max based on count(time_stamp)
from 
(
select 
struct( 
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid , 
hour(time_stamp) as hour 
) as mylocation -- create a struct to make max do the work for us
from 
location 
group by 
location.pulocationid,  
hour(time_stamp)
)
group by 
mylocation.pulocationid
) 

你已经成功了一半!

思路是正确的,但是语法有点偏离:

  1. 首先查找每小时的计数

    Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22 
    Group by pulocationid, hour (tpep_pickup_datetime) 
    
  2. 然后添加row_number,但您需要按总数降序排列:

    select pulocationid, hour, CNT, row_number () over (partition be pulocationid order by CNT desc) as row_no from

  3. 最后但不是列表,只取最高计数的行(这可以通过max函数而不是row_number函数顺便完成)

或合计:

select pulocationid ,  hour from (
select  pulocationid ,  hour ,  cnt  , row_number ()
over ( partition by pulocationid order by  cnt desc ) 
as row_no  from (
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22 
Group by pulocationid, hour (tpep_pickup_datetime) ))
Where row_no=1