我在hortonworks数据平台上使用hive 0.14.0,在一个类似于以下输入数据的大文件上:
tpep_pickup_datetime | pulocationid | 2022-01-28 23:32:52.0 | 100 |
---|---|
2022-02-28 23:02:40.0 | 202 |
2022-02-28 17:22:45.0 | 102 |
2022-02-28 23:19:37.0 | 102 |
2022-03-29 17:32:02.0 | 102 |
2022-01-28 23:32:40.0 | 101 |
2022-02-28 17:28:09.0 | 201 |
2022-03-28 23:59:54.0 | 100 |
2022-02-28 21:02:40.0 | 100 |
with raw_groups -- subquery syntax
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
),
grouped_data as -- another subquery syntax based on `with`
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
raw_groups
group by
mylocation.pulocationid
)
select --format data into your requested format
location.pulocationid,
location.hour
from
grouped_data
我不记得hive 0.14可以使用with子句,但你可以很容易地重写查询不使用它。(通过在表名的节奏中替换select)我只是不觉得它是可读的:
select --format data into your requested format
location.pulocationid,
location.hour
from
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
)
group by
mylocation.pulocationid
)
你已经成功了一半!
思路是正确的,但是语法有点偏离:
-
首先查找每小时的计数
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22 Group by pulocationid, hour (tpep_pickup_datetime)
-
然后添加
row_number
,但您需要按总数降序排列:select pulocationid, hour, CNT, row_number () over (partition be pulocationid order by CNT desc) as row_no from
-
最后但不是列表,只取最高计数的行(这可以通过
max
函数而不是row_number
函数顺便完成)
或合计:
select pulocationid , hour from (
select pulocationid , hour , cnt , row_number ()
over ( partition by pulocationid order by cnt desc )
as row_no from (
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime) ))
Where row_no=1