MySQL query to Hiveql



work(id, rank)

数据:

work
------------------
1 | A
1 | B
1 | C
1 | D
2 | A
2 | C
2 | B
3 | C

我需要找到所有对具有其计数的共同排名的id,并且只有当排名计数大于2时才应该显示,并按降序打印它们。我为此写了一个mysql查询,但是,我是SparkSQL和HIVEQL的新手。所以请告诉我怎么做。例如,使用上面的数据,结果集应该是:

mysql查询是:

select a.id,b.id
from work as a, work as b
where a.id>b.id
group by a.id,b.id having group_concat(distinct a.rank order by a.rank)=group_concat(distinct b.rank order by b.rank)
---------------------
id1 | id2 | Count
---------------------
 A  | B   |  3
 B  | C   |  3

我认为Hive不支持group_concat()。我认为这是相同的事情:

select a.id, b.id, a.cnt
from (select a.*, count(*) over (partition by a.id) as cnt
      from work a
     ) a join
     (select b.*, count(*) over (partition by b.id) as cnt
      from work b
     ) b
     on a.rank = b.rank and a.cnt = b.cnt
where a.id < b.id   -- I *think* this is allowed in Hive; it not, a subquery or expression in the `having` clause will do the same thing
group by a.id, b.id, a.cnt
having count(*) = a.cnt;

这是获得具有相同排名的id对的更自然的方法。事实上,在几乎任何数据库中,它都应该比MySQL版本更有效。cross join生成大量数据

相关内容

  • 没有找到相关文章