我在 Spark 上执行下面的查询,但它不起作用。当到达阶段 13 时,它会阻塞。并且磁盘空间正在增加,而在同一阶段被阻塞时什么都没有,然后当磁盘变满时。查询有问题,你看到火花查询有什么问题吗?
首先,我在 hive 中创建一个视图:
create view q2_min_ps_supplycost as
select
p_partkey as min_p_partkey,
min(ps_supplycost) as min_ps_supplycost
from
part,
partsupp,
supplier,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
group by
p_partkey;
然后,在带有 hivecontext 的 Spark 中使用的查询:
select
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region,
q2_min_ps_supplycost
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 37
and p_type like '%COPPER'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
and ps_supplycost = min_ps_supplycost
and p_partkey = min_p_partkey
order by
s_acctbal desc,
n_name,
s_name,
p_partkey
limit 100;
您可以在多个查询中设计查询,因此您只需在每个查询中联接两个表即可在最后一个表中获得相同的结果,这将最小化中间文件的大小,并应避免阻塞。