Spark 查询有问题,因为处理块处于一个阶段并一直处于阻塞状态,直到磁盘已满



我在 Spark 上执行下面的查询,但它不起作用。当到达阶段 13 时,它会阻塞。并且磁盘空间正在增加,而在同一阶段被阻塞时什么都没有,然后当磁盘变满时。查询有问题,你看到火花查询有什么问题吗?

首先,我在 hive 中创建一个视图:

create view q2_min_ps_supplycost as
select
    p_partkey as min_p_partkey,
    min(ps_supplycost) as min_ps_supplycost
from
    part,
    partsupp,
    supplier,
    nation,
    region
where
    p_partkey = ps_partkey
    and s_suppkey = ps_suppkey
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'EUROPE'
group by
    p_partkey;

然后,在带有 hivecontext 的 Spark 中使用的查询:

 select
        s_acctbal,
        s_name,
        n_name,
        p_partkey,
        p_mfgr,
        s_address,
        s_phone,
        s_comment
    from
        part,
        supplier,
        partsupp,
        nation,
        region,
        q2_min_ps_supplycost
    where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 37
        and p_type like '%COPPER'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE'
        and ps_supplycost = min_ps_supplycost
        and p_partkey = min_p_partkey
    order by
        s_acctbal desc,
        n_name,
        s_name,
        p_partkey
    limit 100;

您可以在多个查询中设计查询,因此您只需在每个查询中联接两个表即可在最后一个表中获得相同的结果,这将最小化中间文件的大小,并应避免阻塞。

最新更新