优化BigQuery中的LEFT JOIN速度



我需要连接BigQuery中的六个表。总的来说,数据总量为500亿,每个表包含数千行。我正在尝试LEFT JOIN共享列上的表。该查询目前估计运行时间为数周,在接近完成之前很长一段时间就会超时。有更好的方法来优化这个查询吗?

SELECT
*
FROM
`report1` t1
LEFT JOIN
`report2` t2
ON
t1.campaignid = t2.campaignid
LEFT JOIN
`report3` t3
ON
t1.campaignid = t3.campaignid
LEFT JOIN
`report4` t4
ON
t1.campaignid = t4.campaignid
LEFT JOIN
`report5` t5
ON
t1.campaignid = t5.campaignid
LEFT JOIN
`report6` t6
ON
t1.campaignid = t6.campaignid

这对于注释来说太长了。

查询的问题不在于left joins本身。相反,问题是(很可能(底层表每个campaignid有多行。您可以通过执行以下操作来估计查询生成的行数:

select sum(t1.cnt * t2.cnt * t3.cnt * t4.cnt * t5.cnt * t6.cnt)
from (select campaignid, count(*) as cnt from table1 group by 1) t1 left join
(select campaignid, count(*) as cnt from table2 group by 1) t2 
using (campaignid) left join
(select campaignid, count(*) as cnt from table3 group by 1) t3 
using (campaignid) left join
(select campaignid, count(*) as cnt from table4 group by 1) t4 
using (campaignid) left join
(select campaignid, count(*) as cnt from table5 group by 1) t5 
using (campaignid) left join
(select campaignid, count(*) as cnt from table6 group by 1) t6 
using (campaignid);

这可能会返回一个非常大的数字——这解释了运行查询的时间很长。

您需要修复join条件或以不同的方式构建查询。您可能会问另一个问题,包括示例数据、所需结果和逻辑解释。

最新更新