跨分区的BQ自联接使成本翻倍



我们最近面临以下问题"发行";与一个自联接分区表有关,到目前为止原因尚未揭示。。。

Ps。考虑一下所涉及的具有略微不同行数的分区。比方说;"相同大小";

WITH 
partition_one AS 
(
SELECT 
-- Why using ONLY `id`, charges the same of using `[id, col_2, col_3, col_4]`
id
FROM
`partition_tbl_a`
WHERE
DATE(_PARTITIONTIME) = "YYYY-MM-01"
)
--- This query basically double the price like it were being executed twice
--- when indeed the partition_one, has only 1 column.
SELECT
id,
col_2,
col_3,
col_4
FROM
`partition_tbl_a` AS partition_two
-- Could be any other JOIN TYPE – like INNER, RIGHT or even using EXISTS/IN
LEFT JOIN
partition_one
ON 
partition_one.id = partition_two.id
WHERE
DATE(partition_two._PARTITIONTIME) = "YYYY-MM-02"

模拟

-- Partitions of "transactions" having very similar sizes
-- 2022-08-01 – 13.76Gb
-- 2022-09-01 – 13.48Gb
WITH AUG AS
(
-- ========================================================
-- Total single execution of this query – 557.25 MB
-- ========================================================
SELECT
AUG.hash
FROM
`bigquery-public-data.crypto_bitcoin.transactions` AS AUG
WHERE
block_timestamp_month = "2022-08-01"
)
-- ========================================================
-- Total single execution of this query – 1.42 GB
-- ========================================================
-- SELECT
--   SEP.hash,
--   SEP.size,
--   SEP.virtual_size,
--   SEP.version,
--   SEP.lock_time,
--   SEP.block_hash,
--   SEP.block_number,
--   SEP.output_value
-- FROM
--   `bigquery-public-data.crypto_bitcoin.transactions` AS SEP
-- WHERE
--   block_timestamp_month = "2022-09-01"
-- ========================================================
-- Total join execution – 2.84 GB 
--
-- Q1: Why exactly the double of SEP single query (1.42 GB)
--     if it contains only column (`hash`) from AUG query?
-- 
-- Q2: Why not instead
--     557.25 MB (single AUG) + 1.42 GB (single SEP)?
-- ========================================================
SELECT
SEP.hash,
SEP.size,
SEP.virtual_size,
SEP.version,
SEP.lock_time,
SEP.block_hash,
SEP.block_number,
SEP.output_value
FROM
`bigquery-public-data.crypto_bitcoin.transactions` AS SEP
LEFT JOIN
AUG
ON SEP.hash = AUG.hash
WHERE
block_timestamp_month = "2022-09-01" AND
AUG.hash IS NULL

问:考虑到列式BigQuery收费概念,为什么在partition_one中仅使用id,在自加入期间使用[id, col_2, col_3, col_4]的成本相同

我已经多次看到这种行为,如果您使用"创建临时表";在您的bq脚本中,所有操作都运行得非常快,而且实际成本远低于在引用不同列的多个子查询中使用同一个表的情况。

(如果您将同一个表划分为多个子查询,这些子查询具有不同的列数,为了便于在后端实现,bq可能只占用引用的所有列数的最大值,并相应地为其计费。尽管没有关于这些的好文档,但您是对的。查询执行计划可能会帮助您了解更多信息。(。)

现在我看到你正在对结尾的where条件进行partition2像

WHERE
DATE(partition_two._PARTITIONTIME) = "YYYY-MM-02"

这将使gbq首先执行整个左联接,然后为第二个分区应用where过滤器,这样您就已经失去了在这一侧处理的计费字节数。(在不了解源表的情况下,这可能是导致成本膨胀的真正罪魁祸首。(

(还有一点从您的示例中不清楚,您有一个左联接,但您没有对它做任何进一步的操作,因此您将获得整个第一个表,其中可能有来自第二个表的空值列作为输出。(

因此您可以尝试以下两种方式进行优化:

1]而不是根据下面的样本

FROM
`partition_tbl_a` AS partition_two
... join ... on ... where partition filtering 

进行

FROM
(select id, col2, col3, col4 from `partition_tbl_a` where DATE(_PARTITIONTIME) = "YYYY-MM-02") AS partition_two

这将首先应用筛选器,然后进行联接,从而减少处理的总字节数。

2]

创建2个如下所示的临时表并使用它。(与第一个想法类似,但将数据分离为单独的临时表(

BEGIN
create temp table `T1` cluster by id as (
SELECT distinct
id
FROM
`partition_tbl_a`
WHERE
DATE(_PARTITIONTIME) = "YYYY-MM-01"
);

create temp table `T2` cluster by id as (
SELECT distinct
id, 
col_2,
col_3,
col_4
FROM
`partition_tbl_a`
WHERE
DATE(_PARTITIONTIME) = "YYYY-MM-02"
);
SELECT
T2.id,
col_2,
col_3,
col_4
FROM
T2 
LEFT JOIN
T1
ON 
T1.id = T2.id;
END

看看这些技巧是否有助于

最新更新