优化大型查询使用更少的join,计数不同



在Google BigQuery上,我们有一个大约有10列的报告,比如:

+----------------+-----------------+---------------+-------------+
|     uniquesent | uniquedelivered | uniquebounced | uniqueopens |
+----------------+-----------------+---------------+-------------+

我们有一个更长的查询,它使用很多连接来计算这些值,大致上大查询是这样组织的:

select
    ...report_columns...,
   sent.uniquesent,
   delivered.uniquedelivered,
from [main table]
left join (
  select 
    language,
    exact_count_distinct(e.user_id) as uniquesent
   from emailevent e
    where country=1 and event='sent'
   group by 1
) as sent
left join (
  select 
    language,
    exact_count_distinct(e.user_id) as uniquedelivered
   from emailevent e
    where country=1 and event='delivered'
   group by 1
) as delivered

,这张JOINs的清单以同样的风格用于其他10个类似的项目。同样,想象一下这个查询按日/周/月分组,它甚至变得非常复杂的阅读。我们还会得到一些错误消息:Resource exceeded.

我们希望重写和优化查询,以返回相同的数字,但效率更高。如果您有其他问题,请告诉我,但主要是我们希望以某种方式消除连接,使其更紧凑,性能更好。

我们已经使用以下语法对查询应用了一些压缩:

sum(if(p.country_id=1 AND event = "userblocked" AND JSON_EXTRACT_SCALAR(e.meta,'$.reason') contains 'drop_status',1,0)) as bounced,
sum(if(p.country_id=1 AND event = "userblocked" AND JSON_EXTRACT_SCALAR(e.meta,'$.reason') contains 'spam_report',1,0)) as spam_reported

但是语法不支持不同计数

可以提升想要查找的条件并将其转换为子选择中的字段,然后计算字段的不同值吗?换句话说,类似于:

select
    ...report_columns...,
   t1.uniquesent,
   t1.uniquedelivered,
from [main table]
left join (
  select 
    language,
    exact_count_distinct(sent) as uniquesent,
    exact_count_distinct(users_delivered) as uniquedelivered,         
  from (
    select 
      language,
      if (country=1 and event='sent', e.user_id, null) as sent,
      if (country=1 and event='delivered', e.user_id, null) as delivered,
    from emailevent e
  ) group by language
) as t1

如果使用太多不同的值进行精确计数,可能会使您进入resources_exceeded区域。注意,如果您将count distinct与bucket count一起使用,那么您将得到bucket count的精确计数。大多数情况下,如果数字很小,人们会关心它的确切数字,但如果它很大,它是近似值就可以了。

对于你发布的数据块,你可以这样做来减少连接的数量。

select
    ...report_columns...,
   SUM(IF(event='sent', unique_event, 0)) as uniqusent
   SUM(IF(event='delivered', unique_event, 0)) as uniquedelivered
from [main table]
left join (
  select
    event,
    language,
    exact_count_distinct(e.user_id) as uniqueevent
   from emailevent e
    where country=1 and event in ('sent', 'delivered')
   group by event, language
) as sent

最新更新