BigQuery - 获取每个组的聚合结果,即使其中一些组没有任何成员



使用BigQuery,我想根据页面标题将页面分组为一个查询,并计算组中的不同度量。由于标题的规则并不相互排斥,我是这样做的:

SELECT SUM(views) views, ..., title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`, 
UNNEST([
CASE WHEN (title LIKE '%game%') 
THEN 'games_group' END, 
CASE WHEN (title LIKE '%sport%') 
THEN 'sports_group' END, 
CASE WHEN (title LIKE '%rarerare%')
THEN 'rare_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group

我得到:

views       ...   title_group
3414469869  ... 
4355264     ...   games_group
1361074     ...   sports_group

我想获得的有点不同:

views     ...   title_group
4355264   ...   games_group
1361074   ...   sports_group
0         ...   rare_group

事实上,即使一个小组没有出现,我也需要有结果。

如何修改查询以获得此结果?

最简单的方法是用附加的左联接来包装原始查询(没有任何更改(,如下面的示例所示

#standardSQL
SELECT IFNULL(views, 0) views, title_group
FROM (SELECT title_group FROM UNNEST(['games_group', 'sports_group', 'rare_group']) AS title_group)
LEFT JOIN (
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN UNNEST([
CASE WHEN (title LIKE '%game%') THEN 'games_group' END, 
CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END, 
CASE WHEN (title LIKE '%rarerare%') THEN 'rare_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
)
USING(title_group)
-- ORDER BY views DESC   

带输出

Row views   title_group  
1   4355264 games_group  
2   1361074 sports_group     
3   0       rare_group    

如果您愿意对上面的版本进行优化/重构,请考虑下面的版本(显然输出相同(

#standardSQL
WITH title_groups AS (
SELECT title_group, pattern 
FROM UNNEST([STRUCT<title_group STRING, pattern STRING>
('games_group', '%game%'), ('sports_group', '%sport%'), ('rare_group', '%rarerare%')
])
)
SELECT IFNULL(views, 0) views, title_group FROM title_groups
LEFT JOIN (
SELECT SUM(CASE WHEN title LIKE pattern THEN views END) views, title_group 
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN title_groups
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
)
USING(title_group)

为什么不把每个小时的值放在单独的列中?

SELECT SUM(CASE WHEN title LIKE '%game%' THEN views ELSE 0 END) as game_views,
SUM(CASE WHEN title LIKE '%sport%' THEN views ELSE 0 END) as sport_views,
SUM(CASE WHEN title LIKE '%rarerare%' THEN views ELSE 0 END) as rare_views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en';

如果你喜欢的话,你可以取消这个:

SELECT el.title_group, el.views
FROM (SELECT SUM(CASE WHEN title LIKE '%game%' THEN views ELSE 0 END) as game_views,
SUM(CASE WHEN title LIKE '%sport%' THEN views ELSE 0 END) as sport_views,
SUM(CASE WHEN title LIKE '%rarerare%' THEN views ELSE 0 END) as rare_views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en'
) v CROSS JOIN
UNNEST([STRUCT('games_groups' as title_group, v.game_views as views),
STRUCT('sports_groups' as title_group, v.sports_views as views),
STRUCT('rare_groups' as title_group, v.rare_views as views)
]
) el

最新更新