使用BigQuery,我想根据页面标题将页面分组为一个查询,并计算组中的不同度量。由于标题的规则并不相互排斥,我是这样做的:
SELECT SUM(views) views, ..., title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`,
UNNEST([
CASE WHEN (title LIKE '%game%')
THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%')
THEN 'sports_group' END,
CASE WHEN (title LIKE '%rarerare%')
THEN 'rare_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
我得到:
views ... title_group
3414469869 ...
4355264 ... games_group
1361074 ... sports_group
我想获得的有点不同:
views ... title_group
4355264 ... games_group
1361074 ... sports_group
0 ... rare_group
事实上,即使一个小组没有出现,我也需要有结果。
如何修改查询以获得此结果?
最简单的方法是用附加的左联接来包装原始查询(没有任何更改(,如下面的示例所示
#standardSQL
SELECT IFNULL(views, 0) views, title_group
FROM (SELECT title_group FROM UNNEST(['games_group', 'sports_group', 'rare_group']) AS title_group)
LEFT JOIN (
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN UNNEST([
CASE WHEN (title LIKE '%game%') THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END,
CASE WHEN (title LIKE '%rarerare%') THEN 'rare_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
)
USING(title_group)
-- ORDER BY views DESC
带输出
Row views title_group
1 4355264 games_group
2 1361074 sports_group
3 0 rare_group
如果您愿意对上面的版本进行优化/重构,请考虑下面的版本(显然输出相同(
#standardSQL
WITH title_groups AS (
SELECT title_group, pattern
FROM UNNEST([STRUCT<title_group STRING, pattern STRING>
('games_group', '%game%'), ('sports_group', '%sport%'), ('rare_group', '%rarerare%')
])
)
SELECT IFNULL(views, 0) views, title_group FROM title_groups
LEFT JOIN (
SELECT SUM(CASE WHEN title LIKE pattern THEN views END) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN title_groups
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
)
USING(title_group)
为什么不把每个小时的值放在单独的列中?
SELECT SUM(CASE WHEN title LIKE '%game%' THEN views ELSE 0 END) as game_views,
SUM(CASE WHEN title LIKE '%sport%' THEN views ELSE 0 END) as sport_views,
SUM(CASE WHEN title LIKE '%rarerare%' THEN views ELSE 0 END) as rare_views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en';
如果你喜欢的话,你可以取消这个:
SELECT el.title_group, el.views
FROM (SELECT SUM(CASE WHEN title LIKE '%game%' THEN views ELSE 0 END) as game_views,
SUM(CASE WHEN title LIKE '%sport%' THEN views ELSE 0 END) as sport_views,
SUM(CASE WHEN title LIKE '%rarerare%' THEN views ELSE 0 END) as rare_views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en'
) v CROSS JOIN
UNNEST([STRUCT('games_groups' as title_group, v.game_views as views),
STRUCT('sports_groups' as title_group, v.sports_views as views),
STRUCT('rare_groups' as title_group, v.rare_views as views)
]
) el