是否可以编写一个 BigQuery 来检索一段时间内 PyPI 下载的分箱计数?



以下代码是Google的BigQuery的SQL查询,用于计算过去30天内我的PyPI软件包被下载的次数。

#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pycotools'
-- Only query the last 30 days of history
AND _TABLE_SUFFIX
BETWEEN FORMAT_DATE(
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())

是否可以修改此查询,以便自上传包以来每 30 天获取一次下载次数?输出将是一个如下所示的.csv

date          count
01-01-2016    10
01-02-2016    20
..        ..
01-05-2018    100

我建议使用 EXTRACT 或 MONTH(( 并只计算 file.project 字段,因为它可以让查询运行得更快。 您可以使用的查询是:

#standardSQL
SELECT
EXTRACT(MONTH FROM _PARTITIONDATE) AS month_, 
EXTRACT(YEAR FROM _PARTITIONDATE) AS year_,
count(file.project) as count
FROM
`the-psf.pypi.downloads*`
WHERE
file.project= 'pycotools'
GROUP BY 1, 2
ORDER by 1 ASC

我用公共数据集尝试过:

#standardSQL
SELECT
EXTRACT(MONTH FROM pickup_datetime) AS month_, 
EXTRACT(YEAR FROM pickup_datetime) AS year_,
count(rate_code) as count
FROM
`nyc-tlc.green.trips_2015`
WHERE
rate_code=5
GROUP BY 1, 2
ORDER by 1 ASC

或使用旧版

SELECT
MONTH(pickup_datetime) AS month_, 
YEAR(pickup_datetime) AS year_,
count(rate_code) as count
FROM
[nyc-tlc:green.trips_2015]
WHERE
rate_code=5
GROUP BY 1, 2
ORDER by 1 ASC

结果是:

month_  year_   count    
1       2015    34228    
2       2015    36366    
3       2015    42221    
4       2015    41159    
5       2015    41934    
6       2015    39506        

我看到您正在使用_TABLE_SUFFIX,因此如果您正在查询分区表,您可以使用_PARTITIONDATE列而不是格式化日期并使用date_sub函数。这也将使用更少的计算时间。

从一个分区查询:

SELECT
[COLUMN]
FROM
[DATASET].[TABLE]
WHERE
_PARTITIONDATE BETWEEN '2016-01-01'
AND '2016-01-02'

相关内容

  • 没有找到相关文章

最新更新