查找每个 mmdd PostgreSQL 的 count(*) 的最大值、最小值、平均值、百分位数



Postgres version 9.4.18, PostGIS Version 2.2.

以下是我正在使用的表格(并且不太可能对表格结构进行重大更改(:

表ltg_data(1988年至2018年(:

Column   |           Type           | Modifiers 
----------+--------------------------+-----------
intensity | integer                  | not null
time      | timestamp with time zone | not null
lon       | numeric(9,6)             | not null
lat       | numeric(8,6)             | not null
ltg_geom  | geometry(Point,4269)     | 
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")
Size of ltg_data (~800M rows):
ltg=# select pg_relation_size('ltg_data');
pg_relation_size 
------------------
149729288192

表县:

Column   |            Type             |                       Modifiers                      
-----------+-----------------------------+---------------------------------        -----------------------
gid        | integer                     | not null default        
nextval('counties_gid_seq'::regclass)
objectid_1 | integer                     | 
objectid   | integer                     | 
state      | character varying(2)        | 
cwa        | character varying(9)        | 
countyname | character varying(24)       | 
fips       | character varying(5)        | 
time_zone  | character varying(2)        | 
fe_area    | character varying(2)        | 
lon        | double precision            | 
lat        | double precision            | 
the_geom   | geometry(MultiPolygon,4269) | 
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)

我有一个查询,用于计算跨越 30 年的一年(月-日(每天的总行数。在 Stackoverflow 的帮助下,获取这些计数的查询工作正常。 下面是使用以下函数的查询和结果。

功能:

CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
$$SELECT to_char($1, 'MMDD')::int$$;

查询:

SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM  (
SELECT f_mmdd(d::date) AS mmdd  -- ignoring the year
FROM   generate_series(timestamp '2018-01-01'  -- any dummy year
, timestamp '2018-12-31'
, interval '1 day') d
) d
LEFT  JOIN (
SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
FROM   counties c
JOIN   ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE  cwa = 'MFR'
GROUP  BY 1
) ct USING (mmdd)
ORDER  BY 1;

结果:

mmdd       total_count
725 |        2126
726 |         558
727 |           2
728 |           2
729 |           2
730 |           0
731 |           0
801 |           0
802 |          10

期望结果:我正在尝试查找有关一年中天数计数的其他统计信息。例如,我在 7 月 25 日(下表中的 725(知道表中多年来的总数为 2126。我正在寻找的是 7 月 25 日 (725( 的最大每日计数、该天不为零的年份百分比、计数 (*( 不为零的最小百分比年份、百分位数(第 10 个百分位数、第 25 个百分位数、第 50 个百分位数、第 75 个百分位数、第 90 个百分位数,stdev 也很有用(。很高兴看到max_daily发生的年份。我想如果这些年来那天没有任何计数,year_max_daily将是空白或零。

mmdd       total_count  max daily  year_max_daily   percent_years_count_not_zero  10th percentile_daily   90th percentile_daily
725 |        2126         1000          1990                 30                          15                   900
726 |         558          120          1992                 20                          10                   80
727 |           2            1          1991                 2                            0                   1
728 |           2            1          1990                 2                            0                   1
729 |           2            1          1989                 2                            0                   1
730 |           0            0                               0                            0                   0 
731 |           0            0                               0                            0                   0 
801 |           0            0                               0                            0                   0
802 |          10           10          1990                 0                            1                   8

到目前为止,我尝试过的方法不起作用。它返回与总计相同的结果。我认为这是因为我只是想在计算总数后获得平均值,所以我并没有真正查看每年每天的计数并找到平均值。

尝试:

SELECT AVG(CAST(total_count as FLOAT)), day
FROM
(
SELECT d.mmdd as day, COALESCE(ct.ct, 0) as total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd
FROM generate_series(timestamp '2018-01-01', timestamp '2018-12-31',     interval '1 day') d
) d
LEFT JOIN (
SELECT mmdd, avg(q.ct) FROM (
SELECT f_mmdd((time at time zone 'utc+12')::date) as mmdd, count(*) as ct
FROM counties c
JOIN ltg_data d on ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
) 
) as q
ct USING (mmdd)
ORDER BY 1

感谢您的任何帮助!

我没有包括所有请求的统计数据的计算 - 一个问题太多了,但我希望您能够扩展下面的查询并添加您需要的额外统计数据。

我在下面使用 CTE 使查询可读。如果你愿意,你可以把它全部放在一个巨大的查询中。我建议逐步运行查询,逐个 CTE 并检查中间结果以了解其工作原理。

CTE_Dates是 30 年所有可能日期的简单列表。

CTE_DailyCounts是 30 年来每天的基本计数列表(我接受了您现有的查询(。

CTE_FullStats再次是所有日期的列表,以及使用按月,日分区的窗口函数为每个(月,日(计算的一些统计信息。ROW_NUMBER用于获取每年计数最多的日期。

最终查询仅选择一年中计数最大的一行以及其余信息。

我没有尝试运行查询,因为问题没有示例数据,因此可能存在一些拼写错误。

WITH
CTE_Dates
AS
(
SELECT
d::date AS dt
,EXTRACT(MONTH FROM d::date) AS dtMonth
,EXTRACT(DAY FROM d::date) AS dtDay
,EXTRACT(YEAR FROM d::date) AS dtYear
FROM
generate_series(timestamp '1988-01-01', timestamp '2018-12-31', interval '1 day') AS d
-- full range of possible dates
)
,CTE_DailyCounts
AS
(
SELECT
time::date AS dt
,count(*) AS ct
FROM
counties c
INNER JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY time::date
)
,CTE_FullStats
AS
(
SELECT
CTE_Dates.dt
,CTE_Dates.dtMonth
,CTE_Dates.dtDay
,CTE_Dates.dtYear
,CTE_DailyCounts.ct
,SUM(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS total_count
,MAX(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS max_daily
,SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay) AS nonzero_day_count
,COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS years_count
,100.0 * SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay) 
/ COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS percent_years_count_not_zero
,ROW_NUMBER() OVER (PARTITION BY dtMonth, dtDay ORDER BY CTE_DailyCounts.ct DESC) AS rn
FROM
CTE_Dates
LEFT JOIN CTE_DailyCounts ON CTE_DailyCounts.dt = CTE_Dates.dt
)
SELECT
dtMonth
,dtDay
,total_count
,max_daily
,dtYear AS year_max_daily
,percent_years_count_not_zero
FROM
CTE_FullStats
WHERE
rn = 1
ORDER BY
dtMonth
,dtDay
;

相关内容

最新更新