根据已知的人口平均值而不包括出生日期为一个人分配年龄



我想使用Postgres SQL为家庭列表分配年龄类别,其中我们不知道任何家庭成员的出生日期。

数据集如下:

tbody> <<tr>x2
household_id housed_size
x15
1
x38

我不想发布一个答案,只是一个链接,所以我想我会给它一个机会,看看我是否可以简化depeszweighted_random到普通sql。结果是这个更慢、可读性更差、更差的版本,但是更短、更简单的sql:

CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] ) 
RETURNS ANYELEMENT language sql as $$
select  choice 
from
( select  case when (sum(weight) over (rows UNBOUNDED PRECEDING)) >= hit
then choice end as choice
from ( select unnest(p_choices) as choice, 
unnest(p_weights) as weight ) inputs, 
( select sum(weight)*random() as hit
from unnest(p_weights) a(weight) ) as random_hit
) chances
where choice is not null
limit 1 
$$;

由于聚合函数和窗口函数调用,它不是内联的。如果您假设权重仅为概率之和为1,则会更快。

原则是您提供任意选项数组和等长权重数组(这些可以是百分比,但不必是百分比,也不必是任何特定数字的总和):

update test_area t
set ("0-18",
"19-30",
"31-100")
=   (with cte AS (
select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}') 
as age_group
from generate_series(1,household_size,1))
select count(*) filter (where age_group='0-18')   as "0-18",
count(*) filter (where age_group='19-30')  as "19-30",
count(*) filter (where age_group='31-100') as "31-100"
from cte)
returning *;
在线演示显示他的版本和我的版本在统计上是可靠的。

A最小值开头可以是:

SELECT
household_id,
MIN(household_size) as size,
ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT 
h.household_id,
h.household_size,
p.agegroup_from,
p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;

输出:

<表类>household_id大小g1g2g3tbody><<tr>x151.52.01.5x210.30.40.3x382.43.22.4

相关内容

最新更新