我想使用Postgres SQL为家庭列表分配年龄类别,其中我们不知道任何家庭成员的出生日期。
数据集如下:
household_id | housed_size | x1 | 5 | x2
---|---|
1 | |
x3 | 8 |
… | … |
我不想发布一个答案,只是一个链接,所以我想我会给它一个机会,看看我是否可以简化depeszweighted_random
到普通sql。结果是这个更慢、可读性更差、更差的版本,但是更短、更简单的sql:
CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] )
RETURNS ANYELEMENT language sql as $$
select choice
from
( select case when (sum(weight) over (rows UNBOUNDED PRECEDING)) >= hit
then choice end as choice
from ( select unnest(p_choices) as choice,
unnest(p_weights) as weight ) inputs,
( select sum(weight)*random() as hit
from unnest(p_weights) a(weight) ) as random_hit
) chances
where choice is not null
limit 1
$$;
由于聚合函数和窗口函数调用,它不是内联的。如果您假设权重仅为概率之和为1
,则会更快。
原则是您提供任意选项数组和等长权重数组(这些可以是百分比,但不必是百分比,也不必是任何特定数字的总和):
update test_area t
set ("0-18",
"19-30",
"31-100")
= (with cte AS (
select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}')
as age_group
from generate_series(1,household_size,1))
select count(*) filter (where age_group='0-18') as "0-18",
count(*) filter (where age_group='19-30') as "19-30",
count(*) filter (where age_group='31-100') as "31-100"
from cte)
returning *;
在线演示显示他的版本和我的版本在统计上是可靠的。
A最小值开头可以是:
SELECT
household_id,
MIN(household_size) as size,
ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT
h.household_id,
h.household_size,
p.agegroup_from,
p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;
输出:
<表类>household_id 大小 g1 g2 g3 tbody><<tr>x1 5 1.5 2.0 1.5 x21 0.3 0.4 0.3 x3 8 2.4 3.2 2.4 表类>