假设我在Postgres中有一个表,它存储了一列这样的字符串。
animal |
---|
猫/狗/鸟 |
狗/狮子 |
鸟/狗 |
狗/猫 |
猫/鸟 |
不会比这更简单或更快:
WITH flat AS (
SELECT t.id, a
FROM (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
, ct AS (SELECT a, count(*) AS ct FROM flat GROUP BY 1)
SELECT a AS base_animal
, b AS corr_animal
, COALESCE(xc.ct, 0) AS instance_count
, COALESCE(round(xc.ct * 100.0 / x.ct), 0) AS correlation
FROM (
SELECT a.a, b.a AS b, a.ct
FROM ct a, ct b
) x
LEFT JOIN (
SELECT f1.a, f2.a AS b, count(*) AS ct
FROM flat f1
JOIN flat f2 USING (id)
GROUP BY 1,2
) xc USING (a,b)
ORDER BY a, instance_count DESC;
db<>小提琴在这里
产生您想要的结果,除了...
- 添加了一致的排序顺序
- 正确舍入
这假定源数据中每行都有不同的动物。否则不清楚如何准确计算同一行中的同一只动物......
一步一步地设置
CTEflat
附加任意行号作为唯一id
。(如果您有PRIMARY KEY
,请改用它并跳过子查询t
。然后给动物筑巢,每排养一只宠物(id
)。
CTEct
获得不同动物及其总数的列表。
外SELECT
在子查询x
中构建动物对(a
/b
)的完整栅格,加上a
的总数。LEFT JOIN
子查询xc
中的实际对计数。需要两个步骤来保留结果中从未相遇的对。最后,巧妙地计算和舍入"相关性"。看:
- 查找文本块中单词/短语中的字符百分比
更新的任务
如果你不需要从未见过面的配对,也不需要与自己配对,这可能是你的查询:
-- updated task excluding pairs that never met and same pairing with self
WITH flat AS (
SELECT t.id, a, count(*) OVER (PARTITION BY a) AS ct
FROM (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
SELECT f1.a AS base_animal
, f1.ct AS base_count
, f2.a AS corr_animal
, count(*) AS instance_count
, round(count(*) * 100.0 / f1.ct) AS correlation
FROM flat f1
JOIN flat f2 USING (id)
JOIN (SELECT a, count(*) AS ct FROM flat GROUP BY 1) ct ON ct.a = f1.a
WHERE f1.a <> f2.a -- exclude pairing with self
GROUP BY f1.a, f1.ct, f2.a
ORDER BY f1.a, instance_count DESC;
db<>小提琴在这里
我将基础动物的总出现次数添加为base_count
。
最值得注意的是,我删除了额外的 CTEct
,并使用窗口功能从第一个 CTE 获取base_count
。这本身的成本大致相同,但我们不需要在外部查询中再连接,这总体上应该更便宜.
如果你包含与 self 的对,你仍然可以使用它。检查小提琴。
哦,我们不再需要COALESCE
了。
的想法是将数据拆分为行(使用unnest(string_to_array())
),然后交叉连接以获得所有排列。
with data1 as (
select *
from corr_tab), data2 as (
select distinct un as base_animal, x.correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un,
(select distinct un as correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un) X)
select base_animal, correlated_animal,
(case
when
data2.base_animal = data2.correlated_animal
then
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL)
else
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL)
end) instance_count,
(case
when
data2.base_animal = data2.correlated_animal
then
100
else
ceil(
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL) * 100 /
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL) )
end) correlation
from data2
order by base_animal
请参阅此处的小提琴。