计算两个单词之间的相关性



假设我在Postgres中有一个表,它存储了一列这样的字符串。

animal
猫/狗/鸟
狗/狮子
鸟/狗
狗/猫
猫/鸟

不会比这更简单或更快:

WITH flat AS (
SELECT t.id, a
FROM  (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
, ct AS (SELECT a, count(*) AS ct FROM flat GROUP BY 1)
SELECT a AS base_animal
, b AS corr_animal
, COALESCE(xc.ct, 0) AS instance_count
, COALESCE(round(xc.ct * 100.0 / x.ct), 0) AS correlation
FROM  (
SELECT a.a, b.a AS b, a.ct
FROM   ct a, ct b
) x
LEFT   JOIN (
SELECT f1.a, f2.a AS b, count(*) AS ct
FROM   flat f1
JOIN   flat f2 USING (id)
GROUP  BY 1,2
) xc USING (a,b)
ORDER  BY a, instance_count DESC;

db<>小提琴在这里

产生您想要的结果,除了...

  1. 添加了一致的排序顺序
  2. 正确舍入

这假定源数据中每行都有不同的动物。否则不清楚如何准确计算同一行中的同一只动物......

一步

一步地设置

CTEflat附加任意行号作为唯一id。(如果您有PRIMARY KEY,请改用它并跳过子查询t。然后给动物筑巢,每排养一只宠物(id)。

CTEct获得不同动物及其总数的列表。

SELECT在子查询x中构建动物对(a/b)的完整栅格,加上a的总数。LEFT JOIN子查询xc中的实际对计数。需要两个步骤来保留结果中从未相遇的对。最后,巧妙地计算和舍入"相关性"。看:

  • 查找文本块中单词/短语中的字符百分比

更新的任务

如果你不需要从未见过面的配对,也不需要与自己配对,这可能是你的查询:

-- updated task excluding pairs that never met and same pairing with self
WITH flat AS (
SELECT t.id, a, count(*) OVER (PARTITION BY a) AS ct
FROM  (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
SELECT f1.a AS base_animal
, f1.ct AS base_count
, f2.a AS corr_animal
, count(*) AS instance_count
, round(count(*) * 100.0 / f1.ct) AS correlation
FROM   flat f1
JOIN   flat f2 USING (id)
JOIN  (SELECT a, count(*) AS ct FROM flat GROUP BY 1) ct ON ct.a = f1.a
WHERE  f1.a <> f2.a  -- exclude pairing with self
GROUP  BY f1.a, f1.ct, f2.a
ORDER  BY f1.a, instance_count DESC;

db<>小提琴在这里

我将基础动物的总出现次数添加为base_count

最值得注意的是,我删除了额外的 CTEct,并使用窗口功能从第一个 CTE 获取base_count。这本身的成本大致相同,但我们不需要在外部查询中再连接,这总体上应该更便宜.
如果你包含与 self 的对,你仍然可以使用它。检查小提琴。

哦,我们不再需要COALESCE了。

的想法是将数据拆分为行(使用unnest(string_to_array())),然后交叉连接以获得所有排列。

with data1 as (
select * 
from corr_tab), data2 as (
select distinct un as base_animal, x.correlated_animal 
from corr_tab, unnest(string_to_array(animal,'/')) un,
(select distinct un as correlated_animal 
from corr_tab, unnest(string_to_array(animal,'/')) un) X)
select base_animal, correlated_animal, 
(case 
when 
data2.base_animal = data2.correlated_animal
then 
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL)
else 
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL)
end) instance_count,
(case 
when 
data2.base_animal = data2.correlated_animal
then 
100
else 
ceil(
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL) * 100 /
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL) )
end) correlation
from data2
order by base_animal

请参阅此处的小提琴。

相关内容

  • 没有找到相关文章

最新更新