如何在Google BigQuery中对与属性记录最接近的匹配项进行排名



我有一个BiqQuery表(表a(,它有大约1000条记录,其中包含一个ID和15个数据点,范围在0-100之间。想象一下,它就像一张顶级王牌,但有15个属性。这里有一个例子:

Record_ID = 0001
Size = 56
Height = 34
Width = 23
Weight = 78
Color = 42
Volume = 8
Density = 77
Smell = 23
Touch = 67
Hearing = 52
Power = 87
Sensitivity = 3
Strength = 78
Endurance = 45
Reliability = 87

我有一个单独的表(表B(,它有完全相同的模式,大约有5000个不同的记录

我需要从表A中提取每个Record_ID,然后以某种方式对表B中所有属性最匹配的记录进行排序。如果我只是试图根据单个属性(如大小(对记录进行排名,那么这将非常容易,但当我试图在所有属性中找到最接近的匹配和排名时,我不知道从哪里开始。

有什么模式或方法可以帮助我实现这一点吗?我一直在读集群和K-表示最近的邻居,但这些似乎没有帮助。

两个表的交叉联接如何。1000乘以1000行将生成一个100万行的表。将其聚合为排名最接近的值。

我们通过减法和取绝对值abs(A.size-B.size)来比较两个表中的值。在下面的例子中,我只选择了两个条目,您可以添加更多。算术权重将是pow(...,2)而不是abs。但是,事先将每个变量归一化到0到1的范围将有助于误导结果;我没有在这里做这件事。

with recursive 
tblA_tmp as (select id, cast(rand()*100 as int64) as size,cast(rand()*1000 as int64) as height from unnest(generate_array(1,10000) ) id ),
tblB_tmp as (select id, cast(rand()*100 as int64) as size,cast(rand()*1000 as int64) as height from unnest(generate_array(1,10000) ) id),
tblA as (Select * from tblA_tmp union all select * from tblA where false),
tblB as (Select * from tblB_tmp union all select * from tblB where false) 
SELECT * 
from (
SELECT A.id as id_A,
array_agg(B.id order by abs(A.size-B.size)+abs(A.height-B.height) limit 1)[safe_offset(0)] as id_B,
min(abs(A.size-B.size)+abs(A.height-B.height)) as distance
from tblA A
cross join tblB B
group by 1
)
left join tblA on id_A=tblA.id
left join tblB on id_B=tblB.id

请忽略CTEwith部分。只有使用递归,我才能使示例表A和B永久化,而不是在每个步骤中生成。

考虑以下方法(注意CORR函数的使用(

select a_id, b_id from (
select a.Record_ID a_id, b.Record_ID b_id, (
select corr(x.value, y.value)
from (
select as struct value, col 
from (select * from unnest([a]))
unpivot (value for col in (Size, Height, Width, Weight, Color, Volume, Density, Smell, Touch, Hearing, Power, Sensitivity, Strength, Endurance, Reliability))
) x 
join (
select as struct value, col 
from (select * from unnest([b]))
unpivot (value for col in (Size, Height, Width, Weight, Color, Volume, Density, Smell, Touch, Hearing, Power, Sensitivity, Strength, Endurance, Reliability))
) y
using(col) 
) a_b_corr
from tableA a
cross join tableB b
)
qualify 1 = row_number() over(partition by a_id order by a_b_corr desc)

作为改进方向-您可以将所有这些取消平移移到from tableA a cross join tableB b

最新更新