我有一个BiqQuery表(表a(,它有大约1000条记录,其中包含一个ID和15个数据点,范围在0-100之间。想象一下,它就像一张顶级王牌,但有15个属性。这里有一个例子:
Record_ID = 0001
Size = 56
Height = 34
Width = 23
Weight = 78
Color = 42
Volume = 8
Density = 77
Smell = 23
Touch = 67
Hearing = 52
Power = 87
Sensitivity = 3
Strength = 78
Endurance = 45
Reliability = 87
我有一个单独的表(表B(,它有完全相同的模式,大约有5000个不同的记录
我需要从表A中提取每个Record_ID,然后以某种方式对表B中所有属性最匹配的记录进行排序。如果我只是试图根据单个属性(如大小(对记录进行排名,那么这将非常容易,但当我试图在所有属性中找到最接近的匹配和排名时,我不知道从哪里开始。
有什么模式或方法可以帮助我实现这一点吗?我一直在读集群和K-表示最近的邻居,但这些似乎没有帮助。
两个表的交叉联接如何。1000乘以1000行将生成一个100万行的表。将其聚合为排名最接近的值。
我们通过减法和取绝对值abs(A.size-B.size)
来比较两个表中的值。在下面的例子中,我只选择了两个条目,您可以添加更多。算术权重将是pow(...,2)
而不是abs。但是,事先将每个变量归一化到0到1的范围将有助于误导结果;我没有在这里做这件事。
with recursive
tblA_tmp as (select id, cast(rand()*100 as int64) as size,cast(rand()*1000 as int64) as height from unnest(generate_array(1,10000) ) id ),
tblB_tmp as (select id, cast(rand()*100 as int64) as size,cast(rand()*1000 as int64) as height from unnest(generate_array(1,10000) ) id),
tblA as (Select * from tblA_tmp union all select * from tblA where false),
tblB as (Select * from tblB_tmp union all select * from tblB where false)
SELECT *
from (
SELECT A.id as id_A,
array_agg(B.id order by abs(A.size-B.size)+abs(A.height-B.height) limit 1)[safe_offset(0)] as id_B,
min(abs(A.size-B.size)+abs(A.height-B.height)) as distance
from tblA A
cross join tblB B
group by 1
)
left join tblA on id_A=tblA.id
left join tblB on id_B=tblB.id
请忽略CTEwith
部分。只有使用递归,我才能使示例表A和B永久化,而不是在每个步骤中生成。
考虑以下方法(注意CORR函数的使用(
select a_id, b_id from (
select a.Record_ID a_id, b.Record_ID b_id, (
select corr(x.value, y.value)
from (
select as struct value, col
from (select * from unnest([a]))
unpivot (value for col in (Size, Height, Width, Weight, Color, Volume, Density, Smell, Touch, Hearing, Power, Sensitivity, Strength, Endurance, Reliability))
) x
join (
select as struct value, col
from (select * from unnest([b]))
unpivot (value for col in (Size, Height, Width, Weight, Color, Volume, Density, Smell, Touch, Hearing, Power, Sensitivity, Strength, Endurance, Reliability))
) y
using(col)
) a_b_corr
from tableA a
cross join tableB b
)
qualify 1 = row_number() over(partition by a_id order by a_b_corr desc)
作为改进方向-您可以将所有这些取消平移移到from tableA a cross join tableB b