根据匹配的客户字段(电话、电子邮件、地址)为客户分配一个household_id——重复的问题



我有一个客户表,其中每个客户都有一个唯一的id,这取决于他们下订单的电子邮件。此外,phoneemailaddress还有单独的表列。如果客户ID的phoneemailaddress与另一个"匹配",则我试图在相同的household_id下将客户ID分组在一起;客户";在客户表中。

我遇到的问题是,我可以将客户分组在一起,并给他们一个household_id,但我很难在筛选中完全删除这些客户分组的重复出现。下面查询中的最后两条注释旨在帮助解释我当前的过滤逻辑,并说明它的失败之处。这种逻辑适用于成对的客户,但一旦3个或更多的客户需要绑定到同一个单一的household_id,就会开始失败。有没有更好的方法可以过滤这些结果,或者我需要添加一些额外的CTE,使用min()/max()函数和其他类型的分组来在这里添加更多的智能?除了rank(),还有其他聪明的窗口函数可以帮助我吗?

with household as (
select
c1.id as parent_id,
c2.id as child_id,
rank() over (partition by c1.id order by c2.id) as child_number
-- order by clause is important here to ensure lowest c2.id is always rank 1 (referenced later on in household join onto customer table)

from customer c1
left join customer c2 on (c1.phone = c2.phone) or (c1.email = c2.email) or (c1.address = c2.address)

order by c1.id, child_number
)

select
'H-' || h.parent_id as household_id, -- effectively creates a unique household_id
h.child_id

from household h
where h.parent_id < h.child_id or (h.parent_id = h.child_id and h.child_number = 1)
-- ^this where clause is my attempt at removing the duplicate groupings of customers
-- it works in the instance when there is a pair of customers tied to a household_id, but when there are 3 or more it starts to fail

请参阅链接的图片以查看家庭cte的视图,查看由3个客户id组成的组,因为他们有匹配的电话、电子邮件或地址而连接在一起。突出显示的行将通过上面查询的where子句中的过滤器
我的查询如何失败

我使用与评论w/Julius:中讨论的方法类似的方法解决了这个问题

  1. 使用递归ctegrp将所有类似的客户分组到阵列中
  2. 使用householdcte中的select distinct oncardinality()函数,将递归grp修剪为仅包含所有家庭成员的阵列
  3. 扫描household表,并通过取消测试linked_customers数组并根据数组中的最小id将其所有元素绑定到单个household_id来删除任何重复项

这似乎运行得很好,但我相信这个查询可以进一步简化,我很乐意接受任何反馈!

with recursive grp as (
select
c1.id,
c1.email,
c1.phone,
c1.address,
array[c1.id] as linked_customers -- initializes an array based of the id of every customer
from customer c1

union all

select
c2.id,
c2.email,
c2.phone,
c2.address,
c2.id || linked_customers
from grp g
join customer c2 on (g.email = c2.email or g.phone = c2.phone or g.address = c2.address)
where c2.id <> all(linked_customers) -- ensures the same customer id that already was used in the array initialization is not being looked at again
), -- creates several similar groups as well as intermediary groups of customers
household as (
select
distinct on (g.id) g.linked_customers
from grp g
order by g.id, cardinality(linked_customers) desc
) -- extracts largest array for each customer_id (still duplicate groupings here), but only the max length arrays are being pulled out (all household members)
select
distinct on (p.parent_id) 'H-' || parent_id as household_id,
unnest(p.linked_customers) as child_id
from (
select
min(parent_id) as parent_id, -- pulls out the minimum id of each linked customers group which will remove the creation of multiple household_ids for the same customer groups in the select clause above
h.linked_customers
from household h, unnest(h.linked_customers) parent_id
group by h.linked_customers
) p
order by parent_id

最新更新