寻找一些查询结构帮助。我有一个表格,其中包含链接时间戳、user_id、linked_id type_if_link行。例如,这些链接类型是"电子邮件"与"电话号码",因此在下面的示例中,您可以看到用户 1 没有直接连接到用户 3,而是通过用户 2。 另一个复杂问题是每个"链接帐户"也出现在 r1 中,这意味着有几个"重复"字段(在示例中:第 1+2 行,第 3+4 行(
前任:
Link time user id linked_id link type
---------------------------------------------------
link_occurred at user 1 user 2 link a
link_occurred at user 2 user 1 link a
link_occurred at user 2 user 3 link b
link_occurred at user 3 user 2 link b
link_occurred_at user 4 user 5 link a
link_occurred_at user 5 user 4 link a
我可以使用哪些函数来获取第一个用户 ID、所有(直接+间接(链接帐户的计数以及可能的关联帐户 ID 数组。
例如,我在这里想要的输出是:
initial user - Count linked accounts array of linked accounts
--------------------------------------------------------------
user 1 2 linked [user 2, user 3]
user 4 1 linked account [user 5]
这将使我对所有链接的客户网络进行相互排斥的分组。
我不知道递归 CTE,直到 Erwin Brandstetter 在上面的评论中提到了它们。这个概念就像它的名字一样:一个引用自身的 CTE,并有一个基本情况,以便递归终止。对于您的问题,递归 CTE 解决方案可能如下所示:
WITH accumulate_users AS (
-- Base case: the direct links from a user_id.
SELECT
user_id AS user_id,
ARRAY_AGG(linked_id) AS linked_accounts
FROM your_table
GROUP BY user_id
UNION ALL
-- Recursive case: transitively linked accounts.
SELECT
ARRAY_UNION(
accumulate_users.linked_accounts,
ARRAY_AGG(DISTINCT your_table.linked_id)
) AS linked_accounts
FROM accumulate_users
JOIN your_table ON CONTAINS(accumulate_users.linked_accounts, your_table.user_id)
GROUP BY accumulate_users.user_id
-- But there is no enforced termination condition, hopefully it just
-- ends at some point? This is part of why implementing recursive CTEs
-- is challenging, I think.
)
SELECT
user_id,
CARDINALITY(linked_accounts) AS count_linked_accounts,
linked_accounts
FROM accumulate_users
但是,我无法测试此查询,因为正如另一个Stack Overflow Q&A Presto中详述的那样,Presto不支持递归CTE。
通过重复联接回您拥有的表,可以遍历任意但有限数量的链接。像这样的东西,我包括了second_,third_,fourth_degree_links只是为了清楚起见:
SELECT
yt1.user_id,
ARRAY_AGG(DISTINCT yt2.user_id) AS first_degree_links,
ARRAY_AGG(DISTINCT yt3.user_id) AS second_degree_links,
ARRAY_AGG(DISTINCT yt3.linked_user) AS fourth_degree_links,
ARRAY_UNION(
ARRAY_AGG(DISTINCT yt2.user_id),
ARRAY_UNION(ARRAY_AGG(DISTINCT yt3.user_id), ARRAY_AGG(DISTINCT yt3.linked_user))
) AS up_to_fourth_degree_links
FROM your_table AS yt1
JOIN your_table AS yt2 ON yt1.linked_user = yt2.user_id
JOIN your_Table AS yt3 ON yt2.linked_user = yt3.user_id
GROUP BY yt1.user_id
我一直在使用类似的数据集,尽管我将原始标识符作为原始数据集的一部分。换句话说,示例中的"电子邮件"和"电话号码"。我发现创建一个表,通过以下连接标识符对用户 ID 进行分组很有帮助:
CREATE TABLE email_connections AS
SELECT
email,
ARRAY_AGG(DISTINCT user_id) AS users
FROM source_table
GROUP BY email
然后,可以通过查找用户数组之间的交集来计算相同的任意但有限深度的链接集:
SELECT
3764350 AS user_id,
FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users))))) AS all_users,
CARDINALITY(FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users)))))) AS count_all_users
FROM email_connections AS emails1
JOIN email_connections AS emails2 ON CARDINALITY(ARRAY_INTERSECT(emails1.users, emails2.users)) > 0
JOIN email_connections AS emails3 ON CARDINALITY(ARRAY_INTERSECT(emails2.users, emails3.users)) > 0
JOIN email_connections AS emails4 ON CARDINALITY(ARRAY_INTERSECT(emails3.users, emails4.users)) > 0
WHERE CONTAINS(emails1.users, 3764350)
GROUP BY 1
计算到任意深度的链接是Neo4j或JanusGraph等图形数据库技术的一个很好的用例。这就是我现在正在研究的解决这个"用户链接"问题的方法。