我想根据每天访问的所有访问者来计算新访问者。现在我唯一可用的数据是前两列,所以我需要根据前两列推断最后两列。
Date | New_Visitors | All_Visitors_To_Date | 12月6 | 艾莉,Jon | 艾莉,Jon | 艾莉,Jon |
---|---|---|---|
12月7 | 艾莉,乔恩,扎克 | 扎克 | 艾莉,乔恩,扎克 |
12月8 | Barb, Jon | Barb | 艾莉,Barb,乔恩,扎克 |
12月9日 | 珍妮特,扎克 | Janet | 艾莉,Barb,珍妮特,乔恩,扎克 |
您可以使用窗口函数与数组聚合(从visitor_names
CTE中删除ARRAY_JOIN
):
-- sample data
with dataset(date, visitors_today) as (
values ('Dec 6', array['Allie', 'Jon']),
('Dec 7', array['Allie', 'Jon', 'Zach']),
('Dec 8', array['Barb', 'Jon']),
('Dec 9', array['Janet', 'Zach'])
)
-- query
select date,
visitors_today,
array_distinct(visitors_today || prev_visitors) all_visitors_to_date,
array_except(visitors_today, prev_visitors) new_visitors
from (
select *,
coalesce(
flatten(array_distinct(array_agg(visitors_today)
over (order by date rows between UNBOUNDED PRECEDING and 1 PRECEDING))),
array[]) as prev_visitors -- combine all visitors before today into non null array
from dataset);
输出:
date | visitors_today | all_visitors_to_date | new_visitors | 12月6 | [艾莉,Jon] | [艾莉,Jon] | [艾莉,Jon] |
---|---|---|---|
12月7 | (艾莉,乔恩·扎克) | (艾莉,乔恩·扎克) | (扎克) |
12月8 | (Barb, Jon) | (Barb,乔恩,艾莉,扎克) | (倒钩) |
12月9日 | [珍妮特,扎克] | [珍妮特,扎克,艾莉,乔恩,Barb] | (珍妮) |
可以将数组扁平化并在子查询中执行新的聚合:
select t.*, (select array_join(array_agg(t1.v), ', ') from
(select v from unnest(t.visitors_today) v
except
select v from tbl t2 cross join unnest(t2.visitors_today) v
where t2.date < t.date) t1),
(select array_join(array_distinct(array_agg(v)), ', ')
from tbl t1 cross join unnest(t1.visitors_today)) v
where t1.date <= t.date)
from tbl t
参见fiddle(上面的查询在Postgres中的演示)。
从规范化表开始,您可以汇总日期内访问者的排名值(首次访问对应排名= 1),然后使用新访问者字段的窗口函数来获得您的增加访问者。
如果你想要字符串(fiddle):
WITH cte AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY Visitors_Today ORDER BY Date_) AS rn
FROM tab
), cte2 AS (
SELECT Date_,
STRING_AGG(Visitors_Today, ', ') AS Visitors_Today,
STRING_AGG(Visitors_Today, ', ') FILTER (WHERE rn = 1) AS New_Visitors
FROM cte
GROUP BY Date_
)
SELECT *,
STRING_AGG(New_Visitors, ', ') OVER(ORDER BY Date_) AS All_Visitors_To_Date
FROM cte2
如果你想要数组(fiddle):
WITH cte AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY Visitors_Today ORDER BY Date_) AS rn
FROM tab
), cte2 AS (
SELECT Date_,
ARRAY_AGG(Visitors_Today) AS Visitors_Today,
ARRAY_AGG(Visitors_Today) FILTER (WHERE rn = 1) AS New_Visitors
FROM cte
GROUP BY Date_
)
SELECT DISTINCT Date_, Visitors_Today, New_Visitors, ARRAY_AGG(elements) OVER(ORDER BY Date_) AS All_Visitors_To_Date
FROM cte2, UNNEST(New_Visitors) AS elements
ORDER BY Date_