从SQL Presto的日常字段中创建新字段的列



我想根据每天访问的所有访问者来计算新访问者。现在我唯一可用的数据是前两列,所以我需要根据前两列推断最后两列。

tbody> <<tr>
Date New_Visitors All_Visitors_To_Date
12月6艾莉,Jon艾莉,Jon艾莉,Jon
12月7艾莉,乔恩,扎克扎克艾莉,乔恩,扎克
12月8Barb, JonBarb艾莉,Barb,乔恩,扎克
12月9日珍妮特,扎克Janet艾莉,Barb,珍妮特,乔恩,扎克

您可以使用窗口函数与数组聚合(从visitor_namesCTE中删除ARRAY_JOIN):

-- sample data
with dataset(date, visitors_today) as (
values ('Dec 6', array['Allie', 'Jon']),
('Dec 7', array['Allie', 'Jon', 'Zach']),
('Dec 8', array['Barb', 'Jon']),
('Dec 9', array['Janet', 'Zach'])
)
-- query
select date,
visitors_today,
array_distinct(visitors_today || prev_visitors) all_visitors_to_date,
array_except(visitors_today, prev_visitors) new_visitors    
from (
select *,
coalesce(
flatten(array_distinct(array_agg(visitors_today)
over (order by date rows between UNBOUNDED PRECEDING and 1 PRECEDING))),
array[]) as prev_visitors  -- combine all visitors before today into non null array
from dataset);

输出:

tbody> <<tr>
datevisitors_todayall_visitors_to_datenew_visitors
12月6[艾莉,Jon][艾莉,Jon][艾莉,Jon]
12月7(艾莉,乔恩·扎克)(艾莉,乔恩·扎克)(扎克)
12月8(Barb, Jon)(Barb,乔恩,艾莉,扎克)(倒钩)
12月9日[珍妮特,扎克][珍妮特,扎克,艾莉,乔恩,Barb](珍妮)

可以将数组扁平化并在子查询中执行新的聚合:

select t.*, (select array_join(array_agg(t1.v), ', ') from 
(select v from unnest(t.visitors_today) v
except
select v from tbl t2 cross join unnest(t2.visitors_today) v 
where t2.date < t.date) t1),
(select array_join(array_distinct(array_agg(v)), ', ') 
from tbl t1 cross join unnest(t1.visitors_today)) v 
where t1.date <= t.date)
from tbl t

参见fiddle(上面的查询在Postgres中的演示)。

从规范化表开始,您可以汇总日期内访问者的排名值(首次访问对应排名= 1),然后使用新访问者字段的窗口函数来获得您的增加访问者。

如果你想要字符串(fiddle):

WITH cte AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY Visitors_Today ORDER BY Date_) AS rn
FROM tab
), cte2 AS (
SELECT Date_, 
STRING_AGG(Visitors_Today, ', ') AS Visitors_Today,
STRING_AGG(Visitors_Today, ', ') FILTER (WHERE rn = 1) AS New_Visitors
FROM cte
GROUP BY Date_
)
SELECT *, 
STRING_AGG(New_Visitors, ', ') OVER(ORDER BY Date_) AS All_Visitors_To_Date
FROM cte2

如果你想要数组(fiddle):

WITH cte AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY Visitors_Today ORDER BY Date_) AS rn
FROM tab
), cte2 AS (
SELECT Date_, 
ARRAY_AGG(Visitors_Today) AS Visitors_Today,
ARRAY_AGG(Visitors_Today) FILTER (WHERE rn = 1) AS New_Visitors
FROM cte
GROUP BY Date_
)
SELECT DISTINCT Date_, Visitors_Today, New_Visitors, ARRAY_AGG(elements) OVER(ORDER BY Date_) AS All_Visitors_To_Date
FROM cte2, UNNEST(New_Visitors) AS elements
ORDER BY Date_