Azure SQL DWH - CTE和随机样本的问题



查询Azure SQL数据仓库(即Azure Synapse)的版本:

Microsoft Azure SQL Data Warehouse - 10.0.15554.0 Dec 10 2020

那么我有一个包含事件和额外数据的表,看起来像这样:

event_id  |  date    | field3    | field4 |.......  
1     |  10/20   | ......    | ...... |.......  
2     |  10/21   | ......    | ...... |.......  
3     |  10/22   | ......    | ...... |.......  
1     |  10/20   | ......    | ...... |.......  

我想获得在表中出现两次的10个事件的示例,因此它应该返回20行。我试图做到这一点与CTE,得到10 event_id出现两次。然后,在SELECT子句中,我可以使用CTE来过滤并获得其余的信息:

-- CTE
-- get sample of 10 event_id
-- that appear twice
WITH SPL_2_ROWS AS 
(SELECT TOP 10 event_id, COUNT(*) AS q_rows
FROM report_table
GROUP BY event_id
HAVING COUNT(*) = 2
ORDER BY NEWID())
-- Main Query
-- get all the information for the 
-- previous 10 event_id
SELECT REP.*
FROM report_table  REP
WHERE event_id IN
(SELECT event_id FROM SPL_2_ROWS)

我遇到的问题是,它返回它看起来像一个随机数行(介于10到30行之间)。由于查询的编写方式,不应该发生这种情况。我尝试更改JOIN的WHERE过滤器,结果相同:

JOIN SPL_2_ROWS  SPL
ON REP.event_id = SPL.event_id

奇怪的是,如果我使用时态表而不是CTE,查询工作得很好。
我在这里错过了什么?

ADITIONAL信息:
表有大约60个字段和6M条记录,没有主键:DDL创建语句类似于:

CREATE TABLE [my_schema].[report_table]
(
[my_date] [datetime2](7) NULL,
[field2] [nvarchar](4000) NULL,
...
...
...
[event_id] [int] NULL,
[client_id] [int] NULL,
[field30] [nvarchar](4000) NULL,
[field31] [nvarchar](4000) NULL,
[field32] [int] NULL,
...
...
...
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
HEAP
)
GO

可以使用窗口函数。但是使用聚合和连接可能会更简单:

select rt.*
from (select event_id, row_number() over (order by newid()) as seqnum
from report_table rt
group by event_id
having count(*) = 2
) e join
report_table rt
on rt.event_id = e.event_id
where seqnum <= 10;

最新更新