使用SQL高效地查询节点和边



我有2个SQL Server表存储网络信息,EF模式是:

public partial class edge
{
public long edge_id { get; set; }
public string source { get; set; }
public string target { get; set; }
public Nullable<System.DateTime> edgedate { get; set; }
}
public partial class node
{
public string node_id { get; set; }
public string name { get; set; }
public string address { get; set; }
}

我正在从UI传递边缘和节点特定的过滤器,以内置到SQL查询中,像这样:

select * 
from [dbo].[Nodes] 
where name = 'John Doe' 
or address = '123 Fake Street'  
select * 
from [dbo].[Edges] 
where edgedate >= '01/12/2020 00:00:00' 
and edgedate <= '01/12/2021 23:59:59
然而,这些查询必须考虑到整个网络,即节点过滤器必须应用于边缘,反之亦然-

-- nodes example with edge filters applied
select * 
from [dbo].[Nodes] 
where name = 'John Doe' 
or address = '123 Fake Street'  
and node_id in (select source 
from EDGESTEMP 
where edgedate >= '01/12/2020 00:00:00' 
and edgedate <= '01/12/2021 23:59:59'
union 
select target 
from EDGESTEMP 
where edgedate >= '01/12/2020 00:00:00' 
and edgedate <= '01/12/2021 23:59:59') 

这在小规模网络上工作得很好,但是如果我处理的是100万条边和50万个节点的网络,那么运行这些查询的性能会受到影响,因为中的

在每个实例中检查另一个表时使用。我已经为查询的所有附属列添加了索引,但是需要知道是否有更有效的方法来做到这一点?

<附加信息/strong>

查询计划-这里

在每个主键上设置集群索引,例如node_id和edge_id,在其余主键上设置非集群索引,例如-

CREATE NONCLUSTERED INDEX [NonClusteredIndex-20211017-194859] ON [dbo].[NODES]
(
[name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

首先,您的查询似乎有一个逻辑错误:应该在or周围有括号。

其次,UNION ALL通常优于UNION,尽管在半连接中,如INEXISTS,这并不重要

select n.* 
from [dbo].[Nodes] n
where (n.name = 'John Doe' 
or n.address = '123 Fake Street')
and node_id in (
select source
from EDGES 
where edgedate >= '01/12/2020 00:00:00' 
and edgedate <= '01/12/2021 23:59:59'
union all
select target 
from EDGES
where edgedate >= '01/12/2020 00:00:00' 
and edgedate <= '01/12/2021 23:59:59'
);

最后,对于这个查询,您可能应该有以下索引

NODES (name) INCLUDE (node_id)
NODES (address) INCLUDE (node_id)
EDGES (edgedate) INCLUDE (source, target)

or条件可能仍然会导致问题,因为您可能仍然会对NODES进行索引扫描。如果是这样,您可能需要重写查询来强制使用索引联合。

select n.* 
from (
select *
from [dbo].[Nodes] n
where n.name = 'John Doe'
union
select *
from [dbo].[Nodes] n
where n.address = '123 Fake Street'
) n
where node_id in (
select source
from EDGES 
where edgedate >= '01/12/2020 00:00:00' 
and edgedate <= '01/12/2021 23:59:59'
union all
select target 
from EDGES
where edgedate >= '01/12/2020 00:00:00' 
and edgedate <= '01/12/2021 23:59:59'
);

您可以使用exists来进行更有效的查询。

同样,你不应该使用模棱两可的日期字面值。我不确定这些日期是1月12日还是12月1日。另外,对于日期时间范围查询,不应该使用>=和<=,而应该使用>=和<。您可以在代码中看到这些调整:

select * 
from [dbo].[Nodes] n
where (name = 'John Doe' 
or address = '123 Fake Street')  
and exists (select * 
from EDGESTEMP e
where (n.node_id = e.source or n.node_id = e.target)
and e.edgedate >= '20200112' 
and e.edgedate < '20210113');

顺便说一句,我假设您已经在source、target和edgedate上有了索引。

最新更新