SQL / VBScript /智能算法快速找到和组合



我试图列出同一主题内所有可能的顺序(仅连续和向前方向)和组合。列出row_id和求和中涉及的行数。

Sample :
Input (Source Table :) 
DLID    Subject Total   
1   Science 70  
2   Science 70  
3   Science 70  
4   Science 70  
5   Maths   80  
6   Maths   80  
7   English 90  
8   English 90  
9   English 90  
10  Science 75  
Expected Result :           
ID  Number of Rows  Subject Total
1   1   Science 70
2   1   Science 70
3   1   Science 70
4   1   Science 70
5   1   Maths   80
6   1   Maths   80
7   1   English 90
8   1   English 90
9   1   English 90
10  1   Science 75
1   2   Science 140
2   2   Science 140
3   2   Science 140
5   2   Maths   160
7   2   English 180
8   2   English 180
1   3   Science 210
2   3   Science 210
7   3   English 270
1   4   Science 280

vbscript代码:

' myarray -从access数据库读取整个表"i"是读取的总行数"j"表示逐个访问每一行"m"是具有相同主题的后续行数,我们正在尝试检查"n"是一个计数器,从每一行开始,检查到m - 1行是否相同的子"k"用于将结果存储到"resultarray"

myarray(0,j) = holds the row_id
myarray(1,j) = holds the subject
myarray(2,j) = holds the score
myarray(3 4 5 6 are other details
i is the total number of rows - around 80,000
There can be conitnuous records from the same subject as many as 700 - 800
m = is the number of rows matching / number of rows leading to the sum

For m = 1 to 700
For j = 0 to i-m
matchcount = 1
For n = 1 to m-1 
if  myarray(1,j) = myarray (1,j+n) Then 
matchcount = matchcount + 1
Else
Exit For
End If
Next
If matchcount = m Then
resultarray(2,k) = 0
For o = 0 to m - 1
resultarray(2,k) = CDbl(resultarray(2,k)) + CDbl (myarray (2,j+o))
resultarray(1,k) = m
resultarray(0,k) = ( myarray (0,j) )
resultarray(3,k) = ( myarray (3,j) )
resultarray(4,k) = ( myarray (4,j) )
resultarray(5,k) = ( myarray (1,j) )
resultarray(7,k) = ( myarray (5,j) )
resultarray(8,k) = ( myarray (6,j) )
Next
resultarray(2,k) = round(resultarray(2,k),0)
k = k + 1
ReDim Preserve resultarray(8,k)
End If
Next
Next

代码工作完美,但非常慢。我处理80,000行和从5到900连续行相同的主题。所以组合的数量,有几百万种。一组8万行需要几个小时。每天要做很多组。

请建议如何加快速度。更好的算法/代码改进/不同的编程语言请协助。

以下是"真正的"Access (SQL)解决方案的构建模块。

观察# 1

在我看来,一个好的第一步是在[SourceTable]中添加两个数字(长整数)列:

[SubjectBlock]将对主题相同的行进行"块"编号

[SubjectBlockSeq]将按顺序编号每个块中的行

它们都应该被索引(duplicate OK)。填充这些列的代码如下:

Public Sub UpdateBlocksAndSeqs()
Dim cdb As DAO.Database, rst As DAO.Recordset
Dim BlockNo As Long, SeqNo As Long, PrevSubject As String
Set cdb = CurrentDb
Set rst = cdb.OpenRecordset("SELECT * FROM [SourceTable] ORDER BY [DLID]", dbOpenDynaset)
PrevSubject = "(an impossible value)"
BlockNo = 0
SeqNo = 0
DBEngine.Workspaces(0).BeginTrans  ''speeds up bulk updates
Do While Not rst.EOF
    If rst!Subject <> PrevSubject Then
        BlockNo = BlockNo + 1
        SeqNo = 0
    End If
    SeqNo = SeqNo + 1
    rst.Edit
    rst!SubjectBlock = BlockNo
    rst!SubjectBlockSeq = SeqNo
    rst.Update
    PrevSubject = rst!Subject
    rst.MoveNext
Loop
DBEngine.Workspaces(0).CommitTrans
rst.Close
Set rst = Nothing
End Sub

…更新后的SourceTable将是…

DLID    Subject   Total   SubjectBlock    SubjectBlockSeq
1       Science   70      1               1
2       Science   60      1               2
3       Science   75      1               3
4       Science   70      1               4
5       Maths     80      2               1
6       Maths     90      2               2
7       English   90      3               1
8       English   80      3               2
9       English   70      3               3
10      Science   75      4               1

(请注意,我调整了测试数据,以便更容易验证下面的结果。)

现在,当我们遍历不断增加的"要包含在总数中的序列长度"时,我们可以简单地通过使用像…

这样的查询来快速识别感兴趣的"块"。
SELECT SubjectBlock FROM SourceTable WHERE SubjectBlockSeq=3

…这将返回…

1
3

…这表明在计算"3跑"的总数时,我们根本不需要看第2块("数学")和第4块(最后一个"科学"块)。

观察# 2

第一次通过,当NumRows=1时,是一个特殊情况:它只是将[SourceTable]中的行复制到[Expected Results]表中。我们可以通过一个查询来节省时间:

INSERT INTO ExpectedResult ( DLID, NumRows, Subject, Total, SubjectBlock, NextSubjectBlockSeq )
SELECT SourceTable.DLID, 1 AS Expr1, SourceTable.Subject, SourceTable.Total, 
    SourceTable.SubjectBlock, SourceTable.SubjectBlockSeq+1 AS Expr2
FROM SourceTable;

你可能会注意到,我已经添加了两列到[ExpectedResult]表:[SubjectBlock](和以前一样)和[NextSubjetBlockSeq](这只是[SubjectBlockSeq]+1)。同样,它们都应该被索引,以允许重复。我们将在下面使用它们

观察# 3

当我们继续寻找越来越长的"运行"来求和时,每次运行实际上只是一个更早(更短)的运行,并在末尾添加了一个额外的行。如果我们将结果写入[ExpectedResults]表中,我们就可以重用这些值,而不必在整个运行过程中返回并将单个值相加。

当NumRows=2时,"附加"行是SubjectBlockSeq>=2

SELECT SourceTable.*
FROM SourceTable
WHERE (((SourceTable.SubjectBlockSeq)>=2))
ORDER BY SourceTable.DLID;

…这是…

DLID    Subject   Total SubjectBlock    SubjectBlockSeq
2       Science   60    1               2
3       Science   75    1               3
4       Science   70    1               4
6       Maths     90    2               2
8       English   80    3               2
9       English   70    3               3

…与"较早(较短)运行"的[ExpectedResult]行,我们将"添加"额外的行是那些

  • 来自同一个[SubjectBlock],

  • with [NumRows]=1, and

  • [ExpectedResult].[NextSubjectBlockSeq] = [SourceTable].[SubjectBlockSeq]

所以我们可以得到新的总数并像这样将它们附加到[ExpectedResult]中

INSERT INTO ExpectedResult ( DLID, NumRows, Subject, Total, SubjectBlock, NextSubjectBlockSeq )
SELECT SourceTable.DLID, 2 AS Expr1, SourceTable.Subject, 
    [ExpectedResult].[Total]+[SourceTable].[Total] AS NewTotal, 
    SourceTable.SubjectBlock, [SourceTable].[SubjectBlockSeq]+1 AS Expr2
FROM SourceTable INNER JOIN ExpectedResult 
    ON (SourceTable.SubjectBlockSeq = ExpectedResult.NextSubjectBlockSeq) 
        AND (SourceTable.SubjectBlock = ExpectedResult.SubjectBlock)
WHERE (((SourceTable.SubjectBlockSeq)>=2) AND (ExpectedResult.NumRows=1));

附加到[ExpectedResult]的行

DLID    NumRows Subject   Total SubjectBlock    NextSubjectBlockSeq
2       2       Science   130   1               3
3       2       Science   135   1               4
4       2       Science   145   1               5
6       2       Maths     170   2               3
8       2       English   170   3               3
9       2       English   150   3               4

使用与之前相同的逻辑,我们现在可以处理为NumRows=3。唯一的区别是,我们将在NumRows中插入值3,我们的选择标准将是

WHERE (((SourceTable.SubjectBlockSeq)>=3) AND (ExpectedResult.NumRows=2))

完整的查询是

INSERT INTO ExpectedResult ( DLID, NumRows, Subject, Total, SubjectBlock, NextSubjectBlockSeq )
SELECT SourceTable.DLID, 3 AS Expr1, SourceTable.Subject, 
    [ExpectedResult].[Total]+[SourceTable].[Total] AS NewTotal, 
    SourceTable.SubjectBlock, [SourceTable].[SubjectBlockSeq]+1 AS Expr2
FROM SourceTable INNER JOIN ExpectedResult 
    ON (SourceTable.SubjectBlockSeq = ExpectedResult.NextSubjectBlockSeq) 
        AND (SourceTable.SubjectBlock = ExpectedResult.SubjectBlock)
WHERE (((SourceTable.SubjectBlockSeq)>=3) AND (ExpectedResult.NumRows=2));

和附加到[ExpectedResult]的行

DLID    NumRows Subject   Total SubjectBlock    NextSubjectBlockSeq
3       3       Science   205   1               4
4       3       Science   205   1               5
9       3       English   240   3               4

参数

由于每个连续的查询都是如此相似,如果我们可以只编写一次并重复使用它,那将是非常好的。幸运的是,我们可以,如果我们把它变成一个"参数查询":

PARAMETERS TargetNumRows Long;
INSERT INTO ExpectedResult ( DLID, NumRows, Subject, Total, SubjectBlock, NextSubjectBlockSeq )
SELECT SourceTable.DLID, [TargetNumRows] AS Expr1, SourceTable.Subject, 
    [ExpectedResult].[Total]+[SourceTable].[Total] AS NewTotal, 
    SourceTable.SubjectBlock, [SourceTable].[SubjectBlockSeq]+1 AS Expr2
FROM SourceTable INNER JOIN ExpectedResult 
    ON (SourceTable.SubjectBlock = ExpectedResult.SubjectBlock) 
        AND (SourceTable.SubjectBlockSeq = ExpectedResult.NextSubjectBlockSeq)
WHERE (((SourceTable.SubjectBlockSeq)>=[TargetNumRows]) 
    AND ((ExpectedResult.NumRows)=[TargetNumRows]-1));

创建一个新的Access查询,将上述内容粘贴到SQL窗格中,然后将其保存为pq_appendToExpectedResult。("pq_"只是一个视觉提示,它是一个参数查询。)

从VBA调用参数查询

你可以通过QueryDef object在VBA中调用(执行)一个参数查询:

Dim cdb As DAO.Database, qdf As DAO.QueryDef
Set cdb = CurrentDb
Set qdf = cdb.QueryDefs("pq_appendToExpectedResult")
qdf!TargetNumRows = 4  '' parameter value
qdf.Execute
Set qdf = Nothing

何时停止

现在您可以看到,这只是增加NumRows并重新运行参数查询的问题,但是何时停止呢?这很简单:

在VBA中增加NumRows变量后,测试

DCount("DLID", "SourceTable", "SubjectBlockSeq=" & NumRows)

如果返回0,则完成。

Show me (all) code

对不起,现在不行。,)尝试一下,让我们知道效果如何。

问题是:"请建议如何加快速度。更好的算法/代码改进/不同的语言代码请协助。"

我可以简短地回答你问题的部分"不同的语言编码" == SQL. 在细节:

无论你想要实现的是什么,看起来都是数据集密集型的。我几乎可以肯定,这种处理将在存储数据的DBMS中更有效地处理,因为DBMS能够采用(编写得相当好的)SQL查询,并根据自己对您正在查询的数据的了解对其进行优化,并且非常快速有效地对大型数据集/子集执行聚合。

逐行迭代大型数据集以积累值很少(我敢说从不)会产生可接受的性能。这就是为什么dbms不原生地这样做(如果您不通过使用迭代代码或需要调查每一行的代码(例如VB代码)来强迫它们这样做的话)。


现在,为了实现更好的算法/代码改进/不同的语言。

我已经在SQL中这样做了,但不管你是否使用我的解决方案,我仍然会高度建议你迁移你的数据到例如MS SQL或Oracle或mySql等,如果你发现你使用MS Access将你绑定到迭代方法(这并不是建议它这样做…我不知道是不是这样。

但是,如果这真的不是家庭作业,并且/或者你真的与MS Access联系在一起,那么也许投入精力将其转换为MS Access可能会在性能方面取得成果。原则应该都是一样的——这是一个关系数据库,这都是相当标准的SQL,所以我本以为我在这里所做的会有等价的Access。

如果做不到这一点,您应该能够将MSSQL实例"指向"MS Access文件,作为通过Access提供程序链接的服务器。如果你需要建议,请告诉我。

这里有一些本质上是过程性的代码,为了设置一些"辅助"表,这些表将允许使用基于集合的操作来完成对序列的繁重聚合。

我将源表命名为"Your_Source_Table"。对所有实例执行搜索替换,将其重命名为您所命名的名称。

还请注意,我没有在任何东西上设置索引…你应该这么做。我希望应该为连接中涉及的所有列创建索引。检查执行计划以确保没有不必要的表扫描是明智的。

我使用以下代码创建Your_Source_Table:

-- Create Your apparent table structure
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Your_Source_Table]') AND type in (N'U'))
    DROP TABLE [dbo].[Your_Source_Table]
GO
CREATE TABLE [dbo].[Your_Source_Table](
    [DLID] [int] NOT NULL,
    [Subject] [nchar](10) NOT NULL,
    [Total] [int] NOT NULL
) ON [PRIMARY]
GO

并填充为:

DLID        Subject    Total
----------- ---------- -----------
1           Science    70
2           Science    70
3           Science    70
4           Science    70
5           Maths      80
6           Maths      80
7           English    90
8           English    90
9           English    90
10          Science    75

然后,我创建了下面的"助手"。代码中的解释。

-- Set up helper structures.
-- Build a number table
if object_id('tempdb..##numbers') is not null
    BEGIN DROP TABLE ##numbers END
SELECT TOP 10000 IDENTITY(int,1,1) AS Number -- Can be 700, 800, or 900 contiguous rows, depending on which comment I read.  So I'll run with 100000 to be sure :-)
    INTO ##numbers
    FROM sys.objects s1
    CROSS JOIN sys.objects s2
ALTER TABLE ##numbers ADD CONSTRAINT PK_numbers PRIMARY KEY CLUSTERED (Number)

-- Determine where each block starts.
if object_id('tempdb..#tempGroups') is not null
    BEGIN DROP TABLE #tempGroups END
GO
CREATE TABLE #tempGroups (
    [ID] [int] NOT NULL IDENTITY,
    [StartID] [int] NULL,
    [Subject] [nchar](10) NULL
) ON [PRIMARY]
INSERT INTO #tempGroups
    SELECT t.DLID, t.Subject FROM Your_Source_Table t WHERE DLID=1
    UNION
    SELECT 
        t.DLID, t.Subject
    FROM
        Your_Source_Table t
        INNER JOIN Your_Source_Table t2 ON t.DLID = t2.DLID+1 AND t.subject != t2.subject
-- Determine where each block ends
if object_id('tempdb..##groups') is not null
    BEGIN DROP TABLE ##groups END
CREATE TABLE ##groups (
    [ID] [int] NOT NULL,
    [Subject] [nchar](10) NOT NULL,
    [StartID] [int] NOT NULL,
    [EndID] [int] NOT NULL
) ON [PRIMARY]
INSERT INTO ##groups
SELECT
    g1.id as ID,
    g1.subject,
    g1.startID as startID,
    CASE
        WHEN g2.id is not null THEN g2.startID-1
        ELSE (SELECT max(dlid) FROM Your_Source_Table) -- Boundary case when there is no following group (ie return the last row)
    END as endID
FROM
    #tempGroups g1
    LEFT JOIN #tempGroups g2 ON g1.id = g2.id-1
DROP TABLE #tempGroups;
GO
-- We now have a helper table called ##groups, that identifies the subject, start DLID and end DLID of each continuous block of a particular subject in your dataset.
-- So now, we can build up the possible sequences within each group, by joining to a number table.
if object_id('tempdb..##sequences') is not null
    BEGIN DROP TABLE ##sequences END
CREATE TABLE ##sequences (
    [seqID] [int] NOT NULL IDENTITY,
    [groupID] [int] NOT NULL,
    [start_of_sequence] [int] NOT NULL,
    [end_of_sequence] [int] NOT NULL
) ON [PRIMARY]
INSERT INTO ##sequences
SELECT
    g.id,
    ns.number start_of_sequence,
    ne.number end_of_sequence
FROM
    ##groups g
    INNER JOIN ##numbers ns
    ON ns.number <= (g.endid - g.startid + 1) -- number is in range for this block
    INNER JOIN ##numbers ne
    ON ne.number <= (g.endid - g.startid + 1) -- number is in range for this block
    and ne.number >= ns.number -- end after start
ORDER BY
    1,2,3

然后,你想要的结果可以通过一个基于集合的操作来实现:

-- By joining groups to your dataset we can add a group identity to each record.
-- By joining sequences we can generate copies of the rows for aggregation into each sequence.
select 
    min(t.dlid) as ID, -- equals (s.start_of_sequence + g.startid - 1) (sequence positions offset by group start position)
    count(t.dlid) as number_of_rows,    
    g.subject,
    sum(t.total) as total
--select *
from
    Your_Source_Table t
    inner join ##groups g
    on t.dlid >= g.startid and t.dlid <= g.endid -- grouping rows into each group.
    inner join ##sequences s
    on s.groupid = g.id -- get the sequences for this group.
    and t.dlid >= (s.start_of_sequence + g.startid - 1) -- include the rows required for this sequence (sequence positions offset by group start position)
    and t.dlid <= (s.end_of_sequence + g.startid - 1)
group by
    g.subject,
    s.seqid
order by 2, 1

但注意:这个结果与你的"预期结果"不完全相同。您错误地包含了从第1行开始的1行序列的重复实例(对于科学,sum total 1*70=70),但没有包含从第1行开始的4行序列(对于科学,sum total 4*70 = 280)。

正确的结果,恕我直言:

ID          number_of_rows subject    total
----------- -------------- ---------- -----------
1           1              Science    70 <-- You've got this row twice.
2           1              Science    70
3           1              Science    70
4           1              Science    70
5           1              Maths      80
6           1              Maths      80
7           1              English    90
8           1              English    90
9           1              English    90
10          1              Science    75
1           2              Science    140
2           2              Science    140
3           2              Science    140
5           2              Maths      160
7           2              English    180
8           2              English    180
1           3              Science    210
2           3              Science    210
7           3              English    270
1           4              Science    280 <-- You don't have this row.
(20 row(s) affected)

最新更新