VBA - 匹配 2 个排序字符串数组,其中某些元素不匹配 - 优化



我有 2 个数据集,其中包含很多字符串,我需要匹配它们。 第一个是 1200 行,第二个是大约 800 000 行。我通过通过 VBA 调用的 Excel 排序对这两个集进行排序,以便它们按升序排列,因此我可以通过在上次匹配后一行开始第二个数据集的下一次迭代来显着优化搜索速度。

不幸的是,当找不到匹配项时,即使根据我搜索的词检查的字符串在字母表中(>我的字符串),也永远不会遇到Exit For。我尝试实现比较If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then(也许之前使用"Mod"检查,如果在每次迭代中执行此操作会很慢),但我遇到了不正确的比较值,例如?"µm">"zzzzz"返回 true,而在数据集中它是应该的,在以"a"开头的字符串之前。

有没有可靠的方法来解决这个问题?

Dim optimizedCounter as long, arrayIndex1 as long, arrayIndex2 as long
Dim vData1 as variant, vData2 as variant
'sort 2 data sets in Excel ascending
'assign data sets to arrays vData1 and vData2
optimizedCounter = LBound(vData2)
For arrayIndex1 = LBound(vData1) To UBound(vData1)
For arrayIndex2 = optimizedCounter To UBound(vData2)
If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
'do action when strings match
optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
End If
Next arrayIndex2
Next arrayIndex1

编辑:

感谢大家的精彩建议。目前,A.S.H 的Application.Evaluate/StrComp解决方案为我解决了问题。因为我使用默认的二进制比较vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1),我想保留当前速度,所以我不能使用选项比较文本。

For arrayIndex1 = LBound(vData1) To UBound(vData1)
For arrayIndex2 = optimizedCounter To UBound(vData2)
If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
'do action when strings match
optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
ElseIf arrayIndex2 Mod 1000 = 0 Then
If Application.Evaluate("""" & vData2(arrayIndex2, 1) & _
""" > """ & vData1(arrayIndex1, 1) & """") Then Exit For
'line below can be used instead of Application.Evaluate, the same speed, easier structure
'If StrComp(vData2(arrayIndex2, 1), vData1(arrayIndex1, 1), vbTextCompare) = 1 Then Exit For
End If           
Next arrayIndex2
Next arrayIndex1 

由于此方法需要一些时间,因此我被迫每n次迭代使用它以获得性能提升。根据数据集长度和匹配值的百分比,最佳模值会有所不同。

作为对检查的组合数量的注释,我的搜索词列表包含重复项。

香草代码:

执行时间:12.76

处理的组合:144596591

Application.Evaluate or StrComp:

执行时间:17.30

处理的组合:1192341

在条件 mod 50 = 0 下评估或 StrComp

:执行时间:0.48

处理的组合:1201717

在条件 mod 1000 = 0 下评估或 StrComp

:执行时间:0.16

处理的组合:1376317

从这一点开始,增加 mod 值将增加执行时间,因为处理的组合数量更多。

我尝试将With Application放在主循环之外并使用.评估,它对速度完全没有影响。

编辑2:

Application.MatchApplication.Vlookup不适用于具有>65536 行的数组。然而,正如评论指出的那样,它们确实适用于范围。

Dim vMatch as Variant, myRng as Range
'myRng is set to one-column range of values to look for, about 800K rows
For arrayIndex1 = LBound(vData1) To UBound(vData1)
vMatch = Application.Match(vData1(arrayIndex1, 1), myRng, 0)
If Not IsError(vMatch) Then
'do action when strings match
End If
Next arrayIndex1

应用程序匹配与匹配类型 = 0:

执行时间:28.81

查找次数:1200

If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then...我遇到了不正确的比较值,例如?"µm">"zzzzz"返回 true,而在数据集中,在以"a"开头的字符串之前,它应该是应该的。

实际上,如果字符串比较操作在先前的排序和代码中是不同的,则先前的排序将变得毫无用处。发生这种情况是因为

默认情况下,VBA 中的比较是二进制的

?"µm">"zzzzz"                              --->   True
?Application.Evaluate("""µm"">""zzzzz""")  --->   False
?StrComp("µm", "zzzzz")                    --->    1 
?StrComp("µm", "zzzzz", vbTextCompare)     --->   -1 
^^^^^^^^^^^^^^

p.s. 除非您按照注释中指出的Option Compare TextstrComp或使用 Excel 的比较设置

If Application.Evaluate("""" & vData1(arrayIndex1, 1) & _ 
""" < """ & vData2(arrayIndex2, 1) & """") Then

这将解决比较不匹配的问题。事实上,根据<比较停止循环应该会让它更快。这是否是最好的算法是另一个争论。您的数组正在排序,二进制搜索应该是一个完美的候选者。

除非您进行二进制搜索,否则请考虑使用 Excel 的内置函数,即Application.VLookupApplication.Match,它们几乎比VBA循环快一个数量级,即使后者正在处理预取的数组。

我用一些二进制匹配函数运行了一个小测试,它在大约 2 秒内运行 129K 行对 780K 行,并进行 335K 比较检查。这就是二叉搜索+一点调整的力量。

一些修改的"二叉搜索"实用程序函数:

Public Function wsArrayBinaryMatch( _
ByVal val As Variant, _
arr() As Variant, _
ByVal searchCol As Long, _
Optional optimalStart As Long, Optional optimalEnd As Long, Optional exactMatch As Boolean = True) As Variant
Dim a As Long, z As Long, curr As Long
wsArrayBinaryMatch = "Not Found in Range"
a = IIf(optimalStart, optimalStart, LBound(arr))
z = IIf(optimalEnd, optimalEnd, UBound(arr))
If compare(arr(a, searchCol), val) = 1 Then
Exit Function
End If
If compare(arr(a, searchCol), val) = 0 Then
wsArrayBinaryMatch = a
Exit Function
End If
If compare(arr(z, searchCol), val) = -1 Then
Exit Function
End If
While z - a > 1
curr = Round((CLng(a) + CLng(z)) / 2, 0)
If compare(arr(curr, searchCol), val) = 0 Then
z = curr
wsArrayBinaryMatch = curr
End If
If compare(arr(curr, searchCol), val) = -1 Then
a = curr
Else
z = curr
End If
Wend
If compare(arr(z, searchCol), val) = 0 Then
wsArrayBinaryMatch = z
Else
If Not exactMatch Then
wsArrayBinaryMatch = a
Else
'approx match to val was found inside the range...
wsArrayBinaryMatch = "ApproxIndex" & a
End If
End If
End Function
Public Function wsArrayBinaryLookup( _
ByVal val As Variant, _
arr() As Variant, _
ByVal searchCol As Long, _
ByVal returnCol As Long, _
Optional exactMatch As Boolean = True) As Variant
Dim a As Long, z As Long, curr As Long
wsArrayBinaryLookup = CVErr(xlErrNA)
a = LBound(arr)
z = UBound(arr)
If compare(arr(a, searchCol), val) = 1 Then
Exit Function
End If
If compare(arr(a, searchCol), val) = 0 Then
wsArrayBinaryLookup = arr(a, returnCol)
Exit Function
End If
If compare(arr(z, searchCol), val) = -1 Then
Exit Function
End If
While z - a > 1
curr = Round((CLng(a) + CLng(z)) / 2, 0)
If compare(arr(curr, searchCol), val) = 0 Then
z = curr
wsArrayBinaryLookup = arr(curr, returnCol)
End If
If compare(arr(curr, searchCol), val) = -1 Then
a = curr
Else
z = curr
End If
Wend
If compare(arr(z, searchCol), val) = 0 Then
wsArrayBinaryLookup = arr(z, returnCol)
Else
If Not exactMatch Then
wsArrayBinaryLookup = arr(a, returnCol)
End If
End If
End Function
Public Function compare(ByVal x As Variant, ByVal y As Variant) As Long
If IsNumeric(x) And IsNumeric(y) Then
Select Case x - y
Case Is = 0
compare = 0
Case Is > 0
compare = 1
Case Is < 0
compare = -1
End Select
Else
If TypeName(x) = "String" And TypeName(y) = "String" Then
compare = StrComp(x, y, vbTextCompare)
End If
End If
End Function

然后我写了一个子(可以转换为函数),试图充分利用排序的数据,并提高限制搜索范围的效率。这涉及在第一个数据集中尝试查找低项和高项之间交替。

请注意,两个数据集各只有 2 列,并且它正在从每个数据集的第一列中搜索匹配项。如果找到匹配项,则设置第一个集合中第二列的值。

在字符串中返回近似匹配的方法有点笨拙,但我累了......

Sub BinaryMatchInSortedSets()
Dim set1() As Variant, set2() As Variant
set1 = Sheet1.Range("E2:F129601").Value  '129K rows of strings and column F says 'Default'
set2 = Sheet1.Range("I2:J780001").Value  '780K rows of strings and numbers
Dim low1 As Long, high1 As Long
Dim low2 As Long, high2 As Long
Dim indexToFind As Long, approxIndex As Long
low1 = LBound(set1)
high1 = UBound(set1)
low2 = LBound(set2)
high2 = UBound(set2)
Dim errString As String
Dim matchIndex As Variant
Dim searchingForLow As Boolean: searchingForLow = True
While low1 <= high1 And low2 < high2
indexToFind = IIf(searchingForLow, low1, high1)
matchIndex = wsArrayBinaryMatch(set1(indexToFind, 1), set2, 1, low2, high2, True)
If IsNumeric(matchIndex) Then
'match found
low2 = IIf(searchingForLow, matchIndex, low2)
high2 = IIf(searchingForLow, high2, matchIndex)
'do all other stuff in here that needs doing when match is found...
set1(indexToFind, 2) = set2(matchIndex, 2)  'Just an example of what you could do
Else
'no match, so set up efficient search range if possible
If Left(matchIndex, 11) = "ApproxIndex" Then
approxIndex = Mid(matchIndex, 12)
If searchingForLow Then
low2 = approxIndex + 1
Else
high2 = approxIndex - 1
End If
End If
End If
If searchingForLow Then
low1 = low1 + 1
Else
high1 = high1 - 1
End If
searchingForLow = Not searchingForLow
Wend
Sheet1.Range("L2").Resize(UBound(set1) - LBound(set1) + 1, 2).Value = set1
End Sub

最新更新