快速正则表达式优化



我想查看字符串数组以获取包含子字符串的所有字符串。此函数还应使用通配符。 我写了这个函数:

func wordcontains(word: String, from words: [String]) -> [String] {
//Si il y a des jokers on utilise la methode regex
//Sinon on utilise la methode simple car beaucoup plus rapide
let foundWords = words.filter { otherWord in
let wordregex = word.replacingOccurrences(of: "?", with: ".")
if (otherWord.range(of: "[A-Z]*(wordregex)[A-Z]*", options: .regularExpression) != nil){
return true
}else  {
return false
}
}
return foundWords
}

它的工作原理是这样的:

input : anagrams(word: "ARC?", from: ["BOU", "BAC", "ARCS", "ARCH", "TREE","ARCHE","PROUE"])
output : ["ARCS", "ARCH", "ARCHE"]

它与一个小数组配合得很好,但我需要签入一个 300000 字的数组,这需要一段时间。

优化正则表达式/函数的最佳方法是什么?

也许有更好的方法?

为了您的兴趣,我用于测试的代码。

创建命令行工具项目。

import Foundation
func wordcontains(word: String, from words: [String]) -> [String] {
...(exactly the same code as yours)...
}
///Creating NSRegularExpression outside of the loop
func wordcontains2(word: String, from words: [String]) -> [String] {
let wordregex = word.replacingOccurrences(of: "?", with: ".")
let pattern = "[A-Z]*(wordregex)[A-Z]*"
let regex: NSRegularExpression
do {
regex = try NSRegularExpression(pattern: pattern)
} catch {
fatalError(error.localizedDescription)
}
let foundWords = words.filter { otherWord in
regex.firstMatch(in: otherWord, range: NSRange(0..<otherWord.utf16.count)) != nil
}
return foundWords
}
/// Removing `[A-Z]*` from both ends as suggested in rmaddy's comment.
/// This assumes all words in the parameter `words` consists only capital letters.
func wordcontains3(word: String, from words: [String]) -> [String] {
let wordregex = word.replacingOccurrences(of: "?", with: ".")
let regex: NSRegularExpression
do {
regex = try NSRegularExpression(pattern: wordregex)
} catch {
fatalError(error.localizedDescription)
}
let foundWords = words.filter { otherWord in
regex.firstMatch(in: otherWord, range: NSRange(0..<otherWord.utf16.count)) != nil
}
return foundWords
}

通常,创建NSRegularExpression实例是一项昂贵的操作,因此将其移出循环可能会提高性能(当然,如果正则表达式不更改(,但效果非常有限。

我添加了一些用于测试的代码。

func makeRandomWords(_ count: Int) -> [String] {
var words: [String] = []
for _ in 0..<count {
let len = Int.random(in: 3...5)
var word = ""
for _ in 0..<len {
let charCode = UInt32.random(in: UInt32(UInt8(ascii: "A"))...UInt32(UInt8(ascii: "Z")))
word.append(Character(UnicodeScalar(charCode)!))
}
words.append(word)
}
return words
}
let words = makeRandomWords(300_000) //I have found the number of words is `300000` after I wrote my comment...
do {
let date1 = Date()
let w1 = wordcontains(word: "ARC?", from: words)
let date2 = Date()
print(date2.timeIntervalSince(date1), w1)
let date3 = Date()
let w2 = wordcontains2(word: "ARC?", from: words)
let date4 = Date()
print(date4.timeIntervalSince(date3), w2)
let date5 = Date()
let w3 = wordcontains3(word: "ARC?", from: words)
let date6 = Date()
print(date6.timeIntervalSince(date5), w3)
}

结果:

6.443639039993286 ["ARCQJ", "ARCZB", "AARCI", "ARCR", "ARCR", "ARCQS", "ARCGM", "ARCKL", "UARCN", "FARCS", "ARCNA", "ARCZM", "PARCL", "ARCTA", "ARCS", "ARCE", "ARCG", "ARCE"]
1.7534430027008057 ["ARCQJ", "ARCZB", "AARCI", "ARCR", "ARCR", "ARCQS", "ARCGM", "ARCKL", "UARCN", "FARCS", "ARCNA", "ARCZM", "PARCL", "ARCTA", "ARCS", "ARCE", "ARCG", "ARCE"]
1.4359259605407715 ["ARCQJ", "ARCZB", "AARCI", "ARCR", "ARCR", "ARCQS", "ARCGM", "ARCKL", "UARCN", "FARCS", "ARCNA", "ARCZM", "PARCL", "ARCTA", "ARCS", "ARCE", "ARCG", "ARCE"]

由于此代码使用随机单词,结果可能会发生变化,但每次运行的消耗时间可能不会显示太大差异。

最新更新