去字符串.Contains()比Python3慢2倍

我将文本模式扫描仪从Python3转换为Go1.10，但我很惊讶它实际上慢了2倍。根据分析，罪魁祸首在strings.Contains()中。请参阅下面的简单基准。我错过什么了吗？你能为Go推荐一种更快的模式搜索算法吗？在这种情况下，它会表现得更好？我不在乎启动时间，同样的模式将用于扫描数百万个文件。

Py3基准：

import time
import re
RUNS = 10000
if __name__ == '__main__':
with open('data.php') as fh:
testString = fh.read()
def do():
return "576ad4f370014dfb1d0f17b0e6855f22" in testString
start = time.time()
for i in range(RUNS):
_ = do()
duration = time.time() - start
print("Python: %.2fs" % duration)

Go1.10基准：

package main
import (
"fmt"
"io/ioutil"
"log"
"strings"
"time"
)
const (
runs = 10000
)
func main() {
fname := "data.php"
testdata := readFile(fname)
needle := "576ad4f370014dfb1d0f17b0e6855f22"
start := time.Now()
for i := 0; i < runs; i++ {
_ = strings.Contains(testdata, needle)
}
duration := time.Now().Sub(start)
fmt.Printf("Go: %.2fsn", duration.Seconds())
}
func readFile(fname string) string {
data, err := ioutil.ReadFile(fname)
if err != nil {
log.Fatal(err)
}
return string(data)
}

data.php是一个528KB的文件，可以在这里找到。

输出：

Go:     1.98s
Python: 0.84s

为什么Python 3(24.79s(比Go(5.47s(慢4.5倍？你得到了什么结果？

Python:

$ cat contains.py
import time
import re
RUNS = 10000
if __name__ == '__main__':
# The Complete Works of William Shakespeare by William Shakespeare
# http://www.gutenberg.org/files/100/100-0.txt
file = '/home/peter/shakespeare.100-0.txt' # 'data.php'
with open(file) as fh:
testString = fh.read()
def do():
return "Means to immure herself and not be seen." in testString
start = time.time()
for i in range(RUNS):
_ = do()
duration = time.time() - start
print("Python: %.2fs" % duration)
print(do())
$ python3 --version
Python 3.6.5
$ python3 contains.py
Python: 24.79s
True
$

转到：

$ cat contains.go
package main
import (
"fmt"
"io/ioutil"
"log"
"strings"
"time"
)
const (
runs = 10000
)
func main() {
// The Complete Works of William Shakespeare by William Shakespeare
// http://www.gutenberg.org/files/100/100-0.txt
fname := `/home/peter/shakespeare.100-0.txt` // "data.php"
testdata := readFile(fname)
needle := "Means to immure herself and not be seen."
start := time.Now()
for i := 0; i < runs; i++ {
_ = strings.Contains(testdata, needle)
}
duration := time.Now().Sub(start)
fmt.Printf("Go: %.2fsn", duration.Seconds())
fmt.Println(strings.Contains(testdata, needle))
fmt.Println(strings.Index(testdata, needle))
}
func readFile(fname string) string {
data, err := ioutil.ReadFile(fname)
if err != nil {
log.Fatal(err)
}
return string(data)
}
$ go version
go version devel +5332b5e75a Tue Jul 31 15:44:37 2018 +0000 linux/amd64
$ go run contains.go
Go: 5.47s
true
5837178
$

我在维基百科上找到了各种字符串搜索实现，比如：

https://github.com/cloudflare/ahocorasick
https://github.com/cubicdaiya/bms
https://github.com/kkdai/kmp
https://github.com/paddie/gokmp
https://github.com/hillu/go-yara(亚拉似乎在幕后实施了Aho&Corasick(

基准结果(此处代码(：

BenchmarkStringsContains-4         10000        208055 ns/op
BenchmarkBMSSearch-4                1000       1856732 ns/op
BenchmarkPaddieKMP-4                2000       1069495 ns/op
BenchmarkKkdaiKMP-4                 1000       1440147 ns/op
BenchmarkAhocorasick-4              2000        935885 ns/op
BenchmarkYara-4                     1000       1237887 ns/op

然后，我针对本机(strings.Contains和regexp(和基于C的Yara实现，在一个500KB的无匹配文件中测试了大约1100个签名(100个正则表达式，1000个文本(的实际用例：

BenchmarkScanNative-4              2     824328504 ns/op
BenchmarkScanYara-4              300       5338861 ns/op

尽管围棋中的C调用被认为是昂贵的，但在这些"繁重"的操作中，利润是惊人的。侧面观察：Yara只需要5倍的CPU时间就可以匹配1100个签名，而不是1个。

相关内容

最新更新

热门标签：