关于 re:run 与"dotall"选项的效率?



当我使用re:run时,我发现一个有趣的事情:当我使用dotall选项时,效率非常低。

源代码:

main3() ->
    Sdp = 
"v=0rno=- 1001 11112 IN IP4 10.10.121.7rns=-rnt=0 0rnm=audio 52363 RTP/AVPF 0 8rnc=IN IP4 10.10.121.7rna=rtcp:52369 IN IP4 138.85.151.208rna=candidate:1783138469 1 udp 2113937151 138.85.151.208 52363 typ host generation 0rna=candidate:4012290674 1 udp 2113937151 192.168.125.1 52364 typ host generation 0rna=candidate:1760259326 1 udp 2113937151 192.168.2.12 52367 typ host generation 0rna=candidate:2294684747 1 udp 2113937151 192.168.58.1 52368 typ host generation 0rna=candidate:1783138469 2 udp 2113937150 138.85.151.208 52369 typ host generation 0rna=candidate:4012290674 2 udp 2113937150 192.168.125.1 52370 typ host generation 0rna=candidate:1760259326 2 udp 2113937150 192.168.2.12 52371 typ host generation 0rna=candidate:2294684747 2 udp 2113937150 192.168.58.1 52372 typ host generation 0rna=candidate:617313365 1 tcp 1509957375 138.85.151.208 52530 typ host generation 0rna=candidate:2711965314 1 tcp 1509957375 192.168.125.1 52531 typ host generation 0rna=candidate:644386830 1 tcp 1509957375 192.168.2.12 52532 typ host generation 0rna=candidate:3326468283 1 tcp 1509957375 192.168.58.1 52533 typ host generation 0rna=candidate:617313365 2 tcp 1509957374 138.85.151.208 52534 typ host generation 0rna=candidate:2711965314 2 tcp 1509957374 192.168.125.1 52535 typ host generation 0rna=candidate:644386830 2 tcp 1509957374 192.168.2.12 52536 typ host generation 0rna=candidate:3326468283 2 tcp 1509957374 192.168.58.1 52537 typ host generation 0rna=ice-ufrag:rootrna=ice-pwd:myreallysecretpasswordrna=sendrecvrna=rtpmap:0 PCMU/8000rna=rtpmap:8 PCMA/8000rna=ssrc:1947760130 cname:OCGE4NpwFpLE/BFWrna=ssrc:1947760130 mslabel:oBAkRgSOpLdfl7u1JWdnMyUytcGGD4COvttPrna=ssrc:1947760130 label:oBAkRgSOpLdfl7u1JWdnMyUytcGGD4COvttP00rnm=video 52373 RTP/AVPF 126rnc=IN IP4 10.10.121.7rna=rtcp:52377 IN IP4 138.85.151.208rna=candidate:1783138469 1 udp 2113937151 138.85.151.208 52373 typ host generation 0rna=candidate:4012290674 1 udp 2113937151 192.168.125.1 52374 typ host generation 0rna=candidate:1760259326 1 udp 2113937151 192.168.2.12 52375 typ host generation 0rna=candidate:2294684747 1 udp 2113937151 192.168.58.1 52376 typ host generation 0rna=candidate:1783138469 2 udp 2113937150 138.85.151.208 52377 typ host generation 0rna=candidate:4012290674 2 udp 2113937150 192.168.125.1 52378 typ host generation 0rna=candidate:1760259326 2 udp 2113937150 192.168.2.12 52379 typ host generation 0rna=candidate:2294684747 2 udp 2113937150 192.168.58.1 52380 typ host generation 0rna=candidate:617313365 1 tcp 1509957375 138.85.151.208 52538 typ host generation 0rna=candidate:2711965314 1 tcp 1509957375 192.168.125.1 52539 typ host generation 0rna=candidate:644386830 1 tcp 1509957375 192.168.2.12 52540 typ host generation 0rna=candidate:3326468283 1 tcp 1509957375 192.168.58.1 52541 typ host generation 0rna=candidate:617313365 2 tcp 1509957374 138.85.151.208 52542 typ host generation 0rna=candidate:2711965314 2 tcp 1509957374 192.168.125.1 52543 typ host generation 0rna=candidate:644386830 2 tcp 1509957374 192.168.2.12 52544 typ host generation 0rna=candidate:3326468283 2 tcp 1509957374 192.168.58.1 52545 typ host generation 0rna=ice-ufrag:rootrna=ice-pwd:myreallysecretpasswordrna=sendrecvrna=rtpmap:126 H264/90000rn",
    ReStr = "(.*)a=candidate.*host.*a=candidate.*host(.*)a=ice-ufrag.*a=setup:active(.*)a=mid:audio(.*)a=candidate.*host.*a=candidate.*host(.*)a=ice-ufrag.*a=setup:active(.*)a=mid:video(.*)",
    {ok, Pattern1} = re:compile(ReStr, [{newline, crlf}]),
    {Time1, _} = timer:tc(re, run, [ Sdp, Pattern1, [{capture,all_but_first,list}] ]),
    io:format("not using dotall, time is ~p~n", [Time1]),
    {ok, Pattern2} = re:compile(ReStr, [{newline, crlf}, dotall]),
    {Time2, _} = timer:tc(re, run, [ Sdp, Pattern2, [{capture,all_but_first,list}] ]),
    io:format("using dotall, time is ~p~n", [Time2]).

运行结果:

101> tt:main3().
not using dotall, time is 4499
using dotall, time is 2760364
ok

从结果中,我们可以发现差异是如此之大

通常默认情况下,当dotall未设置时,.模式不匹配n,因此搜索将只扩展到行尾。设置dotall后,.将匹配输入字符串结束前的所有字符。这在您的情况下会产生差异,因为您的输入字符串包含许多行。

要记住的是re是基于PCRE的,它遵循Perl正则表达式。其中一个特点是,它们是使用回溯算法实现的,这意味着当你的模式中有备选项时,比如.*,它会导致大量搜索以找到匹配项。这是Perl正则表达式的一个属性,而不是由于糟糕的实现。

关于实现正则表达式的各种方法的更长的讨论,请参见wikipedia正则表达式(第三种算法),Russ Cox的实现正则表达式(第一篇论文),或Friedl的精通正则表达式(尽管他给了3种算法错误的名称)。

最新更新