按正则表达式分割vs多个单字符分割性能



我比较了使用正则表达式拆分字符串和使用多个单个字符拆分字符串,使用这个基准

import org.openjdk.jmh.annotations.*;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
public class Test {
static String start = "1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.1, 2, 3, 4, 5, 6, 7, 8. 9. 10. 11. 12.";
public static void main(String[] args) throws IOException {
org.openjdk.jmh.Main.main(args);
}
@Fork(value = 1, warmups = 0)
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 0)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public static void splitByRegex() {
String test = start;
test = String.join("_", test.split("[1,.]"));
}
@Fork(value = 1, warmups = 0)
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 0)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public static void multipleSplitByOneChar() {
String test = start;
test = String.join("_", test.split("\."));
test = String.join("_", test.split(","));
test = String.join("_", test.split("1"));
}
}

得到了这些结果

Benchmark                    Mode  Cnt      Score     Error  Units
Test.multipleSplitByOneChar  avgt    5  10493,118 ± 572,528  ns/op
Test.splitByRegex            avgt    5  15519,418 ± 913,220  ns/op

为什么用regex分割比用多个单独的字符分割慢,即使它们产生相同的结果?

注意:

  1. 我在JDK 14.0.2上运行代码
  2. 我使用JMH 1.28

String.split实现具有一个字符分割的优化快速路径。

public String[] split(String regex, int limit) {
/* fastpath if the regex is a
* (1) one-char String and this character is not one of the
*     RegEx's meta characters ".$|()[{^?*+\", or
* (2) two-char String and the first char is the backslash and
*     the second is not the ascii digit or ascii letter.
*/

最新更新