绝对最快的Java HTML逃脱功能



基本上,这篇文章是一个挑战。我一直在尝试以适度的成功来优化HTML逃生功能。但是我知道那里有一些严重的爪哇黑客可能比我更好,我很想学习。

我一直在分析我的Java Web应用程序,发现主要的热点是我们的字符串逃逸功能。我们当前使用Apache Commons lang进行此任务,调用StringScapeutils.escapehtml()。我以为它是如此广泛使用,它将很快使用,但即使我最幼稚的实现也要快得多。

这是我使用的基准代码以及天真的实现。它测试了各种长度的字符串,有些仅包含纯文本,有些包含需要逃脱的HTML。

public class HTMLEscapeBenchmark {
    public static String escapeHtml(String text) {
        if (text == null) return null;
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < text.length(); i++) {
            char c = text.charAt(i);
            if (c == '&') {
                sb.append("&amp;");
            } else if (c == ''') {
                sb.append("&#39;");
            } else if (c == '"') {
                sb.append("&quot;");
            } else if (c == '<') {
                sb.append("&lt;");
            } else if (c == '>') {
                sb.append("&gt;");
            } else {
                sb.append(c);
            }
        }
        return sb.toString();
    }
    /*
    public static String escapeHtml(String text) {
        if (text == null) return null;
        return StringEscapeUtils.escapeHtml(text);
    }
    */

    public static void main(String[] args) {
        final int RUNS = 5;
        final int ITERATIONS = 1000000;

        // Standard lorem ipsum text.
        String loremIpsum = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut " +
            "labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut " +
            "aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum " +
            "dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia " +
            "deserunt mollit anim id est laborum. ";
        while (loremIpsum.length() < 1000) loremIpsum += loremIpsum;
        // Add some characters that need HTML escaping.  Bold every 2 and 3 letter word, quote every 5 letter word.
        String loremIpsumHtml = loremIpsum.replaceAll("[A-Za-z]{2}]", "<b>$0</b>").replaceAll("[A-Za-z]{5}", ""$0"");
        System.out.print("nNormal-10");
        String text = loremIpsum.substring(0, 10);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("t%.3f", (System.nanoTime() - start) / 1e9);
        }
        System.out.print("nNormal-100");
        text = loremIpsum.substring(0, 100);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("t%.3f", (System.nanoTime() - start) / 1e9);
        }
        System.out.print("nNormal-1000");
        text = loremIpsum.substring(0, 1000);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("t%.3f", (System.nanoTime() - start) / 1e9);
        }
        System.out.print("nHtml-10");
        text = loremIpsumHtml.substring(0, 10);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("t%.3f", (System.nanoTime() - start) / 1e9);
        }
        System.out.print("nHtml-100");
        text = loremIpsumHtml.substring(0, 100);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("t%.3f", (System.nanoTime() - start) / 1e9);
        }
        System.out.print("nHtml-1000");
        text = loremIpsumHtml.substring(0, 1000);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("t%.3f", (System.nanoTime() - start) / 1e9);
        }
    }
}

在我两年历史的MacBook Pro上,我得到以下结果。

commons lang stringscapeutils.escapehtml

Normal-10     0.439     0.357     0.351     0.343     0.342
Normal-100     2.244     0.934     0.930     0.932     0.931
Normal-1000     8.993     9.020     9.007     9.043     9.052
Html-10     0.270     0.259     0.258     0.258     0.257
Html-100     1.769     1.753     1.765     1.754     1.759
Html-1000     17.313     17.479     17.347     17.266     17.246

幼稚实施

Normal-10    0.111    0.091    0.086     0.084     0.088
Normal-100    0.636     0.627     0.626     0.626     0.627
Normal-1000     5.740     5.755     5.721     5.728     5.720
Html-10     0.145     0.138     0.138     0.138     0.138
Html-100     0.899     0.901     0.896     0.901     0.900
Html-1000     8.249     8.288     8.272     8.262     8.284

我将发布自己的最佳优化尝试作为答案。所以,我的问题是,你能做得更好吗?逃脱HTML的最快方法是什么?

这是我优化它的最佳尝试。我对我希望是纯文本字符串的常见情况进行了优化,但是对于使用HTML实体的字符串,我无法使其变得更好。

    public static String escapeHtml(String value) {
        if (value == null) return null;
        int length = value.length();
        String encoded;
        for (int i = 0; i < length; i++) {
            char c = value.charAt(i);
            if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
                // We found a character to encode, so we need to start from here and buffer the encoded string.
                StringBuilder sb = new StringBuilder((int) (length * 1.25));
                sb.append(value.substring(0, i));
                sb.append(encoded);
                i++;
                for (; i < length; i++) {
                    c = value.charAt(i);
                    if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
                        sb.append(encoded);
                    } else {
                        sb.append(c);
                    }
                }
                value = sb.toString();
                break;
            }
        }
        return value;
    }
    private static String getHtmlEntity(char c) {
        switch (c) {
            case '&': return "&amp;";
            case ''': return "&#39;";
            case '"': return "&quot;";
            case '<': return "&lt;";
            case '>': return "&gt;";
            default: return null;
        }
    }

Normal-10     0.021     0.023     0.011     0.012     0.011
Normal-100     0.074     0.074     0.073     0.074     0.074
Normal-1000     0.671     0.678     0.675     0.675     0.680
Html-10     0.222     0.152     0.153     0.153     0.154
Html-100     0.739     0.715     0.718     0.724     0.706
Html-1000     6.812     6.828     6.802     6.802     6.806

我认为它的使用情况如此之快,这将是相当快的,但是即使我最幼稚的实现也更快。

如果您查看Apache版本的源代码(例如),您会发现它正在处理您的版本忽略的许多情况:

  • 它编码HTML 4中定义的所有实体(以及应用程序添加的其他实体),而不仅仅是硬接线最小子集
  • 它编码所有更大或等于0x7f的所有字符。

简而言之,它较慢,因为它更一般。

最新更新