正则表达式回溯,直到在 Java 中溢出



以下表达式:

^(#ifdef FEATURE)+?s*$((rn.*?)*^(#endif)+s*[//]*s*(end of)*s*FEATURE)+?$

在运行我的编译 .贾尔文件。

匹配的字符串可以类似于:

这是一条垃圾线

#ifdef 功能
#endif//功能结束

这是一条垃圾线

#ifdef 功能

这是一个应该匹配的垃圾行:HOLasduiqwhei &//FEATURE fjfefj#endif//h

#endif 功能

这是一条垃圾线

因此,粗体字符串应匹配。错误如下:

   at java.util.regex.Pattern$GroupHead.match(Unknown Source)
   at java.util.regex.Pattern$Loop.match(Unknown Source)
   at java.util.regex.Pattern$GroupTail.match(Unknown Source)
   at java.util.regex.Pattern$Curly.match1(Unknown Source)
   at java.util.regex.Pattern$Curly.match(Unknown Source)
   at java.util.regex.Pattern$Slice.match(Unknown Source)
   at java.util.regex.Pattern$GroupHead.match(Unknown Source)
   at java.util.regex.Pattern$Loop.match(Unknown Source)
   at java.util.regex.Pattern$GroupTail.match(Unknown Source)
   at java.util.regex.Pattern$Curly.match1(Unknown Source)
   at java.util.regex.Pattern$Curly.match(Unknown Source)
   at java.util.regex.Pattern$Slice.match(Unknown Source)
   at java.util.regex.Pattern$GroupHead.match(Unknown Source)
   at java.util.regex.Pattern$Loop.match(Unknown Source)
   at java.util.regex.Pattern$GroupTail.match(Unknown Source)
   at java.util.regex.Pattern$Curly.match1(Unknown Source)
   at java.util.regex.Pattern$Curly.match(Unknown Source)
   at java.util.regex.Pattern$Slice.match(Unknown Source)
   at java.util.regex.Pattern$GroupHead.match(Unknown Source)
   at java.util.regex.Pattern$Loop.match(Unknown Source)
   at java.util.regex.Pattern$GroupTail.match(Unknown Source)
   at java.util.regex.Pattern$Curly.match1(Unknown Source)
   at java.util.regex.Pattern$Curly.match(Unknown Source)
   at java.util.regex.Pattern$Slice.match(Unknown Source)
   at java.util.regex.Pattern$GroupHead.match(Unknown Source)
   at java.util.regex.Pattern$Loop.match(Unknown Source)
   at java.util.regex.Pattern$GroupTail.match(Unknown Source)
   at java.util.regex.Pattern$Curly.match1(Unknown Source)
   at java.util.regex.Pattern$Curly.match(Unknown Source)
   at java.util.regex.Pattern$Slice.match(Unknown Source)

欢迎任何回溯避免策略/改进表达式。我已经尝试过原子组(?>)但由于某种原因没有简化。

代码如下:

公共字符串条(字符串文本( {

    ArrayList<String> patterns=new ArrayList<String>();
    patterns=readFile("Disabled_Features.txt");
    for(int i = 0; i < patterns.size(); ++i)
    {
      Pattern todoPattern = Pattern.compile("^#ifdef "+patterns.get(i)+"((?:\r?\n(?!#endif (?:// end of )?"+patterns.get(i)+"$).*)*)\r?\n#endif (?:// end of )?"+patterns.get(i)+"$",Pattern.MULTILINE); 
      Matcher m = todoPattern.matcher(text);
      text = m.replaceAll("");
    }
    return text;        
}

我已经尝试了@Wiktor编写的代码并且运行良好

import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestRegex {
  public static void main(String[] args) {
    String text = "this is a junk linen" + 
        "n" + 
        "#ifdef FEATURE n" + 
        "#endif // end of FEATUREn" + 
        "n" + 
        "this is a junk linen" + 
        "n" + 
        "#ifdef FEATUREn" + 
        "n" + 
        "this is a junk line that should be matched: HOLasduiqwhei & // FEATURE fjfefj #endif // hn" + 
        "n" + 
        "#endif FEATUREn" + 
        "n" + 
        "this is a junk line";
    // this version does not use Pattern.MULTILINE, this should reduce the backtraking
    Matcher matcher2 = Pattern.compile("\n#ifdef FEATURE((?:\r?\n(?!#endif (?:// end of )?FEATURE).*)*)\r?\n#endif (?:// end of )?FEATURE").matcher(text);
    while (matcher2.find()) {
      System.out.println(matcher2.group());
    }
  }
}

这让我认为您的问题是由于输入文件的大小造成的。

因此,如果您的文件太大,您可以将输入实现为CharSequence,这样您就可以包装您的大型文本文件。为什么?因为从Pattern构建Matcher需要CharSequence作为论据。

https://github.com/fge/largetext

更新:

我尝试实现Wiktor的解决方案:

"^#ifdef "+patterns.get(i)+"((?:\r?\n(?!#endif (?:// end of )?"+patterns.get(i)+"$).*)*)\r?\n#endif (?:// end of )?"+patterns.get(i)+"$"

它只捕获第二个块,但不捕获以下块:

#ifdef 功能

垃圾捕获的文本

#endif//功能结束

无论如何,当我运行罐子时仍然溢出。

最新更新