匹配格式错误的 XML 注释中的双连字符

我要解析不符合"注释中没有双连字符"标准的XML文件，这让MSXML抱怨。我正在寻找一种删除冒犯性连字符的方法。

我正在使用StringRegExpReplace().我尝试了以下正则表达式：

<!--(.*)--> : correctly gets comments
<!--(-*)--> : fails to be a correct regex (also tried escaping and using x2D)

给定正确的模式，我会调用：

StringRegExpReplace($xml_string,$correct_pattern,"") ;replace with nothing

如何匹配XML注释中剩余的额外连字符，同时保留剩余文本？

您可以使用

此模式：

(?|G(?!A)(?|-{2,}+([^->][^-]*)|(-[^-]+)|-+(?=-->)|-->[^<]*(*SKIP)(*FAIL))|[^<]*<+(?>[^<]+<+)*?(?:!--K|[^<]*zK(*ACCEPT))(?|-*+([^->][^-]*)|-+(?=-->)|-?+([^-]+)|-->[^<]*(*SKIP)(*FAIL)()))

详：

(?| 
    G(?!A) # contiguous to the precedent match (inside a comment)
    (?|
        -{2,}+([^->][^-]*) # duplicate hyphens, not part of the closing sequence
      |
         (-[^-]+)          # preserve isolated hyphens 
      |
         -+ (?=-->)        # hyphens before closing sequence, break contiguity
      |
         -->[^<]*          # closing sequence, go to next <
         (*SKIP)(*FAIL)    # break contiguity
    )
  |
    [^<]*<+ # reach the next < (outside comment)
    (?> [^<]+ <+ )*?       # next < until !-- or the end of the string 
    (?: !-- K | [^<]*zK (*ACCEPT) ) # new comment or end of the string
    (?|
        -*+ ([^->][^-]*)   # possible hyphens not followed by >
      |
        -+ (?=-->)         # hyphens before closing sequence, break contiguity
      |
        -?+ ([^-]+)        # one hyphen followed by >
      |
        -->[^<]*           # closing sequence, go to next <
        (*SKIP)(*FAIL) ()  # break contiguity (note: "()" avoids a mysterious bug
    )                      # in regex101, you can remove it)
)

使用此替换：1

在线演示

G功能可确保匹配是连续的。有两种方法可用于打破连续性：

展望(?=-->)
回溯控制谓词(*SKIP)(*FAIL)强制模式失败，并且之前匹配的所有字符都不会重试。

因此，当连续性被破坏或在开始时，第一个主分支将失败（G锚的原因），将使用第二个分支。

K从比赛结果中删除左侧的所有内容。

(*ACCEPT)使模式无条件成功。

此模式大量使用分支重置功能(?|...(..)...|...(..)...|...)，因此所有捕获组具有相同的编号（换句话说，只有一个组，即组 1。

注意：即使这个模式很长，也需要几个步骤来获得匹配。尽可能减少非贪婪量词的影响，并且每个备选方案都经过排序并尽可能高效。目标之一是减少处理字符串所需的匹配总数。

(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)

仅在之间匹配--（或----等）。您需要设置 /s 参数以允许点与换行符匹配。

解释：

(?<!<!)   # Assert that we're not right at the start of a comment
--+       # Match two or more dashes --
(?=       # only if the following can be matched further onwards:
 (?!-?>)  # First, make sure we're not at the end of the comment.
 (?:      # Then match the following group
  (?!-->) # which must not contain -->
  .       # but may contain any character
 )*       # any number of times
 -->      # as long as --> follows.
)         # End of lookahead assertion.

在 regex101.com 上实时测试。

我想正确的 AutoIt 语法是

StringRegExpReplace($xml_string, "(?s)(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)", "")

相关内容

最新更新

热门标签：