忽略在正则表达式中创建单词的开头

我正在尝试解析消息中的所有链接。

我的Java代码如下所示：

Pattern URLPATTERN = Pattern.compile(
"([--:\w?@%&+~#=]*\.[a-z]{2,4}/{0,2})((?:[?&](?:\w+)=(?:\w+))+|[--:\w?@%&+~#=]+)?",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
links.add(new int[] {matcher.start(1), matcher.end()});
[...]

现在的问题是，链接有时以颜色代码开头，看起来如下：[&§]{1}[a-z0-9]{1}

例如：Please use Google: §ehttps://google.com, and don't ask me.

使用正则表达式，我在互联网上的某个地方发现它将匹配以下内容：ehttps://google.com，但它应该只匹配https://google.com

现在，我如何更改上面的正则表达式以排除以下模式，但仍然匹配颜色代码后面的链接？

[&§]{1}[a-z0-9]{1}

您可以在正则表达式的开头添加(?:[&§][a-z0-9])?模式(匹配&或§的可选序列，然后是ASCII字母或数字(：

Pattern URLPATTERN = Pattern.compile(
"(?:[&§][a-z0-9])?([--:\w?@%&+~#=]*\.[a-z]{2,4}/{0,2})((?:[?&]\w+=\w+)+|[--:\w?@%&+~#=]+)?", Pattern.CASE_INSENSITIVE);

请参阅regex演示。

当正则表达式找到§ehttps://google.com时，§e与可选的非捕获组(?:[&§][a-z0-9])?匹配，这就是为什么它是"0"；排除"；从第1组值。

不需要在正则表达式中使用Pattern.MULTILINE | Pattern.DOTALL，模式中没有.和^/$。

相关内容

最新更新

热门标签：