使用正则表达式从文本中删除URL,当前域除外



我正在尝试预替换一个字符串,并从中删除所有不包含当前域的URL。

到目前为止,我得到了这个正则表达式,但它并不排除mydomain。我做错了什么
http[s]?://[w]{0,3}.{0,1}((?<!mydomain)[^.].*)

预期输入和输出:
http://regex101.=>应匹配
http://www.regex101.=>应匹配
https://regex101.=>应匹配
https://www.regex101.=>应匹配
https://www.mydomain.=>不应该匹配,但它匹配

https://regex101.com/r/kGil9O/1

我读过几个SO问题/答案,要么不适用于我的情况,要么在某种程度上有所不同。当回答时,请解释一下我错在哪里,这样我下次会更好。谢谢

如果mydomain在匹配后不直接位于左侧,则负查找会断言,例如https://https://www,这始终为真,因此您将获得与尝试的模式的所有匹配。

您可以选择使用所有格量词后跟否定先行词来匹配www.

^https?://(?:www.)?+(?!mydomain.)S+$

模式匹配:

  • ^字符串开始
  • https?://将协议与可选的s://匹配
  • (?:www.)?+可选择匹配www.,并在匹配时使用所有格量词不回溯
  • (?!mydomain.)否定前瞻,不直接在当前位置右侧断言mydomain.
  • S+匹配任何非whitspace字符的1+倍
  • $字符串结束

regex演示| Php演示

示例

$strings = [
"http://regex101.",
"http://www.regex101.",
"https://regex101.",
"https://www.regex101.",
"https://www.mydomain.",
"https://mydomain."
];
$pattern = "~^https?://(?:www.)?+(?!mydomain.)S+$~";
foreach ($strings as $s) {
if (preg_match($pattern, $s)) {
echo "Match: $s" . PHP_EOL;
} else {
echo "No match: $s" . PHP_EOL;
}
}

输出

Match: http://regex101.
Match: http://www.regex101.
Match: https://regex101.
Match: https://www.regex101.
No match: https://www.mydomain.
No match: https://mydomain.

如果使用了错误的lookbacking,它会检查左侧的文本,并且在lookbacking之前尝试匹配www.。CCD_ 22不是CCD_。

使用前瞻性:

https?://(?!(?:www.)?mydomain)(?:www.)?([^.].*)

查看验证

解释

--------------------------------------------------------------------------------
http                     'http'
--------------------------------------------------------------------------------
s?                       's' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
:                        ':'
--------------------------------------------------------------------------------
/                       '/'
--------------------------------------------------------------------------------
/                       '/'
--------------------------------------------------------------------------------
(?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
(?:                      group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
www                      'www'
--------------------------------------------------------------------------------
.                       '.'
--------------------------------------------------------------------------------
)?                       end of grouping
--------------------------------------------------------------------------------
mydomain                 'mydomain'
--------------------------------------------------------------------------------
)                        end of look-ahead
--------------------------------------------------------------------------------
(?:                      group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
www                      'www'
--------------------------------------------------------------------------------
.                       '.'
--------------------------------------------------------------------------------
)?                       end of grouping
--------------------------------------------------------------------------------
(                        group and capture to 1:
--------------------------------------------------------------------------------
[^.]                     any character except: '.'
--------------------------------------------------------------------------------
.*                       any character except n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)                        end of 1

最新更新