Regex匹配domain.com，但不匹配@domain.com

这应该是简单的，但它逃避我。有许多好的和坏的regex方法来匹配URL，有或没有协议，有或没有www。我所遇到的问题是这样的(在javascript中):如果我使用regex来匹配文本字符串中的url，并设置它，以便它匹配'domain.com'，它还捕获了电子邮件地址的域('@'之后的部分)，这是我不想要的。一个负向后看解决了这个问题——但显然不是在JS中。

这是我到目前为止最近的成功:

 /^(www.)?([^@])([a-z]*.)(com|net|edu|org)(.au)?(/S*)?$/g

，但如果匹配不在字符串的开头，则失败。我确定我处理它的方式是错误的。有没有一个简单的答案?

编辑:修改正则表达式以回应下面的一些评论(坚持使用'www'而不是允许子域名:

b(www.)?([^@])(w*.)(w{2,3})(.w{2,3})?(/S*)?$

但是，正如评论中提到的，这仍然匹配@后面的域。

谢谢

，如果匹配不在字符串
的开头，则失败。

这是因为^在匹配的开始:

/(www.)?([^@])([a-z]*.)(com|net|edu|org)(.au)?(/S*)?$/g

js> "www.foobar.com".match(/(www.)?([^@])([a-z]*.)(com|net|edu|org)(.au)?(/S*)?$/g)
["www.foobar.com"]
js> "aoeuaoeu foobar.com".match(/(www.)?([^@])([a-z]*.)(com|net|edu|org)(.au)?(/S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu foobar.com".match(/(www.)?([^@])([a-z]*.)(com|net|edu|org)(.au)?(/S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu toto@foobar.com".match(/(www.)?([^@])([a-z]*.)(com|net|edu|org)(.au)?(/S*)?$/g)
["foobar.com"]

，尽管它仍然匹配域前的空格。它对定义域做出了错误的假设…

xyz.example.org是一个有效的域名，不匹配您的regexp;
www.3x4mpl3.org是一个有效的域，不匹配您的regexp;
example.co.uk是一个有效的域名，你的regexp不匹配;
ουτοπία.δπθ.gr是一个有效的域名，不匹配您的regexp。

合法域名的定义是什么?它只是一个由点分隔的utf-8字符序列。它不能有两个点在彼此后面，并且规范名称是w.ww(因为我认为不存在一个字母的名称)。

尽管如此，我要做的是简单地将看起来像的所有内容匹配为一个域，通过使用字边界(b)将所有文本与点分隔符匹配:

/b(w+.)+w+b/g

js> "aoe toto.example.org  uaoeu foo.bar aoeuaoeu".match(/b(w+.)+w+b/g)
["toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu".match(/b(w+.)+w+b/g)
["example.org", "toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/b(w+.)+w+b/g)
["example.org", "toto.example.org", "foo.bar", "f00bar.com"]

，然后进行第二轮检查该域是否真的存在于所找到的域列表中。缺点是javascript中的regexp不能检查unicode字符，b或w都不会接受ουτοπία.δπθ.gr作为有效域名。

在ES6中，有/u修饰符，它应该与最新的浏览器一起工作(但到目前为止我没有测试过):

"ουτοπία.δπθ.gr aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu".match(/b(w+.)+w+b/gu)

编辑:

一个消极的向后看解决了这个问题-但显然不是在JS。

是的，它会:为了跳过所有的电子邮件地址，下面是regex实现背后的工作外观:

/(?![^@])?b(w+.)+w+b/g

js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/(?<![^@])?b(w+.)+w+b/g)
["toto.example.org", "foo.bar", "f00bar.com"]

虽然它和unicode是一样的，但它很快就会出现在JS中…

唯一的解决方法是在匹配的regexp中实际保留@，并丢弃任何包含@:

的匹配。

js> "toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?bw+.+w+b/g).map(function (x) { if (!x.match(/@/)) return x })
["toto.net", (void 0), "toto.example", "foo.bar", "f00bar.com"]

或者使用ES6/JS1.7的新列表推导式，它应该在现代浏览器中存在…

[x for x of "toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?bw+.+w+b/g) if (!x.match(/@/))];

最后一次更新:

/@?b(w*[^Wd]+w*.+)+[^Wd_]{2,}b/g

> "x.y tot.toc.toc $11.00 11.com 11foo.com toto.11 toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?b(w*[^Wd]+w*.+)+[^Wd_]{2,}b/g).filter(function (x) { if (!x.match(/@/)) return x })
[ 'tot.toc.toc',
  '11foo.com',
  'toto.net',
  'toto.example.org',
  'foo.bar',
  'f00bar.com' ]

经过一番折腾后，这最终成功了(对@zmo的最后评论表示赞赏):

var rx = /b(www.)?(w*@)?([a-zA-Z-]*.)(com|org|net|edu|COM|ORG|NET|EDU)(.au)?(/S*)?/g;
var link = txt.match(rx);
    if(link !== null) {
    for(var i = 0; i < link.length; i++) {
      if (link[i].indexOf('@') == -1) {
         //create link
       } else {
        //create mailto;
       }
       }
       }

我知道关于子域，顶级域名等的限制(which@zmo已经在上面解决了-如果您需要捕获所有url，我建议您调整该代码)，但这不是我的主要问题。我的回答中的代码允许匹配没有'www.'的文本字符串中存在的url，而不会捕获电子邮件地址的域。

相关内容

最新更新

热门标签：