Python - 查找尚未被标签包围的所有 URL



试图找出regex来检测文本中的URL,除了那些已经被<a href="url">...</a>包围并用标签包围的URL。

input: "http://google.sk this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"
input: "<a href="http://google.sk">http://google.sk</a> this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"

这个答案对我帮助很大,但它并不期望已经包围的 URL。

def fix_urls(text):
    pat_url = re.compile(  r'''
                     (?x)( # verbose identify URLs within text
         (https|http|ftp|gopher) # make sure we find a resource type
                       :// # ...needs to be followed by colon-slash-slash
            (w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
                      (/?| # could be just the domain name (maybe w/ slash)
                [^ nr"]+ # or stuff then space, newline, tab, quote
                    [w/]) # resource name ends in alphanumeric or slash
         (?=[s.,>)'"]]) # assert: followed by white or clause ending
                         ) # end of match group
                           ''')
    for url in re.findall(pat_url, text):
       text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})
    return text

如果文本中有任何<a>标签,此函数会再次包装我不想要的 URL。你知道如何让它工作吗?

使用否定的回溯来检查href="是否不在您的网址之前(第二行(:

(?x) # verbose
(?<!href=") #don't match already inside hrefs
(https?|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
((?:w+[:.]?){2,}) # at least two domain groups, e.g. (gnosis.)(cx) fixed capture group*
(/?| # could be just the domain name (maybe w/ slash)
[^ nr"]+ # or stuff then space, newline, tab, quote
[w/]) # resource name ends in alphanumeric or slash
(?=[s.,>)'"]]) # assert: followed by white or clause ending

https://regex101.com/r/EpcMKw/2/

最新更新