Python - 查找尚未被标签包围的所有 URL - Python - find all URLs not surrounded by a tags already 小贝子编程网

试图找出regex来检测文本中的URL，除了那些已经被<a href="url">...</a>包围并用标签包围的URL。

input: "http://google.sk this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"
input: "<a href="http://google.sk">http://google.sk</a> this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"

这个答案对我帮助很大，但它并不期望已经包围的 URL。

def fix_urls(text):
    pat_url = re.compile(  r'''
                     (?x)( # verbose identify URLs within text
         (https|http|ftp|gopher) # make sure we find a resource type
                       :// # ...needs to be followed by colon-slash-slash
            (w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
                      (/?| # could be just the domain name (maybe w/ slash)
                [^ nr"]+ # or stuff then space, newline, tab, quote
                    [w/]) # resource name ends in alphanumeric or slash
         (?=[s.,>)'"]]) # assert: followed by white or clause ending
                         ) # end of match group
                           ''')
    for url in re.findall(pat_url, text):
       text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})
    return text

如果文本中有任何<a>标签，此函数会再次包装我不想要的 URL。你知道如何让它工作吗？

使用否定的回溯来检查href="是否不在您的网址之前(第二行(：

(?x) # verbose
(?<!href=") #don't match already inside hrefs
(https?|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
((?:w+[:.]?){2,}) # at least two domain groups, e.g. (gnosis.)(cx) fixed capture group*
(/?| # could be just the domain name (maybe w/ slash)
[^ nr"]+ # or stuff then space, newline, tab, quote
[w/]) # resource name ends in alphanumeric or slash
(?=[s.,>)'"]]) # assert: followed by white or clause ending

https://regex101.com/r/EpcMKw/2/

Python - 查找尚未被标签包围的所有 URL

相关内容

最新更新

热门标签：