试图找出regex
来检测文本中的URL,除了那些已经被<a href="url">...</a>
包围并用标签包围的URL。
input: "http://google.sk this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"
input: "<a href="http://google.sk">http://google.sk</a> this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"
这个答案对我帮助很大,但它并不期望已经包围的 URL。
def fix_urls(text):
pat_url = re.compile( r'''
(?x)( # verbose identify URLs within text
(https|http|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
(w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
(/?| # could be just the domain name (maybe w/ slash)
[^ nr"]+ # or stuff then space, newline, tab, quote
[w/]) # resource name ends in alphanumeric or slash
(?=[s.,>)'"]]) # assert: followed by white or clause ending
) # end of match group
''')
for url in re.findall(pat_url, text):
text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})
return text
如果文本中有任何<a>
标签,此函数会再次包装我不想要的 URL。你知道如何让它工作吗?
使用否定的回溯来检查href="
是否不在您的网址之前(第二行(:
(?x) # verbose
(?<!href=") #don't match already inside hrefs
(https?|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
((?:w+[:.]?){2,}) # at least two domain groups, e.g. (gnosis.)(cx) fixed capture group*
(/?| # could be just the domain name (maybe w/ slash)
[^ nr"]+ # or stuff then space, newline, tab, quote
[w/]) # resource name ends in alphanumeric or slash
(?=[s.,>)'"]]) # assert: followed by white or clause ending
https://regex101.com/r/EpcMKw/2/