如何将字符串中的所有 URL 替换为主机名和 TLD(例如 google.com)



我有一个字符串,其中包含多个URL和一些文本。

如何将每个 URL 替换为其主机名和顶级域?

示例输入:www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask

期望输出:google.com some text google.com some text google.com some text stackoverflow.com

我找到了 Python 模块tldextract但这只有助于提取主机名 + tld,但不能帮助查找和替换所有 URL

提前感谢!

您还可以将regex与以下逻辑一起使用:

  1. (http[s]?://)--> 捕获 http://或 https://
  2. (www.)--> 捕获 www.
  3. (?<=.[a-z][a-z][a-z])(/[^ ]*)使用斜杠捕获过去.com的任何内容,不包括.com(以及其他域,如 org、net,只要 3 个字母长)
yourString = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'
re.sub(r'(http[s]?://)|(?<=.com)(/[^ ]*)|(www.)', '', yourString)
Out[1]:'google.com some text google.com some text google.com some text stackoverflow.com'

您可以将域前面的部分替换为'www'(等)'',但该解决方案会忽略后缀后无法预测的所有内容。

试试这个:

import tldextract
somestr = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'
newstr = ''
for word in somestr.split(' '):
extracted = tldextract.extract(word)
if extracted.domain != '' and extracted.suffix != '':
newstr += extracted.domain + '.' + extracted.suffix + ' '
else:
newstr += word + ' '
print(newstr)

这是pandas列上使用"re"和"tldextract"的另一个版本:

import re
import tldextract
#define the regex pattern to catch any url (try it on regex101.com)
ANY_URL_REGEX = re.compile(r"""(?i)b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^s()<>{}[]]+|([^s()]*?([^s()]+)[^s()]*?)|([^s]+?))+(?:([^s()]*?([^s()]+)[^s()]*?)|([^s]+?)|[^s`!()[]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)b/?(?!@)))""")
#you may want to lower the string in your column
data['column'] = data['column'].str.lower()
#to simplify the process create 2 more columns
#1- that catches the full url example xyz.co.uk or asd.am.edu or google.com
#2- that catches the domain in that full url
data['url'] = data['column'].str.extract(ANY_URL_REGEX, expand=False)
data['domain'] = data['column'].str.extract(ANY_URL_REGEX, expand=False).apply(lambda url: tldextract.extract(url).domain if pd.notnull(url) else '')
#now apply on column to find any "URL" and replace it with "domain"
data['column'] = data.apply(lambda x: str(x['coalesced_brand']).replace(str(x['url']),x['domain']), axis=1)

注意:此示例代码从(http(s)://www.)sample_site.com/whatever中提取sample_site。 您可以对其进行修改以提取sample_site.com

最新更新