Python-使用起始和结束条件拆分网站URL

我从数据框架中提取了一个带有网站链接的列。

tweets = csv_doc.Tweet
URL = [re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',x) for x in tweets]

现在我只想选择除http(s)//:和'之外的所有内容，这是我所需信息的限制。示例：显示为：https://t(.)co/V3aoj9RUh4(添加括号以避免链接(需要：'t.co'.

我想当使用变量URL而不是使用findall时，我可以直接使用split然而，我尝试先单独做：[re.split(r'//?'',x) for x in tweets]

我没有收到任何错误，但屏幕上没有显示任何内容。

r'//表示它拆分以"//"开头的字符串

?表示字符串的长度未知

CCD_ 10表示它以"结束">

这里出了什么问题？我如何通过保存URL的主要部分来证明这一点？

一旦拆分，我如何进行分组和计数？我想得到一个链接出现的次数列表。CCD_ 11和CCD_。

这个怎么样？

#create an empty dictionary to store the counts
counts = {}
#loop over the URLs in your list
for tweet in tweets:
#match everything after http(s):// up to the next '/'
matches = re.findall(r'http[s]?://([^/]*)',tweet)
for match in matches:
counts[match] = counts.get(match,0) + 1
#print out the number of counts of each domain
for domain,frequency in counts.items():
print("{} {}".format(domain,frequency))

您可以简单地使用split，并将maxsplit设置为2。url将是列表中的最后一个项目，您可以使用[-1]:访问它

url = 'https://twitter.com/V3aoj9RUh4'
url.split('/', 2)[-1]

结果：twitter.com/V3aoj9RUh4

对于字符串中的url，首先是split字符串：

text = 'This is a text https://twitter.com/V3aoj9RUh4 nice'
[url.split('/', 2)[-1] for url in text.split() if 'http' in url]

这将使：

URL = [url.split('/', 2)[-1] for x in tweets for url in x.split() if 'http' in url]

相关内容

最新更新

热门标签：