如何在字典列表中格式化字典对象?



我从网站上抓取了以下列表,我们假设random.com

tags1 = [{tag.name: tag['src']} for tag in soup.find_all('script')]
tags2 = [{tag.name: tag['href']} for tag in soup.find_all(name="link",attrs={'rel':'stylesheet'})]
tag_list = tags1 + tags2 
print(tag_list)
[ {'script': 'js/custom.js'}, {'script': 'https:cdnjs.cloudflare.c
om/ajax/libs/fancybox/2.1.5/jquery.fancybox.min.js'}, {'link': 'css/bootstrap.min.css'}, {'link': 'css/style.css'}, {'link': 'css/responsive.css'}, {'link': 'css/jqu
ery.mCustomScrollbar.min.css'}, {'link': 'https://netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css'}]

我想根据条件修改这个列表:

  1. 中删除https://
  2. 将值分成两部分:domain和path
  3. 如果没有域名,添加域名为"random.com
  4. "。

预期输出如下:

[ {'script': [{'domain':'random.com','path':'js/custom.js'}]}, {'script': [{'domain':'cdnjs.cloudflare.c
om','path':'ajax/libs/fancybox/2.1.5/jquery.fancybox.min.js'}]}, {'link': [{'domain':'random.com','path':'css/bootstrap.min.css'}]}, {'link': [{'domain':'random.com','path':'css/style.css'}]}, {'link': [{'domain':'random.com','path':'css/responsive.css'}]}, {'link': [{'domain':'random.com','path':'css/jqu
ery.mCustomScrollbar.min.css'}]}, {'link': [{'domain':'netdna.bootstrapcdn.com','path':'font-awesome/4.0.3/css/font-awesome.css'}]}]

就像这样。

你可以试试urllib.parse

from urllib.parse import urlparse
output_ = []
# --> Regex to format URI with invalid schema
extract_uri = re.compile(r":(.+)")
for tag in tags:
for k, v in tag.items():
extract_ = extract_uri.search(v)
# --> Identify the URI with schema & prefix format the schema
if extract_:
v = "https://" + extract_.group(1).replace("//", "")
parse_ = urlparse(v)  # --> Parse the URI
output_.append({
k: [{
"domain": parse_.netloc if parse_.netloc else "random.com",
"path": parse_.path
}]
})
print ( output )

[{'script': [{'domain': 'random.com', 'path': 'js/custom.js'}]},
{'script': [{'domain': 'cdnjs.cloudflare.com',
'path': '/ajax/libs/fancybox/2.1.5/jquery.fancybox.min.js'}]},
{'link': [{'domain': 'random.com', 'path': 'css/bootstrap.min.css'}]},
{'link': [{'domain': 'random.com', 'path': 'css/style.css'}]},
{'link': [{'domain': 'random.com', 'path': 'css/responsive.css'}]},
{'link': [{'domain': 'random.com',
'path': 'css/jquery.mCustomScrollbar.min.css'}]},
{'link': [{'domain': 'netdna.bootstrapcdn.com',
'path': '/font-awesome/4.0.3/css/font-awesome.css'}]}]

试试这个-

import re
tagList = [{'script': 'js/custom.js'},
{'script': 'https:cdnjs.cloudflare.com/ajax/libs/fancybox/2.1.5/jquery.fancybox.min.js'},
{'link': 'css/bootstrap.min.css'}, {'link': 'css/style.css'},
{'link': 'css/responsive.css'}, {'link': 'css/jquery.mCustomScrollbar.min.css'},
{'link': 'https://netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css'}]
print(tagList)
reqTagList = []
for i in tagList:
for k, v in i.items():
result = re.match(r"httpsW*", v) # Using regex to find https with leading non word characters so it will work for both https: and https://
if result is not None:
url = v[result.end():]
reqTagList.append(
{k: [
{
'domain': url.split('/')[0],
'path': '/'.join(url.split('/')[1:])
}]
})
else:
reqTagList.append(
{k: [
{
'domain': 'random.com',
'path': v
}]
})
print(reqTagList)

相关内容

  • 没有找到相关文章

最新更新