我有一个url列表,所有url的第一部分都相同。所有的url都有"成分披露",产品类别后面有一个/分隔。我想创建一个包含所有产品类别的列表。
对于给定的url,我想抓取文本'commercial-professional'并将其存储在包含所有产品类别的列表中。
以下是其中一个网址:https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx
谢谢你的帮助!
您可能需要考虑使用Python集合来存储类别,这样您最终会得到每个类别中的一个。
试试下面的例子,使用他们的索引页来获取可能的链接:
import requests
from bs4 import BeautifulSoup
import csv
url = "https://churchdwight.com/ingredient-disclosure/"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
categories = set()
for a_tag in soup.find_all("a", href=True):
url_parts = [p for p in a_tag["href"].split('/') if p]
if len(url_parts) > 2 and url_parts[0] == "ingredient-disclosure":
categories.update([url_parts[1]])
print("n".join(sorted(categories)))
这将给你以下类别:
Nausea-Relief
antiperspirant-deodorant
cleaning-products
commercial-professional
cough-allergy
dental-care
depilatories
fabric-softener-sheets
feminine-hygiene
hair-care
hand-sanitizer
hemorrhoid-relief
laundry-fabric-care
nasal-care
oral-care
pain-relief
pet-care
pool-products
sexual-health
skin-care
wound-care
你在"/"字符,并从结果列表中获取所需的内容:
prod_cat_list = []
url = 'https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx'
parts = url.split('/')
domain = parts[2]
prod_category = parts[4]
prod_cat_list.append(prod_category)
print(prod_cat_list)