使用BeautifulSoup获取URL的第二部分,并将该文本存储在一个变量中



我有一个url列表,所有url的第一部分都相同。所有的url都有"成分披露",产品类别后面有一个/分隔。我想创建一个包含所有产品类别的列表。

对于给定的url,我想抓取文本'commercial-professional'并将其存储在包含所有产品类别的列表中。

以下是其中一个网址:https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx

谢谢你的帮助!

您可能需要考虑使用Python集合来存储类别,这样您最终会得到每个类别中的一个。

试试下面的例子,使用他们的索引页来获取可能的链接:

import requests
from bs4 import BeautifulSoup
import csv
url = "https://churchdwight.com/ingredient-disclosure/"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
categories = set()

for a_tag in soup.find_all("a", href=True):
url_parts = [p for p in a_tag["href"].split('/') if p]
if len(url_parts) > 2 and url_parts[0] == "ingredient-disclosure":
categories.update([url_parts[1]])
print("n".join(sorted(categories)))

这将给你以下类别:

Nausea-Relief
antiperspirant-deodorant
cleaning-products
commercial-professional
cough-allergy
dental-care
depilatories
fabric-softener-sheets
feminine-hygiene
hair-care
hand-sanitizer
hemorrhoid-relief
laundry-fabric-care
nasal-care
oral-care
pain-relief
pet-care
pool-products
sexual-health
skin-care
wound-care

你在"/"字符,并从结果列表中获取所需的内容:

prod_cat_list = []
url = 'https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx'
parts = url.split('/')
domain = parts[2]
prod_category = parts[4]
prod_cat_list.append(prod_category)
print(prod_cat_list)

最新更新