无法使用<strong>美丽汤正确抓取标签



所以我试图使用以下代码从阿迪达斯网站上抓取产品的日期:

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
'84.0.4147.105 Safari/537.36'}
url = "https://www.adidas.com.sg/release-dates"
productsource = requests.get(url, headers=headers, timeout=15)
productinfo = BeautifulSoup(productsource.text, "lxml")

def jdMonitor():
# webscraper
all_items = productinfo.find_all(name="div", class_="gl-product-card")
# print(all_items)
for item in all_items:
# print(item)
pname = item.find(name="div", class_="plc-product-name___2cofu").text
pprice = item.find(name="div", class_="gl-price-item").text
imagelink = item.find(name="img")['src']
plink = f"https://www.adidas.com.sg/{item.a['href']}"
try:
pdate = item.find(name="div", class_="plc-product-date___1zgO_").strong.text
except AttributeError as e:
print(e)
pdate = "Data Not Available"
print(f"""
Product Name: {pname}
Product Price: {pprice}
Image Link: {imagelink}
Product Link: {plink}
Product Date: {pdate}
""")

jdMonitor()

但是我在pdate中得到了一个空字符串。但是,如果我使用print(productinfo.find_all(name="strong"))来提取页面上的所有强标签,我就能够正确地提取所有标签,只是不是我需要的标签。我得到的输出为:

... <strong>All Recycled Materials</strong>, <strong> </strong> ...

第二对强标签之间的空格应该包含类似的日期

<strong>Wednesday 30 Jun 21:30</strong>

有人能解释为什么会发生这种情况吗?以及提取它的方法。

日期似乎是动态更新的,在源代码中没有这样的日期(打开源代码并查找"WEDNESDAY 30 JUN 19:00",什么都不会显示(。最明显的是使用selenium使其工作,但这可能是一个缓慢的解决方案。requests-html对我不起作用,就像bs4一样。渲染页面也没有帮助(或我做错了(。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
# running in headless mode for some reason gives no result or throws an error.
# options.headless = True 
driver = webdriver.Chrome(options=options)
driver.get('https://www.adidas.com.sg/release-dates')
for date in driver.find_elements_by_css_selector('.plc-product-date___1zgO_.gl-label.gl-label--m.gl-label--condensed'):
print(date.text)
driver.quit()
# output:
'''
WEDNESDAY 30 JUN 19:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
'''

你也可以使用regex来获取这些日期(如果它们会出现的话(,如下所示:

import re
test = '''
Wednesday 30 Jun 19:00
THURSDAY 01 JUL 05:00
THURSDAY 01 FEb 25:00
'''
matches = re.findall(r"[a-zA-Z]+sd+sw+sd+:d+", str(test))
finall_matches = 'n'.join(matches)
print(finall_matches)
# output before joining: "['Wednesday 30 Jun 19:00', 'THURSDAY 01 JUL 05:00', 'THURSDAY 01 FEb 25:00']"
# output after joining:
'''
Wednesday 30 Jun 19:00
THURSDAY 01 JUL 05:00
THURSDAY 01 FEb 25:00
'''

最新更新