您好!我只需要href="这个值";在h4块内。糟糕的是,这个a href没有任何类/ids。这就是这个区块在html:中的样子
<h4 class="article_title_list" itemprop="name">
<a href="10-deutsche-pokemon-karten-sparpack">10 deutsche Pokemon Karten - mit Rare oder Holo/EX/GX - wie ein Booster!</a></h4>
Python代码:
page = requests.get(product_fetch_url, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
product_fetch_url_class = "article_title_list"
product_fetch_url_html = "h4"
find_urls = soup.find_all('{0}'.format(product_fetch_url_html), class_='{0}'.format(product_fetch_url_class))
for row in find_urls:
string = row
print("Produkt: {0}".format(string))
html = BeautifulSoup(string, "html.parser")
for a in html.find('a', href=True):
print("Produkt URL-Slug: {0}".format(a['href']))
输出:
Produkt: <h4 class="article_title_list" itemprop="name">
<a href="10-deutsche-pokemon-karten-sparpack">10 deutsche Pokemon Karten - mit Rare oder Holo/EX/GX - wie ein Booster!</a></h4>
Traceback (most recent call last):
File "/usr/share/nginx/html/mp-masterdb/pokefri.de/scraper.py", line 45, in <module>
fetch_urls()
File "/usr/share/nginx/html/mp-masterdb/pokefri.de/scraper.py", line 38, in fetch_urls
html = BeautifulSoup(string, "html.parser")
File "/usr/lib/python3.10/site-packages/bs4/__init__.py", line 312, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
排除输出:
Produkt: <h4 class="article_title_list" itemprop="name"><a href="10-deutsche-pokemon-karten-sparpack">10 deutsche Pokemon Karten - mit Rare oder Holo/EX/GX - wie ein Booster!</a></h4>
Produkt Url-slug: 10-deutsche-pokemon-karten-sparpack
有什么想法可以早点用BeautifulSoup而不是re/regex来解决这个问题吗?
如果您只是尝试获取链接,请选择更具体的元素。
for a in soup.select('h4>a'):
print(a.get('href'))
或者,如果你喜欢每行:
for e in soup.select('#product-list > div'):
print(e.h4.a.get('href'))
示例
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.lotticards.de/pokemon-sammelkarten').text)
for e in soup.select('#product-list > div'):
print(e.h4.a.get('href'))
输出
10-deutsche-pokemon-karten-sparpack
Glaenzendes-Schicksal-Booster-Deutsch
Pokemon-Celebrations-Booster-Packung-Deutsch
Pikachu-V-Kollektion-Glaenzendes-Schicksal-Deutsch
Verborgenes-Schicksal-Top-Trainer-Box
Sun-Moon-Tag-Team-All-Stars-GX-High-Class-Pack-SM12a-Display-Japanisch
Champions-Path-Elite-Trainer-Box-Englisch
Glaenzendes-Schicksal-Mini-Tin-Set-Alle-5-Motive-Deutsch
...
或者作为list comprehension
并且基于itemprop="url"
:
[a.get('content') for a in soup.select('#product-list [itemprop="url"]')]
输出:
['https://www.lotticards.de10-deutsche-pokemon-karten-sparpack',
'https://www.lotticards.deGlaenzendes-Schicksal-Booster-Deutsch',
'https://www.lotticards.dePokemon-Celebrations-Booster-Packung-Deutsch',
'https://www.lotticards.dePikachu-V-Kollektion-Glaenzendes-Schicksal-Deutsch',
'https://www.lotticards.deVerborgenes-Schicksal-Top-Trainer-Box',
'https://www.lotticards.deSun-Moon-Tag-Team-All-Stars-GX-High-Class-Pack-SM12a-Display-Japanisch',
'https://www.lotticards.deChampions-Path-Elite-Trainer-Box-Englisch',
'https://www.lotticards.deGlaenzendes-Schicksal-Mini-Tin-Set-Alle-5-Motive-Deutsch',
'https://www.lotticards.deShining-Fates-Elite-Trainer-Box-Englisch',
'https://www.lotticards.deHidden-Fates-Elite-Trainer-Box-Reprint-Januar-2021',
'https://www.lotticards.deVMAX-Climax-s8b-Display-Japanisch',
'https://www.lotticards.deSonne-Mond-Ultra-Prisma-Booster-Deutsch',
'https://www.lotticards.desonne-mond-2-stunde-der-waechter-booster-deutsch-kaufen',
'https://www.lotticards.deSchwert-Schild-Kampfstile-Display-Deutsch',
'https://www.lotticards.dePokemon-Celebrations-Booster-Pack-Englisch',...]