BeautifulSoup在h-tag中获取href



您好!我只需要href="这个值";在h4块内。糟糕的是,这个a href没有任何类/ids。这就是这个区块在html:中的样子

<h4 class="article_title_list" itemprop="name">
<a href="10-deutsche-pokemon-karten-sparpack">10 deutsche Pokemon Karten - mit Rare oder Holo/EX/GX - wie ein Booster!</a></h4>

Python代码:

page = requests.get(product_fetch_url, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
product_fetch_url_class = "article_title_list"
product_fetch_url_html = "h4"
find_urls = soup.find_all('{0}'.format(product_fetch_url_html), class_='{0}'.format(product_fetch_url_class))
for row in find_urls:
string = row
print("Produkt: {0}".format(string))
html = BeautifulSoup(string, "html.parser")

for a in html.find('a', href=True):
print("Produkt URL-Slug: {0}".format(a['href']))

输出:

Produkt: <h4 class="article_title_list" itemprop="name">
<a href="10-deutsche-pokemon-karten-sparpack">10 deutsche Pokemon Karten - mit Rare oder Holo/EX/GX - wie ein Booster!</a></h4>
Traceback (most recent call last):
File "/usr/share/nginx/html/mp-masterdb/pokefri.de/scraper.py", line 45, in <module>
fetch_urls()
File "/usr/share/nginx/html/mp-masterdb/pokefri.de/scraper.py", line 38, in fetch_urls
html = BeautifulSoup(string, "html.parser")
File "/usr/lib/python3.10/site-packages/bs4/__init__.py", line 312, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable

排除输出:

Produkt: <h4 class="article_title_list" itemprop="name"><a href="10-deutsche-pokemon-karten-sparpack">10 deutsche Pokemon Karten - mit Rare oder Holo/EX/GX - wie ein Booster!</a></h4> 
Produkt Url-slug: 10-deutsche-pokemon-karten-sparpack

有什么想法可以早点用BeautifulSoup而不是re/regex来解决这个问题吗?

如果您只是尝试获取链接,请选择更具体的元素。

for a in soup.select('h4>a'):
print(a.get('href'))

或者,如果你喜欢每行:

for e in soup.select('#product-list > div'):
print(e.h4.a.get('href'))

示例

import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.lotticards.de/pokemon-sammelkarten').text)
for e in soup.select('#product-list > div'):
print(e.h4.a.get('href'))

输出

10-deutsche-pokemon-karten-sparpack
Glaenzendes-Schicksal-Booster-Deutsch
Pokemon-Celebrations-Booster-Packung-Deutsch
Pikachu-V-Kollektion-Glaenzendes-Schicksal-Deutsch
Verborgenes-Schicksal-Top-Trainer-Box
Sun-Moon-Tag-Team-All-Stars-GX-High-Class-Pack-SM12a-Display-Japanisch
Champions-Path-Elite-Trainer-Box-Englisch
Glaenzendes-Schicksal-Mini-Tin-Set-Alle-5-Motive-Deutsch
...

或者作为list comprehension并且基于itemprop="url":

[a.get('content') for a in soup.select('#product-list [itemprop="url"]')]

输出:

['https://www.lotticards.de10-deutsche-pokemon-karten-sparpack',
'https://www.lotticards.deGlaenzendes-Schicksal-Booster-Deutsch',
'https://www.lotticards.dePokemon-Celebrations-Booster-Packung-Deutsch',
'https://www.lotticards.dePikachu-V-Kollektion-Glaenzendes-Schicksal-Deutsch',
'https://www.lotticards.deVerborgenes-Schicksal-Top-Trainer-Box',
'https://www.lotticards.deSun-Moon-Tag-Team-All-Stars-GX-High-Class-Pack-SM12a-Display-Japanisch',
'https://www.lotticards.deChampions-Path-Elite-Trainer-Box-Englisch',
'https://www.lotticards.deGlaenzendes-Schicksal-Mini-Tin-Set-Alle-5-Motive-Deutsch',
'https://www.lotticards.deShining-Fates-Elite-Trainer-Box-Englisch',
'https://www.lotticards.deHidden-Fates-Elite-Trainer-Box-Reprint-Januar-2021',
'https://www.lotticards.deVMAX-Climax-s8b-Display-Japanisch',
'https://www.lotticards.deSonne-Mond-Ultra-Prisma-Booster-Deutsch',
'https://www.lotticards.desonne-mond-2-stunde-der-waechter-booster-deutsch-kaufen',
'https://www.lotticards.deSchwert-Schild-Kampfstile-Display-Deutsch',
'https://www.lotticards.dePokemon-Celebrations-Booster-Pack-Englisch',...]

相关内容

  • 没有找到相关文章

最新更新