使用 BS4 抓取 marketwatch.com 时出现'NoneType'错误



我在尝试抓取这段HTML时遇到NoneType错误:

<div class="article__content">


<h3 class="article__headline">
<a class="link" href="https://www.marketwatch.com/story/infrastructure-bill-looks-set-to-pass-senate-without-changes-sought-by-crypto-advocates-2021-08-10?mod=cryptocurrencies">

Infrastructure bill looks set to pass Senate without changes sought by crypto advocates
</a>
</h3>
<p class="article__summary">A $1 trillion bipartisan infrastructure bill on Tuesday appeared on track to pass the Senate without changes sought by the cryptocurrency industry&#x27;s supporters, as a deal among key senators on an amendment didn&#x27;t get suppo...</p>
<div class="content--secondary">
<div class="group group--tickers">
<bg-quote class="negative" channel="/zigman2/quotes/31322028/realtime">
<a class="ticker qt-chip j-qt-chip" data-charting-symbol="CRYPTOCURRENCY/US/COINDESK/BTCUSD" data-track-hover="QuotePeek" href="https://www.marketwatch.com/investing/cryptocurrency/btcusd?mod=cryptocurrencies">
<span class="ticker__symbol">BTCUSD</span>
<bg-quote class="ticker__change" field="percentChange" channel="/zigman2/quotes/31322028/realtime">-1.07%</bg-quote>
<i class="icon"></i>
</a>
</bg-quote>
</div>

</div>
<div class="article__details">
<span class="article__timestamp" data-est="2021-08-10T10:42:34">Aug. 10, 2021 at 10:42 a.m. ET</span>
<span class="article__author">by Victor Reklaitis</span>

</div>
</div>

我的代码是这样的:

for article in soup.find_all('div', class_='article__content'):
date = article.find('span', class_='article__timestamp')['data-est']
print(date)

谁能告诉我是什么问题,为什么这个跨度找不到?

您需要过滤掉没有时间戳的<div>标记:

import requests
from bs4 import BeautifulSoup

url = "https://www.marketwatch.com/investing/cryptocurrency?mod=side_nav"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for article in soup.find_all("div", class_="article__content"):
date = article.find("span", class_="article__timestamp")
if not date:
continue
print(date["data-est"])

打印:

2021-08-10T10:42:34
2021-08-10T05:30:00
2021-08-09T19:15:00
2021-08-09T12:33:00
2021-08-09T11:22:00
2021-08-08T20:09:00
2021-08-07T15:14:00
2021-08-07T15:04:00
2021-08-06T09:15:27
2021-08-05T14:25:00
2021-08-05T11:17:00
2021-08-04T16:11:00
2021-08-02T17:07:00
2021-08-02T06:54:00
2021-08-01T21:01:00

或带CSS选择器:

for span in soup.select(".article__content .article__timestamp[data-est]"):
print(span["data-est"])

最新更新