使用美丽汤"NoneType"对象没有属性"get_text"进行网络抓取



我正在尝试学习 beautifulsoup 来抓取NYT 政治文章中的文本,目前使用我现在拥有的代码,它确实设法抓取了两个段落,但在那之后,它吐出了 AttributeError:"NoneType"对象没有属性"get_text"。我已经查找了此错误,一些线程声称该错误源于使用 beautifulsoup3 中的遗留函数。但这似乎不是这里的问题,有什么想法吗?

法典:

import requests
from urllib import request, response, error, parse
from urllib.request import urlopen
from bs4 import BeautifulSoup


url = "https://www.nytimes.com/2020/02/10/us/politics/trump-manchin-impeachment.html"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")

title = soup.title
titleText = title.get_text()
body = soup.find('article', class_='css-1vxca1d')
section = soup.find('section', class_="css-1r7ky0e")
for elem in section:
div1 = elem.findAll('div')
for x in div1:
div2 = elem.findAll('div')
for i in div2:
text = i.find('p').get_text()
print (text)
print("----------")

输出:

WASHINGTON — Senator Joe Manchin III votes with President Trump more than any other Democrat in the Senate. But his vote last week to convict Mr. Trump of impeachable offenses has eclipsed all of that, earning him the rage of a president who coveted a bipartisan acquittal.
----------
“Munchkin means that you’re small, right?” he said. “I’m bigger than him — of course he has me by weight, now, he has more volume than I have by about 30 or 40 pounds. I’m far from being weak and pathetic, and I’m far from being a munchkin, and I still want him to succeed as president of the United States.”
----------
Traceback (most recent call last):
File "/Users/user/PycharmProjects/project2/webscrapper.py", line 25, in <module>
text = i.find('p').get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
Process finished with exit code 1

就像我在评论中提到的,当你做text = i.find('p').get_text()时,你实际上是在做 2 个操作。

首先获取所有<p>标签,然后获取其文本。i.find('p')在某些时候返回None。所以None.get_text()给你一个错误。

您可以看到这一点,因为错误消息告诉您'NoneType' object has no attribute 'get_text'

从文档中:

如果find_all()找不到任何内容,则返回一个空列表。如果find()找不到任何内容,则返回None

一个快速的解决方法是检查i.find('p')是否不返回None

# ...
for elem in section:
div1 = elem.findAll('div')
for x in div1:
div2 = elem.findAll('div')
for i in div2:
p = i.find('p')
if p is not None:
text = p.get_text()
print (text)
print("----------")

另请注意,find()只会返回您第一个<p>,如果有,则忽略其他。

最新更新