所以我试图刮一个新闻网站,并得到里面的实际文本。我现在的问题是,实际的文章被分成几个p
标签,这些标签又在一个div标签内。
它看起来像这样:
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>
到目前为止我尝试的是:
article = requests.get(url)
soup = BeautifulSoup(article.content, 'html.parser')
article_title = soup.find('h1').text
article_author = soup.find('a', class_='author-link').text
article_text = ''
for element in soup.find('div', class_='wysiwyg wysiwyg--all-content css-1vkfgk0'):
article_text += element.find('p').text
但是它显示'NoneType'对象没有属性'text'
原因预期的输出从问题中不是那么清楚-一般方法是选择div
中的所有p
,例如使用css selectors
提取文本并按您喜欢的方式提取join()
:
article_text = 'n'.join(e.text for e in soup.select('div p'))
如果您只是想在示例中从h2
的兄弟中抓取文本,请使用:
article_text = 'n'.join(e.text for e in soup.select('h2 ~ p'))
或find()
和find_next_siblings()
:
article_text = 'n'.join(e.text for e in soup.find('h2').find_next_siblings('p'))
from bs4 import BeautifulSoup
html='''
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>'''
soup = BeautifulSoup(html)
article_text = 'n'.join(e.text for e in soup.select('h2 ~ p'))
print(article_text)
输出text
text
text
from bs4 import BeautifulSoup
html='''
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>'''
soup = BeautifulSoup(html)
article_text = 'n'.join(e.text for e in soup.select('h2 ~ p'))
print(article_text)
text
text
text