我需要一个页面上不同文章的body部分。它们被写在一个section标签中,每个段落都有几个p标签。如:
<section class="...">
<div>...</div>
<figure>...</figure>
<p id='...' class='...'></p>
<p id='...' class='...'></p>
<p id='...' class='...'></p>
</section>
<section class="...">
<div>...</div>
<figure>...</figure>
<p id='...' class='...'></p>
<p id='...' class='...'></p>
<p id='...' class='...'></p>
</section>
如果我在下面使用代码:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('url')
all_bodies = soup.find_all('section')
for i in range(len(all_bodies)):
print(all_bodies[i])
它返回section的完整内容,如果我将p标签添加到find_all,它将返回每个p标签作为列表的一个元素,但我想要一个section的整个p标签在一个列表元素中。
EDIT
from bs4 import BeautifulSoup
html = '''
<section class="...">
<div>...</div>
<figure>...</figure>
<p id='...' class='...'>1</p>
<p id='...' class='...'>2</p>
<p id='...' class='...'>3</p>
</section>
<section class="...">
<div>...</div>
<figure>...</figure>
<p id='...' class='...'>1</p>
<p id='...' class='...'>2</p>
<p id='...' class='...'>3</p>
</section>
'''
soup = BeautifulSoup(html, 'lxml')
[e.select('p') for e in soup.select('section')]
[[<p class="..." id="...">1</p>,
<p class="..." id="...">2</p>,
<p class="..." id="...">3</p>],
[<p class="..." id="...">1</p>,
<p class="..." id="...">2</p>,
<p class="..." id="...">3</p>]]