获取某个div上方的所有soup

我有一个这样的格式:

<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>

我想刮掉表格和栏div之间的所有段落。挑战在于这些段落之间的段落数量不是恒定的。所以我不能只看前三段(可以是1-5段)。

我该如何把这个汤分成段落呢?Regex一开始看起来不错，但它不适合我，因为后来我仍然需要一个soup对象来允许进一步的提取。

Thanks a ton

你可以选择你的元素，迭代它的siblings和break，如果没有p:

for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)

或其他方式接近初始问题——选择<div class = 'bar'>find_previous_siblings('p'):

for t in soup.select_one('.bar').find_previous_siblings('p'):
print(t)

from bs4 import BeautifulSoup
html='''
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
'''
soup = BeautifulSoup(html)
for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)

输出

<p> </p>
<p> </p>
<p> </p>

如果html如所示，则只需使用:不过滤后面的兄弟p标签

from bs4 import BeautifulSoup
html='''
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
'''
soup = BeautifulSoup(html)
soup.select('.foo > table ~ p:not(.bar ~ p)')

`from bs4 import BeautifulSoup html=''' <div class = 'foo'> <table> </table> <p> </p> <p> </p> <p> </p> <div class = 'bar'> <p> </p> . . </div> ''' soup = BeautifulSoup(html) for t in soup.div.table.find_next_siblings(): if t.name != 'p': break print(t)`

相关内容

最新更新

热门标签：