如何刮擦遵循另一个HTML线路的特定HTML线

我想从html页面上刮擦一些看起来像

的数据

<tr>
 <td> Some information <td>
 <td> 123 </td>
</tr>
<tr>
 <td> some other information </td>
 <td> 456 </td>
</tr>
<tr>
 <td> and the info continues </td>
 <td> 789 </td>
</tr>

我想要的是获得给定HTML线后出现的HTML线。也就是说，如果我看到"其他一些信息"，我想要输出" 456"。我想到将Regex与Beautifulsoup的.find_next相结合，但是我对此没有任何好运(我对Regex也不熟悉(。有人知道该怎么做？提前，非常感谢

实际上与regex和find_next混合在一起，您可以实现自己想要的东西：

from bs4 import BeautifulSoup
import re
html = """
<tr>
 <td> Some information <td>
 <td> 123 </td>
</tr>
<tr>
 <td> some other information </td>
 <td> 456 </td>
</tr>
<tr>
 <td> and the info continues </td>
 <td> 789 </td>
</tr>
"""
soup = BeautifulSoup(html)
x = soup.find('td', text = re.compile('some other information'))
print(x.find_next('td').text)

输出

'456'

编辑由x.find_next('td').text替换x.find_next('td').contents[0]，Shorter

相关内容

最新更新

热门标签：