当字符串的兄弟姐妹有一个同名的父字符串(BeautifulSoup)时,如何提取字符串的子字符串



例如:

<ul class="key-dates">

<li>
Birthday: Monday 26 April 2021
</li>
<li>
Christmas: Saturday 25 December 2021
</li>
<li>
New Years: Saturday 1 January 2021
</li>

</ul>

比方说,如果我只是想取消生日日期,我会怎么做?

import requests
import bs4
info = requests.get('url')

您可以使用CSS选择器(:contains:-soup-contains(:

from bs4 import BeautifulSoup
html_doc = """
<ul class="key-dates">

<li>
Birthday: Monday 26 April 2021
</li>
<li>
Christmas: Saturday 25 December 2021
</li>
<li>
New Years: Saturday 1 January 2021
</li>

</ul>
"""
soup = BeautifulSoup(html_doc, "html.parser")
birthday = soup.select_one('.key-dates li:-soup-contains("Birthday")')
print(birthday.text.strip())

打印:

Birthday: Monday 26 April 2021

或不带CSS:

birthday = soup.find("li", text=lambda t: "Birthday" in t)
print(birthday.text.strip())

不幸的是,由于li标签没有任何唯一的标识符,因此没有真正100%的保证方法。最好的方法是找到具有唯一标识符的最接近的父标记,并从中解析您的生日。

在这种情况下,它看起来像:

from bs4 import BeautifulSoup
import requests
def get_source(url):
return BeautifulSoup(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text, 'html.parser')
soup = get_source('url')
ul_list = soup.find('ul', class_='key-dates') # Gets the parent ul tag with the class='key-dates' and children
list_item = ul_list.find('li', text='Birthday: Monday 26 April 2021') # gets the li item you need containing whatever you pass in the text parameter.
print(list_item)                             # <li>Birthday: Monday 26 April 2021</li>
print(list_item.text)                        # Birthday: Monday 26 April 2021
print(list_item.text.split('Birthday: ')[1]) # Monday 26 April 2021

最新更新