如何使用<option>美丽汤提取所有嵌套标签及其内容?



我试图在Python中使用BeautifulSoup拉出所有嵌套的<option>标记及其值。第一个代码块提供所需的unicode类型结果(超过60页的输出)。下面包含了HTML树的一部分。请注意,所需的<option>标签是嵌套的。

问题:下面的第二块代码不提供输出,不抛出错误。

from bs4 import BeautifulSoup
import requests
def main(base_url):
response = requests.get(base_url)
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify)

main('https://meps.ahrq.gov/data_stats/download_data_files.jsp')
from bs4 import BeautifulSoup
import requests
def main(base_url):
response = requests.get(base_url)
soup = BeautifulSoup(response.text, "html.parser")
select_id = soup.find_all("select", id="pufnumber")
print(select_id)
nested_option = [x.find_all("option") for x in select_id] 
print(nested_option)

main('https://meps.ahrq.gov/data_stats/download_data_files.jsp')

print(soup.prettify)部分输出:

</table>
<!-- 3/23/06 <img src="../images/bullets/spacer.gif" width="1" height="3" alt="">
<table role="presentation" width="430" height="15" border="0" cellpadding="6" cellspacing="0">
<tr>
<td height="0" bgcolor="#F9F9F9" class="contentStyle"><strong><font color="#006600">Option
2: </font><font color="#003399"><label for="pufnumber">Select by data file number/title </label></font></strong></td>
</tr>
</table>      
<table role="presentation" width="430" height="25" border="0" cellpadding="5" cellspacing="0" class="BlueBox">
<tr>
<td width="430" height="0"> <span class="contentStyle">

<select id="pufnumber" size=1 name="cboPufNumber">
<option value="All">All data files</option>


<option value="HC-225">MEPS HC-225: MEPS Panel 24 Longitudinal Data File</option> 


<option value="HC-224">MEPS HC-224: 2020 Full Year Consolidated Data File</option> 


<option value="HC-223">MEPS HC-223: 2020 Person Round Plan File</option> 

我的目标是取出嵌套的选项标签,像这样:

<option value="HC-225">MEPS HC-225: MEPS Panel 24 Longitudinal Data File</option> 

我对以下<option>标签不感兴趣:

<option value="All">All available years</option>
<option value="2020">2020</option>
<option value="2019">2019</option>
<option value="2018">2018</option>
<option value="2017">2017</option>
<option value="2016">2016</option>
...

我注意到要处理的HTML部分位于注释块中,这意味着BeautifulSoup无法处理该内容。

<!-- 3/23/06 <img src=" -->

尝试下面的代码来查看所有的注释,

import requests
from bs4 import BeautifulSoup, Comment
def main(base_url):
response = requests.get(base_url)
soup = BeautifulSoup(response.text, "html.parser")
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
print(c)
print("===========")
c.extract()
main('https://meps.ahrq.gov/data_stats/download_data_files.jsp')

现在,您的问题变成了如何处理注释以提取您想要的数据。

下面是一个工作示例,我使用正则表达式来处理原始文本。请注意,这只是为特定的网页结构设计的,可能对其他网站没有用处。

import requests
from bs4 import BeautifulSoup, Comment
import re
# find all options match the start and end string
def extractOptions(inputData):
sub1 = str(re.escape('<option value="All">All data files</option>'))
sub2 = str(re.escape('</select>'))
result = re.findall(sub1+"(.*)"+sub2, inputData, flags=re.S)
if len(result) > 0:
return result[0]
# find the actual data from each option
def extracData(inputData):
sub1 = str(re.escape('>'))
sub2 = str(re.escape('</option>'))
result =  re.findall(sub1+"(.*)"+sub2, inputData, flags=re.S)
if len(result) > 0:
return result[0]
return ''
def main(base_url):
response = requests.get(base_url)
soup = BeautifulSoup(response.text, "html.parser")
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
if '<select id="pufnumber" size=1 name="cboPufNumber">' in c:
options = extractOptions(c)
ops = options.splitlines() #split text into lines
for op in ops:
data = extracData(op)
if data != '': #check if the data found
print(data)


main('https://meps.ahrq.gov/data_stats/download_data_files.jsp')

最新更新