当HTML代码不一致时,我如何在python中使用bs4识别正确的td标签



我在Python中使用BeautifulSoup4来解析一些HTML代码。我已经设法深入到正确的表并识别td标签,但我面临的问题是,标签中的样式属性应用不一致,这使得获得正确的td标签的任务成为一个真正的挑战。

我试图拉的数据是一个日期字段,但在任何时候都会有多个td标签使用CSS隐藏(什么是可见的取决于在HTML代码的其他地方选择的选项值)。

实际例子:

<td style="display: none;">01/03/2016</td>
<td style="display: table-cell;">27/10/2015</td> <-- this is the tag I want

<td style="display:none">23/02/2016</td>
<td style="">09/05/2011</td> <-- this is the tag I want
<td style="display: none;">29/03/2011</td>
<td style="display:none">19/10/2010</td>

<td>27/10/2015</td> <-- this is the tag I want
<td style="display: none">01/03/2016</td>
<td style="display: none">22/03/2016</td>

<td style="display:none">11/04/2015</td>
<td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
<td style="display: none">18/10/2013</td>

我如何排除/删除不正确的项目(其中有display:nonedisplay: none的风格),让我与我真正想要的一个?

使用list comp筛选tds,仅当td在集合{"display:none", "display: none;","display: none;","display: none"}:

中没有style属性时保留。
In [8]: h1 = """"<td style="display: none;">01/03/2016</td>
   ...: <td style="display: table-cell;">27/10/2015</td>"""
In [9]: h2 = """"<td style="display:none">23/02/2016</td>
   ...: <td style="">09/05/2011</td> <-- this is the tag I want
   ...: <td style="display: none;">29/03/2011</td>
   ...: <td style="display:none">19/10/2010</td>"""
In [10]: h3 = """"<td>27/10/2015</td> <-- this is the tag I want
   ....: <td style="display: none">01/03/2016</td>
   ....: <td style="display: none">22/03/2016</td>"""
In [11]: h4 = """<td style="display:none">11/04/2015</td>
   ....: <td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
   ....: <td style="display: none">18/10/2013</td>"""
In [12]: ignore = {"display:none", "display: none;", "display: none;", "display: none"}
In [13]: for html in [h1, h2, h3, h4]:
   ....:         soup = BeautifulSoup(html, "html.parser")
   ....:         print([td for td in soup.find_all("td") if not td.get("style") in ignore])
   ....:     
[<td style="display: table-cell;">27/10/2015</td>]
[<td style="">09/05/2011</td>]
[<td>27/10/2015</td>]
[<td style="display: table-cell;">02/02/2016</td>]

最新更新