Beautiful Soup:如果单元格中包含单词,请在表中选择一行



如果您关注我的帖子,,谢谢您

经过多次研究,我找不到如果一个单元格包含一个特定值,那么只刮表中的一行的方法。

更具体地说:我想保留包含单词"的行;oui";在下表的最后一列中:

<table align="center" cellspacing="0" cellpadding="3" width="100%">
<tbody><tr>
<td class="tdhg" align="left"><b>Liste des candidats</b></td>
<td class="tdhv"><strong>Voix</strong></td>
<td class="tdhv"><strong>%&nbsp;Inscrits</strong></td>
<td class="tdhv"><strong>%&nbsp;Exprimés</strong></td>
<td class="tdhv"><strong>Elu(e)</strong></td>
</tr>
<tr>
<td class="tdcbf" align="left">M.&nbsp;Jean-François LAMOUR&nbsp;(UMP) </td>
<td class="tdcd" align="right">23&nbsp;964</td>
<td class="tdcd" align="right">  33,01</td>
<td class="tdcd" align="right">  54,60</td>
<td class="tdcd" align="center">oui
&nbsp;</td>
</tr>
<tr>
<td class="tdcbf" align="left">M.&nbsp;Gilles ALAYRAC&nbsp;(RDG) </td>
<td class="tdcd" align="right">19&nbsp;927</td>
<td class="tdcd" align="right">  27,45</td>
<td class="tdcd" align="right">  45,40</td>
<td class="tdcd" align="center">
&nbsp;</td>
</tr>
</tbody></table>

我第一次尝试使用正则表达式,我成功地找到了匹配的单词,但保持相关行似乎很复杂,所以我决定更改方法并使用BeautifulSoup。

到目前为止,我做得最好的是:

url='www.someurl.com'
headers = {"User-Agent":"Mozilla/5.0"}
response = requests.get(url.format())
html_soup = soup(response.content, 'lxml')
html_soup.select('td.tdcd')

我无法更进一步,特别是在"tdcd"包含"oui"的地方保留"tr"。即使我阅读了文档https://www.crummy.com/software/BeautifulSoup/bs4/doc/,如果我没有弄错的话,很难将细胞作为一个孩子来考虑它的价值。

谢谢你,

找到包含ouitd.tdcd,并将其选择为父

html_soup = soup(response.content, 'lxml')
tds = html_soup.select('td.tdcd')
for td in tds :
if 'oui' in td.text:
print(td.parent)

这就是你想要的。只需读取数据帧,然后过滤数据帧

html = '''<table align="center" cellspacing="0" cellpadding="3" width="100%">
<tbody><tr>
<td class="tdhg" align="left"><b>Liste des candidats</b></td>
<td class="tdhv"><strong>Voix</strong></td>
<td class="tdhv"><strong>%&nbsp;Inscrits</strong></td>
<td class="tdhv"><strong>%&nbsp;Exprimés</strong></td>
<td class="tdhv"><strong>Elu(e)</strong></td>
</tr>
<tr>
<td class="tdcbf" align="left">M.&nbsp;Jean-François LAMOUR&nbsp;(UMP) </td>
<td class="tdcd" align="right">23&nbsp;964</td>
<td class="tdcd" align="right">  33,01</td>
<td class="tdcd" align="right">  54,60</td>
<td class="tdcd" align="center">oui
&nbsp;</td>
</tr>
<tr>
<td class="tdcbf" align="left">M.&nbsp;Gilles ALAYRAC&nbsp;(RDG) </td>
<td class="tdcd" align="right">19&nbsp;927</td>
<td class="tdcd" align="right">  27,45</td>
<td class="tdcd" align="right">  45,40</td>
<td class="tdcd" align="center">
&nbsp;</td>
</tr>
</tbody></table>'''
import pandas as pd
table = pd.read_html(html)[0]
# Keep any rows that have 'oui' in the row; doesn't matter which column
filter_table = table[table.values == 'oui']
# Or if you specifically need to look in the last column
#filter_table = table[table.iloc[:,-1] == 'oui']
# Or specific column name
#filter_table = table[table[4] == 'oui']

输出:

print (filter_table)
0       1     2     3    4
1  M. Jean-François LAMOUR (UMP)  23 964  3301  5460  oui

替代方案:

在这里,您可以遍历行,并且只有在包含'oui'时才进行追加

html_soup = BeautifulSoup(html, 'lxml')
data_rows = html_soup.select('tr')
rows = []
for row in data_rows:
data = [ x.text.strip() for x in row.find_all('td',{'class':'tdcd'})]
if 'oui' in data:
rows.append(data)

table = pd.DataFrame(rows)

最新更新