如果您关注我的帖子,,谢谢您
经过多次研究,我找不到如果一个单元格包含一个特定值,那么只刮表中的一行的方法。
更具体地说:我想保留包含单词"的行;oui";在下表的最后一列中:
<table align="center" cellspacing="0" cellpadding="3" width="100%">
<tbody><tr>
<td class="tdhg" align="left"><b>Liste des candidats</b></td>
<td class="tdhv"><strong>Voix</strong></td>
<td class="tdhv"><strong>% Inscrits</strong></td>
<td class="tdhv"><strong>% Exprimés</strong></td>
<td class="tdhv"><strong>Elu(e)</strong></td>
</tr>
<tr>
<td class="tdcbf" align="left">M. Jean-François LAMOUR (UMP) </td>
<td class="tdcd" align="right">23 964</td>
<td class="tdcd" align="right"> 33,01</td>
<td class="tdcd" align="right"> 54,60</td>
<td class="tdcd" align="center">oui
</td>
</tr>
<tr>
<td class="tdcbf" align="left">M. Gilles ALAYRAC (RDG) </td>
<td class="tdcd" align="right">19 927</td>
<td class="tdcd" align="right"> 27,45</td>
<td class="tdcd" align="right"> 45,40</td>
<td class="tdcd" align="center">
</td>
</tr>
</tbody></table>
我第一次尝试使用正则表达式,我成功地找到了匹配的单词,但保持相关行似乎很复杂,所以我决定更改方法并使用BeautifulSoup。
到目前为止,我做得最好的是:
url='www.someurl.com'
headers = {"User-Agent":"Mozilla/5.0"}
response = requests.get(url.format())
html_soup = soup(response.content, 'lxml')
html_soup.select('td.tdcd')
我无法更进一步,特别是在"tdcd"包含"oui"的地方保留"tr"。即使我阅读了文档https://www.crummy.com/software/BeautifulSoup/bs4/doc/,如果我没有弄错的话,很难将细胞作为一个孩子来考虑它的价值。
谢谢你,
找到包含oui
的td.tdcd
,并将其选择为父
html_soup = soup(response.content, 'lxml')
tds = html_soup.select('td.tdcd')
for td in tds :
if 'oui' in td.text:
print(td.parent)
这就是你想要的。只需读取数据帧,然后过滤数据帧
html = '''<table align="center" cellspacing="0" cellpadding="3" width="100%">
<tbody><tr>
<td class="tdhg" align="left"><b>Liste des candidats</b></td>
<td class="tdhv"><strong>Voix</strong></td>
<td class="tdhv"><strong>% Inscrits</strong></td>
<td class="tdhv"><strong>% Exprimés</strong></td>
<td class="tdhv"><strong>Elu(e)</strong></td>
</tr>
<tr>
<td class="tdcbf" align="left">M. Jean-François LAMOUR (UMP) </td>
<td class="tdcd" align="right">23 964</td>
<td class="tdcd" align="right"> 33,01</td>
<td class="tdcd" align="right"> 54,60</td>
<td class="tdcd" align="center">oui
</td>
</tr>
<tr>
<td class="tdcbf" align="left">M. Gilles ALAYRAC (RDG) </td>
<td class="tdcd" align="right">19 927</td>
<td class="tdcd" align="right"> 27,45</td>
<td class="tdcd" align="right"> 45,40</td>
<td class="tdcd" align="center">
</td>
</tr>
</tbody></table>'''
import pandas as pd
table = pd.read_html(html)[0]
# Keep any rows that have 'oui' in the row; doesn't matter which column
filter_table = table[table.values == 'oui']
# Or if you specifically need to look in the last column
#filter_table = table[table.iloc[:,-1] == 'oui']
# Or specific column name
#filter_table = table[table[4] == 'oui']
输出:
print (filter_table)
0 1 2 3 4
1 M. Jean-François LAMOUR (UMP) 23 964 3301 5460 oui
替代方案:
在这里,您可以遍历行,并且只有在包含'oui'
时才进行追加
html_soup = BeautifulSoup(html, 'lxml')
data_rows = html_soup.select('tr')
rows = []
for row in data_rows:
data = [ x.text.strip() for x in row.find_all('td',{'class':'tdcd'})]
if 'oui' in data:
rows.append(data)
table = pd.DataFrame(rows)