在 Python 中用漂亮的汤循环浏览 html

我正在尝试遍历一个html表。

在我正在浏览的页面上，只有一个表格。所以这很容易找到。在此之下有几个<tr>，除了由 <th> 而不是 <td> s 定义的一些标头之外，我想查看这些标头。每个<tr>都由<td>中的几个不同分类组成。我只想收集带有class="table-name"的两个<td>和class="table-score"的<td>。

我尝试与：

rows = html.find("table", class_="table").find_all("tr")
for row in rows:
    if row.find("th") is None:
        td_names = row.findall("td")
for td_name in td_names:
    print(td_name)

但我在这方面确实取得了任何成功。

所以基本上html看起来像这样：

<table>
  <tr>
    <th>Header</th>
  </tr>
  <tr>
    <td class="table-rank">1</td>
    <td class="table-name">John</td>
    <td class="table-name">Jim</td>
    <td class="table-place">Russia</td>
    <td class="table-score">2-1</td>
  </tr>
</table>

我只在寻找"约翰"，"吉姆"，"2-1"。

提前谢谢。

find_all（）将返回与过滤器匹配的所有元素的列表。您可以使用列表的索引来选择所需的元素。0 表示第一，1 表示第二，依此类推。

from bs4 import BeautifulSoup
html="""
<table>
<tr>
<th>Header</th>
</tr>
<tr>
<td class="table-rank">1</td>
<td class="table-name">John</td>
<td class="table-name">Jim</td>
<td class="table-place">Russia</td>
<td class="table-score">2-1</td>
</tr>
</table>
"""
soup=BeautifulSoup(html,'html.parser')
our_tr=soup.find('table').find_all('tr')[1] #the second tr in the table - index starts at 0
#print all td's of seconf tr
our_tds=our_tr.find_all('td')
print(our_tds[1].text)
print(our_tds[2].text)
print(our_tds[4].text)

输出

John
Jim
2-1

在您的特定示例中，.find("table", class_="table") 不会返回任何内容，因为它正在查找类名为"table"的表。您在此处<table>标签只是<table>，而不是<table class="table">。

我

做了以下操作，我能够提取您想要的类的项目。

from bs4 import BeautifulSoup
html = """
<table>
  <tr>
    <th>Header</th>
  </tr>
  <tr>
    <td class="table-rank">1</td>
    <td class="table-name">John</td>
    <td class="table-name">Jim</td>
    <td class="table-place">Russia</td>
    <td class="table-score">2-1</td>
  </tr>
</table>
"""
soup = BeautifulSoup(html, 'html.parser')
t = soup.find('table')
td_data = []
for row in t.find_all('tr'):
    # Ignore any rows containing a <th> cell.
    if not row.th:
        # Generate a list of any strings found inside <td class="table-name"> tags.
        # Concatenate this list with td_data.  Do the same with cells of the class "table-score".
        td_data += [ s.string for s in row.find_all('td', class_="table-name") ]
        td_data += [ s.string for s in row.find_all('td', class_="table-score") ]
print(td_data)

我之所以将

td_data声明为空列表，然后只是将新列表添加到其中，是因为您可以对具有多行的表运行此算法，这些行可能包含您要查找的内容。此外，有一些方法可以进行各种"或"搜索来查找具有所需任一类的标签，但由于只有两个，我认为收集表名值和表分数值的完整列表非常简单。如果其中一个结果为空，则td_data保持不变。

如果我看到表格标签，我通常会让熊猫做这项工作，你可以过滤掉你不需要或不想要的列。

html = """
<table>
  <tr>
    <th>Header</th>
  </tr>
  <tr>
    <td class="table-rank">1</td>
    <td class="table-name">John</td>
    <td class="table-name">Jim</td>
    <td class="table-place">Russia</td>
    <td class="table-score">2-1</td>
  </tr>
</table>
"""
import pandas as pd

df = pd.read_html(html, skiprows=1)
results = df[0]

编辑：如果您更关心实际的类属性，我可以提供 2 种替代方案。

选项：1

仍然使用 pandas 来解析表，但在此之前，使用 BeautifulSoup 通过 .decompose() 消除不需要的列/标签/类（无论您想怎么称呼它们）：

import pandas as pd
import bs4
html = """
<table>
  <tr>
    <th>Header</th>
  </tr>
  <tr>
    <td class="table-rank">1</td>
    <td class="table-name">John</td>
    <td class="table-name">Jim</td>
    <td class="table-place">Russia</td>
    <td class="table-score">2-1</td>
  </tr>
</table>
"""
soup = bs4.BeautifulSoup(html, 'html.parser')
keep_list = ["table-name", "table-score"]
for data in soup.find_all('td'):
    class_attr = data['class'][0]
    if class_attr in keep_list:
        continue
    else:
        soup.select("td."+class_attr)[0].decompose()
df = pd.read_html(str(soup), skiprows=1)
results = df[0]

输出：

print (results)
      0    1    2
0  John  Jim  2-1

选项： 2

与其他解决方案类似，只需查找特定的类属性。

import bs4
html = """
<table>
  <tr>
    <th>Header</th>
  </tr>
  <tr>
    <td class="table-rank">1</td>
    <td class="table-name">John</td>
    <td class="table-name">Jim</td>
    <td class="table-place">Russia</td>
    <td class="table-score">2-1</td>
  </tr>
</table>
"""
soup = bs4.BeautifulSoup(html, 'html.parser')
keep_list = ["table-name", "table-score"]
alpha = soup.find_all('td', class_=lambda x: x in keep_list)
for data in alpha:
    print (data.text)
# or if wanted in list
results = [ data.text for data in alpha ]

输出：

John
Jim
2-1

或者，该列表可以分 3 行完成：

soup = bs4.BeautifulSoup(html, 'html.parser')
keep_list = ["table-name", "table-score"]
results = [ data.text for data in soup.find_all('td', class_=lambda x: x in keep_list)]

输出：

print (results)
['John', 'Jim', '2-1']

相关内容

最新更新

热门标签：