从只获取最后一行的表中提取数据



我从这个网站上得到了以下表格:

<table id="sample">
<tbody>
<tr class="toprow">
<td></td>
<td colspan="5">Number of Jurisdictions</td>
</tr>
<tr class="toprow">
<td>Region</td>
<td>Jurisdictions in the region</td>
<td>Jurisdictions that require IFRS&nbsp;Standards&nbsp;<br>
for all or most domestic publicly accountable entities</td>
<td>Jurisdictions that require IFRS Standards&nbsp;as % of total jurisdictions in the region</td>
<td>Jurisdictions that permit or require IFRS&nbsp;Standards for at least some (but not all or most) domestic publicly accountable entities</td>
<td>Jurisdictions that neither require nor permit IFRS Standards for any domestic publicly accountable entities</td>
</tr>
<tr>
<td class="leftcol">Europe</td>
<td class="data">44</td>
<td class="data">43</td>
<td class="data">98%</td>
<td class="data">1</td>
<td class="data">0</td>
</tr>
<tr>
<td class="leftcol">Africa</td>
<td class="data">23</td>
<td class="data">19</td>
<td class="data">83%</td>
<td class="data">1</td>
<td class="data">3</td>
</tr>
<tr>
<td class="leftcol">Middle East</td>
<td class="data">13</td>
<td class="data">13</td>
<td class="data">100%</td>
<td class="data">0</td>
<td class="data">0</td>
</tr>
<tr>
<td class="leftcol">Asia-Oceania</td>
<td class="data">33</td>
<td class="data">24</td>
<td class="data">73%</td>
<td class="data">3</td>
<td class="data">6</td>
</tr>
<tr>
<td class="leftcol">Americas</td>
<td class="data">37</td>
<td class="data">27</td>
<td class="data">73%</td>
<td class="data">8</td>
<td class="data">2</td>
</tr>
<tr>
<td class="leftcol" style="border-top:2px solid #000000"><strong>Totals</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>150</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>126</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>84%</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>13</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>11</strong></td>
</tr>
<tr>
<td class="leftcol"><strong>As % <br>
of 150</strong></td>
<td class="data"><strong>100%</strong></td>
<td class="data"><strong>84%</strong></td>
<td class="data"><strong>&nbsp;</strong></td>
<td class="data"><strong>9%</strong></td>
<td class="data"><strong>7%</strong></td>
</tr>
</tbody>
</table>

这是我下面的尝试:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import requests
# Site URL
url = "http://archive.ifrs.org/Use-around-the-world/Pages/Analysis-of-the-IFRS-jurisdictional-profiles.aspx"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# On site there are 3 tables with the class "wikitable"
# The following line will generate a list of HTML content for each table
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow'):
for c in g.select('td'):
cols.append(c.text)

for g in gdp.select('tr:not(.toprow)'):
row = []
for item in g.select('td'):
row.append(item.text)
rows.append(row)
pd.DataFrame(rows, columns=cols)

问题是cols得到正确的结果:

['', 'Number of Jurisdictions', 'Region', 'Jurisdictions in the region', 'Jurisdictions that require IFRSxa0Standardsxa0rn        
for all or most domestic publicly accountable entities', 'Jurisdictions that require IFRS Standardsxa0as % of total jurisdictions in the region', 'Jurisdictions that permit or require IFRSxa0Standards for at least some (but not all or most) domestic publicly accountable entities', 'Jurisdictions that neither require nor permit IFRS Standards for any domestic publicly accountable entities']  

问题是行,它只得到最后一行:

['As % rn            of 150', '100%', '84%', 'xa0', '9%', '7%']

我得到这个错误:

ValueError: 8 columns passed, passed data have 6 columns

有两个带。toprow的tr,跳过第一个。toprow

for g in gdp.select('tr.toprow')[1:]:

你的解决方案看起来像:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow')[1:]:
for c in g.select('td'):
cols.append(c.text)

for g in gdp.select('tr:not(.toprow)'):
row = []
for item in g.select('td'):
row.append(item.text)
rows.append(row)
pd.DataFrame(rows, columns=cols)

最新更新