如何使用BeautifulSoup从表中刮取特定列并作为pandas数据帧返回

尝试使用HDI解析表，并将数据加载到Pandas DataFrame中，其中包含以下列：Country、HDI_score。

我一直在用以下代码加载国家专栏：

import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')
df = pd.DataFrame(columns=['Countries', 'HDI_score'])
for row in table.find_all('tr'):    
columns = row.find_all('td')

if(columns != []):
countries = columns[1].text.strip()
hdi_score = columns[2].text.strip()
df = df.append({'Countries': countries, 'HDI_score': hdi_score}, ignore_index=True)

来自我的代码的结果

因此，我没有国家名称，而是从"5年内排名变化"栏中获得了值。我试过更改列的索引，但没有帮助。

您可以使用panda获取适当的表，match='Rank'为您获取正确的表，然后提取感兴趣的列。

import pandas as pd
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index', match='Rank')[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)

根据评论，如果您仍在使用panda，我认为涉及bs4没有什么意义。如下所示：

import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(th:contains("Rank"))')))[0]
columns = ['Nation','HDI']
table = table.loc[:, columns].iloc[:, :2]
table.columns = columns
print(table)

注意 投票支持QHarr，因为在我看来，这也是使用pandas的最直接的解决方案

此外并回答您的问题-也可以仅通过BeautifulSoup选择列。只需将css selectors和stripped_strings组合即可。

示例

import requests
import pandas as pd
from bs4 import BeautifulSoup
html = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index")
bsObj = BeautifulSoup(html.text, 'html.parser')
pd.DataFrame(
[list(r.stripped_strings)[-3:-1] for r in bsObj.select('table tr:has(span[data-sort-value])')],
columns=['Countries', 'HDI_score']
)

输出

国家
挪威	0.957
爱尔兰	0.955
瑞士	0.955

示例

输出

相关内容

最新更新

热门标签：