我想从下面的页面读取表格:各国GDP
我尝试了pandas read_html命令,得到以下结果:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data =pd.DataFrame(pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")[2])
print(data.head())
Country/Territory UN Region ... United Nations[15]
Country/Territory UN Region ... Estimate Year
0 World — ... 85328323 2020
1 United States Americas ... 20893746 2020
2 China Asia ... 14722801 [n 1]2020
3 Japan Asia ... 5057759 2020
4 Germany Europe ... 3846414 2020
问题是:我如何删除第一个内容?例如,我们可以遍历所有行:
for index, row in data.iterrows():
print(index,row)
并创建空数据框和所有元素从索引1开始(因此删除第一个索引并将其余元素保存在空数据框中),但我确信存在更专业的方式(也许我们需要正则表达式的一点点帮助?还是BeautifulSoup?)提前感谢
在伟大的人的帮助下,我已经解决了这个问题,但还有一个额外的问题,请查看下表:
Country/Territory UN Region Estimate ... Year Estimate Year
0 World — 101560901 ... 2021 85328323 2020
1 United States Americas 25035164 ... 2021 20893746 2020
2 China Asia 18321197 ... [n 3]2021 14722801 [n 1]2020
3 Japan Asia 4300621 ... 2021 5057759 2020
4 Germany Europe 4031149 ... 2021 3846414 2020
你可以在相同的数据前面看到额外的不必要的符号[n 1],可能是因为NaN或类似的东西,我们可以过滤掉这些数据吗?把剩下的都留下?
编辑:我试图创建一个通用函数,然后应用到所有的年份列,这里是一个非常简单的例子:如果我们只提取第一个元素:
result =data.loc[2,"Year"][0]
等于:[n 1]2022
当然我可以拆分它:
print(result.split("]")[1])
根据这些信息,我创建了以下函数:
def split_column(text):
if len(text)==4:
return int(text)
else:
return int(text.split("]")[1])
逻辑是,如果len(text)==4意味着我们有一个仅由4位数字组成的字符串(如2020),因此直接转换为数字,否则应用前面示例中提到的逻辑,但是当我运行:
data['Year'] =data['Year'].apply(split_column,axis=1)
它给我错误:
AttributeError: 'Series' object has no attribute 'split'. Did you mean: 'plot'?
是不是很奇怪?提前感谢
最后用给定代码求解:
def split_column(text):
if len(text)==4:
return int(text)
elif text=="—":
return 0
else:
return int(text.split("]")[1])
是完整的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
def split_column(text):
if len(text)==4:
return int(text)
elif text=="—":
return 0
else:
return int(text.split("]")[1])
data = pd.DataFrame(pd.read_html(wiki_link)[2]).droplevel(0, axis=1).loc[:, lambda x: ~x.columns.duplicated()]
data.dropna(inplace=True)
print(data['Year'].value_counts())
print(data['Year'])
# data['Year'] =data['Year'].map(split_column)
# for col in data.columns:
# print(col)
# # if col=='Year':
# # data[col] =data[col].map(split_column)
data['Year'] =data['Year'].map(split_column)
print(data.head())
# result =data.loc[2,"Year"][0]
# print(result.split("]")[1])
# print(result)
# print(data.head())
# print(data.columns)
# print(data['Year'].loc[2,:][0].split("]")[1])
# print(len(data['Year'].iloc[0,0]))
似乎列是一个MultiIndex,所以你可以使用droplevel
:
wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
data = (
pd.DataFrame(pd.read_html(wiki_link)[2])
.droplevel(0, axis=1) # <- we drop here the top header
.loc[:, lambda x: ~x.columns.duplicated()]
.query(r'~Year.str.contains(r"[")')
)
输出:
print(data)
Country/Territory UN Region Estimate Year
0 World — 101560901 2022
1 United States Americas 25035164 2022
3 Japan Asia 4300621 2022
.. ... ... ... ...
214 Nauru Oceania 134 2022
215 Montserrat Americas — —
216 Tuvalu Oceania 64 2022
[207 rows x 4 columns]
更新:
根据@Corralien的评论,你似乎想要这个:
df = pd.DataFrame(pd.read_html(wiki_link)[2])
out = (
df.iloc[:, :2].droplevel(0, axis=1)
.join(df.iloc[:, 2:]
.pipe(lambda df_: df_.set_axis([f"{c2} ({c1.split('[')[0]})"
for c1,c2 in df_.columns], axis=1)))
.replace(r"[.*]", "", regex=True)
)
输出:
print(out)
Country/Territory UN Region Estimate (IMF) Year (IMF)
0 World — 101560901 2022
1 United States Americas 25035164 2022
2 China Asia 18321197 2022
.. ... ... ... ...
214 Nauru Oceania 134 2022
215 Montserrat Americas — —
216 Tuvalu Oceania 64 2022
Estimate (World Bank) Year (World Bank) Estimate (United Nations)
0 96513077 2021 85328323
1 22996100 2021 20893746
2 17734063 2021 14722801
.. ... ... ...
214 133 2021 135
215 — — 68
216 63 2021 55
Year (United Nations)
0 2020
1 2020
2 2020
.. ...
214 2020
215 2020
216 2020
[217 rows x 8 columns]
您应该手动解析HTML表:
import pandas as pd
import numpy as np
import requests
import bs4
import re
import io
# Get HTML table
resp = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
soup = bs4.BeautifulSoup(resp.text)
table = soup.find('caption', text=re.compile('GDP.*by country')).parent
# Create headers
header = pd.MultiIndex.from_product([['IMF', 'World Bank', 'United Nations'], ['Estimate', 'Year']], names=['Source', 'Data'])
# Parse data
data = {}
for row in table.find_all('tr', {'class': None}):
cols = row.find_all('td')
key = cols[0].text.strip(), cols[1].text.strip()
data[key] = []
for col in cols[2:]:
if col.has_attr('colspan'):
data[key].append(np.nan) # Estimate
data[key].append(pd.NA) # Year
continue
val = int(re.sub('[.*]', '', col.text).strip().replace(',', ''))
data[key].append(val)
df = pd.DataFrame.from_dict(data, orient='index', columns=header)
df.index = pd.MultiIndex.from_tuples(df.index, names=['Country', 'Territory'])
输出:
>>> df
Source IMF World Bank United Nations
Data Estimate Year Estimate Year Estimate Year
Country Territory
United States Americas 25035164.0 2022 22996100.0 2021 20893746.0 2020
China Asia 18321197.0 2022 17734063.0 2021 14722801.0 2020
Japan Asia 4300621.0 2022 4937422.0 2021 5057759.0 2020
Germany Europe 4031149.0 2022 4223116.0 2021 3846414.0 2020
India Asia 3468566.0 2022 3173398.0 2021 2664749.0 2020
... ... ... ... ... ... ...
Palau Oceania 226.0 2022 218.0 2021 264.0 2020
Kiribati Oceania 207.0 2022 207.0 2021 181.0 2020
Nauru Oceania 134.0 2022 133.0 2021 135.0 2020
Montserrat Americas NaN <NA> NaN <NA> 68.0 2020
Tuvalu Oceania 64.0 2022 63.0 2021 55.0 2020
[216 rows x 6 columns]
现在你可以重塑你的数据框架了:
>>> pd.concat([df['IMF'], df['World Bank'], df['United Nations']], keys=df.columns.levels[0])
Data Estimate Year
Source Country Territory
IMF United States Americas 25035164.0 2022
China Asia 18321197.0 2022
Japan Asia 4300621.0 2022
Germany Europe 4031149.0 2022
India Asia 3468566.0 2022
... ... ...
World Bank Palau Oceania 264.0 2020
Kiribati Oceania 181.0 2020
Nauru Oceania 135.0 2020
Montserrat Americas 68.0 2020
Tuvalu Oceania 55.0 2020
[648 rows x 2 columns]
不同来源的平均总估计数(地区、国家、年份):
>>> (pd.concat([df['IMF'], df['World Bank'], df['United Nations']])
.groupby(['Territory', 'Country', 'Year'])['Estimate'].mean())
Territory Country Year
Africa Algeria 2020 147689.0
2021 167983.0
2022 187155.0
Angola 2020 62307.0
2021 72547.0
...
Oceania Tuvalu 2021 63.0
2022 64.0
Vanuatu 2020 855.0
2021 984.0
2022 984.0
Name: Estimate, Length: 608, dtype: float64