从Dataframe中移除顶部标题

我想从下面的页面读取表格:各国GDP

我尝试了pandas read_html命令，得到以下结果:

import requests
from bs4 import BeautifulSoup
import pandas as pd
data =pd.DataFrame(pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")[2])
print(data.head())
Country/Territory UN Region  ... United Nations[15]           
Country/Territory UN Region  ...           Estimate       Year
0             World         —  ...           85328323       2020
1     United States  Americas  ...           20893746       2020
2             China      Asia  ...           14722801  [n 1]2020
3             Japan      Asia  ...            5057759       2020
4           Germany    Europe  ...            3846414       2020

问题是:我如何删除第一个内容?例如，我们可以遍历所有行:

for index, row  in data.iterrows():
print(index,row)

并创建空数据框和所有元素从索引1开始(因此删除第一个索引并将其余元素保存在空数据框中)，但我确信存在更专业的方式(也许我们需要正则表达式的一点点帮助?还是BeautifulSoup?)提前感谢

在伟大的人的帮助下，我已经解决了这个问题，但还有一个额外的问题，请查看下表:

Country/Territory UN Region   Estimate  ...       Year  Estimate       Year
0             World         —  101560901  ...       2021  85328323       2020
1     United States  Americas   25035164  ...       2021  20893746       2020
2             China      Asia   18321197  ...  [n 3]2021  14722801  [n 1]2020
3             Japan      Asia    4300621  ...       2021   5057759       2020
4           Germany    Europe    4031149  ...       2021   3846414       2020

你可以在相同的数据前面看到额外的不必要的符号[n 1]，可能是因为NaN或类似的东西，我们可以过滤掉这些数据吗?把剩下的都留下?

编辑:我试图创建一个通用函数，然后应用到所有的年份列，这里是一个非常简单的例子:如果我们只提取第一个元素:

result =data.loc[2,"Year"][0]

等于:[n 1]2022

当然我可以拆分它:

print(result.split("]")[1])

根据这些信息，我创建了以下函数:

def  split_column(text):
if  len(text)==4:
return int(text)
else:
return int(text.split("]")[1])

逻辑是，如果len(text)==4意味着我们有一个仅由4位数字组成的字符串(如2020)，因此直接转换为数字，否则应用前面示例中提到的逻辑，但是当我运行:

data['Year'] =data['Year'].apply(split_column,axis=1)

它给我错误:

AttributeError: 'Series' object has no attribute 'split'. Did you mean: 'plot'?

是不是很奇怪?提前感谢

最后用给定代码求解:

def  split_column(text):
if  len(text)==4:
return int(text)
elif text=="—":
return 0
else:
return int(text.split("]")[1])

是完整的代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
def  split_column(text):
if  len(text)==4:
return int(text)
elif text=="—":
return 0
else:
return int(text.split("]")[1])
data = pd.DataFrame(pd.read_html(wiki_link)[2]).droplevel(0, axis=1).loc[:, lambda x: ~x.columns.duplicated()]
data.dropna(inplace=True)
print(data['Year'].value_counts())
print(data['Year'])
# data['Year'] =data['Year'].map(split_column)
# for col in data.columns:
#     print(col)
#     # if col=='Year':
#     #     data[col] =data[col].map(split_column)
data['Year'] =data['Year'].map(split_column)
print(data.head())
# result =data.loc[2,"Year"][0]
# print(result.split("]")[1])
# print(result)
# print(data.head())
# print(data.columns)
# print(data['Year'].loc[2,:][0].split("]")[1])
# print(len(data['Year'].iloc[0,0]))

似乎列是一个MultiIndex，所以你可以使用droplevel:

wiki_link = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

data = (
pd.DataFrame(pd.read_html(wiki_link)[2])
.droplevel(0, axis=1) # <- we drop here the top header
.loc[:, lambda x: ~x.columns.duplicated()]
.query(r'~Year.str.contains(r"[")')
)

输出:

print(data)
Country/Territory UN Region   Estimate  Year
0               World         —  101560901  2022
1       United States  Americas   25035164  2022
3               Japan      Asia    4300621  2022
..                ...       ...        ...   ...
214             Nauru   Oceania        134  2022
215        Montserrat  Americas          —     —
216            Tuvalu   Oceania         64  2022
[207 rows x 4 columns]

更新:

根据@Corralien的评论，你似乎想要这个:

df = pd.DataFrame(pd.read_html(wiki_link)[2])

out = (
df.iloc[:, :2].droplevel(0, axis=1)
.join(df.iloc[:, 2:]
.pipe(lambda df_: df_.set_axis([f"{c2} ({c1.split('[')[0]})"
for c1,c2 in df_.columns], axis=1)))
.replace(r"[.*]", "", regex=True)
)

输出:

print(out)
Country/Territory UN Region Estimate (IMF) Year (IMF)  
0               World         —      101560901       2022   
1       United States  Americas       25035164       2022   
2               China      Asia       18321197       2022   
..                ...       ...            ...        ...   
214             Nauru   Oceania            134       2022   
215        Montserrat  Americas              —          —   
216            Tuvalu   Oceania             64       2022   
Estimate (World Bank) Year (World Bank) Estimate (United Nations)  
0                96513077              2021                  85328323   
1                22996100              2021                  20893746   
2                17734063              2021                  14722801   
..                    ...               ...                       ...   
214                   133              2021                       135   
215                     —                 —                        68   
216                    63              2021                        55   
Year (United Nations)  
0                    2020  
1                    2020  
2                    2020  
..                    ...  
214                  2020  
215                  2020  
216                  2020  
[217 rows x 8 columns]

您应该手动解析HTML表:

import pandas as pd
import numpy as np
import requests
import bs4
import re
import io
# Get HTML table
resp = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
soup = bs4.BeautifulSoup(resp.text)
table = soup.find('caption', text=re.compile('GDP.*by country')).parent
# Create headers
header = pd.MultiIndex.from_product([['IMF', 'World Bank', 'United Nations'], ['Estimate', 'Year']], names=['Source', 'Data'])
# Parse data
data = {}
for row in table.find_all('tr', {'class': None}):
cols = row.find_all('td')
key = cols[0].text.strip(), cols[1].text.strip()
data[key] = []
for col in cols[2:]:
if col.has_attr('colspan'):
data[key].append(np.nan)  # Estimate
data[key].append(pd.NA)   # Year
continue
val = int(re.sub('[.*]', '', col.text).strip().replace(',', ''))
data[key].append(val)
df = pd.DataFrame.from_dict(data, orient='index', columns=header)
df.index = pd.MultiIndex.from_tuples(df.index, names=['Country', 'Territory'])

输出:

>>> df
Source                          IMF        World Bank       United Nations      
Data                       Estimate  Year    Estimate  Year       Estimate  Year
Country       Territory                                                         
United States Americas   25035164.0  2022  22996100.0  2021     20893746.0  2020
China         Asia       18321197.0  2022  17734063.0  2021     14722801.0  2020
Japan         Asia        4300621.0  2022   4937422.0  2021      5057759.0  2020
Germany       Europe      4031149.0  2022   4223116.0  2021      3846414.0  2020
India         Asia        3468566.0  2022   3173398.0  2021      2664749.0  2020
...                             ...   ...         ...   ...            ...   ...
Palau         Oceania         226.0  2022       218.0  2021          264.0  2020
Kiribati      Oceania         207.0  2022       207.0  2021          181.0  2020
Nauru         Oceania         134.0  2022       133.0  2021          135.0  2020
Montserrat    Americas          NaN  <NA>         NaN  <NA>           68.0  2020
Tuvalu        Oceania          64.0  2022        63.0  2021           55.0  2020
[216 rows x 6 columns]

现在你可以重塑你的数据框架了:

>>> pd.concat([df['IMF'], df['World Bank'], df['United Nations']], keys=df.columns.levels[0])
Data                                  Estimate  Year
Source     Country       Territory                  
IMF        United States Americas   25035164.0  2022
China         Asia       18321197.0  2022
Japan         Asia        4300621.0  2022
Germany       Europe      4031149.0  2022
India         Asia        3468566.0  2022
...                                        ...   ...
World Bank Palau         Oceania         264.0  2020
Kiribati      Oceania         181.0  2020
Nauru         Oceania         135.0  2020
Montserrat    Americas         68.0  2020
Tuvalu        Oceania          55.0  2020
[648 rows x 2 columns]

不同来源的平均总估计数(地区、国家、年份):

>>> (pd.concat([df['IMF'], df['World Bank'], df['United Nations']])
.groupby(['Territory', 'Country', 'Year'])['Estimate'].mean())
Territory  Country  Year
Africa     Algeria  2020    147689.0
2021    167983.0
2022    187155.0
Angola   2020     62307.0
2021     72547.0
...   
Oceania    Tuvalu   2021        63.0
2022        64.0
Vanuatu  2020       855.0
2021       984.0
2022       984.0
Name: Estimate, Length: 608, dtype: float64

相关内容

最新更新

热门标签：