南 南…………………19711997
我正在尝试建立一个美国大学的数据库。我一直在使用Beautiful Soup和Pandas来做到这一点,但遇到了困难,因为每页都有几个表格要废弃。为了重新合并从两个表中提取的数据,我尝试使用.merge(),但根本没有成功。
我的代码如下:# Connecticut
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Connecticut'
soup=bs(requests.get(url).text)
table = soup.find_all('table')
#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1.rename(columns = {'Enrollment(2020)[4]': 'Enrollment', 'Founded[5]':'Founded'}, inplace = True)
df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Type','Ref.'], axis=1)
df_Connecticut=df1.merge(df2, on=['School','Location','Control','Founded'])
df_Connecticut
我已经尝试在其他状态下这样做,但仍然遇到同样的问题:
Maine
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Maine'
soup=bs(requests.get(url).text)
table = soup.find_all('table')
#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1=df1.drop(['Type[a]'], axis=1)
df1.rename(columns = {'Location(s)': 'Location', 'Enrollment (2019)[b]':'Enrollment'}, inplace = True)
df1 = df1.astype({'School':'string','Location':'string','Control':'string','Enrollment':'string','Founded':'string'})
df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Cite'], axis=1)
df2.rename(columns = {'Location(s)': 'Location'}, inplace = True)
df2 = df2.astype({'School':'string','Location':'string','Founded':'string','Closed':'string'})
df_Maine=df1.merge(df2, on=['School','Location','Founded'])
df_Maine```
我完全是Python初学者。
我同意@Clegane的说法,你可能指的是concat
,因为merge表明两者中必须有可以匹配的值。
pd.concat([df1,df2], ignore_index=True)
下面是关于您提到的两种状态的完整示例。当然,你也可以在你的列表中包括其他州。
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
big_df = pd.DataFrame()
states = ['Maine', 'Connecticut']
for x in states:
r = s.get(f'https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_{x}')
soup = bs(r.text, 'html.parser')
df = pd.read_html(str(soup))[0]
df.columns = ['School', 'Location', 'Control', 'Type', 'Enrollment', 'Founded']
df['State'] = x
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)
终端结果:
School Location Control Type Enrollment Founded State
0 Bates College Lewiston Private Baccalaureate college 1964 1855 Maine
1 Beal University Bangor Private (for-profit) Associates college 229 1891[5] Maine
2 Bowdoin College Brunswick Private Baccalaureate college 1969 1794 Maine
3 Central Maine Community College Auburn Public Associates college 3746 1963[6] Maine
4 Colby College Waterville Private Baccalaureate college 2150 1813[7] Maine
... ... ... ... ... ... ... ...
61 University of New Haven West Haven Private Masters university 7587 1920 Connecticut
62 University of Saint Joseph West Hartford Private (Catholic) Masters university 2975 1932 Connecticut
63 Wesleyan University Middletown Private Baccalaureate college 3599 1831 Connecticut
64 Western Connecticut State University Danbury Public Masters university 6463 1903 Connecticut
65 Yale University New Haven Private Doctoral university 14910 1701 Connecticut
66 rows × 7 columns