我有两个数据帧,我想最终合并,以比较不同拼写的领导人的名字之间的差异。
我的第一个数据帧看起来像这样:
year country_isocode country_name leader leader_start_date leader_end_date
20 1986 AFG Afghanistan Mohammed Najibullah 1986-05-04 1992-04-16
21 1987 AFG Afghanistan Mohammed Najibullah 1986-05-04 1992-04-16
22 1988 AFG Afghanistan Mohammed Najibullah 1986-05-04 1992-04-16
23 1989 AFG Afghanistan Mohammed Najibullah 1986-05-04 1992-04-16
24 1990 AFG Afghanistan Mohammed Najibullah 1986-05-04 1992-04-16
25 1991 AFG Afghanistan Mohammed Najibullah 1986-05-04 1992-04-16
26 1992 AFG Afghanistan Burhanuddin Rabbani 1992-06-28 1996-09-27
27 1993 AFG Afghanistan Burhanuddin Rabbani 1992-06-28 1996-09-27
28 1994 AFG Afghanistan Burhanuddin Rabbani 1992-06-28 1996-09-27
而我的第二个是这样的:
leader_start_date leader_end_date LeaderCountryOrIGO LeaderCountryISO LeaderTitle LeaderLastName LeaderFullName
0 1986-05-04 1990-06-28 Afghanistan AFG General Secretary Najibullah Mohammad Najibullah
1 1989-02-21 1990-05-07 Afghanistan AFG Prime Minister Keshtmand Ali Keshtmand
2 1990-05-07 1992-04-15 Afghanistan AFG Prime Minister Khaliqyar Fazal Haq Khaliqyar
3 1992-04-16 1992-04-28 Afghanistan AFG President (Acting) Hatef Abdul Rahim Hatef
4 1992-04-28 1992-06-28 Afghanistan AFG President (Acting) Mojadedi Sibghatullah Mojadedi
5 1992-06-28 1996-09-27 Afghanistan AFG President Rabbani Burhanuddin Rabbani
第一个数据帧有单独的行代表数据集中的每个国家/年份条目,而第二个数据帧有单行代表唯一的领导人及其任职年份。我的目标是重新塑造第二个数据框架,使其与第一个数据框架的形状相当。
我想采用两个日期时间列"leader_start_date"one_answers"leader_end_date"所暗示的年份范围。在第二个数据集和"这些用于创建该范围内每年的新行集,其中包含有关国家和领导人姓名的重复信息。然后,我需要在第二个日期框架中迭代此解决方案,以获取所有唯一的领导者姓名及其年份范围。
虽然数据集不是完美匹配,但让两个数据帧具有相同的形状将允许我识别出许多匹配。
使用说明:
#convert both columns to datetimes
df['leader_start_date'] = pd.to_datetime(df['leader_start_date'])
df['leader_end_date'] = pd.to_datetime(df['leader_end_date'])
#create new column by years
df.insert(0, 'year', df['leader_start_date'].dt.year)
#subtract years for repeating, repalce missing values by actual year
s = df['leader_end_date'].dt.year.fillna(pd.to_datetime('now').year) - df['year']
#if output is previous year by leader_end_date
df = df.loc[df.index.repeat(s)].copy()
#if output match also year in leader_end_date
# df = df.loc[df.index.repeat(s + 1)].copy()
#add counter to column year
df['year'] += df.groupby(level=0).cumcount()
#create default index
df = df.reset_index(drop=True)
print (df)
year leader_start_date leader_end_date LeaderCountryOrIGO
0 1986 1986-05-04 1990-06-28 Afghanistan
1 1987 1986-05-04 1990-06-28 Afghanistan
2 1988 1986-05-04 1990-06-28 Afghanistan
3 1989 1986-05-04 1990-06-28 Afghanistan
4 1989 1989-02-21 1990-05-07 Afghanistan
5 1990 1990-05-07 1992-04-15 Afghanistan
6 1991 1990-05-07 1992-04-15 Afghanistan
7 1992 1992-06-28 1996-09-27 Afghanistan
8 1993 1992-06-28 1996-09-27 Afghanistan
9 1994 1992-06-28 1996-09-27 Afghanistan
10 1995 1992-06-28 1996-09-27 Afghanistan
LeaderCountryISO LeaderTitle LeaderLastName LeaderFullName
0 AFG General Secretary Najibullah Mohammad Najibullah
1 AFG General Secretary Najibullah Mohammad Najibullah
2 AFG General Secretary Najibullah Mohammad Najibullah
3 AFG General Secretary Najibullah Mohammad Najibullah
4 AFG Prime Minister Keshtmand Ali Keshtmand
5 AFG Prime Minister Khaliqyar Fazal Haq Khaliqyar
6 AFG Prime Minister Khaliqyar Fazal Haq Khaliqyar
7 AFG President Rabbani Burhanuddin Rabbani
8 AFG President Rabbani Burhanuddin Rabbani
9 AFG President Rabbani Burhanuddin Rabbani
10 AFG President Rabbani Burhanuddin Rabbani