比较panda中的两个数据帧并编辑结果



我有两个数据帧,df1和df2,df1包含将用于匹配df2 中数据的正确数据

我想在df2中找到与df1中城市名称不匹配的纬度和经度。

此外,我想在df2中找到";位于";在错误的国家

这是df1数据帧

id城市lat lng国家

1036323110 Katherine-14.4667 132.2667澳大利亚

1840015979 South Pasadena 27.7526-82.7394美国

1124755118 Beaconsfield 45.4333-73.8667加拿大

1250921305 Ferney Voltaire 46.2558 6.1081法国

1156346497江山28.7412 118.6225中国

1231393325迪拉6.4104 38.3100埃塞俄比亚

1192391794直布罗陀21.1072-76.1367古巴

1840054954汉普斯特德42.8821-71.1709美国

1840005111 West Islip 40.7097-732971美国

1076327352 Paulínia-22.7611-47.1542巴西

这是df2数据帧

id位置城市国家

16620625-5686 45.5333,-73.2833加拿大大教堂

16310427-5502 52.0000,84.9833 Belokurikha俄罗斯

16501010-4957-14.4667136.267Katherine Australia

16110430-8679 40.5626,-74.5743 Finderne美国

16990624-4174 27.7526,-90.7394 South Pasadena China

16790311-9092 35.98157,-160.41182美国江山

16650927-9151 44.7667,39.8667 West Islip Russia

16530328-2221-22.8858,-48.4450博图卡图巴西

16411229-7314 42.8821,-71.1709汉普斯特德美国

16060229-4175-7.7296,38.9500 Kibiti坦桑尼亚

到目前为止,我的代码是:

city_df = pd.merge(df1,df2,on ='city',how ='left')

首先将latlng列添加到df2

df2[['lat', 'lng']] = df2['location'].str.split(', ', expand=True)
df2[['lat', 'lng']] = df2[['lat', 'lng']].astype(float)

然后基于城市将df1合并为df2

city_df = pd.merge(df1[['lat', 'lng', 'city', 'country']], df2, on='city', how ='right', suffixes=('_correct', ''))

在df2中查找";位于";在错误的国家

m = ~((city_df['country_correct'] == city_df['country']) | city_df['country_correct'].isna())
print(city_df[m])
lat_correct  lng_correct            city country_correct             id              location        country       lat        lng
4      27.7526     -82.7394  South Pasadena   United States  16990624-4174     27.7526, -90.7394          China  27.75260  -90.73940
5      28.7412     118.6225       Jiangshan           China  16790311-9092  35.98157, -160.41182  United States  35.98157 -160.41182
6      40.7097     -73.2971      West Islip   United States  16650927-9151      44.7667, 39.8667         Russia  44.76670   39.86670

要比较这两个数据帧,首先可以更容易地使用类似格式的df1和df2。例如,df1是这样的:

lat       lng        country
city                                             
Katherine       -14.4667  132.2667      Australia
South Pasadena   27.7526  -82.7394  United States
Beaconsfield     45.4333  -73.8667         Canada
Ferney-Voltaire  46.2558    6.1081         France
Jiangshan        28.7412  118.6225          China
Dīla              6.4104   38.3100       Ethiopia
Gibara           21.1072  -76.1367           Cuba
Hampstead        42.8821  -71.1709  United States
West Islip       40.7097  -73.2971  United States
Paulínia        -22.7611  -47.1542         Brazil

和df2:

country2         lng2      lat2
city                                                       
Saint-Basile-le-Grand         Canada     -73.2833   45.5333
Belokurikha                   Russia      84.9833   52.0000
Katherine                  Australia     132.2667  -14.4667
Finderne               United States     -74.5743   40.5626
South Pasadena         United States     -82.7394   27.7526
West Islip             United States   -160.41182  35.98157
Belorechensk                  Russia      39.8667   44.7667
Botucatu                      Brazil     -48.4450  -22.8858
Hampstead              United States     -71.1709   42.8821
Kibiti                      Tanzania      38.9500   -7.7296

然后您可以在axis=1上使用pd.concat方法,如下所示:

df3 = pd.concat([df1,df2],axis=1)以便获得以下df:

lat       lng        country       country2         lng2      lat2
city                                                                                         
Katherine             -14.4667  132.2667      Australia      Australia     132.2667  -14.4667
South Pasadena         27.7526  -82.7394  United States  United States     -82.7394   27.7526
Beaconsfield           45.4333  -73.8667         Canada            NaN          NaN       NaN
Ferney-Voltaire        46.2558    6.1081         France            NaN          NaN       NaN
Jiangshan              28.7412  118.6225          China            NaN          NaN       NaN
Dīla                    6.4104   38.3100       Ethiopia            NaN          NaN       NaN
Gibara                 21.1072  -76.1367           Cuba            NaN          NaN       NaN
Hampstead              42.8821  -71.1709  United States  United States     -71.1709   42.8821
West Islip             40.7097  -73.2971  United States  United States   -160.41182  35.98157
Paulínia              -22.7611  -47.1542         Brazil            NaN          NaN       NaN
Saint-Basile-le-Grand      NaN       NaN            NaN         Canada     -73.2833   45.5333
Belokurikha                NaN       NaN            NaN         Russia      84.9833   52.0000
Finderne                   NaN       NaN            NaN  United States     -74.5743   40.5626
Belorechensk               NaN       NaN            NaN         Russia      39.8667   44.7667
Botucatu                   NaN       NaN            NaN         Brazil     -48.4450  -22.8858
Kibiti                     NaN       NaN            NaN       Tanzania      38.9500   -7.7296

最后,从连接的df3中,您可以得到df2中纬度和经度与df1:中城市名称不匹配的行

df3[(df3['lat']!=df3['lat2']) & (df3['lng']!=df3['lng2'])].dropna()
lat      lng        country       country2         lng2      lat2
city                                                                             
West Islip  40.7097 -73.2971  United States  United States   -160.41182  35.98157

要在df2中找到";位于";在错误的国家:

df3[df3['country']!=df3['country2']]

最新更新