我有两个数据帧,df1和df2,df1包含将用于匹配df2 中数据的正确数据
我想在df2中找到与df1中城市名称不匹配的纬度和经度。
此外,我想在df2中找到";位于";在错误的国家
这是df1数据帧
id城市lat lng国家
1036323110 Katherine-14.4667 132.2667澳大利亚
1840015979 South Pasadena 27.7526-82.7394美国
1124755118 Beaconsfield 45.4333-73.8667加拿大
1250921305 Ferney Voltaire 46.2558 6.1081法国
1156346497江山28.7412 118.6225中国
1231393325迪拉6.4104 38.3100埃塞俄比亚
1192391794直布罗陀21.1072-76.1367古巴
1840054954汉普斯特德42.8821-71.1709美国
1840005111 West Islip 40.7097-732971美国
1076327352 Paulínia-22.7611-47.1542巴西
这是df2数据帧
id位置城市国家
16620625-5686 45.5333,-73.2833加拿大大教堂
16310427-5502 52.0000,84.9833 Belokurikha俄罗斯
16501010-4957-14.4667136.267Katherine Australia
16110430-8679 40.5626,-74.5743 Finderne美国
16990624-4174 27.7526,-90.7394 South Pasadena China
16790311-9092 35.98157,-160.41182美国江山
16650927-9151 44.7667,39.8667 West Islip Russia
16530328-2221-22.8858,-48.4450博图卡图巴西
16411229-7314 42.8821,-71.1709汉普斯特德美国
16060229-4175-7.7296,38.9500 Kibiti坦桑尼亚
到目前为止,我的代码是:
city_df = pd.merge(df1,df2,on ='city',how ='left')
首先将lat
和lng
列添加到df2
df2[['lat', 'lng']] = df2['location'].str.split(', ', expand=True)
df2[['lat', 'lng']] = df2[['lat', 'lng']].astype(float)
然后基于城市将df1
合并为df2
city_df = pd.merge(df1[['lat', 'lng', 'city', 'country']], df2, on='city', how ='right', suffixes=('_correct', ''))
在df2中查找";位于";在错误的国家
m = ~((city_df['country_correct'] == city_df['country']) | city_df['country_correct'].isna())
print(city_df[m])
lat_correct lng_correct city country_correct id location country lat lng
4 27.7526 -82.7394 South Pasadena United States 16990624-4174 27.7526, -90.7394 China 27.75260 -90.73940
5 28.7412 118.6225 Jiangshan China 16790311-9092 35.98157, -160.41182 United States 35.98157 -160.41182
6 40.7097 -73.2971 West Islip United States 16650927-9151 44.7667, 39.8667 Russia 44.76670 39.86670
要比较这两个数据帧,首先可以更容易地使用类似格式的df1和df2。例如,df1是这样的:
lat lng country
city
Katherine -14.4667 132.2667 Australia
South Pasadena 27.7526 -82.7394 United States
Beaconsfield 45.4333 -73.8667 Canada
Ferney-Voltaire 46.2558 6.1081 France
Jiangshan 28.7412 118.6225 China
Dīla 6.4104 38.3100 Ethiopia
Gibara 21.1072 -76.1367 Cuba
Hampstead 42.8821 -71.1709 United States
West Islip 40.7097 -73.2971 United States
Paulínia -22.7611 -47.1542 Brazil
和df2:
country2 lng2 lat2
city
Saint-Basile-le-Grand Canada -73.2833 45.5333
Belokurikha Russia 84.9833 52.0000
Katherine Australia 132.2667 -14.4667
Finderne United States -74.5743 40.5626
South Pasadena United States -82.7394 27.7526
West Islip United States -160.41182 35.98157
Belorechensk Russia 39.8667 44.7667
Botucatu Brazil -48.4450 -22.8858
Hampstead United States -71.1709 42.8821
Kibiti Tanzania 38.9500 -7.7296
然后您可以在axis=1
上使用pd.concat
方法,如下所示:
df3 = pd.concat([df1,df2],axis=1)
以便获得以下df:
lat lng country country2 lng2 lat2
city
Katherine -14.4667 132.2667 Australia Australia 132.2667 -14.4667
South Pasadena 27.7526 -82.7394 United States United States -82.7394 27.7526
Beaconsfield 45.4333 -73.8667 Canada NaN NaN NaN
Ferney-Voltaire 46.2558 6.1081 France NaN NaN NaN
Jiangshan 28.7412 118.6225 China NaN NaN NaN
Dīla 6.4104 38.3100 Ethiopia NaN NaN NaN
Gibara 21.1072 -76.1367 Cuba NaN NaN NaN
Hampstead 42.8821 -71.1709 United States United States -71.1709 42.8821
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
Paulínia -22.7611 -47.1542 Brazil NaN NaN NaN
Saint-Basile-le-Grand NaN NaN NaN Canada -73.2833 45.5333
Belokurikha NaN NaN NaN Russia 84.9833 52.0000
Finderne NaN NaN NaN United States -74.5743 40.5626
Belorechensk NaN NaN NaN Russia 39.8667 44.7667
Botucatu NaN NaN NaN Brazil -48.4450 -22.8858
Kibiti NaN NaN NaN Tanzania 38.9500 -7.7296
最后,从连接的df3中,您可以得到df2中纬度和经度与df1:中城市名称不匹配的行
df3[(df3['lat']!=df3['lat2']) & (df3['lng']!=df3['lng2'])].dropna()
lat lng country country2 lng2 lat2
city
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
要在df2中找到";位于";在错误的国家:
df3[df3['country']!=df3['country2']]