查找重复项并将ID添加为属性pandas



我正在使用大量(约450万)对象的geopandas,其中每个对象都有一个唯一的ID号('PARCEL_SPI')和另一个代码('PC_PLANNO')。

我想做的是编写一些代码,对于每个对象,找到具有相同PLANNO的所有其他对象,并将其ID号添加为新属性中的列表,为对象说'Same_code'。df被称为spine_copy。

这里是我的一个快速示例:

tbody> <<tr>
PARCEL_SPI PC_PLANNO
23908LP12345
90435LP12345
329048LP90803
6409LP2399
34534LP90803
092824LP12345

这里不需要转换为列表-通过Series.duplicated过滤重复行,并使用GroupBy.transform与传递给numpy.where的反向掩码:

m = spine_copy['PC_PLANNO'].duplicated(keep=False)
s = spine_copy.groupby('PC_PLANNO')['PARCEL_SPI'].transform(lambda x: x.to_numpy()[::-1])
spine_copy['Same_code'] = np.where(m, s, None)
print (spine_copy)
PARCEL_SPI PC_PLANNO Same_code
0       23908   LP12345     90435
1       90435   LP12345     23908
2      329048   LP90803     34534
3        6409    LP2399      None
4       34534   LP90803    329048

EDIT: with new data:

m = spine_copy['PC_PLANNO'].duplicated(keep=False)
new = spine_copy.groupby('PC_PLANNO')['PARCEL_SPI'].apply(list).rename('Same_code')
vals = spine_copy.join(new, on='PC_PLANNO')[['PARCEL_SPI','Same_code']]
s = [[z for z in y if z != x] for x, y in vals.to_numpy()]
spine_copy['Same_code'] = np.where(m, s, None)
print (spine_copy)
PARCEL_SPI PC_PLANNO       Same_code
0       23908   LP12345  [90435, 92824]
1       90435   LP12345  [23908, 92824]
2      329048   LP90803         [34534]
3        6409    LP2399            None
4       34534   LP90803        [329048]
5       92824   LP12345  [23908, 90435]

也许你可以试试:

other = df.groupby('PC_PLANNO')['PARCEL_SPI'].apply(lambda x: x.tolist()).reset_index()
df = df.merge(other.rename(columns={'PARCEL_SPI':'Same_code'}), how='left', on=['PC_PLANNO'])
df['Same_code'] = df[['PARCEL_SPI', 'Same_code']].apply(lambda x: list(set(x['Same_code']) - set([x['PARCEL_SPI']])), axis=1)

输出:

PARCEL_SPI PC_PLANNO Same_code
0       23908   LP12345   [90435]
1       90435   LP12345   [23908]
2      329048   LP90803   [34534]
3        6409    LP2399        []
4       34534   LP90803  [329048]

最新更新