从pandas python中删除重复项,只删除一次



我有一个csv文件,我需要从中删除基于3列的重复行。我尝试了下面的代码,但它只删除了一次,而不是所有可能的重复。

ins.csv

sr,instrument_token,exchange_token,tradingsymbol,name
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf

python:

import pandas as pd
import numpy as np
ins = pd.read_csv('ins.csv')
new_ins = ins[pd.DataFrame(np.sort(ins[['instrument_token','exchange_token','tradingsymbol']].values.astype(str),1)).duplicated()]
new_ins.to_csv('ins.csv', mode='w', header=new_ins.columns.tolist(), index=False)
import pandas as pd
import numpy as np
ins = pd.read_csv('ins.csv')
new_ins = ins.drop_duplicates(['instrument_token','exchange_token','tradingsymbol'], keep='first')

keep="first"实际上是默认值,所以不需要添加它。这意味着它将只保留第一次出现。

如果你想放弃所有的,保持=假

除了使用drop_duplicates之外的另一种方法是使用groupby.nunique

df.groupby(['sr', 'instrument_token', 'exchange_token', 'tradingsymbol', 'name']).nunique().reset_index()
Out[24]: 
sr  instrument_token  exchange_token tradingsymbol  name
0   4            367376            2112          nf50  nf50
1   9            361216            2127          nfbf  nfbf

最新更新