我有一个csv文件,我需要从中删除基于3列的重复行。我尝试了下面的代码,但它只删除了一次,而不是所有可能的重复。
ins.csv:
sr,instrument_token,exchange_token,tradingsymbol,name
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
python:
import pandas as pd
import numpy as np
ins = pd.read_csv('ins.csv')
new_ins = ins[pd.DataFrame(np.sort(ins[['instrument_token','exchange_token','tradingsymbol']].values.astype(str),1)).duplicated()]
new_ins.to_csv('ins.csv', mode='w', header=new_ins.columns.tolist(), index=False)
import pandas as pd
import numpy as np
ins = pd.read_csv('ins.csv')
new_ins = ins.drop_duplicates(['instrument_token','exchange_token','tradingsymbol'], keep='first')
keep="first"实际上是默认值,所以不需要添加它。这意味着它将只保留第一次出现。
如果你想放弃所有的,保持=假
除了使用drop_duplicates
之外的另一种方法是使用groupby.nunique
。
df.groupby(['sr', 'instrument_token', 'exchange_token', 'tradingsymbol', 'name']).nunique().reset_index()
Out[24]:
sr instrument_token exchange_token tradingsymbol name
0 4 367376 2112 nf50 nf50
1 9 361216 2127 nfbf nfbf