如何使用重复值、排序值和透视表对数据w.r.t.单元格值进行分组,并保留多次出现的单元格值



我希望根据"name"列值对表中的数据进行分组,并保留表中"name"列值多次出现的信息。以下代码适用于给定的小数据表。

import pandas as pd
data={'Name':['Danny','Damny','Monny','Quony','Dimny','Danny'],
'Email':['danny@gmail.com','danny@gmail.com','monny@gmail.com','quony@gmail.com','danny@gmail.com','danny@gmail.com'],
'IBAN':['NLAMRO123456789','NLINGB126656723','BGFFEO128856754','NLAMRO123896763','DUDMRO567456722','NLRABO123456712']} #data with three columns
df=pd.DataFrame(data) #creation of dataframe
df['No Dutch Bank']=None #creation of extra column for analysis
df.loc[df['IBAN'].str.find('NL') == -1, 'No Dutch Bank']='ja'# to find rows, which contain non dutch bank numbers.
df_filt=df[['Name',"Email", "IBAN"]]#filtering columns needed in the final results
df_gb = df_filt[df_filt.duplicated(subset=['Name'], keep=False)].sort_values(by='Name', ascending=False).reset_index(drop=True)#filtering on a required column
piv_tab = pd.pivot_table(df_gb, index=['Name',"Email", "IBAN"])#applying pivot table
piv_tab

这适用于三列六行的原始数据。在实践中,我有三十列和三万(30000(行的数据。当我选择三列(名称、电子邮件和IBAN(并运行相同的代码时,该代码不会过滤出表中只出现过一次的行。

为什么?

除了透视逻辑之外的所有内容都是正确的,应该也适用于更大的数据集。

另一种方法是计算名称的计数,逐计数过滤>1,并通过左联接将名称合并回该表。

这种方式在按计数过滤>1,并且更灵活地支持计数过滤>2等

import pandas as pd
# count names
df_name = df_filt.groupby('Name')['Email'].count()
# get names with count > 1
df_name = df_name[df_name > 1].reset_index()
# merge filtered names back to original df to get filtered df
df_gb = pd.merge(df_name['Name'], df_filt, on=['Name'], how='left')
# sorting etc.
df_gb = df_gb.sort_values(by='Name', ascending=False).reset_index(drop=True)
# and some pivot stuff
...

最新更新