我试图在过滤的DataFrame中找到重复项。我的DataFrame:
Document type Application number
0 Return 1658
1 Sale 1658
2 Return 1659
3 Sale 1659
4 Return 1659
5 Return 1660
6 Return 1660
我需要得到重复的应用程序编号只有行与"返回"键入文档并写注释"发现副本"在这些行的评论中。这里是我需要的:
Document type Application number Comment
0 Return 1658
1 Sale 1658
2 Return 1659 //Duplicate is found
3 Sale 1659
4 Return 1659 //Duplicate is found
5 Return 1660 //Duplicate is found
6 Return 1660 //Duplicate is found
但是当我试图过滤DataFrame时,我得到一个错误TypeError: unhashable type: 'Series'。下面是我的代码:
def check_duplicated_app_nums(df,
col_app_num,
col_doc_type,
col_comments,
comment = 'Duplicate is found'):
mask_doc_type = df[col_doc_type] == 'Return'
mask_duplicate = df[mask_doc_type].duplicated(subset=col_app_num, keep=False)
df.loc[mask_duplicate, col_comments] = df.apply(lambda x: '%s//%s' % (x[col_comments], comment), axis=1)
与mask_duplicate:
一起使用mask_duplicate = df.duplicated(subset=col_app_num, keep=False)
但是在这个例子中它返回:
Document type Application number Comment
0 Return 1658 //Duplicate is found
1 Sale 1658 //Duplicate is found
2 Return 1659 //Duplicate is found
3 Sale 1659 //Duplicate is found
4 Return 1659 //Duplicate is found
5 Return 1660 //Duplicate is found
6 Return 1660 //Duplicate is found
如何在我需要的行中获得副本?
使用说明:
m = df.duplicated(subset=['Document type', 'Application number'], keep=False)
df.loc[m, col_comment] = comment
作为函数:
def check_duplicated_app_nums(df,
col_app_num,
col_doc_type,
col_comments,
comment = 'Duplicate is found'):
m = df.duplicated(subset=[col_doc_type, col_app_num], keep=False)
df.loc[m, col_comment] = f'//{comment}'
check_duplicated_app_nums(df, 'Application number', 'Document type', 'Comment')
输出:
Document type Application number Comment
0 Return 1658 NaN
1 Sale 1658 NaN
2 Return 1659 //Duplicate is found
3 Sale 1659 NaN
4 Return 1659 //Duplicate is found
5 Return 1660 //Duplicate is found
6 Return 1660 //Duplicate is found