pandas:逐行比较列，并删除压缩到第一列的重复项

我有一个如下的数据帧：

import pandas as pd
data = {'name': ['the weather is good', ' we need fresh air','today is sunny', 'we are lucky'],
'name_1': ['we are lucky','the weather is good', ' we need fresh air','today is sunny'],
'name_2': ['the weather is good', 'today is sunny', 'we are lucky',' we need fresh air'],
'name_3': [ 'today is sunny','the weather is good',' we need fresh air', 'we are lucky']}
df = pd.DataFrame(data)

我想逐行比较列(意味着要比较具有相同索引的行(，如果重复的列与第一列具有相同的值，则用单词"same"替换它们。我想要的输出是：

name               name_1               name_2  
0  the weather is good         we are lucky               same   
1    we need fresh air  the weather is good       today is sunny   
2       today is sunny    we need fresh air         we are lucky   
3         we are lucky       today is sunny    we need fresh air   
name_3  
0       today is sunny  
1  the weather is good  
2    we need fresh air  
3           same

为了找到这些值，我尝试了以下方法：

import numpy as np
np.where(df['name'].eq(df['name_1'])|df['name'].eq(df['name_2'])|df['name'].eq(df['name_3']))

但是为了替换它们，我不知道如何为np.where((公式化(condition，x，y(。下面的返回与列"name"one_answers"name_3"相同：

np.where(df['name'].eq(df['name_1'])|df['name'].eq(df['name_2'])|df['name'].eq(df['name_3']),'same',df)

IIUC，您需要检查列'name_1'、'name_2'和'name_3'中的哪些值在列名中具有相同的值，如果是，请将这些值替换为'same'，否则保持原样。您使用numpy.where是正确的，但请尝试将您的语句重写为：

import numpy as np
cols = ['name_1','name_2','name_3']
for c in cols:
df[c] = np.where(df['name'].eq(df[c]),'same',df[c])

这给了你：

name               name_1              name_2  
0  the weather is good         we are lucky                same   
1    we need fresh air  the weather is good      today is sunny   
2       today is sunny    we need fresh air        we are lucky   
3         we are lucky       today is sunny   we need fresh air   
name_3  
0       today is sunny  
1  the weather is good  
2    we need fresh air  
3                 same

相关内容

最新更新

热门标签：