在跨连续列移动时,有条件地更新pandas数据框中的值



我有一个成员记录的pandas DataFrame,其中包含一些输入错误(参见下面的几个示例)。一些成员被错误地识别为"已加入"。而实际上是"更新"。和/或列为"已加入";很多次了。我想通过转动"Joined"来纠正这些错误。到"Renewal"反之亦然,根据列年而定。

df = pd.DataFrame({2012: {0: 'Renewal', 1: 'Joined', 2: np.nan, 3: np.nan},
2013: {0: 'Renewal', 1: 'Renewal', 2: 'Joined', 3: np.nan},
2014: {0: 'Joined', 1: 'Joined', 2: 'Renewal', 3: 'Renewal'},
2015: {0: 'Renewal', 1: 'Renewal', 2: 'Joined', 3: 'Renewal'},})
df
2012    2013    2014    2015
0   Renewal Renewal Joined  Renewal
1   Joined  Renewal Joined  Renewal
2   NaN     Joined  Renewal Joined
3   NaN     NaN     Renewal Renewal

这可以工作,但不美观且耗时:使用np。这里还有一个更新行的循环,我可以逐列修正。例如,要修复重复的"Joined"行:

years = list(df.columns)
for col in df[years[1:]]:
df[col] = np.where(((df[years[0]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])
for col in df[years[2:]]:
df[col] = np.where(((df[years[1]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])

for col in df[years[3:]]:
df[col] = np.where(((df[years[2]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])
df
2012    2013    2014    2015
0   Renewal Renewal Joined  Renewal
1   Joined  Renewal Renewal Renewal
2   NaN     Joined  Renewal Renewal 
3   NaN     NaN     Renewal Renewal

将删除任何重复的"Joined",但我没有足够的技能来一次为所有列做这件事。现在我正在手动更新每一个连续的列,有一百万条记录和大约20年的连续数据。

我理解还有其他需要纠正的地方,但是如果我能理解如何循环并纠正这个问题作为第一步,我想我可以用同样的方法来解决其他困难。

尝试循环问题:我已经尝试了几种更复杂的循环的变体,但我没有得到响应,出现错误,或者意外地覆盖了所有数据。

例如,

for x in range(len(years)):
for col in df[years[x+1]]:
df[col] = np.where(((df[years[x]] == 'Joined') & (df[col] == 'Joined')), 'Renewal', df[col])

是否有一种方法可以批量更新记录,首先检查与第一列相关的值,然后移动到下一列?它不一定是一个循环,我只是假设这将是解决方案。

谢谢你的建议/例子。

如果你想替换除第一个"Joined"与"Renewal"你可以创建一个带有所有"joined"的DataFrame副本。替换为" renew ",查找"Joined"第一次出现的索引然后把它放回copy

# Create a copy frame
df2 = df.applymap(lambda x: 'Renewal' if x == 'Joined' else x)
# Pull first joined from original
first_joined = df.apply(lambda x: x=='Joined', axis=1).apply(pd.Series.idxmax, axis=1) 
# Replace first "Joined" in the copy
for item in first_joined.items():
if df.loc[item] == 'Joined':
df2.loc[item] = 'Joined'

# print(df2)
2012     2013     2014     2015
0  Renewal  Renewal   Joined  Renewal
1   Joined  Renewal  Renewal  Renewal
2      NaN   Joined  Renewal  Renewal
3      NaN      NaN  Renewal  Renewal

最新更新