我有以下数据帧df,我想在其中添加"距离"列,这样:
日期 | 活动 | 距离 | |
---|---|---|---|
2022年9月1日 | 1 | 0 | |
2022年9月2日 | 0 | 1 | |
2022年9月5日 | 0 | 2 | |
2022年9月6日 | 0 | 3 | |
2022年9月7日 | 0 | 4 | |
2022年9月8日 | 1 | 0 | |
2022年9月9日 | 0 | 1 |
通过比较1
和Series.cumsum
创建组,并通过GroupBy.cumcount
:对其进行累积计数
df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
print (df)
date active distance
0 01/09/2022 1 0
1 02/09/2022 0 1
2 05/09/2022 0 2
3 06/09/2022 0 3
4 07/09/2022 0 4
5 08/09/2022 1 0
6 09/09/2022 0 1
您的列可以完全从;活动的";柱您的公式与相同
count_up = pd.Series(np.arange(len(df)), index=df.index)
distance = count_up - count_up.where(df.active).ffill()
使用cumsum
标记活动组。
g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
print(df)
结果
date active distance
0 01/09/2022 1 0
1 02/09/2022 0 1
2 05/09/2022 0 2
3 06/09/2022 0 3
4 07/09/2022 0 4
5 08/09/2022 1 0
6 09/09/2022 0 1
肯定有无数种方法都会得到相同的结果。这里有六个:
# ======================================================================
# ----------------------------------------------------------------------
# Provided in another answers (and fixed if necessary)
# Using merely pandas own methods:
df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
# nice pure pandas and short one - in my eyes the best choice
print(df)
# -------------------------------
cnt = pd.Series(np.arange(df.shape[0]), index=df.index)
distance = (cnt-cnt.where(df.active.astype(bool)).ffill()).astype(int)
df['distance'] = distance
# a much longer pure pandas one
print(df)
# -------------------------------
g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
# using in addition a function as replacement for .cumcount()
print(df)
# ======================================================================
# ----------------------------------------------------------------------
# Using a loop over values in column 'active':
d=[];c=-1
for i in df['active']:
c+=1
if i: c = 0
d.append(c)
df["distance"] = d
print(df)
# ----------------------------------------------------------------------
# Using a function
c = -1
def f(i):
global c
if i: c=0
else: c+=1;
return c
# -------------------------------
# with a list comprehension:
df['distance'] = [ f(i) for i in df['active'] ]
print(df)
# -------------------------------
# or pandas apply() function:
df['distance'] = df['active'].apply(f)
print(df)
下面是其中一个,包括完整的代码和数据:
import pandas as pd
import numpy as np
df_print = """
date active
01/09/2022 1
02/09/2022 0
05/09/2022 0
06/09/2022 0
07/09/2022 0
08/09/2022 1
09/09/2022 0"""
open('df_print', 'w').write(df_print)
df = pd.read_table('df_print', sep=r'sss*' ) # index_col = 0)
print(df)
distance = []
counter = -1
for index, row in df.iterrows():
if row['active']:
counter = 0
distance.append(counter)
continue
counter +=1
distance.append(counter)
df["distance"] = distance
print(df)
给出:
date active
0 01/09/2022 1
1 02/09/2022 0
2 05/09/2022 0
3 06/09/2022 0
4 07/09/2022 0
5 08/09/2022 1
6 09/09/2022 0
date active distance
0 01/09/2022 1 0
1 02/09/2022 0 1
2 05/09/2022 0 2
3 06/09/2022 0 3
4 07/09/2022 0 4
5 08/09/2022 1 0
6 09/09/2022 0 1