我有一个有序的数据框架,df。按'ID'分组,按'order'排序
df = pd.DataFrame(
{'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A','A', 'A','A', 'B','B', 'B','B', 'B', 'B', 'B','B'],
'order': [1,3,4,6,7,9,11,12,13,14,15,16,19,25,8,10,15,17,20,25,29,31],
'col1': [1,2,np.nan, 1,2,3,4,5, np.nan, np.nan,6,7,8,9,np.nan,np.nan,np.nan,10,11,12,np.nan,13],
'col2': [1,5,6,np.nan,1,2,3,np.nan,2,3,np.nan,np.nan,3,1,5,np.nan,np.nan, np.nan,2,3, np.nan,np.nan],
}
)
在每个ID组中,我需要对col2为NA的行求和col1。和包括col1的值,col2的下一个值存在:
我更喜欢矢量化的解决方案,以使其快速,但这可能很难。我需要在groupby中使用这个(col1_dynamic_sum应该按ID分组)
到目前为止,我所做的是定义一个函数来帮助计算行中之前连续的NAs的数量:
def count_prev_consec_na(input_col):
"""
This function takes a dataframe Series (column) and outputs the number of consecutive misisng values in previous rows
"""
try:
a1 = input_col.isna() + 0 ## missing
a2 = ~input_col.isna() + 0 ## not missing
b1 = a1.shift().fillna(0) ## prev missing
d = a1.cumsum()
e = b1*a2
f = d*e
g = f.replace(0, np.nan)
h=g.ffill()
h = h.fillna(0)
i = h.shift()
result = h-i
result = result.fillna(0)
return (result)
except Exception as e:
print(e.message)
return None
我认为一个解决方案是使用它来获得需要回滚sum的动态行数:
df['roll_back_count'] = df.groupby(['ID'], as_index = False).col2.transform(count_prev_consec_na)
ID order col1 col2 roll_back_count
A 1 1.0 1.0 0.0
A 3 2.0 5.0 0.0
A 4 NaN 6.0 0.0
A 6 1.0 NaN 0.0
A 7 2.0 1.0 1.0 ## I want to sum col1 of order 6 and 7 and remove order 6 row
A 9 3.0 2.0 0.0
A 11 4.0 3.0 0.0
A 12 5.0 NaN 0.0
A 13 NaN 2.0 1.0 ## I want to sum col1 of order 12 and 13 and remove order 12 row
A 14 NaN 3.0 0.0
A 15 6.0 NaN 0.0
A 16 7.0 NaN 0.0
A 19 8.0 3.0 2.0 ## I want to sum col1 of order 15,16,19 and remove order 15 and 16 rows
A 25 9.0 1.0 0.0
B 8 NaN 5.0 0.0
B 10 NaN NaN 0.0
B 15 NaN NaN 0.0
B 17 10.0 NaN 0.0 ## I want to sum col1 of order 10,15,17,20 and remove order 10,15,17 rows
B 20 11.0 2.0 3.0
B 25 12.0 3.0 0.0
B 29 NaN NaN 0.0
B 31 13.0 NaN 0.0
这是我想要的输出:
desired_output:
ID order col1_dynamic_sum col2
A 1 1.0 1
A 3 2.0 5
A 4 NaN 6
A 7 3.0 1
A 9 3.0 2
A 11 4.0 3
A 13 5.0 2
B 14 NaN 3
B 19 21.0 3
B 25 9.0 1
B 8 NaN 5
B 20 21.0 2
B 25 12.0 3
注意:总和应忽略NAs同样,我更喜欢矢量化的解决方案,但由于滚动效应,这可能是不可能的。
哎呀,我想我找到了一个根本不涉及滚动的解决方案!我基于col2的NA值创建了一个新的分组ID,使用没有任何值的行索引。然后我将使用这个分组ID进行聚合!
def create_na_group(rollback_col):
a = ~rollback_col.isna() + 0
b = a.replace(0, np.nan)
c = rollback_col.index
d = c*b
d = d.bfill()
return(d)
df['na_group'] = df.groupby(['ID'], as_index = False).col2.transform(create_na_group)
df = df.loc[~df.na_group.isna()]
desired_output = df.groupby(['ID','na_group'], as_index=False).agg(
order = ('order', 'last')
, col1_dyn_sum = ('col1', sum)
, col2 = ('col2', sum)
)
我只需要找到一种方法来确保NaN不变成0,就像在第2、7和10行一样。
ID na_group order col1_dyn_sum col2
0 A 0.0 1 1.0 1.0
1 A 1.0 3 2.0 5.0
2 A 2.0 4 0.0 6.0
3 A 4.0 7 3.0 1.0
4 A 5.0 9 3.0 2.0
5 A 6.0 11 4.0 3.0
6 A 8.0 13 5.0 2.0
7 A 9.0 14 0.0 3.0
8 A 12.0 19 21.0 3.0
9 A 13.0 25 9.0 1.0
10 B 14.0 8 0.0 5.0
11 B 18.0 20 21.0 2.0
12 B 19.0 25 12.0 3.0
我将用lamba x创建两个单独的和列:x.sum(skipna = False)和lamba x: x.sum(skipna = True),然后如果skipna = True sum列为0,skipna = False sum列为NA,那么我将保留最终和为NA,否则,我将使用skipna = True sum列作为最终所需的输出。