根据Pandas中另一列的NA值对一列进行动态求和



我有一个有序的数据框架,df。按'ID'分组,按'order'排序

df = pd.DataFrame(
{'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A','A', 'A','A', 'B','B', 'B','B', 'B', 'B', 'B','B'],
'order': [1,3,4,6,7,9,11,12,13,14,15,16,19,25,8,10,15,17,20,25,29,31],
'col1': [1,2,np.nan, 1,2,3,4,5, np.nan, np.nan,6,7,8,9,np.nan,np.nan,np.nan,10,11,12,np.nan,13],
'col2': [1,5,6,np.nan,1,2,3,np.nan,2,3,np.nan,np.nan,3,1,5,np.nan,np.nan, np.nan,2,3, np.nan,np.nan],
}
)

在每个ID组中,我需要对col2为NA的行求和col1。和包括col1的值,col2的下一个值存在:

我更喜欢矢量化的解决方案,以使其快速,但这可能很难。我需要在groupby中使用这个(col1_dynamic_sum应该按ID分组)

到目前为止,我所做的是定义一个函数来帮助计算行中之前连续的NAs的数量:

def count_prev_consec_na(input_col):
"""
This function takes a dataframe Series (column) and outputs the number of consecutive misisng values in previous rows
"""    
try:
a1 = input_col.isna() + 0 ## missing
a2 = ~input_col.isna() + 0  ## not missing
b1 = a1.shift().fillna(0) ## prev missing
d = a1.cumsum()
e = b1*a2
f = d*e
g = f.replace(0, np.nan)
h=g.ffill()
h = h.fillna(0)
i = h.shift()
result = h-i
result = result.fillna(0)
return (result)
except Exception as e:
print(e.message)
return None

我认为一个解决方案是使用它来获得需要回滚sum的动态行数:

df['roll_back_count'] = df.groupby(['ID'], as_index = False).col2.transform(count_prev_consec_na)
ID  order   col1    col2    roll_back_count
A   1       1.0     1.0         0.0
A   3       2.0     5.0         0.0
A   4       NaN     6.0         0.0
A   6       1.0     NaN         0.0  
A   7       2.0     1.0         1.0    ## I want to sum col1 of order 6 and 7 and remove order 6 row
A   9       3.0     2.0         0.0
A   11      4.0     3.0         0.0
A   12      5.0     NaN         0.0
A   13      NaN     2.0         1.0   ## I want to sum col1 of order 12 and 13 and remove order 12 row
A   14      NaN     3.0         0.0
A   15      6.0     NaN         0.0
A   16      7.0     NaN         0.0
A   19      8.0     3.0         2.0   ## I want to sum col1 of order 15,16,19 and remove order 15 and 16 rows
A   25      9.0     1.0         0.0
B   8       NaN     5.0         0.0
B   10      NaN     NaN         0.0
B   15      NaN     NaN         0.0
B   17      10.0    NaN         0.0   ## I want to sum col1 of order 10,15,17,20 and remove order 10,15,17 rows
B   20      11.0    2.0         3.0
B   25      12.0    3.0         0.0
B   29      NaN     NaN         0.0
B   31      13.0    NaN         0.0

这是我想要的输出:

desired_output:
ID  order   col1_dynamic_sum    col2
A   1       1.0                 1
A   3       2.0                 5
A   4       NaN                 6
A   7       3.0                 1
A   9       3.0                 2
A   11      4.0                 3
A   13      5.0                 2
B   14      NaN                 3
B   19      21.0                3
B   25      9.0                 1
B   8       NaN                 5
B   20      21.0                2
B   25      12.0                3

注意:总和应忽略NAs同样,我更喜欢矢量化的解决方案,但由于滚动效应,这可能是不可能的。

哎呀,我想我找到了一个根本不涉及滚动的解决方案!我基于col2的NA值创建了一个新的分组ID,使用没有任何值的行索引。然后我将使用这个分组ID进行聚合!

def create_na_group(rollback_col):
a = ~rollback_col.isna() + 0
b = a.replace(0, np.nan)
c = rollback_col.index
d = c*b
d = d.bfill()
return(d)
df['na_group'] = df.groupby(['ID'], as_index = False).col2.transform(create_na_group)
df = df.loc[~df.na_group.isna()]
desired_output = df.groupby(['ID','na_group'], as_index=False).agg(
order = ('order', 'last')
, col1_dyn_sum = ('col1', sum)
, col2 = ('col2', sum)
)

我只需要找到一种方法来确保NaN不变成0,就像在第2、7和10行一样。

ID  na_group    order   col1_dyn_sum    col2
0   A       0.0     1       1.0             1.0
1   A       1.0     3       2.0             5.0
2   A       2.0     4       0.0             6.0
3   A       4.0     7       3.0             1.0
4   A       5.0     9       3.0             2.0
5   A       6.0     11      4.0             3.0
6   A       8.0     13      5.0             2.0
7   A       9.0     14      0.0             3.0
8   A       12.0    19      21.0            3.0
9   A       13.0    25      9.0             1.0
10  B       14.0    8       0.0             5.0
11  B       18.0    20      21.0            2.0
12  B       19.0    25      12.0            3.0

我将用lamba x创建两个单独的和列:x.sum(skipna = False)和lamba x: x.sum(skipna = True),然后如果skipna = True sum列为0,skipna = False sum列为NA,那么我将保留最终和为NA,否则,我将使用skipna = True sum列作为最终所需的输出。

最新更新