在没有for循环的情况下,跨2个数据帧计算信息的优雅方式



我正在寻找一种不使用for循环的更优雅的方法。我有两个数据帧,df1和df2,格式如下:df1:

Position   Sample   Time  Measurement Type   Measurement  Unique_ID    
1           1       1       A                   6.8        ID_1            
1           1       2       A                   7.2        ID_1            
1           1       3       A                   7.8        ID_1            
1           1       4       A                   9.2        ID_1            
2           2       1       A                   7.3        ID_2            
2           2       2       A                   8.5        ID_2            
2           2       3       A                   9.3        ID_2            
2           2       4       A                   9.5        ID_2            
3           1       1       B                   6.2        ID_3    
3           1       2       B                   7.0        ID_3    
3           1       3       B                   8.2        ID_3    
3           1       4       B                   8.0        ID_3

df2:

Position   Sample     Unique_ID    Threshold 
1           1         ID_1            8.3
2           2         ID_2            8.9
3           1         ID_3            7.1

我想检查df2中的每一行的测量列是否超过阈值,如果超过阈值;时间值";如果我们假设测量值之间有一条直线,并将该值添加到新列"跨阈值"中,则测量值将越过阈值

for unique_ID in df2['Unique_ID']:
y_last = 0
threshold = float(df2.loc[df2['Unique_ID'] == unique_ID, 'Threshold'].tolist()[0])
for x, y in zip(df1['Time'].tolist(), df1['Measurement'].tolist()):
if y > threshold:
val = (threshold - y_last)/(y-y_last) + x - 1 # straight line formula
break
else:
y_last = y
val = np.nan
df2.loc[df2['Unique_ID'] == unique_ID, 'Cross_threshold'] = val

结果:

Position  Sample Unique_ID  Threshold    Cross_threshold
1       1      ID_1        8.3         3.357143
2       2      ID_2        8.9         3.785714
3       1      ID_3        7.1         1.750000

我还想检查测量值是否超过阈值一次,例如在时间2超过阈值,在时间4低于阈值。在实际情况下,每个unique_ID将有更多的测量值,如果超过阈值一次以上,则需要以不同的方式考虑,添加新列"Doublecross_plus"=True或False

谢谢!

这里有一种不同的方法。我包含了一些辅助值,用于验证。我使用上面的值创建了df1df2

# pull the threshold into df1
thresh = df2[['Unique_ID', 'Threshold']].set_index('Unique_ID').squeeze()
df1['thresh'] = df1['Unique_ID'].map(thresh)
# pull previous values onto current row
df1['time_prev'] = df1['Time'].shift(1)
df1['meas_prev'] = df1['Measurement'].shift(1)
df1['id_prev'] = df1['Unique_ID'].shift(1)
# compare current and previous measurements versus threshold...
df1['over_curr'] = df1['Measurement']          > df1['thresh']
df1['over_prev'] = df1['Measurement'].shift(1) > df1['thresh']
# ... and see if we crossed the threshold (^ is XOR) for current ID
df1['cross'] = ((df1['over_curr'] ^ df1['over_prev']) & 
(df1['Unique_ID'] == df1['id_prev']))
# interpolate; must filter because this is meaningful when 'cross' is True
df1['Cross_threshold'] = (
(df1['thresh'] - df1['meas_prev']) / (df1['Measurement'] - df1['meas_prev'])
* (df1['Time'] - df1['time_prev'])
+ df1['time_prev']
)

现在过滤并打印。我转换了结果,以避免包装:

fields = ['Position', 'Sample', 'Unique_ID', 
'Measurement', 'thresh', 'meas_prev', 
'time_prev', 'Cross_threshold', 'Time']
mask = df1['cross']
# filter and export selected fields
print(df1.loc[mask, fields].transpose())
3     6        10
Position               1     2        3
Sample                 1     2        1
Unique_ID           ID_1  ID_2     ID_3
Measurement          9.2   9.3      8.2
thresh               8.3   8.9      7.1
meas_prev            7.8   8.5        7
time_prev              3     2        2
Cross_threshold  3.35714   2.5  2.08333
Time                   4     3        3

我的"跨阈值"值与原始帖子中的预期结果不同。为了验证,我展示了:

  • 阈值介于当前测量值和先前测量值之间;以及
  • 跨阈值时间介于当前时间和上一个时间之间

最新更新