如何在pandas DataFrame中的每一行下面添加计算行



我有一个dataframe_1,如下所示:

Index   Time          Label
0       0.000 ns      Segment 1
1       2.749 sec     baseline
2       3.459 min     begin test
3       7.009 min     end of test

我想在dataframe_1的每一行之间添加多个新行,其中每一新行的Time列将增加一分钟,直到到达dataframe_1的下一行的时间(以及相应的Label(。例如,上表最终应该是这样的:

Index     Time               Label
0         0.000 ns           Segment 1
1         2.749 sec          baseline
2         00:01:02.749000    baseline + 1min
3         00:02:02.749000    baseline + 2min
4         00:03:02.749000    baseline + 3min
5         3.459 min          begin test
6         00:04:27.540000    begin test + 1min
7         00:05:27.540000    begin test + 2min
8         00:06:27.540000    begin test + 3min
9         7.009 min          end of test

通过pd.to_timedelta()使用Timedelta类型是非常好的。

我认为最好的方法是将dataframe_1的每一行分解为自己的数据帧,然后为每增加一分钟添加行,然后concat将数据帧重新组合在一起。然而,我不确定如何做到这一点。

我应该使用嵌套的for循环来[首先]迭代dataframe_1的行,然后[第二]迭代计数器,这样我就可以创建添加了分钟数的新行吗?

我之前没有将单独的行拆分为新的数据帧,我正在进行第二次迭代,如下所示:

baseline_row = df_legend[df_legend['Label'] == 'baseline']
[baseline_index] = baseline_row.index
baseline_time = baseline_row['Time']
interval_mins = 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
cutoff_time = pd.to_timedelta(cutoff_time_np)

while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):
new_row = baseline_row.copy()
new_row['Label'] = f'minute {interval_mins}'
new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
new_row.index = [baseline_index + interval_mins - 0.5]
df_legend = df_legend.append(new_row, ignore_index=False)
df_legend = df_legend.sort_index().reset_index(drop=True)
pdb.set_trace()
interval_mins += 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)

但由于我想对原始dataframe_1中的每一行都这样做,所以我想把它拆分成单独的数据帧,然后再放回一起。我只是不确定最好的方法是什么,尤其是因为熊猫在各行中迭代的速度显然非常慢。

我真的很感谢你的指导。

这可能比您的解决方案更快。

df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)
df
Time        Label  counts
0        00:00:00    Segment 1       0
1 00:00:02.749000     baseline       3
2 00:03:27.540000   begin test       3
3 00:07:00.540000  end of test       0

然后使用iterrows来获得你想要的输出。

new_df = []
for _, row in df.iterrows():
val = row.counts
if val == 0:
new_df.append(row)
else:
new_df.append(row)
new_row = row.copy()
label = row.Label
for i in range(val):
new_row = new_row.copy()
new_row.Time += pd.Timedelta('1 min')
new_row.Label = f'{label} + {i+1}min'
new_df.append(new_row)
new_df = pd.DataFrame(new_df)
new_df
Time              Label  counts
0        00:00:00          Segment 1       0
1 00:00:02.749000           baseline       3
1 00:01:02.749000    baseline + 1min       3
1 00:02:02.749000    baseline + 2min       3
1 00:03:02.749000    baseline + 3min       3
2 00:03:27.540000         begin test       3
2 00:04:27.540000  begin test + 1min       3
2 00:05:27.540000  begin test + 2min       3
2 00:06:27.540000  begin test + 3min       3
3 00:07:00.540000        end of test       0

我假设您将Time列从";数字单元";格式化为字符串时间的表示。类似于:

Time        Label
Index                           
0      00:00:00.000    Segment 1
1      00:00:02.749     baseline
2      00:03:27.540   begin test
3      00:07:00.540  end of test

然后,得到你的结果:

  1. 计算timNxt-时间列移动1个位置并转换至日期时间

    timNxt = pd.to_datetime(df.Time.shift(-1))
    
  2. 定义以下";复制";功能:

    def myRepl(row):
    timCurr = pd.to_datetime(row.Time)
    timNext = timNxt[row.name]
    tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]]
    if pd.notna(timNext):
    n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1
    tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'),
    row.Label + f' + {i}min'] for i in range(1, n)])
    return pd.DataFrame(tbl, columns=row.index)
    
  3. 将其应用于df的每一行,并连接结果:

    result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)
    

结果是:

Time              Label
0  00:00:00.000000          Segment 1
1  00:00:02.749000           baseline
2  00:01:02.749000    baseline + 1min
3  00:02:02.749000    baseline + 2min
4  00:03:02.749000    baseline + 3min
5  00:03:27.540000         begin test
6  00:04:27.540000  begin test + 1min
7  00:05:27.540000  begin test + 2min
8  00:06:27.540000  begin test + 3min
9  00:07:00.540000        end of test

生成的DataFrame具有Time列,也称为字符串,但在秒的小数部分至少有6位数字。

最新更新