如何删除一行,而迭代在一个数据框架?



我正在尝试使用SRT(字幕)文件执行以下操作:

  • 当一行至少5s没有出现在屏幕上
  • 将下一行的文本添加到当前行中,并将当前End_Time替换为下一行End_Time
  • 删除下一行
  • 转到下一行

我必须在数据框dfClean上使用编辑的时间标记字段这样做,然后对原始SRT时间格式dfSRTForm的数据框做同样的操作,这样我就可以将后者导出为SRT文件。

我的代码是这样做的:
for i in dfClean.index:
while dfClean.at[i, 'Difference'] < 5:
dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] + ' ' + dfClean.at[i+1, 'Text']
dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] + ' ' + dfSRTForm.at[i+1, 'Text']

dfClean.at[i, 'End_Time'] = dfClean.at[i+1, 'End_Time']
dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i+1, 'End_Time']

dfClean = dfClean.drop(i+1)
dfSRTForm = dfSRTForm.drop(i+1)

但是我得到这个错误:

KeyError: 3

UPDATE(如果其他人有相同的问题,保持先前):我找到了一种方法来重置索引,以避免KeyError: 3

我当前的代码是:

for i in dfClean.index:
while dfClean.at[i, 'Difference'] < 5:
dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] + ' ' + dfClean.at[i+1, 'Text']
dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] + ' ' + dfSRTForm.at[i+1, 'Text']

dfClean.at[i, 'End_Time'] = dfClean.at[i+1, 'End_Time']
dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i+1, 'End_Time']

dfClean = dfClean.drop(i+1)
dfSRTForm = dfSRTForm.drop(i+1)

dfClean = dfClean.reset_index()
dfClean = dfClean.drop(columns='index')

dfSRTForm = dfSRTForm.reset_index()
dfSRTForm = dfSRTForm.drop(columns='index')

dfClean['Difference'] = (dfClean['End_Time'] - dfClean['Start_Time']).astype('timedelta64[s]')

但是我得到KeyError: 267,我很确定这是因为它将行压缩到266。

是否有办法把"或结束索引"或";或最后一行";在while循环中不硬编码266行?我想将它用于其他具有不同行数的SRT文件。

您可以定义一个空列表,然后遍历数据框行,如果它不满足您的条件,则保存该列表的索引。

之后执行以下操作:

df = df.drop(index=your_indices)

不看你的数据,我无法做出精确的解决方案。但是下面应该作为如何完成你正在做的事情的一个例子

dfClean['Difference'] = (dfClean['End_Time'] - dfClean['Start_Time']).astype('timedelta64[s]')
tmp_diff = 0
tmp_txt = ''
new_data = []
for i, row in dfClean.iterrows():
if tmp_diff < 5:
tmp_txt = ' '.join([tmp_row, row['Text'])
tmp_diff += row['Difference']
else:
new_row = dict(row)
new_row['Text'] = tmp_txt
new_row['End_Time'] = row['End_Time']
new_row['Difference'] = tmp_diff
new_data.append(new_row)

tmp_txt = ''
tmp_diff = 0
new_df = pd.DataFrame(new_data)

我最终是这样修复的:

indexKeep = len(dfClean.index)
minSec = 3 # min number of seconds of screen time per line of subtitles.
for i in range(0, indexKeep):
try:
while dfClean.at[i, 'Difference'] < minSec:
dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] + ' ' + dfClean.at[i+1, 'Text']
dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] + ' ' + dfSRTForm.at[i+1, 'Text']

dfClean.at[i, 'End_Time'] = dfClean.at[i+1, 'End_Time']
dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i+1, 'End_Time']

dfClean = dfClean.drop(i+1)
dfSRTForm = dfSRTForm.drop(i+1)

dfClean = dfClean.reset_index()
dfClean = dfClean.drop(columns='index')

dfSRTForm = dfSRTForm.reset_index()
dfSRTForm = dfSRTForm.drop(columns='index')

dfClean['Difference'] = (dfClean['End_Time']-dfClean['Start_Time']).astype('timedelta64[s]')

dfClean.at[i, 'ID'] = i+1
dfSRTForm.at[i, 'ID'] = i+1
indexKeep = len(dfClean.index)
except KeyError: # Takes care of condensed number of rows
pass

This删除下一行重置索引号这样你就不会被中间的KeyError卡住,然后处理末尾的KeyError。最后的那个是初始化for循环的结果,让它超过800行但是for循环所做的压缩使总数大约为400行,这意味着它最终找不到&;401&;当它到达那里。

相关内容