我正在尝试使用SRT(字幕)文件执行以下操作:
- 当一行至少5s没有出现在屏幕上
- 将下一行的文本添加到当前行中,并将当前End_Time替换为下一行End_Time
- 删除下一行
- 转到下一行
我必须在数据框dfClean
上使用编辑的时间标记字段这样做,然后对原始SRT时间格式dfSRTForm
的数据框做同样的操作,这样我就可以将后者导出为SRT文件。
for i in dfClean.index:
while dfClean.at[i, 'Difference'] < 5:
dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] + ' ' + dfClean.at[i+1, 'Text']
dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] + ' ' + dfSRTForm.at[i+1, 'Text']
dfClean.at[i, 'End_Time'] = dfClean.at[i+1, 'End_Time']
dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i+1, 'End_Time']
dfClean = dfClean.drop(i+1)
dfSRTForm = dfSRTForm.drop(i+1)
但是我得到这个错误:
KeyError: 3
UPDATE(如果其他人有相同的问题,保持先前):我找到了一种方法来重置索引,以避免KeyError: 3
我当前的代码是:
for i in dfClean.index:
while dfClean.at[i, 'Difference'] < 5:
dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] + ' ' + dfClean.at[i+1, 'Text']
dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] + ' ' + dfSRTForm.at[i+1, 'Text']
dfClean.at[i, 'End_Time'] = dfClean.at[i+1, 'End_Time']
dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i+1, 'End_Time']
dfClean = dfClean.drop(i+1)
dfSRTForm = dfSRTForm.drop(i+1)
dfClean = dfClean.reset_index()
dfClean = dfClean.drop(columns='index')
dfSRTForm = dfSRTForm.reset_index()
dfSRTForm = dfSRTForm.drop(columns='index')
dfClean['Difference'] = (dfClean['End_Time'] - dfClean['Start_Time']).astype('timedelta64[s]')
但是我得到KeyError: 267
,我很确定这是因为它将行压缩到266。
是否有办法把"或结束索引"或";或最后一行";在while循环中不硬编码266行?我想将它用于其他具有不同行数的SRT文件。
您可以定义一个空列表,然后遍历数据框行,如果它不满足您的条件,则保存该列表的索引。
之后执行以下操作:
df = df.drop(index=your_indices)
不看你的数据,我无法做出精确的解决方案。但是下面应该作为如何完成你正在做的事情的一个例子
dfClean['Difference'] = (dfClean['End_Time'] - dfClean['Start_Time']).astype('timedelta64[s]')
tmp_diff = 0
tmp_txt = ''
new_data = []
for i, row in dfClean.iterrows():
if tmp_diff < 5:
tmp_txt = ' '.join([tmp_row, row['Text'])
tmp_diff += row['Difference']
else:
new_row = dict(row)
new_row['Text'] = tmp_txt
new_row['End_Time'] = row['End_Time']
new_row['Difference'] = tmp_diff
new_data.append(new_row)
tmp_txt = ''
tmp_diff = 0
new_df = pd.DataFrame(new_data)
我最终是这样修复的:
indexKeep = len(dfClean.index)
minSec = 3 # min number of seconds of screen time per line of subtitles.
for i in range(0, indexKeep):
try:
while dfClean.at[i, 'Difference'] < minSec:
dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] + ' ' + dfClean.at[i+1, 'Text']
dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] + ' ' + dfSRTForm.at[i+1, 'Text']
dfClean.at[i, 'End_Time'] = dfClean.at[i+1, 'End_Time']
dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i+1, 'End_Time']
dfClean = dfClean.drop(i+1)
dfSRTForm = dfSRTForm.drop(i+1)
dfClean = dfClean.reset_index()
dfClean = dfClean.drop(columns='index')
dfSRTForm = dfSRTForm.reset_index()
dfSRTForm = dfSRTForm.drop(columns='index')
dfClean['Difference'] = (dfClean['End_Time']-dfClean['Start_Time']).astype('timedelta64[s]')
dfClean.at[i, 'ID'] = i+1
dfSRTForm.at[i, 'ID'] = i+1
indexKeep = len(dfClean.index)
except KeyError: # Takes care of condensed number of rows
pass
This删除下一行,重置索引号这样你就不会被中间的KeyError卡住,然后处理末尾的KeyError。最后的那个是初始化for循环的结果,让它超过800行但是for循环所做的压缩使总数大约为400行,这意味着它最终找不到&;401&;当它到达那里。