如何在并行化随机种子实验时确保可重复性?

我正在使用Mydia从视频中提取随机帧。因为我有很多视频，所以我想在保持可重复性的同时并行化这个工作流程。mydia.Videos接受随机种子，这对于确保可重复性非常重要。现在我需要处理并行化部分。

给定n视频和随机种子，r，如何确保无论工作人员数量如何，每个视频的提取帧都是相同的？我对算法组件特别感兴趣，不一定是代码。

我最初的想法是使用multiprocessing.Pool.但是，如果进程的完成时间不确定，则在对帧进行采样时将存在争用条件;即，如果 proc 1 花费的时间比 proc 0 长，则Videos类的采样帧将与 proc 0 比 proc 1 长的时间不同。

我的解决方案有点非正统，因为它是特定于库的。Mydia 允许传递要提取的帧，而不是强制Videos客户端直接采样。这使我有机会预先计算要在父进程中采样的帧。通过这样做，我可以通过实例化具有这些帧的新Videos来"模拟"子流程中的随机性。例如：

class MySampler:
def __init__(self, input_directory: Path, total_frames: int, num_frames: int, fps: int):
self.input_directory = Path(input_directory)
self.frames_per_video = [
self.__get_frame_numbers_for_each_video(total_frames, num_frames, fps)
for _ in self.input_directory.glob("*.mp4")
]
@staticmethod
def get_reader(num_frames: int, frames: List[int]):
# ignores the inputs and returns samples the frames that its constructed with
return Videos(target_size=(512, 512), num_frames=num_frames, mode=lambda *_: frames)

然后我可以简单地并行化它：

def sample_frames(self, number_of_workers: int):
pool = Pool(processes=number_of_workers)    
videos = list(self.input_directory.glob("*.mp4"))
pool.starmap_async(self.read_video, zip(self.frames_per_video, videos))    
pool.close()
pool.join()

其中read_video是调用get_reader并进行读取的方法。

相关内容

最新更新

热门标签：