我正试图在以下存储库中重现Mask RCNN的训练:https://github.com/maxkferg/metal-defect-detection
列车的代码片段如下:
# Training - Stage 1
print("Training network heads")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=40,
layers='heads')
# Training - Stage 2
# Finetune layers from ResNet stage 4 and up
print("Fine tune Resnet stage 4 and up")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=120,
layers='4+')
# # Training - Stage 3
# # Fine tune all layers
print("Fine tune all layers")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE / 10,
epochs=160,
layers='all')
第一阶段进展顺利。但在第二阶段失败了。给出以下内容:
2020-08-17 15:53:10.685456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]123大小为2048的块,总计246.0KiB2020-08-17 15:53:10.685456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为2816的块,总计2.8KiB2020-08-17 15:53:10.686456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]6个大小为3072的区块,总计18.0KiB2020-08-17 15:53:10.686456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]387个大小为4096的块,总计1.51MiB2020-08-17 15:53:10.687456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为6144的块,总计6.0KiB2020-08-17 15:53:10.687456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为6656的区块,总计6.5KiB2020-08-17 15:53:10.688456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]60个大小为8192的块,总计480.0KiB2020-08-17 15:53:10.688456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]2个大小为9216的区块,总计18.0KiB2020-08-17 15:53:10.689456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]12个大小为12288的块,总计144.0KiB2020-08-17 15:53:10.689456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]2个大小为16384的块,总计32.0KiB2020-08-17 15:53:10.690456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为21248的块,总计20.8KiB2020-08-17 15:53:10.691456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为24064的块,总计23.5KiB2020-08-17 15:53:10.691456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]5个24576大小的区块,总计120.0KiB2020-08-17 15:53:10.692456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为37632的区块,总计36.8KiB2020-08-17 15:53:10.692456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为40960的区块,总计40.0KiB2020-08-17 15:53:10.693456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]4个大小为49152的区块,总计192.0KiB2020-08-17 15:53:10.693456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]6个大小为65536的块,总计384.0KiB2020-08-17 15:53:10.694456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为81920的块,总计80.0KiB2020-08-17 15:53:10.695456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为90624的块,总计88.5KiB2020-08-17 15:53:10.695456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为131072的块,总计128.0KiB2020-08-17 15:53:10.695456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]3个大小为147456的块,总计432.0KiB2020-08-17 15:53:10.696456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]12个大小为262144的区块,总计3.00MiB2020-08-17 15:53:10.696456:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为327680的块,总计320.0KiB2020-08-17 15:53:10.697457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]11个大小524288的块,总计5.50MiB2020-08-17 15:53:10.697457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]4个大小为589824的块,总计2.25MiB2020-08-17 15:53:10.698457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]194个大小为1048576的区块,总计194.00MiB2020-08-17 15:53:10.699457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]17个大小为2097152的区块,总计34.00MiB2020-08-17 15:53:10.699457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为2211840的块,总计2.11MiB2020-08-17 15:53:10.700457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]146个大小为2359296的区块,总计328.50MiB2020-08-17 15:53:10.701457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为2360320的块,总计2.25MiB2020-08-17 15:53:10.701457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为2621440的区块,总计2.50MiB2020-08-17 15:53:10.702457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为2698496的块,总计2.57MiB2020-08-17 15:53:10.702457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为3670016的块,总计3.50MiB2020-08-17 15:53:10.703457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]31个大小为4194304的区块,总计124.00MiB2020-08-17 15:53:10.703457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]6个大小为4718592的区块,总计27.00MiB2020-08-17 15:53:10.704457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]5个大小为8388608的块,总计40.00MiB2020-08-17 15:53:10.705457:I C:\tf_jenks\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]25个大小为9437184的区块,总计225.00MiB2020-08-17 15:53:10.705457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]2个大小为9438208的块,总计18.00MiB2020-08-17 15:53:10.706457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为9441280的块,总计9.00MiB2020-08-17 15:53:10.706457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为16138752的区块,总计15.39MiB2020-08-17 15:53:10.707457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为18874368的块,总计18.00MiB2020-08-17 15:53:10.707457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]1个大小为37748736的区块,总计36.00MiB2020-08-17 15:53:10.708457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:680]7个大小为51380224的区块,总计343.00MiB2020-08-17 15:53:10.708457:I C:\tf_jenkins\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:684]在用区块总数:1.41GiB2020-08-17 15:53:10.709457:I C:\tf_jenks\workspace\rel-win\M\windows gpu\PY\36\tensorflow\core\common_runtime\bfc_allocater.cc:686]统计:限额:1613615104使用中:1510723072最大使用次数:1510723072NumAllocs:3860最大AllocSize:119947776
训练是在具有2GB RAM的QuadroK420上进行的。只是内存不足的问题还是我遗漏了什么?有没有办法用我的装备训练?
问题出在显卡的gpu内存上。
在第一阶段,你能够顺利训练,因为你只训练了";头部";这转化为较小数量的参数。
在第二阶段,你开始出现记忆力不足的问题,因为你训练了更多的层,导致记忆力不足。
对于计算机视觉问题,我建议使用至少8 GB VRAM的视频卡。
事实上,有时内存不足的问题可以通过减少批量大小来解决,但在您的情况下,唯一可行的解决方案是选择更大/更好的视频卡。
这很可能是RAM问题。您可以尝试将批量大小减少到1或简化网络。如果这两种方法中的任何一种都有效,那么就用更大的RAM。
有时修复此问题的一种方法是在模型中放置上采样层。所以在图像生成器中降低你的目标,然后添加一个上采样层。是一个很好的方法来欺骗它。如果这有效,那么你就知道colab无法处理