Update (October 4, 2021): This trick seemed to work at the time, but, when I returned to this work, multi-GPU training began to fail again. As always, your mileage may vary.
I’ve been working with Detectron 2 a lot recently, building object-detection models for work using PyTorch and Faster R-CNN. Detectron 2 is a delightful and extensible framework for computer-vision tasks1 but it turns out not to offer a baked-in method for tracking evaluation losses during training—kind of a basic thing in machine-learning world. In ML, evaluation losses and other tests against out-of-sample data are critical for estimating overfit and finding a suitable resting point on the bias-variance tradeoff curve, but I wonder if this isn’t a big concern for most computer-vision researchers, who are trying to learn from millions of images and billions of pixels.
I wasn’t the first to realize that this was missing, of course. Detectron 2’s GitHub repo contains a few issues like this one discussing how to implement evaluation loss tracking, and there’s also a related Medium post that uses Detectron 2’s hook system to solve the problem. In a nutshell, Detectron 2’s hook system works like so:
with EventStorage(start_iter) as self.storage:
try:
self.before_train()
for self.iter in range(start_iter, max_iter):
self.before_step()
self.run_step()
self.after_step()
except Exception:
logger.exception("Exception during training:")
raise
finally:
self.after_train()
(Source code from Detectron 2 on GitHub.)
Custom training code code can cleanly register for these hook methods, and this approach works well for single-GPU training. But I figured out pretty quickly that the hook-based system falls over when training with multiple GPUs (I’m often training this particular model with 4 V100s on AWS), probably from communication errors among the GPUs. I saw a post suggesting that different GPUs might be getting stuck in different parts of the code, since the hook system is implemented across multiple functions, and this tracks with my experience.
One way around this multi-GPU issue is to bypass the hook system
entirely, directly subclassing SimpleTrainer
’s run_step()
method since we use a custom trainer descended
from SimpleTrainer
:
class OurFancyTrainer(DefaultTrainer):
def run_step(self) -> None:
super().run_step()
# At a given number of iterations...
self.calculate_test_losses()
This approach is similar in spirit to the hook-based system—we’ve
only moved some code from, e.g., after_step()
into the end of
run_step()
—but by subclassing the trainer we’re now able to
deliver code of similar complexity that works just fine for both
single and multiple GPUs.
-
Detectron 2 is a great Python code base. It’s well organized, extensible, and uses type hints in many places. With only a few thousand lines of code, I’ve been able to write data loaders, evaluators, etc., without writing any models from scratch. Detectron 2 also uses the YACS config system for specifying and tracking experiments, which I really like. ↩︎