Parallel-in-Time Training for Deep Residual Neural Networks

Jacob Schroder (University of New Mexico)

Residual neural networks (ResNets) are a type of deep neural network and they exhibit excellent performance for many learning tasks, e.g., image classification and recognition. A ResNet architecture can be interpreted as a discretization of a time-dependent ordinary differential equation (ODE), with the overall training process being an ODE-constrained optimal control problem. The time-dependent control variables are the network weights and each network layer is associated with a time-step. However, ResNet training often suffers from prohibitively long run-times because of the many sequential sweeps forwards and backwards across layers (i.e., time-steps) to carry out the optimization. This work first investigates one possible remedy (parallel-in-time methods) for the long run-times by demonstrating the multigrid-reduction-in-time method for the efficient and effective training of deep ResNets. The proposed layer-parallel algorithm replaces the classical (sequential) forward and backward propagation through the network layers by a parallel nonlinear multigrid iteration applied to the layer domain. However, the question remains how one initializes networks with hundreds or thousands of layers, which leads to the second part. Here, a multilevel initialization strategy is developed for deep networks, where we apply a refinement strategy across the time domain, that is equivalent to refining in the layer dimension. The resulting refinements create deep networks, with good initializations for the network parameters coming from the coarser trained networks. We investigate this multilevel “nested iteration” initialization strategy for faster training times and for regularization benefits, e.g., reduced sensitivity to hyperparameters and randomness in initial network parameters.