Ido Hakimi, M.Sc. Thesis Seminar
Monday, 24.12.2018, 11:00
Distributed training can significantly reduce the training time of neural networks. Despite its potential, however, distributed training has not been widely adopted due to the difficulty of scaling the training process. Existing methods suffer from slow convergence and low final accuracy when scaling to large clusters, and often require substantial re-tuning of hyper-parameters.
We propose DANA, a novel approach that scales to large clusters while maintaining similar final accuracy and convergence speed to that of a single worker. DANA estimates the future value of model parameters by adapting Nesterov Accelerated Gradient to a distributed setting, and so mitigates the effect of gradient staleness, one of the main difficulties in scaling SGD to more workers.
Evaluation on three state-of-the-art network architectures and three datasets shows that DANA scales as well as or better than existing work without having to tune any hyperparameters or tweak the learning schedule. For example, DANA achieves 75.73\% accuracy on ImageNet when training ResNet-50 with 16 workers, similar to the non-distributed baseline.