Saar Barkai (EE, Technion)
Monday, 16.12.2019, 13:30
Electrical Eng. Building 701
Although distributed computing can significantly reduce the training time of deep neural networks, scaling the training process while maintaining high efficiency and final accuracy is challenging. Cloud computing, which is becoming increasingly popular as a platform for distributed training of deep neural networks, is prone to straggler workers who significantly increase training time in synchronous environments. Therefore, asynchronous distributed training is preferable in terms of training time when using cloud computing. Distributed asynchronous training enjoys near-linear speedup, but asynchrony causes gradient staleness; the main difficulty in scaling stochastic gradient descent to large clusters. We introduce two orthogonal methods for mitigating the gradient staleness, enabling the use of large numbers of asynchronous workers. The first method, called DANA, estimates the model's future parameters thus mitigating the momentum's added gradient staleness. The second method, called Gap-Aware, mitigates gradient staleness by reducing the size of incoming gradients based on a new measure of their staleness we refer to as the Gap. We show that both methods can be combined to a single algorithm, DANA-Gap-Aware (DANA-GA), which produces state of the art results in terms of final accuracy and convergence rate. Despite prior beliefs, we show that if DANA-GA is applied, momentum becomes beneficial in asynchronous environments, even when the number of workers scales up. The evaluation is done on the well-known CIFAR, ImageNet, and WikiText-103 datasets.
* M.Sc. student under the supervision of Professor Assaf Schuster.