Refael Cohen, M.Sc. Thesis Seminar
Advisor: Prof. A. Schuster
As the field of deep learning progresses, and models become larger and larger, training deep neural networks has become a demanding task. The task requires a huge amount of compute power, and can still be very time consuming - especially when using just a single GPU. To tackle this problem, distributed deep learning has come into play, with various asynchronous training algorithms. However, most of these algorithms suffer from decreased accuracy as the number of workers increases. We introduce a new method - Single MomEntum Gradient Accumulation ASGD (SMEGA2), which outperforms existing methods in terms of final test accuracy and scales up to as much as 64 asynchronous workers.