Nadav Cohen (Hebrew University of Jerusalem)
Tuesday, 22.12.2015, 11:30
It has long been conjectured that hypothesis spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical architectures than with shallow ones. Despite the vast empirical evidence, formal arguments to date are limited and do not capture the kind of networks used in practice. Using tensor factorization, we derive a universal hypothesis space implemented by an arithmetic circuit over functions applied to local data structures (e.g. image patches). The resulting networks first pass the input through a representation layer, and then proceed with a sequence of layers comprising sum followed by product-pooling, where sum corresponds to the widely used convolution operator. The hierarchical structure of networks is born from factorizations of tensors based on the linear weights of the arithmetic circuits. We show that a shallow network corresponds to a rank-1 decomposition, whereas a deep network corresponds to a Hierarchical Tucker (HT) decomposition. Log-space computation for numerical stability transforms the networks into SimNets.
In its basic form, our main theoretical result shows that the set of polynomially sized rank-1 decomposable tensors has measure zero in the parameter space of polynomially sized HT decomposable tensors. In deep learning terminology, this amounts to saying that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require an exponential size if one wishes to implement (or approximate) them with a shallow network. Our construction and theory shed new light on various practices and ideas employed by the deep learning community, and in that sense bear a paradigmatic contribution as well.
Joint work with Or Sharir and Amnon Shashua.