Wednesday, 18.11.2015, 11:30
Large Computer Clusters are the infrastructure behind The Cloud and Web2 services. These large clusters host highly parallel programs like distributed databases and MapReduce. The parallelism is required for handling the Big Data stored on these clusters. But applications with high degree of parallelism stress the cluster network as they tend to generate correlated traffic bursts of high bandwidth. This stress causes long network latency tail, as well as incast and throughput collapse. These are known issues of the cluster networks. These problems are intensified by the Cloud environment where multiple such parallel applications run concurrently on the same cluster. Not only that the applications suffer from their own bursts of high throughput traffic, they might be attacked by the other applications. These attacks may create variation in the obtained application performance and reduces runtime predictability.
In this talk, I will present my PhD research work on the problem of improving the computer cluster network performance for the correlated, bursty and high capacity traffic of parallel applications. My work has focused on optimizing the packet forwarding for Fat Tree topologies. I present multiple approaches I studied and my contributions. Then I focus on the improving application runtime predictability in the Cloud.
Eitan Zahavi is a PhD student at the Faculty of Electrical Engineering in the Technion, under the supervision of Professors Avinoam Kolodny, Israel Cidon and Isaac Keslassy. He is also a Senior Principal Engineer in Mellanox Technologies. His research interests include theoretical and practical aspects of Data Center and High Performance Networks. Mostly in scheduling, traffic engineering and network management.