Amir Watad (EE, Technion)
Thursday, 6.4.2017, 11:30
While augmenting a system with multiple GPUs is an appealing way to push more compute power inside a single machine, it is not without its challenges. We will talk about these challenges in the context of network servers, where short request handling latency and throughput scaling with the number of GPUs are the main design goals.
We claim that the current GPU programming model, where the GPU is a co-processor and is managed by the CPU, is an impediment to building scalable network servers. The primary observation is that CPUs cannot keep up with managing too many GPUs, and they become the system’s bottleneck. This problem becomes particularly pronounced when the request processing involves accesses to large dataset sharded across multiple GPUs in order to fit in their aggregate memory. Further, the programming complexity of such multi-GPU memory-intensive servers is prohibitively high, thus preventing broader use of GPUs in data centers.
GPUpipes is a framework for building memory-greedy multi-GPU network servers which promotes a CPU-free design, removing the CPU management of GPUs. GPUpipes builds on the concept of a data parallel pipeline, providing programming abstractions which hide the complexity of multi-GPU systems. The programmer breaks the request processing into multiple pipeline stages, specifies the data shards a stage may execute, and implements the application logic of each stage. GPUpipes orchestrates the execution of pipeline stages across multiple GPUs, while maintaining data locality.
We evaluate our system on an Approximate K-Nearest-Neigbours server workload and compare it to a highly optimized CPU driven multi-GPU server.
Finally, we will disclose the current limitations of the system, and the proposed future work for removing them.