Time+Place: Monday 22/12/2003 14:30 Room 337-8 Taub Bld.
Title: Leveraging Modern Interconnects for Parallel Job Scheduling
Speaker: Eitan Frachtenberg http://www.cs.huji.ac.il/~etcs/
Affiliation: HUJI/LANL
Host: Assaf Schuster

Abstract:

The use of clusters and grids as high capability and capacity computers 
is rapidly growing in the industry, academia, and government. This 
growth is accompanied by fast-paced progress in cluster-aware hardware, 
and in particular in interconnection technology. Contemporary networks 
offer not only excellent performance as expressed by latency and 
bandwidth, but also advanced architectural features, such as 
programmable network interface cards, hardware support for collective 
communication operations, and support for modern communication protocols 
such as MPI and RDMA.
These network mechanisms pave the way to advances in system software for 
large-scale clusters and grids. Such machines are typically composed of 
loosely-coupled independent compute nodes, each running a local 
operating system such as Linux. Such solutions are inadequate for many 
large-scale system tasks, such as resource management, job scheduling, 
and fault tolerance.
Our research at Los Alamos National Laboratory has focused on leveraging 
the features of modern interconnects to address these issues in a 
global, cohesive view. As part of this work, we have implemented two 
novel job scheduling algorithms,that make use of advanced collective 
communication capabilities. We have also implemented some of the more 
traditional job scheduling algorithms, and compared the performance of 
these algorithms in several scenarios and cluster architectures. This 
talk presents an overview of these job scheduling algorithms and the 
main experimental results. In particular, we show how issues such as 
load-imbalance and resource overlapping can be addressed by novel 
job-scheduling techniques.

joint work with Dror Feitelson (HUJI) and Fabrizio Petrini (LANL)