Technion - Israel Institute of Technology 


Computer Science Department 



Welcome to the HADJ Project Home Page

Highly Available Distributed Java  




The Goal

The goal of our research is to create a system that is able to execute a parallel application on a large number of available non-dedicated workstations despite possible failures and intervention of their users, while providing a complete single-system image to the programmer. Thus, we design a system that allows a high-level programming paradigm and completely hides the distributed and dynamic nature of the underlying large-scale non-dedicated environment. In order to achieve this, the desired system must efficiently deal with a high rate of nodes becoming unable to continue their part of the execution, as well as be highly scalable.


Motivation

Today, a common computational environment consists of a large number of commodity workstations, PCs or SMPs interconnected by high bandwidth networks. Previous work shows that on average only 30% of a workstation’s capacity is utilized; even during peak hours more than half of the workstations are idle. Moreover, the communication subsystem dedicated to the idle workstations may be also unused. This results in immense aggregate waste of computational and network resources.

Utilization of idle machines in clusters of workstations can yield a cost-effective substitute to dedicated parallel computers for executing computation-intensive applications. Message passing programming in this environment, however, is far from easy. The programmer’s task can be significantly simplified if the programming environment provides a shared memory abstraction.

The biggest obstacle to distributed computing in a non-dedicated environment is a participating workstation being unable to continue its part of the execution. This can occur due to a failure, e.g., a power outage, etc., or by user intervention, e.g., shutdown or restart. These events result in the immediate loss of the data and threads located on the node, which violates the integrity of the distributed execution. The probability of any of these events occurring increases as the system size and the application running time grows. In large-scale systems, multiple nodes may become unavailable simultaneously. 


Related Distributed Java Runtime Systems


People