Highly
Available Distributed Java


The goal of our research is to create a system that is
able to execute a parallel application on a large number of available
non-dedicated workstations despite possible failures and intervention of their
users, while providing a complete single-system image to the programmer. Thus,
we design a system that allows a high-level programming paradigm and completely
hides the distributed and dynamic nature of the underlying large-scale
non-dedicated environment. In order to achieve this, the desired system must efficiently
deal with a high rate of nodes becoming unable to continue their part of the
execution, as well as be highly scalable.
Today, a common computational environment consists of
a large number of commodity workstations, PCs or SMPs interconnected by high
bandwidth networks. Previous work shows that on average only 30% of a
workstation’s capacity is utilized; even during peak hours more than half of
the workstations are idle. Moreover, the communication subsystem dedicated to
the idle workstations may be also unused. This results in immense aggregate
waste of computational and network resources.
Utilization of idle machines in clusters of
workstations can yield a cost-effective substitute to dedicated parallel
computers for executing computation-intensive applications. Message passing
programming in this environment, however, is far from easy. The programmer’s
task can be significantly simplified if the programming environment provides a
shared memory abstraction.
The biggest obstacle to distributed computing in a
non-dedicated environment is a participating workstation being unable to
continue its part of the execution. This can occur due to a failure, e.g.,
a power outage, etc., or by user intervention, e.g., shutdown or
restart. These events result in the immediate loss of the data and threads
located on the node, which violates the integrity of the distributed execution.
The probability of any of these events occurring increases as the system size
and the application running time grows. In large-scale systems, multiple nodes
may become unavailable simultaneously.