Starfish
is a highly-available fault-tolerant system for running parallel MPI programs
on clusters of workstations/PCs. |
Starfish
provides a collection of checkpoint/restart protocols to facilitate fault-tolerance,
load balancing, and dynamicity for applications running on top of it |
Starfish
is based on the Ensemble
group communication system for high-availability |
Starfish
is mostly written in OCaml, and is therefore highly portable, although
the current version only runs on Linux with either Fast Ethernet or Myrinet |
Starfish
uses the BIP drivers for Myrinet to acheive high performance |