Linux Magazine: Fault-Tolerant MPI
Feb 11, 2005, 05:30 (0 Talkback[s])
(Other stories by Graham E. Fagg)
"Today's users of high performance computing systems (HPC) have
access to larger machines with more processors than ever before.
Even discounting systems such as the Earth Simulator, the ASCI-Q
machine, or IBM's Blue Gene system--all of which consist of
thousands or even tens of thousand of processors--everyday
production clusters can easily consist of hundreds to a few
thousand processors. Future systems composed of a hundred thousand
processors are already on the drawing board and are expected to be
in service within the next few years.
"With such large systems, a critical issue is how to deal with
hardware and software faults that lead to process failures..."