Linux Magazine: Fault-Tolerant MPI Feb 11, 2005, 05 :30 UTC (0 Talkback[s]) (6410 reads) (Other stories by Graham E. Fagg)
"Today's users of high performance computing systems (HPC) have access to larger machines with more processors than ever before. Even discounting systems such as the Earth Simulator, the ASCI-Q machine, or IBM's Blue Gene system--all of which consist of thousands or even tens of thousand of processors--everyday production clusters can easily consist of hundreds to a few thousand processors. Future systems composed of a hundred thousand processors are already on the drawing board and are expected to be in service within the next few years.
"With such large systems, a critical issue is how to deal with hardware and software faults that lead to process failures..."