“Today’s users of high performance computing systems (HPC) have
access to larger machines with more processors than ever before.
Even discounting systems such as the Earth Simulator, the ASCI-Q
machine, or IBM’s Blue Gene system–all of which consist of
thousands or even tens of thousand of processors–everyday
production clusters can easily consist of hundreds to a few
thousand processors. Future systems composed of a hundred thousand
processors are already on the drawing board and are expected to be
in service within the next few years.“With such large systems, a critical issue is how to deal with
hardware and software faults that lead to process failures…”