O'Reilly Network: Linux System Failure Post-Mortem
Nov 08, 2001, 16:01 (1 Talkback[s])
(Other stories by Jennifer Vesperman)
Re-Imagining Linux Platforms to Meet the Needs of Cloud Service Providers
"Your Linux machine has just died, and your high
up-time is wrecked. How do you tell what happened, and more
importantly, how do you prevent a recurrence?
This article doesn't discuss user space programs -- few of them
will crash the box without a chance of recovery; the only one I
know of which reliably does that is crashme. Most crashes are
caused by kernel "oopses," or hardware failures.
A kernel oops occurs when the kernel code gets into an
unrecoverable state. In most cases, the kernel can write its state
to the drive, which allows you to determine what happened if you
have the correct tools. In a few cases, such as the Aiee, killing
interrupt handler crash, the kernel is unable to write to the
drive. With no interrupt handler, interrupt-driven I/O is
Even in the worst cases, some data can be retrieved and the
cause can often be determined."