Linux Today: Linux News On Internet Time.

O'Reilly Network: Linux System Failure Post-Mortem

Nov 08, 2001, 16:01 (1 Talkback[s])
(Other stories by Jennifer Vesperman)
"Your Linux machine has just died, and your high up-time is wrecked. How do you tell what happened, and more importantly, how do you prevent a recurrence?

This article doesn't discuss user space programs -- few of them will crash the box without a chance of recovery; the only one I know of which reliably does that is crashme. Most crashes are caused by kernel "oopses," or hardware failures.

A kernel oops occurs when the kernel code gets into an unrecoverable state. In most cases, the kernel can write its state to the drive, which allows you to determine what happened if you have the correct tools. In a few cases, such as the Aiee, killing interrupt handler crash, the kernel is unable to write to the drive. With no interrupt handler, interrupt-driven I/O is impossible.

Even in the worst cases, some data can be retrieved and the cause can often be determined."

Complete Story