"One of the problems is at least somewhat understood: a call to
fsync() on an ext3 filesystem will force the filesystem journal
(and related file data) to be committed to disk. That operation can
create a lot of write activity which must be waited for. But
contemporary I/O schedulers tend to favor read operations over
writes. Most of the time, that is a rational choice: there is
usually a process waiting for a read to complete, but writes can be
done asynchronously. A journal commit is not asynchronous, though,
and it can cause a lot of things to wait while it is in progress.
So it would be better not to put journal I/O operations at the end
of the queue.
"In fact, it would be better not to make journal operations
contend with the rest of the system at all. To that end, Arjan van
de Ven has long maintained a simple patch which gives the kjournald
thread realtime I/O priority. According to Alan Cox, this patch
alone is sufficient to make a lot of the problems go away. The
patch has never made it into the mainline, though, because Andrew
Morton has blocked it. This patch, he says, does not address the
real problem, and it causes a lot of unrelated I/O traffic to
benefit from elevated priority as well. Andrew says the real fix is
harder:
"The bottom line is that someone needs to do some serious
rooting through the very heart of JBD transaction logic and nobody
has yet put their hand up. If we do that, and it turns out to be
just too hard to fix then yes, perhaps that's the time to start
looking at palliative bandaids.
"Bandaid or not, this approach has its adherents. The ext4
filesystem has a new mount option (journal_ioprio) which can be
used to set the I/O priority for journaling operations; it defaults
to something higher than normal (but not realtime). More recently,
Ted Ts'o has posted a series of ext3 patches which sets the
WRITE_SYNC flag on some journal writes. That flag marks the
operations as synchronous, which will keep them from being blocked
by a long series of read operations. According to Ted, this change
helps quite a bit, at least when there is a lot of read activity
going on. The ext3 changes have not yet been merged for 2.6.30 as
of this writing (none of Ted's trees have), but chances are they
will go in before 2.6.30-rc1."