On Sat, Apr 30, 2011 at 03:51:43PM +0100, Peter Grandi wrote:
> Been thinking about journals and RAID6s and SSDs.
> In particular for file system designs like JFS and XFS that do
> operation journaling (while ext do block journaling).
XFS is not an operation journalling filesystem. Most of the metadata
is dirty-region logged via buffers, just like ext3/4. Perhaps you
need to read some documentation like this:
> The issue is: journal size?
> It seems to me that adopting as guideline a percent of the
> filesystem is very wrong, and so I have been using a rule of
> thumb like one second of expected transfer rate, so "in flight"
> updates are never much behind.
How do you know what "one second" of "in flight" operations is going
I had to deal with this in XFS when implementing the delayed logging
code. It uses a number of operations or a percentage of log space to
determine when to checkpoint the modifications, and that is
typically load dependent as to when it triggers.
And then you've got the problem of concurrency - one second of a
single threaded workload is much different to one second of the same
workload spread across 20 CPU cores. You need to have limits that
work well in both cases, and structures that scale to that level of
In reality, there's not much point in trying to calculate what one
second's worth of metadata is going to be - more often that not
you'll hit some other limitation in the journal subsystem, run out
of memory or have to put limits in place anyway to avoid latency
problems. Easiest and most reliable method seems to be to size your
journal appropriatly in the first place and have you algorithms key
> But even at a single disk *sequential* transfer rate of say
> 80MB/s average, a journal that contains operation records could
> conceivably hold dozens if not hundreds of thousands of pending
> metadata updates, probably targeted at very widely scattered
> locations on disk, and playing a journal fully could take a long
17 minutes is my current record by crashing a VM during a chmod -R
operation over a 100 million inode filesystem. That was on a ~2GB
log (maximum supported size).
> So the idea would be that the relevant transfer rate would be
> the *random* one, and since that is around 4MB/s per single
> disk, journal sizes would end up pretty small. But many people
> allocate very large (at least compared to that) journals.
> This seems to me a fairly bad idea, because then the journal
> becomes a massive hot spot on the disk and draws the disk arm
> like black hole. I suspect that operations should not stay on
That's why you can configure an external log....
> the journal for a long time. However if the journal is too small
> processes that do metadata updates start to hang on it.
Well, yes. The journal needs to be large enough to hold all the
transaction reservations for the active transactions. XFS, in the
worse case for a default filesystem config, needs about 100MB of log
space per 300 concurrent transactions. Increasing transaction
concurrency was the main reason we increased the log size...
> So some questions for which I have guesses but not good answers:
> * What should journal size be proportional to?
> * What is the downside of a too small journal?
> * What is the downside of a too large journal other than space?
Recovery times too long, lots of outstanding metadata pinned in
memory (hello OOM-killer!), and other resource management related
> Again I expect answers to be very different for ext but I am
> asking for operation-journaling file system designs like JFS and
> BTW, another consideration is that for filesystems that are
> fairly journal-intensive, putting the journal on a low traffic
> storage device can have large benefits.
Yeah, nobody ever thought of an external log before.... :)
> But if they can be pretty small, I wonder whether putting the
> journals of several filesystems on the same storage device then
> becomes a sensible option as the locality will be quite narrow
> (e.g. a single physical cylinder) or it could be wortwhile like
> the database people do to journal to battery-backed RAM.
Got a supplier for the custom hardware you'd need? Just use a PCIe