Stan Hoeppner wrote:
I'm not familiar with Exanet, only that it was an Israeli company that
went belly up in late '09 early '10. Was the hardware provided by them?
Is it totally proprietary, or are you able to wipe the OS and install a
fresh copy of your preferred Linux distro and a recent kernel?
It's IBM and LSI gear, so I'm crossing my fingers that a Linux install
will be relatively painless.
I'm not sure which block size you're referring to here. Are you
referring to the kernel page size or the filesystem block size? AFAIK,
the default Linux kernel page size is still 8 KiB although there has
been talk for some time WRT changing it to 4 KiB, but IIRC some are
hesitant due to stack overruns with the 4 KiB page size. Regardless,
the kernel page size isn't a factor WRT to throughput to disk.
If you're referring to the latter, XFS has a block size configurable per
filesystem from 512 bytes to 64 KiB, with 4 KiB being the default. Make
your XFS filesystems with "-b size=65536" and you should be good to go.
I thought that the filesystem block size was still limited to the kernel
page size, which is 4K on x86 systems.
"The maximum filesystem block size is the page size of the kernel, which
is 4K on x86 architecture."
Is this no longer true? It would be awesome news if it wasn't.
My quick calculations were based on worst-case random read, which is
what we were seeing with the Exastore. They had a 64K blocksize * 48
disks * 70 seeks per second = 215 MB/s, which is exactly what we were
seeing under load. Under heavy random load, I'm worried that XFS has to
either thrash the disks with 4K reads and writes ~or~ introduce
unnecessary latency by doing read-combining and write-combining and/or
predictive elevator hanky-panky.
Are those 56 drives configured as a single large RAID stripe? RAID 10
or RAID 6? Or are they split up into multiple smaller arrays? Hardware
or software RAID? I ask as it will allow us to give you the exact
mkfs.xfs command line you need to make your XFS filesystem(s) for
I miscounted; it's 48 drives split into 6 hardware RAID-5 arrays.
They're actually kind of independent of one another. For instance, 'rm
-rf' on a 50k file directory structure won't touch a single file, only
metadata. So you have zero files being written but 50k log write
transactions (which delaylog will coalesce into fewer larger actual disk
writes). Typically, the data being written into the log is only a
fraction of the size of the files themselves, especially in your case
where most files are > 1MB in size, so the log bandwidth required for
"normal" file write operations is pretty low. If you're nervous about
it, simply install a small (40 GB) fast SSD in the server and put one
external journal log on it for each filesystem. That'll give you about
40-50k random 4k IOPS throughput for the journal logs. Combined with
delaylog I think this would thoroughly eliminate any metadata
Is there a way to monitor log operations to find out how much is being
used at a given time?
The key to metadata performance isn't as much the size of log device but
the throughput. If you have huge write cache on your hardware RAID
controllers and are using internal logs, or if you use a local SSD for
external logs, I would think you don't need the logs to be really huge,
as you're able to push the tail very fast, especially in the case of a
locally attached (SATA) SSD. Write cache on a big SAN array may be very
fast, but you typically have a an FC switch hop or two to traverse,
increasing latency. Latency with a locally attached SSD is about as low
as you can get, barring use of a ramdisk, which no sane person would
ever use for a filesystem journal.
All the metadata eventually has to be written to the main array, so
doesn't that ultimately become the limiting factor on metadata
throughput under sustained load?
You're asking for a guarantee that no one can give you, or would dare
to. And this would have little to do with confidence in XFS, but the
sheer complexity of the Linux kernel, and not knowing exactly what
hardware you have. There could be a device driver bug in a newer kernel
that might panic your system. There's no way for us to know that kind
of thing, so, no guarantees. :(
Fair enough. Since I get yelled at if the server goes down, I'm drawn
to proven-track-record kernels as much as possible. (Well, okay... not
quite back to 2.4-proven-track-record kernels...)
WRT XFS, there were a number of patches up to 18.104.22.168 which address
the problems you mention above, but none in 22.214.171.124 or 126.96.36.199, all of
which are the currently available kernels at kernel.org. So, given that
the patches have slowed down dramatically recently and the bugs have
been squashed, WRT XFS, I think you should feel confident installing
And, as always, install it on a test rig first and pound the daylights
out of it first with a test based on your actual real workload.