[ ... ]
mingz> when u say large parallel storage system, you mean
mingz> independent spindles right? but most people will have all
mingz> disks configured in one RAID5/6 and thus it is not parallel
mingz> any more.
cw> it depends, you might have 100s of spindles in groups, you
cw> don't make a giant raid5/6 array with that many disks, you
cw> make a number of smaller arrays
Perhaps you are undestimating the ''if it can be done''
mindset...
Also, if one does a number of smaller RAID5s, is each one a
separate filesystem or they get aggregated, for example with
LVM with ''concat''? Either way, how likely is is that the
consequences have been thought through?
I would personally hesitate to recommend either, especially a
two-level arrangement where the base level is a RAID5.
[I am making an effort in this discussion to use euphemisms]
mingz> i think with write barrier support, system without UPS
mingz> should be ok.
cw> with barrier support a UPS shouldn't be necessary
Sure, «should» and «shouldn't» are nice hopeful concepts.
But write barriers are difficult to achieve, and when achieved
they are often unreliable, except on enterprise level hardware,
because many disks/host adapters/... simply lie as to whether
they have actually started writing (never mind finished writing,
or written correctly) stuff.
To get reliable write barrier often one has to source special
cards or disks with custom firmware; or leave system integration
to the big expensive guys and buy an Altix or equivalent system
from Sun or IBM.
Besides I have seen many reports of ''corruption'' that cannot
be fixed by write barriers: many have the expectation that
*data* should not be lost, even if no 'fsync' is done, *as if*
'mount -o sync' or 'mount -o data=ordered'.
Of course that is a bit of an inflated expectation, but all that
the vast majority of sysadms care about is whether it ''just
works'', without ''wasting time'' figuring things out.
mingz> considering even u have UPS, kernel oops in other parts
mingz> still can take the FS down.
cw> but a crash won't cause writes to be 'reordered' [ ... ]
The metadata will be consistent, but metadata and data may well
will be lost. So the filesystem is still ''corrupted'', at least
from the point of view of a sysadm who just wants the filesystem
to be effortlessly foolproof. Anyhow, if a crash happens all
bets are off, because who knows *what* gets written.
Look at it from the point of view of a ''practitioner'' sysadm:
''who cares if the metadata is consistent, if my 3TiB
application database is unusable (and I don't do backups
because after all it is a concat of RAID5s, backups are not
necessary) as there is a huge gap in some data file, and my
users are yelling at me, and it is not my fault''
The tradeoff in XFS is that if you know exactly what you are
doing you get extra performance...
|