[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: vim file write mode on journaling fs.
OK, just picking a message to respond to from this thread.
1. It is not XFS which is deciding when to write the file data out
to disk, it is Linux. The bdflush daemon is responsible for this,
that 30 seconds is one of its control parameters.
2. An individual thread doing a write in XFS has no way of knowing
or predicting what else may just of happened, or be about to happen
on the system. You cannot say 'I will write my data now because
the system is idle', there may be a couple of Gbytes of I/O about
to come in from another source.
3. O_SYNC is a fairly standard flag on file open, if freebsd does not
have it then it is missing a fairly major feature of filesystems.
Having said that, I do not recommend using O_SYNC, it is more
expensive than an fsync.
4. Anyone who is shutting down their system by pulling the plug
rather than doing an orderly shutdown is asking for trouble.
Yes, it is a situation we want to deal with fairly gracefully,
but filesystem recovery in journaling filesystems and fsck in
others is there to 'recover' from problems, not to cater to
lazy users. umount is there for a reason.
5. The delayed write you talk about is the norm for ALL filesystems
operating on spinning disks. If you don't delay writes in a filesystem
then you will be here until Christmas responding to this email.
Now XFS has delayed allocation which is different.
The normal process of a write in linux is something like this:
o write system call comes in, looks for space in the file, if
there is none it asks the filesystem to allocate some, the
data is copied into a buffer which has the disk address of
this data and which is marked dirty.
The write then returns - the data is NOT written
to disk, nor in the case of ext2 would any of the metadata
changes be written to disk.
o The bdflush daemon comes along and sees buffers as being
suitable for flushing, the data and metadata gets written
out to disk.
With XFS it looks like this:
o write system call comes in looks for space in the file, if
there is none it asks the filesystem for some, the filesystem
records the fact that space was requested at this point in the
file. A buffer is allocated as before, it is marked dirty, it
is also marked delayed allocate. The write returns.
o Possibly an inode flush or a log flush pushes the new inode
out to disk.
o The bdflush daemon comes along and sees the buffers as being
delayed allocate, it calls the filesystem to allocate the space.
The allocate is done, and the buffers are written out to disk.
The transaction which records the extents is still in memory
and will not be flushed for a few seconds yet.
This last sentence is the major difference, and is probably what is
biting here, the write has not really happened until this metadata
makes it out to disk. We may be having some issues with how long
this is taking in Linux.
So the upshot of all of this is that I suspect we do have an issue, and we
will get to it at some point. In the mean time there is no need to start
discarding filesystems which do not behave as you want them to do if you
pull the plug.
Steve