I am trying to debug a problem that has bugged us for the last few months.
We have set up a large storage system using GlusterFS, and XFS
underneath it. We have 5 RHEL6.2 servers (running
2.6.32-220.7.1.el6.x86_64 when problem last occurred), with LSI 9285-8e
Raid Controller with Battery Backup Unit. System memory is 48GB.
Over the few months we have it running, we experience two complete power
outage where everything went down for a long period of time.
After the system came back up, we found some files (between 1-10GB)
truncated. By truncated I mean the file sizes shrunk, and we lost the
tail of the files. Since the files were copied from another storage
system, we have the original to compare. Furthermore we have a cron job
that collect the file sizes once a day.
However, the troubling thing is, these files were all multiple days old,
and were not being written to or accessed at the time of the power outage.
Last week, I sensed some problems on the OS on one of the machines, and
so shut it down cleanly. And right after that, also upgraded the kernels
and rebooted all other 4 servers. After they all came back up, we
discovered truncated file again. I am sure the truncation occurred
within the 24 hours before or after the reboots since the file sizes we
had collected before the reboot differ from what we collected few hours
after the reboot. The file truncation occurred on the problematic
machine, and another one, which I have rebooted cleanly.
I tried to spend more time looking at the truncated files this time. I
found some of the smaller files actually got truncated to zero length.
I used xfs_bmap to look at the extend allocation, and saw that all of
them were using a single extent. So, by looking at the original file
size, and the start location of the truncated file, I tried to extract
the bits from the raw device, and saved it onto a different directory.
Something like this: dd if=/dev/hdc of=/u1/recovered bs=1
To my amaze, after I wrote the file out this way (assuming the complete
file were also occupying one single extent), the checksum matches the
original file which resides on the server where I had copied the file from.
These are my questions:
- Under what possible circumstances would the updated inode not written
to the disk, if the content of the file are already on disk?
- I tried to use block dump to debug while trying to reproduce the
problem on another test box. I notice xfssyncd and xfsbufd don't cause
data and inode to be writen to disk. It seems after a file is written,
data and dirtied inode are written to disk only when flush wakes up.
Does xfssyncd/xfsbufd only responsible for moving stuff to the system cache?
- Can all the flush processes die, or cease to work on a system and
still allow the system to function?
I have been trying to reproduce the problem on a test box for the last
few days but unsuccessful, except I see truncations on file newly
written, and not yet flushed to disk when I reset the test box. It seems
XFS is doing everything right. I tried writing through Gluster layer,
and writing directory to the XFS file system and see no different in
behavior. I would really like to get some ideas what else to look.