On 12/10/12 3:12 AM, Matthias Schniedermeyer wrote:
> On 10.12.2012 11:58, Dave Chinner wrote:
>> On Sat, Dec 08, 2012 at 08:29:27PM +0100, Matthias Schniedermeyer wrote:
>>> On 06.12.2012 09:51, Lin Li wrote:
>>>> Hi, Guys. I recently suffered a huge data loss on power cut on an XFS
>>>> partition. The problem was that I copied a lot of files (roughly 20Gb) to
>>>> an XFS partition, then 10 hours later, I got an unexpected power cut. As a
>>>> result, all these newly copied files disappeared as if they had never been
>>>> copied. I tried to check and repair the partition, but xfs_check reports no
>>>> error at all. So I guess the problem is that the meta data for these files
>>>> were all kept in the cache (64Mb) and were never committed to the hard
>>>> What is the cache flush policy for XFS? Does it always reserve some fixed
>>>> space in cache for metadata? I asked because I thought since I copied such
>>>> a huge amount of data, at least some of these files must be fully committed
>>>> to the hard disk, then cache is only 64Mb anyway. But the reality is all of
>>>> them were lost. the only possibility I can think is some part of the cache
>>>> was reserved for meta data, so even the cache is fully filled, this part
>>>> will not be written to the disk. Am I right?
>>> I have the same problem, several times.
>>> The latest just an hour ago.
>>> I'm copying a HDD onto another. Plain rsync -a /src/ /tgt/ Both HDDs are
>>> 3TB SATA-drives in a USB3-enclosure with a dm-crypt layer in between.
>>> About 45 minutes into copying the target HDD disconnects for a moment.
>>> 45minutes means someting over 200GB were copied, each file is about
>>> After remounting the filesystems there were exactly 0 files.
>> This sounds like an entirely different problem to what the OP
> For me it sounds only like different timing.
> Otherwise i don't see much difference in files vanished after a few
> hours(of inactiviry) and a few minutes (while still beeing active).
>> Did the filesystem have an error returned?
>> i.e. did it shut down (what's in dmesg)?
> There's not much XFS could have done after the block-device vanished.
except to shut down...
> A dis-/r-eappierung block-device gets a new name because the old name is
> still "in use", the block-devic gets cleaned up after 'umount'ing and
> closing the dm-crypt device.
> When the USB3-HDD disconnected it reappered a moment later under a new
> name, it bounced between sdc <-> sdf.
> In syslog it's a plain "USB disconnect, device number XX" message.
> Followed by a standard new device found message-bombardment. In between
> there are some error-messages, but as it's pratically a yanked out and
> replugged cable, a little complaing by the kernel is to be expected.
Sure, but Dave asked if the filesystem shut down. XFS messages would
tell you that; *were* there messages from XFS in the log from the event?
Sometimes "a little complaining" can be quite informative. :)
>> Did you run repair in between the shutdown and remount?
> XFS (dm-3): Mounting Filesystem
> XFS (dm-3): Starting recovery (logdev: internal)
> XFS (dm-3): Ending recovery (logdev: internal)
>> How many files in that 200GB of data?
> At 0.9GB/file at least 220.
>> Basically, you have an IO error situation, and you have dm-crypt
>> in-between buffering an unknown about of changes. In my experience,
>> data loss eventsi are rarely filesystem problems when USB drives or
>> dm-crypt is involved...
> I don't know the inner workings auf dm-*, but shouldn't it behave
> transparent and rely on the block-layer for buffering.
I think that's partly why Dave asked you to test it, to check
that theory ;)
>>> After that i started a "while true; do sync ; done"-loop in the
>>> And just while i was writing this email the HDD disconnected a second
>>> time. But this time the files up until the last 'sync' were retained.
>> Exactly as I'd expect.
>>> And something like this has happend to me at least a half dozen times in
>>> the last few month. I think the first time was with kernel 3.5.X, when i
>>> was actually booting into 3.6 with a plain "reboot" (filesystem might
>>> not have been umounted cleanly.), after the reboot the changes of about
>>> the last half hour were gone. e.g. i had renamed a directory about 15
>>> minutes before i rebooted and after the reboot the directory had it's
>>> old name back.
>>> Kernel in all but (maybe)one case is between 3.6 and 3.6.2 (currently),
>>> the first time MIGHT have been something around 3.5.8 but i'm not sure.
>>> HDDs were either connected by plain SATA(AHCI) or by USB3 enclosure. All
>>> affected filesystems were/are with a dm-crypt layer inbetween.
>> Given that dm-crypt is the common factor here, I'd start by ruling
>> that out. i.e. reproduce the problem without dm-crypt being used.
> That's a slight problem for me, pratically everything i have is
But this is an external drive; you could run a similar test with unencrypted
data on a different hard drive, to try to get to the bottom of this
> Now that i think about it, maybe dm-crypt really is to blame, up until a
> few month ago i was using loop-AES. After dm-crypt got the capability to
> emulate it i have moved over to dm-crypt because the loop-AES support in
> Debian got worse over time. I didn't have any problems until after i
> moved to dm-crypt, but OTOH i'm not the only one using dm-crypt. But
> OTOOH maybe not so many people use the loop-AES compatibility-mode.