> hi
>
> this mail covers two problems: the sync-reboot-data lost problem and
> xfs_fsr kernel crash with fs corruptions.
> i dont separte it, because it happens in the same test session.
>
>
> my system is a k6-500 with ide drive and via chipset (udma enabled).
> /dev/hda6 is xfs root.
> /dev/hda4 is an ext2 root for runnig xfs_check.
>
>
> i made my sync-reboot-data lost test again with the fix
> (TAKE - fix delalloc data not getting flushed to disk (page_buf.c - 1.53,
> page_buf_io.c - 1.51))
>
> the test was:
>
> cp -av /usr/src/linux/drivers/ drivers
> diff -r -u /usr/src/linux/drivers/ drivers/ (no differs)
> sync
> sync
> sync
>
> hit the reset button.
>
> after the reboot diff again.
>
>
> the first 6-8 test succeeded.
> i cycled the tests without clean shutdowns between (hmmm maybe one or two).
>
> then i made only one sync. after reboot some files differ. same problem,
> size is ok, but no extents. the number of differs are small.
Have you tried this with ext2?
OK the real issue here is that sync is not really doing sync, the sync system
call is starting I/O, but not waiting for it to complete. At least two syncs
are probably necessary for this scenario right now. If you wait a few seconds
there will be a kernel initiated sync anyway. And as Ananth pointed out,
there are some bits of I/O which will take a couple of passes to get triggered.
Obviously we cannot immediately flush everything to disk or performance
would tank. The issue with XFS in this area has always been that the
inode size gets out to disk in advance of the file data. You are looking
at a consistent filesystem, it just has data missing! This is why xfs_check
does not complain.
For apps which require a guarantee of data being down on disk immediately
O_SYNC or fsync is supposed to be used, we are working on those too at the
moment ;-)
Steve
>
> i played with the numbers of sync and the time between sync and hitting
> reset.
>
> when reset is hit just after the sync is finished i got data lost.
>
> waiting about 10-20s after the sync finished, everything is ok. regardless
> the number of syncs.
> after the first sync small diskactivity is there for about 10s.
>
> 2 times i check the fs with xfs_check (ext2 root), no errors.
>
>
>
> then i got the idea to run xfs_fsr. the result was a kernel crash and fs
> corruption (this is the first time a got problems with fsr):
>
> kernel BUG at dcache.c:356!
>
> Entering kdb (current=0xc7fbc000, pid 3) Panic: invalid operand
> due to panic @ 0xc0141ec2
> eax = 0x0000001c ebx = 0xc7c870e0 ecx = 0x00000000 edx = 0x00000000
> esi = 0xc7c870c0 edi = 0xc61c3840 esp = 0xc7fbdf98 eip = 0xc0141ec2
> ebp = 0xffffff3b xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010292
> xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc7fbdf64
> kdb> bt
> EBP EIP Function(args)
> 0xffffff3b 0xc0141ec2 prune_dcache+0x76 (0x2a)
> kernel .text 0xc0100000 0xc0141e4c 0xc0141f98
> 0xc0142201 shrink_dcache_memory+0x21 (0x6, 0x4)
> kernel .text 0xc0100000 0xc01421e0 0xc0142210
> 0xc012ad3b do_try_to_free_pages+0x5f (0x4, 0x0)
> kernel .text 0xc0100000 0xc012acdc 0xc012ad58
> 0xc012adcb kswapd+0x73
> kernel .text 0xc0100000 0xc012ad58 0xc012ae68
> 0xc0107457 kernel_thread+0x23
> kernel .text 0xc0100000 0xc0107434 0xc0107464
> kdb> reboot
>
>
> i made some tests with xfs_fsr again. i will mail console capture only to
> Steve Lord <lord@xxxxxxx> because it is very very long (11238 lines).
>
>
> after that (xfs_repair eleminates the corruption) i made the
> sync-reboot-data lost tests again, with the same results above.
>
> a xfs_check at end of it reports no errors.
>
>
> btw: after this torture the system is running well. i noticed no corruptions
> of old files (ok, not tested very well). i can not imagine what happend with
> ext2 or reiserfs. the fs was made about half an year ago.
>
>
> and now i will test the change from Rajagopal Ananthanarayanan.
>
> utz
|