xfs
[Top] [All Lists]

Re: Corruption of in-memory data detected

To: Eric Sandeen <sandeen@xxxxxxxxxxx>
Subject: Re: Corruption of in-memory data detected
From: Thomas Gutzler <thomas.gutzler@xxxxxxxxx>
Date: Wed, 11 Mar 2009 11:44:31 +0900
Cc: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=eUCJRKIyCMtyy/SBbOPV+xAws9h2XYJgN6SIrYW0+HM=; b=nUBeabfBwDVNLGy8TIDKIKHO4mQmLVHCk+JMWvZQDYWE9xTTLG8iRTejo/pK7BqU1W 5jDOX6UTn554Pns644+i+jwvBaNwcoUY4WFDOFZ6stD6TWafi3943jEqY6LHBKy8TVjq 6v3JTIE/fhlqcynXDLRzti6lrC7bx/uki8PQo=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=D4p8IvM+l3vLo1wmwxpexSm3y+9SHB3anIvjI8kLC5MAOeu+xZQhu/dGIYq1CbTzMv mINoqI8g12YIaFUsGmHmq4LG3vM+ztBU9SpprxT396o7AkDNdlknUIW7M79xUAg7xmOz s9wLn41dCpihjm/tMuCJ1UAlTzFumPpuqvE8I=
In-reply-to: <495D88EE.2040406@xxxxxxxxxxx>
References: <169670ec0901011846q1d370e6cu31514519afc8295d@xxxxxxxxxxxxxx> <495D88EE.2040406@xxxxxxxxxxx>
Hi,

a while ago I was having problems with the xfs module on ubuntu
feisty. I have then upgraded to 8.10 intrepid and am still getting the
occasional bug. Good thing is that now the system keeps running after
the bug occurs with load slowly increasing as random processes are
affected by this turn into zombies.

On Fri, Jan 2, 2009 at 12:24, Eric Sandeen <sandeen@xxxxxxxxxxx> wrote:
> Thomas Gutzler wrote:
>> Hi,
>>
>> I've been running an 8x500G hardware SATA RAID5 on an adaptec 31605
>> controller for a while. The operating system is ubuntu feisty with the
>> 2.6.22-16-server kernel. Recently, I added a disk. After the array
>> rebuild was completed, I kept getting errors from the xfs module such
>> as this one:
>> Dec 30 22:55:39 io kernel: [21844.939832] Filesystem "sda":
>> xfs_iflush: Bad inode 1610669723 magic number 0xec9d, ptr 0xe523eb00
>> Dec 30 22:55:39 io kernel: [21844.939879] xfs_force_shutdown(sda,0x8)
>> called from line 3277 of file
>> /build/buildd/linux-source-2.6.22-2.6.22/fs/xfs/xfs_inode.c.  Return
>> address = 0xf8af263c
>> Dec 30 22:55:39 io kernel: [21844.939885] Filesystem "sda": Corruption
>> of in-memory data detected.  Shutting down filesystem: sda
>>
>> My first thought was to run memcheck on the machine, which completed
>> several passes without error; the raid controller doesn't report any
>> SMART failures either.
>
> Both good ideas, but note that "Corruption of in-memory data detected"
> doesn't necessarily mean bad memory (though it might, so memcheck was
> prudent).  0xec9d is not the correct magic nr. for an on-disk inode, so
> that's why things went south.  Were there no storage related errors
> prior to this?
>
>> After an xfs_repair, which fixed a few things,
>
> Knowing which things were fixed might lend some clues ...
>
>> I mounted the file
>> system but the error kept reappearing after a few hours unless I
>> mounted read-only. Since xfs_ncheck -i always exited with 'Out of
>> memory'
>
> xfs_check takes a ton of memory; xfs_repair much less so
>
>> I decided to reduce the max amount of inodes to 1% (156237488)
>> by running xfs_growfs -m 1 - the total amount of inodes used is still
>> less than 1%. Unfortunately, both xfs_check and xfs_ncheck still say
>> 'out of memory' with 2GB installed.
>
> the max inodes really have no bearing on check or repair memory usage;
> it's just an upper limit on how many inodes *could* be created.
>
>> After the modification, the file system survived for a day until the
>> following happened:
>> Jan  2 09:33:29 io kernel: [232751.699812] BUG: unable to handle
>> kernel paging request at virtual address 0003fffb
>> Jan  2 09:33:29 io kernel: [232751.699848]  printing eip:
>> Jan  2 09:33:29 io kernel: [232751.699863] c017d872
>> Jan  2 09:33:29 io kernel: [232751.699865] *pdpt = 000000003711e001
>> Jan  2 09:33:29 io kernel: [232751.699881] *pde = 0000000000000000
>> Jan  2 09:33:29 io kernel: [232751.699898] Oops: 0002 [#1]
>> Jan  2 09:33:29 io kernel: [232751.699913] SMP
>> Jan  2 09:33:29 io kernel: [232751.699931] Modules linked in: nfs nfsd
>> exportfs lockd sunrpc xt_tcpudp nf_conntrack_ipv4 xt_state
>> nf_conntrack nfnetlink iptable_filter ip_tables x_tables ipv6 ext2
>> mbcache coretemp w83627ehf i2c_isa i2c_core acpi_cpufreq
>> cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand
>> freq_table cpufreq_conservative psmouse serio_raw pcspkr shpchp
>> pci_hotplug evdev intel_agp agpgart xfs sr_mod cdrom pata_jmicron
>> ata_piix sg sd_mod ata_generic ohci1394 ieee1394 ahci libata e1000
>> aacraid scsi_mod uhci_hcd ehci_hcd usbcore thermal processor fan fuse
>> apparmor commoncap
>> Jan  2 09:33:29 io kernel: [232751.700180] CPU:    1
>> Jan  2 09:33:29 io kernel: [232751.700181] EIP:
>> 0060:[__slab_free+50/672]    Not tainted VLI
>> Jan  2 09:33:29 io kernel: [232751.700182] EFLAGS: 00010046
>> (2.6.22-16-server #1)
>> Jan  2 09:33:29 io kernel: [232751.700234] EIP is at __slab_free+0x32/0x2a0
>
> Memory corruption perhaps?
>
>> Jan  2 09:33:29 io kernel: [232751.700252] eax: 0000ffff   ebx:
>> ffffffff   ecx: ffffffff   edx: 000014aa
>> Jan  2 09:33:29 io kernel: [232751.700273] esi: c17fffe0   edi:
>> e6b8e0c0   ebp: f8ac2c8c   esp: c21dfe44
>> Jan  2 09:33:29 io kernel: [232751.700293] ds: 007b   es: 007b   fs:
>> 00d8  gs: 0000  ss: 0068
>> Jan  2 09:33:29 io kernel: [232751.700313] Process kswapd0 (pid: 198,
>> ti=c21de000 task=c21f39f0 task.ti=c21de000)
>> Jan  2 09:33:29 io kernel: [232751.700334] Stack: 00000000 00000065
>> 00000000 fffffffe ffffffff c17fffe0 00000287 e6b8e0c0
>> Jan  2 09:33:29 io kernel: [232751.700378]        00000001 c017e3fe
>> f8ac2c8c cecb7d20 00000001 df2e2600 f8ac2c8c df2e2600
>> Jan  2 09:33:29 io kernel: [232751.700422]        f8d7559c e8247900
>> f8ac5224 df2e2600 f8d7559c e8247900 f8ae1606 00000001
>> Jan  2 09:33:29 io kernel: [232751.700466] Call Trace:
>> Jan  2 09:33:29 io kernel: [232751.700499]  [kfree+126/192] kfree+0x7e/0xc0
>> Jan  2 09:33:29 io kernel: [232751.700519]  [<f8ac2c8c>]
>> xfs_idestroy_fork+0x2c/0xf0 [xfs]
>> Jan  2 09:33:29 io kernel: [232751.700561]  [<f8ac2c8c>]
>> xfs_idestroy_fork+0x2c/0xf0 [xfs]
>> Jan  2 09:33:29 io kernel: [232751.700601]  [<f8ac5224>]
>> xfs_idestroy+0x44/0xb0 [xfs]
>> Jan  2 09:33:29 io kernel: [232751.700640]  [<f8ae1606>]
>> xfs_finish_reclaim+0x36/0x160 [xfs]
>> Jan  2 09:33:29 io kernel: [232751.700681]  [<f8af1c47>]
>> xfs_fs_clear_inode+0x97/0xc0 [xfs]
>> Jan  2 09:33:29 io kernel: [232751.700721]  [clear_inode+143/320]
>> clear_inode+0x8f/0x140
>> Jan  2 09:33:29 io kernel: [232751.700743]  [dispose_list+26/224]
>> dispose_list+0x1a/0xe0
>> Jan  2 09:33:29 io kernel: [232751.700765]
>> [shrink_icache_memory+379/592] shrink_icache_memory+0x17b/0x250
>> Jan  2 09:33:29 io kernel: [232751.700789]  [shrink_slab+279/368]
>> shrink_slab+0x117/0x170
>> Jan  2 09:33:29 io kernel: [232751.700815]  [kswapd+859/1136] 
>> kswapd+0x35b/0x470
>> Jan  2 09:33:29 io kernel: [232751.700842]
>> [autoremove_wake_function+0/80] autoremove_wake_function+0x0/0x50
>> Jan  2 09:33:29 io kernel: [232751.700867]  [kswapd+0/1136] kswapd+0x0/0x470
>> Jan  2 09:33:29 io kernel: [232751.700886]  [kthread+66/112] 
>> kthread+0x42/0x70
>> Jan  2 09:33:29 io kernel: [232751.700904]  [kthread+0/112] kthread+0x0/0x70
>> Jan  2 09:33:29 io kernel: [232751.700923]
>> [kernel_thread_helper+7/28] kernel_thread_helper+0x7/0x1c
>> Jan  2 09:33:29 io kernel: [232751.700946]  =======================
>> Jan  2 09:33:29 io kernel: [232751.700962] Code: 53 89 cb 83 ec 14 8b
>> 6c 24 28 f0 0f ba 2e 00 19 c0 85 c0 74 0a 8b 06 a8 01 74 ef f3 90 eb
>> f6 f6 06 02 75 48 0f b7 46 0a 8b 56 14 <89> 14 83 0f b7 46 08 89 5e 14
>> 83 e8 01 f6 06 40 66 89 46 08 75
>> Jan  2 09:33:29 io kernel: [232751.701128] EIP: [__slab_free+50/672]
>> __slab_free+0x32/0x2a0 SS:ESP 0068:c21dfe44
>>
>> Any thoughts what this could be or what could be done to fix it?
>
> seems like maybe something went wrong w/ the raid rebuild, if that's
> when things started going south.  Do you get any storage error related
> messages at all?

I couldn't find any errors other than the dump in dmesg (see below).
I also called adaptec and they said they never had memory failure in
they raid cards.

> Ubuntu knows best what's in this oldish distro kernel, I guess; I don't
> know offhand what might be going wrong.  If they have a debug kernel
> variant, you could run that to see if you get earlier indications of
> problems.
>
> If you can reproduce on a more recent upstream kernel, that would be
> interesting.

Here's the dmesg output:
[1369713.678092] BUG: unable to handle kernel paging request at ffff87f947da5088
[1369713.682882] IP: [<ffffffffa01db24f>]
xfs_dir2_block_lookup_int+0xcf/0x210 [xfs]
[1369713.687802] PGD 0
[1369713.688055] Oops: 0000 [1] SMP
[1369713.688055] CPU 1
[1369713.688055] Modules linked in: nls_cp437 cifs nfsd auth_rpcgss
exportfs wmi video output sbs sbshc pci
_slot container battery ac xt_tcpudp nf_conntrack_ipv4 xt_state
nf_conntrack nfs lockd nfs_acl sunrpc ipv6
iptable_filter ip_tables x_tables ext3 jbd mbcache cpufreq_userspace
cpufreq_stats cpufreq_powersave cpufre
q_ondemand cpufreq_conservative acpi_cpufreq freq_table sbp2
parport_pc lp parport loop evdev pcspkr iTCO_w
dt iTCO_vendor_support button shpchp pci_hotplug intel_agp xfs
pata_jmicron sd_mod crc_t10dif sg pata_acpi
ata_piix ohci1394 ieee1394 aacraid ata_generic ahci libata scsi_mod
e1000e dock uhci_hcd ehci_hcd usbcore t
hermal processor fan fuse vesafb fbcon tileblit font bitblit softcursor
[1369713.688055] Pid: 5278, comm: smbd Not tainted 2.6.27-11-server #1
[1369713.688055] RIP: 0010:[<ffffffffa01db24f>]  [<ffffffffa01db24f>]
xfs_dir2_block_lookup_int+0xcf/0x210
[xfs]
[1369713.688055] RSP: 0018:ffff88007120ba28  EFLAGS: 00010286
[1369713.688055] RAX: 00000005a072ded8 RBX: 00000000da072ded RCX:
0000000000000000
[1369713.688055] RDX: 00000000b40e5bda RSI: ffff88007a474c48 RDI:
ffffffffda072ded
[1369713.688055] RBP: ffff88007120ba98 R08: ffff87fa77a0e120 R09:
00000000ffffffff
[1369713.688055] R10: 00000000db5b0eb4 R11: ffff88001813bff8 R12:
ffff88007120bae8
[1369713.688055] R13: 000000000000002a R14: 0000000000000000 R15:
ffff88007120bb74
[1369713.688055] FS:  00007f79bf44e700(0000) GS:ffff88007f802880(0000)
knlGS:0000000000000000
[1369713.688055] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1369713.688055] CR2: ffff87f947da5088 CR3: 0000000053e88000 CR4:
00000000000006e0
[1369713.688055] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1369713.688055] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[1369713.688055] Process smbd (pid: 5278, threadinfo ffff88007120a000,
task ffff88000350ace0)
[1369713.688055] Stack:  ffff88007120bd68 ffffffff802fa04c
ffff88007120bab4 ffff88007120baa8
[1369713.688055]  ffff88001813b000 ffff88007acb3000 0000000000000000
ffffffffa01d0289
[1369713.688055]  ffff88007a474c48 ffff880035a2c8c0 ffff88007120bae8
ffff88007120bae8
[1369713.688055] Call Trace:
[1369713.688055]  [<ffffffff802fa04c>] ? do_select+0x5bc/0x610
[1369713.688055]  [<ffffffffa01d0289>] ? xfs_bmbt_get_blockcount+0x9/0x20 [xfs]
[1369713.688055]  [<ffffffffa01db470>] xfs_dir2_block_lookup+0x20/0xc0 [xfs]
[1369713.688055]  [<ffffffffa01da7e5>] xfs_dir_lookup+0x195/0x1c0 [xfs]
[1369713.688055]  [<ffffffffa020a6cb>] xfs_lookup+0x7b/0xe0 [xfs]
[1369713.688055]  [<ffffffff802ff9f6>] ? __d_lookup+0x16/0x150
[1369713.688055]  [<ffffffffa02159a1>] xfs_vn_lookup+0x51/0x90 [xfs]
[1369713.688055]  [<ffffffff802f495e>] real_lookup+0xee/0x170
[1369713.688055]  [<ffffffff802f4a90>] do_lookup+0xb0/0x110
[1369713.688055]  [<ffffffff802f50fd>] __link_path_walk+0x60d/0xc20
[1369713.688055]  [<ffffffff80305ef6>] ? mntput_no_expire+0x36/0x160
[1369713.688055]  [<ffffffff802f5c4e>] path_walk+0x6e/0xe0
[1369713.688055]  [<ffffffff802f5e63>] do_path_lookup+0xe3/0x200
[1369713.688055]  [<ffffffff802f386a>] ? getname+0x4a/0xb0
[1369713.688055]  [<ffffffff802f6cbb>] user_path_at+0x7b/0xb0
[1369713.688055]  [<ffffffff802eda68>] ? cp_new_stat+0xe8/0x100
[1369713.688055]  [<ffffffff802ede7d>] vfs_stat_fd+0x2d/0x60
[1369713.688055]  [<ffffffff802edf5c>] sys_newstat+0x2c/0x50
[1369713.688055]  [<ffffffff8021285a>] system_call_fastpath+0x16/0x1b
[1369713.688055]
[1369713.688055]
[1369713.688055] Code: f8 31 c9 45 8b 13 4d 89 d8 44 89 d2 0f ca 89 d0
83 ea 01 48 c1 e0 03 49 29 c0 eb 07 8d 4b 01 39 ca 7c 1c 8d 1c 11 d1
fb 48 63 fb <41> 8b 04 f8 0f c8 41 39 c5 74 26 77 e4 8d 53 ff 39 ca 7d
e4 48
[1369713.688055] RIP  [<ffffffffa01db24f>]
xfs_dir2_block_lookup_int+0xcf/0x210 [xfs]
[1369713.688055]  RSP <ffff88007120ba28>
[1369713.688055] CR2: ffff87f947da5088
[1369714.057232] ---[ end trace 0735c8702d5e7899 ]---

What can I do to help getting this fixed?

Tom

<Prev in Thread] Current Thread [Next in Thread>