[Top] [All Lists]

Re: XFS: I/O Error Detected /

To: Piotr Kandziora <piotr.kandziora@xxxxxxxxxx>
Subject: Re: XFS: I/O Error Detected /
From: Emmanuel Florac <eflorac@xxxxxxxxxxxxxx>
Date: Wed, 17 Nov 2010 12:35:17 +0100
Cc: xfs@xxxxxxxxxxx, Artur Piechocki <artur.piechocki@xxxxxxxxxx>, Janusz Bak <jb@xxxxxxxxxx>, lukasz.wittig@xxxxxxxxxx
In-reply-to: <4CE3B1C0.3060308@xxxxxxxxxx>
Organization: Intellique
References: <4CE282DB.8060200@xxxxxxxxxx> <20101116214415.61ecb7cd@xxxxxxxxxxxxxx> <4CE3B1C0.3060308@xxxxxxxxxx>
Le Wed, 17 Nov 2010 11:43:12 +0100
Piotr Kandziora <piotr.kandziora@xxxxxxxxxx> écrivait:

> - debian based distribution with a lot of modification,
> - architecture x86_64,

So this isn't a case of 32 bits rsync running out of memory (case

> - xfs tools version 2.10.1
> - mount options for this LV: 
> rw,noatime,nodiratime,attr2,nobarrier,usrquota,prjquota,grpquota
> - NFS share is exported with following options: 
> rw,no_root_squash,insecure,insecure_locks,async,anonuid=65534,anongid=65534,subtree_check
> >> Unfortunately our system is freezing unexpectedly without reason.
> >>      
> > What are the symptoms ? does the whole system freeze up? Or does it
> > crash with kernel panic, or otherwise "Oops" messages?
> >    
> Symptoms are different. One time we've got a few oom-killers:

This is abnormal, but kernel disk cache shouldn't interfere anyway.

> [kern.warning] kernel: load_average invoked oom-killer:

> [kern.warning] kernel: 3dm2 invoked oom-killer: gfp_mask=0xd0,

> 2010/11/11 10:56:21|Pid: 4324, comm: nfsd Not tainted 

> Filesystem "dm-37": xfs_log_force: error 5 returned.

This looks like an IO error, as Eric Sandeen said. 

> We've checked temperature on each disk using LSI/3ware CLI (tw_cli)
> and average is 35C.

It's the card itself that's overheating and fails, not the disks. I had
to replace quite a number of these on field after a year of use or two;
they work well for a while, but crash when heavily sollicited.

However this should cause a complete IO freeze. Did you try to run
something like "tw_cli alarms" right after the failure?

If tw_cli can't communicate with the controller anymore, than that's a
case of a fried card.

> This problem occurred two times in the past, we repaired fs using 
> xfs_repair (and it showed errors). We simulated it using dumping
> cache yesterday ...

I don't understand how or why dumping cache would cause the system to
fail... except that it may augment actual disk IOs.

Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |   <eflorac@xxxxxxxxxxxxxx>
                    |   +33 1 78 94 84 02

<Prev in Thread] Current Thread [Next in Thread>