[Top] [All Lists]

Re: linux software RAID, 2.6.6, XFS, Postgres: corrupt files

Subject: Re: linux software RAID, 2.6.6, XFS, Postgres: corrupt files
From: "Thomas J. Teixeira" <tjt@xxxxxxxxxxxxxx>
Date: Thu, 28 Apr 2005 11:32:04 -0400
Cc: linux-xfs <linux-xfs@xxxxxxxxxxx>
In-reply-to: <20050418050023.GD721@frodo>
References: <1113594280.9697.103.camel@xxxxxxxxxxxxxxxxxxxxxxx> <20050418050023.GD721@frodo>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Mon, 2005-04-18 at 01:00, Nathan Scott wrote:

> Your post reminded me of another fix - if you are on 2.6.6 you will
> likely not have this... (see xfs_aops.c CVS revision history)
> date: 2004/09/30 01:37:19;  author: nathans;  state: Exp;  lines: +19 -12
> modid: xfs-linux-melb:xfs-kern:19622a
> Fix sync issues - use correct writepage page re-dirty interface, and do
> not clear dirty flag if page only partially written.
> IIRC, that was merged in 2.6.9 (but don't quote me on that, it was
> awhile ago).
> This may well resolve the issue you're seeing, it resolved all the
> known data sync problems at the time, although there now seems to be
> a harder-to-hit issue that I'm working through with Rich and James.

Thanks for the pointer. We have installed a SuSE 9.1 update kernel
version 2.6.5-7.147 that includes this patch and which works with the
rest of our software. Preliminary testing shows this seems to resolve
the lost data on reboot.

We're seeing a different problem which we at first thought was
unrelated, but just be a different symptom of the underlying problem. We
are getting system hangs after several days of operation, typically 5 to
6 days. We don't have any information on most of these: the systems
seemed to be responding to local network traffic, but not writing to
disk, including syslog on the root file system which is ext3, not xfs.

We did happen to be connected to one of our test systems and noticed it
stopped writing to disk, and the syslogd was in disk wait (according to
ps output). According to /proc/meminfo, there were lots of dirty pages
(about 400M out of 1G, or ~ 40%), and running sync did not clean out
these dirty pages. Looking at the kernel code, I can see where all
writes will cease if the dirty pages exceeds /proc/sys/vm/dirty_ratio --
40%, and this would prevent writes to _all_ filesystems regardless of
which file system owns the dirty pages.

My question is whether anyone recalls that a symptom of the bug fixed
back in the 2.6.9 kernel was this growth in dirty pages. The checkin
comment looks more as if the symptom was losing a dirty page bit rather
than marking a page as dirty forever.

Thanks again,

- Tom

<Prev in Thread] Current Thread [Next in Thread>