xfs
[Top] [All Lists]

Re: trying to avoid a lengthy quotacheck by deleting all quota data

To: xfs@xxxxxxxxxxx
Subject: Re: trying to avoid a lengthy quotacheck by deleting all quota data
From: Harry <harry@xxxxxxxxxxxxxxxxxx>
Date: Thu, 05 Mar 2015 17:09:57 +0000
Cc: "developers@xxxxxxxxxxxxxxxxxx" <developers@xxxxxxxxxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <54F88CEC.4030009@xxxxxxxxxxxxxxxxxx>
References: <54EC958E.2000001@xxxxxxxxxxxxxxxxxx> <20150224215907.GA18360@dastard> <54EF1A8F.7030505@xxxxxxxxxxxxxxxxxx> <54F856E7.10006@xxxxxxxxxxxxxxxxxx> <54F87BF3.3000405@xxxxxxxxxxx> <54F88CEC.4030009@xxxxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
PS.  We might be interested in getting a better estimate of how long a quotacheck would take.  From an old thread on the mailing list, we see this suggestion:

xfstests:src/bstat

We're a bit worried about running this on the live system, because we're worried it will impact its performance substantially.  Is that an unfounded worry?  I presume it's a read-only operation, so it would be safe to kill it if we see performance degradation?

rgds,
Harry + the team.

On 05/03/15 17:05, Harry wrote:
Thanks for the reply Eric.

One of our problems is that we're limited in terms of what manipulations we can apply to the live system, and so instead we've been running our experiments against the backup system, and you're quite right that DRBD may be introducing some weirdness of its own, so those experiments may not be safe to draw conclusions from.

Here's what we know about the live system
-> it had an outage, equivalent to having its power cable yanked, or doing an 'echo b > /proc/sysrq-trigger'
-> when it came back, it decided to mount the drive without quotas.
-> we saw a message in syslog saying " Failed to initialize disk quotas"
-> last time we had to run a quotacheck (several months ago) it took about 2 hours.

We can repro the quotacheck issue on our test clusters, as follows:
-> kick off a job that writes to the disk
-> hard reboot with "echo b > /proc/sysrq-trigger"
-> on next boot, see "Failed to initialize disk quotas" message, xfs mounts without quotas
-> soft reboot with "reboot"
-> on next boot, see "Quotacheck needed: Please wait." message.
-> Quotacheck completes some time later.

So our best-case scenario is that, next time we reboot, we'll have an outage of about 2 hours.  And our paranoid worst-case scenario, induced by our experiments with our drbd backup drives, are that the disk will actually turn out not to be mountable at all.

is that "quotacheck always required after hard reboot" behaviour that we're observing something you expected?  you seemed to be saying that the fact that quota are journaled should mean it's not needed?

HP

On 05/03/15 15:53, Eric Sandeen wrote:
On 3/5/15 7:15 AM, Harry wrote:
Update -- so far, we've not managed to gain any confidence that we'll
ever be able to re-mount that disk. The general consensus seems to be
to fish all the data off the disk using rsync, and then move off XFS
to ext4.

Not a very helpful message for y'all to hear, I know. But if it's any
help in prioritising your future work, i think the dealbreaker for us
was the inescapable quotacheck on mount, which means that any time a
fileserver goes down unexpectedly, we have an unavoidable,
indeterminate-but-long period of downtime...

hp
What you decide to use is up to you of course, and causes us no
heartbreak.  :)  But I think you fundamentally misunderstand the situation;
an unexpected fileserver failure should not result in a lengthy quotacheck
on xfs, because xfs quota is journaled, and will simply be replayed along with
the rest of the log.

I honestly don't know what has led you to the conclusion that remounting
the filesystem will lead to any quotacheck at all, let alone a lengthy one.

* We're even a bit worried the disk might be in a broken state, such
that the quotacheck won't actually complete successfully at all.
If your disk is broken, that's not a filesystem issue.  It seems possible
that whatever drbd manipulation you're doing is causing an issue, but because
you haven't really explained it in detail, I don't know.

We take DRBD offline, so it's no longer writing, then we take
snapshots of the drives, then remount those elsewhere so we can
experiment without disturbing the live system.
Did you quiesce the filesystem first with i.e. xfs_freeze?

So far this thread has been long on prose and speculation, and short
on actual analysis, log messages, etc.  Feel free to use ext4 or whatever
suits you, but given that nothing in this thread has implicated misbehavior
by xfs, I don't think that switching filesystems will solve the perceived
problem.

-Eric

Rgds,
Harry + the PythonAnywhere team.

-- 
Harry Percival
Developer
harry@xxxxxxxxxxxxxxxxxx

PythonAnywhere - a fully browser-based Python development and hosting environment
<http://www.pythonanywhere.com/>

PythonAnywhere LLP
17a Clerkenwell Road, London EC1M 5RD, UK
VAT No.: GB 893 5643 79
Registered in England and Wales as company number OC378414.
Registered address: 28 Ely Place, 3rd Floor, London EC1N 6TD, UK

Rgds,
Harry + the PythonAnywhere team.

-- 
Harry Percival
Developer
harry@xxxxxxxxxxxxxxxxxx

PythonAnywhere - a fully browser-based Python development and hosting environment
<http://www.pythonanywhere.com/>

PythonAnywhere LLP
17a Clerkenwell Road, London EC1M 5RD, UK
VAT No.: GB 893 5643 79
Registered in England and Wales as company number OC378414.
Registered address: 28 Ely Place, 3rd Floor, London EC1N 6TD, UK
<Prev in Thread] Current Thread [Next in Thread>