xfs
[Top] [All Lists]

Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life

To: Hannes Reinecke <hare@xxxxxxx>
Subject: Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
From: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Mon, 11 Jan 2016 07:55:14 -0800
Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>, XFS Developers <xfs@xxxxxxxxxxx>, linux-block@xxxxxxxxxxxxxxx, linux-nvdimm <linux-nvdimm@xxxxxxxxxxx>, Dave Chinner <david@xxxxxxxxxxxxx>, Jens Axboe <axboe@xxxxxx>, Jan Kara <jack@xxxxxxxx>, linux-fsdevel <linux-fsdevel@xxxxxxxxxxxxxxx>, Matthew Wilcox <willy@xxxxxxxxxxxxxxx>, Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Fo3D7nLKO0hgRFNo8zMa6Q0pfcqa992ElF7TRdz8gRQ=; b=ECHkp47KWt6ZoVVrGmgXjPgqz+nJQa2oqq52CXH4SXhz8yTCMCei83JanjhHw1N3Bt 9Su5Dj+Gt7Sje/U0HXGNb47FOO2OT52yuio3PHQgMH9LWKFfsUk/P4O6COvhb3uPeOm+ htaHDBTKS8SxGPg4glSDH3ygjiHQoeUABFglytIBJRYqfohAMr9KT/wukqVjh0Df9pOC TuDwf8hRD42EwS60+srP9VrUW932P+Zrv2R0Nz1LaaLWvhHWA+5uTAHG+lmWXBVzYCV7 okqOYgMCASCZIN9t9d4tsckBGnnSmWgS2Gp7u1vTV352BXKZuTxsBkfZsc5U+dvZq+h4 KL4g==
In-reply-to: <5693C935.3060701@xxxxxxx>
References: <20160104181220.24118.96661.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20160104182005.24118.50361.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20160109075414.GA5008@xxxxxxxxxxxxxxxxxx> <5693C935.3060701@xxxxxxx>
On Mon, Jan 11, 2016 at 7:24 AM, Hannes Reinecke <hare@xxxxxxx> wrote:
> On 01/09/2016 08:54 AM, Al Viro wrote:
>>
>> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>>>
>>> Historically we have waited for filesystem specific heuristics to
>>> attempt to guess when a block device is gone.  Sometimes this works, but
>>> in other cases the system can hang waiting for the fs to trigger its
>>> shutdown protocol.
>>>
>>> The initial motivation for this investigation was to prevent DAX
>>> mappings (direct mmap access to persistent memory) from leaking past the
>>> lifetime of the hosting block device.  However, Dave points out that
>>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>>
>>>      For example, if we detect a free space corruption during allocation,
>>>      it is not safe to trust *any active mapping* because we can't trust
>>>      that we having handed out the same block to multiple owners. Hence
>>>      on such a filesystem shutdown, we have to prevent any new DAX
>>>      mapping from occurring and invalidate all existing mappings as we
>>>      cannot allow userspace to modify any data or metadata until we've
>>>      resolved the corruption situation.
>>>
>>> The current block device shutdown sequence of del_gendisk +
>>> blk_cleanup_queue is problematic.  We want to tell the fs after
>>> blk_cleanup_queue that there is no possibility of recovery, but by that
>>> time we have deleted partitions and lost the ability to find all the
>>> super-blocks on a block device.
>>>
>>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>>> notifications to all the filesystems hosted on the disk.  Where
>>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>>> and ->bdi_gone() is a set of actions to take after the backing device
>>> is known to be permanently dead.
>>
>>
>>         Would you mind explaining what the hell is _the_ backing device
>> of a filesystem?  What does that translate into in case of e.g. btrfs
>> spanning several disks?  Or ext4 with journal on a different device, for
>> that matter?
>>
>>         If anything, I would argue that filesystem is out of place here -
>> general situation is "IO on X may require IO on device Y and X needs to do
>> something when Y goes away".  Consider e.g. /dev/loop backed by a device
>> that went away.  Or by a file on fs that has run down the curtain and
>> joined
>> the bleedin choir invisible.  With another fs partially hosted by that
>> loopback device.  Or by RAID0 containing said device.
>>
>>         You are given Y and attempt to locate the affected X.  _Then_
>> you assume that X is a filesystem and has "something to be done"
>> independent
>> from the role Y played for it, so you can pick that action from superblock
>> method.
>>
>>         IMO you are placing the burden in the wrong place.  _Recepient_
>> knows what it depends upon and what should be done for each source of
>> trouble.  So make it recepient's responsibility to request notifications.
>> At which point the superblock method goes away, along with the requirement
>> to handle all sources of trouble the same way, etc.
>>
>>         What's more, things like RAID5 (also interested in knowing when
>> a component has been ripped out) might or might not decide to propagate
>> the event further - after all, that's exactly the point of redundancy.
>>
>>         I'd look into something along the lines of notifier chain per
>> gendisk, with potential victims registering a callback when they decide
>> that from now on such and such device might screw them over...
>
>
> Fully support this. I was planning on something similar to transport device
> changes (resizing, topology change etc).
>
> And it might even be an idea to convert the block device events to a
> notifier chain, too.
>
> Dan, can you keep me in the loop here?

Yes, will do.

<Prev in Thread] Current Thread [Next in Thread>