From owner-xfs@oss.sgi.com Tue May 1 07:21:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 07:21:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l41EL5fB015382 for ; Tue, 1 May 2007 07:21:08 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id AAA04161; Wed, 2 May 2007 00:20:54 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l41EKqAf81627789; Wed, 2 May 2007 00:20:53 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l41EKn6U80735839; Wed, 2 May 2007 00:20:49 +1000 (AEST) Date: Wed, 2 May 2007 00:20:49 +1000 From: David Chinner To: Nicholas Miell Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501142049.GG77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177994346.3362.5.camel@entropy> User-Agent: Mutt/1.4.2.1i X-archive-position: 11237 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: > On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > This is actually for future use. Any flags that are added into this > > > range must be understood by both sides or it should be considered an > > > error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need > > > to be supported. If it turns out that 8 bits is too small a range for > > > INCOMPAT flags, then we can make 0x01000000 an incompat flag that means > > > e.g. 0x00ff0000 are also incompat flags also. > > > > Ah, ok. So it's not really a set of "compatibility" flags, it's more a > > "compulsory" set. Under those terms, i don't really see why this is > > necessary - either the filesystem will understand the flags or it will > > return EINVAL or ignore them... > > > > > I'm assuming that all flags that will be in the original FIEMAP proposal > > > will be understood by the implementations. Most filesystems can safely > > > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for > > > that matter FLAG_SYNC is probably moot for most filesystems also because > > > they do block allocation at preprw time. > > > > Exactly my point - so why do we really need to encode a compulsory set of > > > > Because flags have meaning, independent of whether or not the filesystem > understands them. And if the filesystem chooses to ignore critically > important flags (instead of returning EINVAL), bad things may happen. > > So, either the filesystem will understand the flag or iff the unknown flag > is in the incompat set, it will return EINVAL or else the unknown flag will > be safely ignored. My point was that there is a difference between specification and implementation - if the specification says something is compulsory, then they must be implemented in the filesystem. This is easy enough to ensure by code review - we don't need additional interface complexity for this.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 1 11:38:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 11:38:27 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41IcLfB004929 for ; Tue, 1 May 2007 11:38:23 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from altaparmakov.plus.com ([212.159.79.82]:49945 helo=[192.168.1.64]) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HixE1-00068o-W4 (Exim 4.63) (return-path ); Tue, 01 May 2007 19:37:22 +0100 In-Reply-To: <20070501042254.GD77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Tue, 1 May 2007 19:37:20 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11238 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 05:22, David Chinner wrote: > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >> The FIBMAP ioctl is for privileged users >> only, and I wonder if FIEMAP should be the same, or at least >> disallow >> mapping files that the user can't access especially with >> FLAG_SYNC and/or >> FLAG_HSM_READ. > > I see little reason for restricting FI[BE]MAP to privileged users - > anyone should be able to determine if files they have permission to > access are fragmented. Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the machine. Perhaps for non-privileged users FIEMAP has to be read- only? As soon as any of the FLAG_* flags come into play you make it privileged. For example fancy any user being able to fill up your file system by calling FIEMAP with FLAG_HSM_READ on all files recursively? This should certainly not be simply dismissed as a non- issue without thinking about it first... Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Tue May 1 11:48:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 11:48:44 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41ImefB006913 for ; Tue, 1 May 2007 11:48:41 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from altaparmakov.plus.com ([212.159.79.82]:49949 helo=[192.168.1.64]) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HixNG-0000gV-WA (Exim 4.63) (return-path ); Tue, 01 May 2007 19:46:55 +0100 In-Reply-To: <20070501142049.GG77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Tue, 1 May 2007 19:46:53 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11239 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 15:20, David Chinner wrote: > On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: >> On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> This is actually for future use. Any flags that are added into >>>> this >>>> range must be understood by both sides or it should be >>>> considered an >>>> error. Flags outside the FIEMAP_FLAG_INCOMPAT do not >>>> necessarily need >>>> to be supported. If it turns out that 8 bits is too small a >>>> range for >>>> INCOMPAT flags, then we can make 0x01000000 an incompat flag >>>> that means >>>> e.g. 0x00ff0000 are also incompat flags also. >>> >>> Ah, ok. So it's not really a set of "compatibility" flags, it's >>> more a >>> "compulsory" set. Under those terms, i don't really see why this is >>> necessary - either the filesystem will understand the flags or it >>> will >>> return EINVAL or ignore them... >>> >>>> I'm assuming that all flags that will be in the original FIEMAP >>>> proposal >>>> will be understood by the implementations. Most filesystems can >>>> safely >>>> ignore FLAG_HSM_READ, for example, since they don't support HSM, >>>> and for >>>> that matter FLAG_SYNC is probably moot for most filesystems also >>>> because >>>> they do block allocation at preprw time. >>> >>> Exactly my point - so why do we really need to encode a >>> compulsory set of >> >> Because flags have meaning, independent of whether or not the >> filesystem >> understands them. And if the filesystem chooses to ignore critically >> important flags (instead of returning EINVAL), bad things may happen. >> >> So, either the filesystem will understand the flag or iff the >> unknown flag >> is in the incompat set, it will return EINVAL or else the unknown >> flag will >> be safely ignored. > > My point was that there is a difference between specification and > implementation - if the specification says something is compulsory, > then they must be implemented in the filesystem. This is easy > enough to ensure by code review - we don't need additional interface > complexity for this.... You are wrong about this because you are missing the point that you have no code to review. The users that will use those flags are going to be applications that run in user space. Chances are you will never see their code. Heck, they might not even be open source applications... And all applications will run against a multitude of kernels. So version X of the application will run on kernel 2.4.*, 2.6.*, a.b.*, etc... For future expandability of the interface I think it is important to have both compulsory and non-compulsory flags. For example there is no reason why FIEMAP_HSM_READ needs to be compulsory. Most filesystems do not support HSM so can safely ignore it. And applications that want to read/write the data locations that are obtained with the FIEMAP call will likely always supply FIEMAP_HSM_READ because they want to ensure the file is brought in if it is off line so they definitely want file systems that do not support this flag to ignore it. And vice versa, an application might specify some weird and funky yet to be developed feature that it expects the FS to perform and if the FS cannot do it (either because it does not support it or because it failed to perform the operation) the application expects the FS to return an error and not to ignore the flag. An example could be the asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS ignores it it will return the extent map for the file data instead of the XATTR_FORK! Not what the application wanted at all. Ouch! So this is definitely a compulsory flag if I ever saw one. So as you see you must support both voluntary and compulsory flags... Also consider what I said above about different kernels. A new feature is implemented in kernel 2.8.13 say that was not there before and an application is updated to use that feature. There will be lots of instances where that application will still be run on older kernels where this feature does not exist. Depending on the feature it may be quite sensible to simply ignore in the kernel that the application set an unknown flag whilst for a different feature it may be the opposite. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Tue May 1 15:32:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 15:32:47 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41MWgfB012145 for ; Tue, 1 May 2007 15:32:43 -0700 Received: from localhost.adilger.int (72-254-21-136.client.stsn.net [72.254.21.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id DFE564E4564; Tue, 1 May 2007 16:32:41 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id AC4524179; Tue, 1 May 2007 15:32:36 -0700 (PDT) Date: Tue, 1 May 2007 15:32:36 -0700 From: Andreas Dilger To: David Chinner Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501223236.GM5722@schatzie.adilger.int> Mail-Followup-To: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501142049.GG77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11241 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 00:20 +1000, David Chinner wrote: > My point was that there is a difference between specification and > implementation - if the specification says something is compulsory, > then they must be implemented in the filesystem. This is easy > enough to ensure by code review - we don't need additional interface > complexity for this.... What you seem to be missing about my proposal is that the FLAG_INCOMPAT is for future use by that part of the specification we haven't thought of yet... Having COMPAT/INCOMPAT flags has been very useful for ext2/3/4, and is much better than having version numbers for the interface. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 1 15:30:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 15:30:53 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41MUnfB011674 for ; Tue, 1 May 2007 15:30:50 -0700 Received: from localhost.adilger.int (72-254-21-136.client.stsn.net [72.254.21.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 459B44E4564; Tue, 1 May 2007 16:30:47 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id F06254179; Tue, 1 May 2007 15:30:40 -0700 (PDT) Date: Tue, 1 May 2007 15:30:40 -0700 From: Andreas Dilger To: David Chinner Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501223040.GL5722@schatzie.adilger.int> Mail-Followup-To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501042254.GD77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11240 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 01, 2007 14:22 +1000, David Chinner wrote: > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't > > I disagree - why would you want to indicate the state is unknown when we know > very well that it is offline? If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a catch-all flag that indicates "this extent contains data but there is nothing sensible to be returned for the extent mapping." > Effectively, when your extent is offline in the HSM, it is inaccessable, and > you have to bring it back from tape so it becomes accessible again. i.e. some > action is necessary on behalf of the user to make it accessible. So I think > that OFFLINE is a good name for this state because it really is inaccessible. What you are calling OFFLINE I would prefer to call UNMAPPED, since that can be used by applications as a catch-all for "no mapping". There can be further flags that give refinements to UNMAPPED that some applications might care about them (e.g. HSM_RESIDENT), but many users/apps will not if they just want the number of fragments in a given file. > Also, I don't think "secondary" is a good term because most large systems > have more than one tier of storage. One possibility is "HSM_RESIDENT" > which indicates the extent is current and resident with a HSM's archive.... Sure. > > Can you propose reasonable flag names for these (I can't think of anything > > very good) and a clear explanation of what they mean. I suspect it will > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > > the concept of stripe unit and stripe width, but as yet they are not > > communicated between the two very well. I'd be much happier if this info > > could be queried in a standard way from the block layer instead of the > > user having to specify it and the filesystem having to track it. > > My preference is definitely for a separate ioctl to grab the > filesystem geometry so this stuff can be calculated in userspace. > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't > bother trying to define names until we decide which appraoch we take > to implement this. Hmm, previously you wrote "This information could be easily passed up in the flags fields if the filesystem has geometry information". So, I _think_ what you are saying is that you want 4 flags to convey this start/end alignment information, but the exact semantics of what a "stripe unit" and a "stripe width" is filesystem specific? I definitely do NOT want to get into any issues of querying the block device geometry here. I was just making a passing comment that ext4+mballoc can already do RAID-specific allocation alignment, but it depends on the admin to specify this information and it would be nice if there was some easy way to get this from userspace/kernel interfaces. Having an API that can request "tell me the number of blocks from this offset until the next physical disk boundary" or similar would be useful to any allocator, and the block layer already needs to know this when submitting IO. > In XFS, mkfs.xfs does the work of getting this information > to see in the filesystem superblock. Here's the code for getting > sunit/swidth from the underlying block device: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ > > Not much in common there ;) It looks like this might be just what e2fsprogs needs also. > > It does make sense to specify zero for the fm_extent_count array and a > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > > extent data itself, for the non-verbose mode of filefrag, and for > > pre-allocating a buffer large enough to hold the file if that is important. > > Rather than rely on implicit behaviour of "pass in extent count of > zero and a don't try to return any extents" to return the number of > extents on the file, why not just explicitly define this as a valid > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my clever-clever for "return no extents" and "return number of extents" is wasted :-/. > > - does XFS return an extent for the metadata parts of the file (e.g. btree)? > > No, but we can return the extent map for the attribute fork (i.e. > extended attrs) if asked for (XFS_IOC_GETBMAPA). This seems like it would be a useful addition to the interface also, having FIEMAP_FLAG_METADATA request the return of metadata allocations too. > > - does XFS return preallocated extents beyond EOF? > > Yes - they are part of the extent map for the file. OK. > > - does XFS allow non-root users to call xfs_bmap on files they don't own, or > > use by non-root users at all? > > Users can run xfs_bmap on any file they have permission to > open(O_RDONLY). > > > The FIBMAP ioctl is for privileged users > > only, and I wonder if FIEMAP should be the same, or at least disallow > > mapping files that the user can't access especially with FLAG_SYNC and/or > > FLAG_HSM_READ. > > I see little reason for restricting FI[BE]MAP to privileged users - > anyone should be able to determine if files they have permission to > access are fragmented. I think I agree with Anton that allowing some of the flags for non-privileged users seems dangerous. I think this needs to be determined on a flag-by-flag basis, and -EPERM should be returned in some cases. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 1 17:07:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 17:07:22 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4207GfB029493 for ; Tue, 1 May 2007 17:07:18 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA19765; Wed, 2 May 2007 10:07:02 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4206xAf82132681; Wed, 2 May 2007 10:07:00 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4206tcJ81768258; Wed, 2 May 2007 10:06:55 +1000 (AEST) Date: Wed, 2 May 2007 10:06:54 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502000654.GK77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11242 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 05:22, David Chinner wrote: > >On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > >> The FIBMAP ioctl is for privileged users > >> only, and I wonder if FIEMAP should be the same, or at least > >>disallow > >> mapping files that the user can't access especially with > >>FLAG_SYNC and/or > >> FLAG_HSM_READ. > > > >I see little reason for restricting FI[BE]MAP to privileged users - > >anyone should be able to determine if files they have permission to > >access are fragmented. > > Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the > machine. Perhaps for non-privileged users FIEMAP has to be read- > only? As soon as any of the FLAG_* flags come into play you make it > privileged. For example fancy any user being able to fill up your > file system by calling FIEMAP with FLAG_HSM_READ on all files > recursively? By that reasoning, users should not be allowed to recall any files without root privileges. HSMs don't work that way, though - any user is allowed to recall any files they have permission to access either by manual command or by trying to read the file daata. If that runs the filesytem out of space, then the HSM either hasn't been configured properly or it's failed to manage the space correctly. Either way, that's not the fault of the user for recalling their own files. Hence allowing FIEMAP to be executed by the user does not open up any DOS conditions that don't already exist in normal HSM-managed filesystem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 1 19:27:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 19:27:06 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l422QxfB029690 for ; Tue, 1 May 2007 19:27:01 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA22695; Wed, 2 May 2007 12:26:48 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l422QkAf82214176; Wed, 2 May 2007 12:26:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l422Qisa78652236; Wed, 2 May 2007 12:26:44 +1000 (AEST) Date: Wed, 2 May 2007 12:26:44 +1000 From: David Chinner To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502022644.GO77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501223040.GL5722@schatzie.adilger.int> User-Agent: Mutt/1.4.2.1i X-archive-position: 11243 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote: > On May 01, 2007 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't > > > > I disagree - why would you want to indicate the state is unknown when we know > > very well that it is offline? > > If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a > catch-all flag that indicates "this extent contains data but there is > nothing sensible to be returned for the extent mapping." Yes, I like that much more. Good suggestion. ;) > > Effectively, when your extent is offline in the HSM, it is inaccessable, and > > you have to bring it back from tape so it becomes accessible again. i.e. some > > action is necessary on behalf of the user to make it accessible. So I think > > that OFFLINE is a good name for this state because it really is inaccessible. > > What you are calling OFFLINE I would prefer to call UNMAPPED, since that > can be used by applications as a catch-all for "no mapping". There can > be further flags that give refinements to UNMAPPED that some applications > might care about them (e.g. HSM_RESIDENT), but many users/apps will not > if they just want the number of fragments in a given file. Agreed - UNMAPPED does make a lot more sense in this case. > > > Can you propose reasonable flag names for these (I can't think of anything > > > very good) and a clear explanation of what they mean. I suspect it will > > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > > > the concept of stripe unit and stripe width, but as yet they are not > > > communicated between the two very well. I'd be much happier if this info > > > could be queried in a standard way from the block layer instead of the > > > user having to specify it and the filesystem having to track it. > > > > My preference is definitely for a separate ioctl to grab the > > filesystem geometry so this stuff can be calculated in userspace. > > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't > > bother trying to define names until we decide which appraoch we take > > to implement this. > > Hmm, previously you wrote "This information could be easily passed up in the > flags fields if the filesystem has geometry information". So, I _think_ > what you are saying is that you want 4 flags to convey this start/end > alignment information, but the exact semantics of what a "stripe unit" and > a "stripe width" is filesystem specific? Right. > I definitely do NOT want to get into any issues of querying the block > device geometry here. I was just making a passing comment that ext4+mballoc > can already do RAID-specific allocation alignment, but it depends on the > admin to specify this information and it would be nice if there was some > easy way to get this from userspace/kernel interfaces. > > Having an API that can request "tell me the number of blocks from this > offset until the next physical disk boundary" or similar would be useful > to any allocator, and the block layer already needs to know this when > submitting IO. The block layer knows this once you get inside the volume manager. I think the issue is that there is no common export interface for this information. > > In XFS, mkfs.xfs does the work of getting this information > > to see in the filesystem superblock. Here's the code for getting > > sunit/swidth from the underlying block device: > > > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ > > > > Not much in common there ;) > > It looks like this might be just what e2fsprogs needs also. More than likely. > > > It does make sense to specify zero for the fm_extent_count array and a > > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > > > extent data itself, for the non-verbose mode of filefrag, and for > > > pre-allocating a buffer large enough to hold the file if that is important. > > > > Rather than rely on implicit behaviour of "pass in extent count of > > zero and a don't try to return any extents" to return the number of > > extents on the file, why not just explicitly define this as a valid > > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS > > That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my > clever-clever for "return no extents" and "return number of extents" > is wasted :-/. Too clever for an API, I think. ;) My point is mainly that if you are going to use an API for a specific function (e.g. query the number of extents) I think that the API should have an obvious method for executing that specific function. Using a command of "get no extents" to provide the query of "how many extents in this file" is kind of obscure. When you read the code it doesn't make a lot of sense, as opposed to seeing a clear statement of intent from the code itself. i.e. FIEMAP_FLAG_GET_NUMEXTENTS is self-documenting in both the API and the code that uses it... > > > - does XFS return an extent for the metadata parts of the file (e.g. btree)? > > > > No, but we can return the extent map for the attribute fork (i.e. > > extended attrs) if asked for (XFS_IOC_GETBMAPA). > > This seems like it would be a useful addition to the interface also, having > FIEMAP_FLAG_METADATA request the return of metadata allocations too. Agreed. The different types of requests need to be mutually exclusive, though - returning the map of the attribute fork mixed with the map of the data fork is going to be confusing.... > > > - does XFS allow non-root users to call xfs_bmap on files they don't own, or > > > use by non-root users at all? > > > > Users can run xfs_bmap on any file they have permission to > > open(O_RDONLY). > > > > > The FIBMAP ioctl is for privileged users > > > only, and I wonder if FIEMAP should be the same, or at least disallow > > > mapping files that the user can't access especially with FLAG_SYNC and/or > > > FLAG_HSM_READ. > > > > I see little reason for restricting FI[BE]MAP to privileged users - > > anyone should be able to determine if files they have permission to > > access are fragmented. > > I think I agree with Anton that allowing some of the flags for non-privileged > users seems dangerous. I think this needs to be determined on a flag-by-flag > basis, and -EPERM should be returned in some cases. Agreed, but I'm yet to see any flags where I think that is necessary yet. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 01:18:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:18:25 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428IKfB012099 for ; Wed, 2 May 2007 01:18:21 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49210) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjA0M-0001Ma-PD (Exim 4.63) (return-path ); Wed, 02 May 2007 09:16:06 +0100 In-Reply-To: <20070502000654.GK77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> <20070502000654.GK77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <8464EA47-03AC-4162-A2D0-683517568640@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:16:04 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11244 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 01:06, David Chinner wrote: > On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: >> On 1 May 2007, at 05:22, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> The FIBMAP ioctl is for privileged users >>>> only, and I wonder if FIEMAP should be the same, or at least >>>> disallow >>>> mapping files that the user can't access especially with >>>> FLAG_SYNC and/or >>>> FLAG_HSM_READ. >>> >>> I see little reason for restricting FI[BE]MAP to privileged users - >>> anyone should be able to determine if files they have permission to >>> access are fragmented. >> >> Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the >> machine. Perhaps for non-privileged users FIEMAP has to be read- >> only? As soon as any of the FLAG_* flags come into play you make it >> privileged. For example fancy any user being able to fill up your >> file system by calling FIEMAP with FLAG_HSM_READ on all files >> recursively? > > By that reasoning, users should not be allowed to recall any files > without root privileges. HSMs don't work that way, though - any user > is allowed to recall any files they have permission to access either > by manual command or by trying to read the file daata. > > If that runs the filesytem out of space, then the HSM either hasn't > been configured properly or it's failed to manage the space > correctly. Either way, that's not the fault of the user for > recalling their own files. > > Hence allowing FIEMAP to be executed by the user does not open up > any DOS conditions that don't already exist in normal HSM-managed > filesystem. Sorry, it was not a great example. But the point still stands that there are/may be created flags that you do not want to allow everyone to use. I completely agree with Andreas that those can simply return -EPERM and the rest can be allowed through. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 01:25:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:25:15 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428P7fB013738 for ; Wed, 2 May 2007 01:25:08 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49214) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjA7h-0003fi-Mq (Exim 4.63) (return-path ); Wed, 02 May 2007 09:23:41 +0100 In-Reply-To: <20070501223040.GL5722@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:23:38 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11245 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 23:30, Andreas Dilger wrote: > On May 01, 2007 14:22 +1000, David Chinner wrote: >> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I >>> didn't >> >> I disagree - why would you want to indicate the state is unknown >> when we know >> very well that it is offline? > > If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a > catch-all flag that indicates "this extent contains data but there is > nothing sensible to be returned for the extent mapping." I like UNMAPPED. I even use it in NTFS internally for extents maps that have not been read into memory yet. (-: On a different issue, do you think it would be worth adding an option flags like FIEMAP_DONT_RELOCATE or something similar that would be a compulsory flag and if set the FS is not allowed to move the file around/change the block allocation of the file. My thinking is that the extent map is not terribly useful if the FS goes and relocates the file to somewhere else just after you have done the ioctl. For example HFS on OSX automatically defragments files whilst it is running... Linux file systems may one day do similar things. Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell the FS we want to access the actual raw blocks so the FS can make sure the data is on block aligned boundaries and if the FS does not support this (e.g. ZFS or a compressed or encrypted NTFS file) then it can return -ENOTSUP. Perhaps this is totally the wrong interface and such a "prepare file for direct access" API should be a different ioctl() or syscall or whatever. It just seems very simple and appropriate to combine it here as people who use FIEMAP are at least sometimes going to be wanting to access those blocks directly as well and it feels right to be able to communicate this to the FS in the same call, kind of like an "open intent" of "I want to use the data directly on disk"... What do you think? Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 01:31:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:31:39 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428VZfB015273 for ; Wed, 2 May 2007 01:31:36 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49220) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjAE7-0006gx-N5 (Exim 4.63) (return-path ); Wed, 02 May 2007 09:30:19 +0100 In-Reply-To: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <69B76939-CAAD-4F43-BE9F-6C3CA3ECCF5E@cam.ac.uk> Cc: David Chinner , linux-ext4@vger.kernel.org, Linux Filesystems , xfs@oss.sgi.com, Christoph Hellwig Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:30:17 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11246 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 09:23, Anton Altaparmakov wrote: > On 1 May 2007, at 23:30, Andreas Dilger wrote: > >> On May 01, 2007 14:22 +1000, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but >>>> I didn't >>> >>> I disagree - why would you want to indicate the state is unknown >>> when we know >>> very well that it is offline? >> >> If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a >> catch-all flag that indicates "this extent contains data but there is >> nothing sensible to be returned for the extent mapping." > > I like UNMAPPED. I even use it in NTFS internally for extents maps > that have not been read into memory yet. (-: Oops, I use NOT_MAPPED in NTFS rather than UNMAPPED but I still like UNMAPPED, too. (-: Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:15:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:15:52 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l429FjfB025664 for ; Wed, 2 May 2007 02:15:47 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA03102; Wed, 2 May 2007 19:15:33 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l429FUAf82146138; Wed, 2 May 2007 19:15:31 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l429FQYw81999881; Wed, 2 May 2007 19:15:26 +1000 (AEST) Date: Wed, 2 May 2007 19:15:26 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11247 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 15:20, David Chinner wrote: > >> > >>So, either the filesystem will understand the flag or iff the > >>unknown flag > >>is in the incompat set, it will return EINVAL or else the unknown > >>flag will > >>be safely ignored. > > > >My point was that there is a difference between specification and > >implementation - if the specification says something is compulsory, > >then they must be implemented in the filesystem. This is easy > >enough to ensure by code review - we don't need additional interface > >complexity for this.... > > You are wrong about this because you are missing the point that you > have no code to review. The users that will use those flags are > going to be applications that run in user space. Chances are you > will never see their code. Heck, they might not even be open source > applications... Ummm - the specification defines what is compulsory for *filesystems* to implement, not what applications can use. We don't need to see what the applications do - what we care about is that all filesystems implement the compulsory part of the specification. That's the code we review, and that's what I was referring to. > And all applications will run against a multitude of > kernels. So version X of the application will run on kernel 2.4.*, > 2.6.*, a.b.*, etc... For future expandability of the interface I > think it is important to have both compulsory and non-compulsory flags. Ah, so that's what you want - a mutable interface. i.e. versioning. So how does compusory flags help here? What happens if a voluntary flag now becomes compulsory? Or vice versa? How is the application supposed to deal with this dynamically? I suggested a version number for this right back at the start of this discussion and got told that we don't want versioned interfaces because we should make the effort to get it right the first time. I don't think this can be called "getting it right". > For example there is no reason why FIEMAP_HSM_READ needs to be > compulsory. Most filesystems do not support HSM so can safely ignore > it. They might be able to safely ignore it, but in reality it should be saying "I don't understand this". If the application *needs* to use a flag like this, then it should be told that the filesystem is not capable of doing what it was asked! OTOH if the application does not need to use the flag, then it shouldn't be using it and we shouldn't be silently ignoring incorrect usage of the provided API. What you are effectively saying about these "voluntary" flags is that their behaviour is _undefined_. That is, if you use these flags what you get on a successful call is undefined; it may or may not contain what you asked for but you can't tell if it really did what you want or returned the information you asked for. This is a really bad semantic to encode into an API. > And vice versa, an application might specify some weird and funky yet > to be developed feature that it expects the FS to perform and if the > FS cannot do it (either because it does not support it or because it > failed to perform the operation) the application expects the FS to > return an error and not to ignore the flag. An example could be the > asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS > ignores it it will return the extent map for the file data instead of > the XATTR_FORK! Not what the application wanted at all. Ouch! So > this is definitely a compulsory flag if I ever saw one. Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But we don't need a flag defined in the user visible API to tell us that we need to return an error here. > So as you see you must support both voluntary and compulsory flags... No, you've managed to convince me that they are not necessary and they are in fact a Bad Idea... ;) > Also consider what I said above about different kernels. A new > feature is implemented in kernel 2.8.13 say that was not there before > and an application is updated to use that feature. There will be > lots of instances where that application will still be run on older > kernels where this feature does not exist. This is *exactly* where silently ignoring flags really falls down. On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does something and it returns different structure contents for the same state. Now how does the application writer know which is correct or how to tell the difference? They have to guess or write detection code which is exactly what we want to avoid. I objected to the UNKNOWN flag because it wasn't explicit in it's meaning - I'm doing the same thing here. An interface needs to be explicitly defined and should not have and undefined behaviour in it.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 02:38:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:38:32 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429cPfB032340 for ; Wed, 2 May 2007 02:38:28 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49355) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBFu-0000Yj-Ne (Exim 4.63) (return-path ); Wed, 02 May 2007 10:36:14 +0100 In-Reply-To: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:36:12 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11248 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:15, David Chinner wrote: > On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >> On 1 May 2007, at 15:20, David Chinner wrote: >>>> >>>> So, either the filesystem will understand the flag or iff the >>>> unknown flag >>>> is in the incompat set, it will return EINVAL or else the unknown >>>> flag will >>>> be safely ignored. >>> >>> My point was that there is a difference between specification and >>> implementation - if the specification says something is compulsory, >>> then they must be implemented in the filesystem. This is easy >>> enough to ensure by code review - we don't need additional interface >>> complexity for this.... >> >> You are wrong about this because you are missing the point that you >> have no code to review. The users that will use those flags are >> going to be applications that run in user space. Chances are you >> will never see their code. Heck, they might not even be open source >> applications... > > Ummm - the specification defines what is compulsory for *filesystems* > to implement, not what applications can use. We don't need to see > what the applications do - what we care about is that all filesystems > implement the compulsory part of the specification. That's the code > we review, and that's what I was referring to. > >> And all applications will run against a multitude of >> kernels. So version X of the application will run on kernel 2.4.*, >> 2.6.*, a.b.*, etc... For future expandability of the interface I >> think it is important to have both compulsory and non-compulsory >> flags. > > Ah, so that's what you want - a mutable interface. i.e. versioning. > > So how does compusory flags help here? What happens if a voluntary > flag now becomes compulsory? Or vice versa? How is the application > supposed to deal with this dynamically? > > I suggested a version number for this right back at the start of > this discussion and got told that we don't want versioned interfaces > because we should make the effort to get it right the first time. > I don't think this can be called "getting it right". Look at ext2/3/4. They do it that way and it works well. No versioning just compatible and incompatible flags... The proposal is to do the same here. >> For example there is no reason why FIEMAP_HSM_READ needs to be >> compulsory. Most filesystems do not support HSM so can safely ignore >> it. > > They might be able to safely ignore it, but in reality it should > be saying "I don't understand this". If the application *needs* to > use a flag like this, then it should be told that the filesystem is > not capable of doing what it was asked! That is where you are completely wrong! (-: Or rather you are wrong for my example, i.e. you are wrong/right depending on the type of flag in question. HSM_READ is definitely _NOT_ required because all it means is "if the file is OFFLINE, bring it ONLINE and then return the extent map". Clearly all file systems that do not support HSM can 100% ignore this flag as all files will ALWAYS be ONLINE so they will return the correct data ALWAYS so no need to do anything for HSM_READ. > OTOH if the application does not need to use the flag, then it > shouldn't be using it and we shouldn't be silently ignoring > incorrect usage of the provided API. > > What you are effectively saying about these "voluntary" flags > is that their behaviour is _undefined_. That is, if you use > these flags what you get on a successful call is undefined; > it may or may not contain what you asked for but you can't > tell if it really did what you want or returned the information > you asked for. > > This is a really bad semantic to encode into an API. That is your opinion. There is nothing undefined in the API at all. You just fail to understand it... >> And vice versa, an application might specify some weird and funky yet >> to be developed feature that it expects the FS to perform and if the >> FS cannot do it (either because it does not support it or because it >> failed to perform the operation) the application expects the FS to >> return an error and not to ignore the flag. An example could be the >> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS >> ignores it it will return the extent map for the file data instead of >> the XATTR_FORK! Not what the application wanted at all. Ouch! So >> this is definitely a compulsory flag if I ever saw one. > > Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But > we don't need a flag defined in the user visible API to tell us > that we need to return an error here. Heh? What are you talking about? You need a flag to specify that you want XATTR_FORK. If not how the hell does the application specify that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you of the opinion that FIEMAP should definitely not support XATTR_FORK. If the latter I fully agree. This should be a separate API with named streams and the FD of the named stream should be passed to FIEMAP without the silly XATTR_FORK flag... >> So as you see you must support both voluntary and compulsory flags... > > No, you've managed to convince me that they are not necessary and > they are in fact a Bad Idea... ;) We agree to disagree then. I think they are a very Good Idea(TM). (-; >> Also consider what I said above about different kernels. A new >> feature is implemented in kernel 2.8.13 say that was not there before >> and an application is updated to use that feature. There will be >> lots of instances where that application will still be run on older >> kernels where this feature does not exist. > > This is *exactly* where silently ignoring flags really falls down. It does not! > On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does > something and it returns different structure contents for the same No it does not. You do NOT understand at all what we are talking about do you?!? If a flag would do something weird like returning different data then OBVIOUSLY you would make this a mandatory flag and it will NOT be ignored! You should know better than arguing with fallacies. Seriously... > state. Now how does the application writer know which is correct or > how to tell the difference? They have to guess or write detection > code which is exactly what we want to avoid. No they don't. It is then a compulsory flag so your argument is totally moot. > I objected to the UNKNOWN flag because it wasn't explicit > in it's meaning - I'm doing the same thing here. An interface > needs to be explicitly defined and should not have and undefined > behaviour in it.... That is exactly the point. It is explicitly defined and has NO undefined behaviour in it. (-: Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:48:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:48:16 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429mCfB003811 for ; Wed, 2 May 2007 02:48:14 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49362) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBPJ-0006HX-NQ (Exim 4.63) (return-path ); Wed, 02 May 2007 10:45:57 +0100 In-Reply-To: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <1AFF1746-8313-4DC2-81D6-4271B5FB71A3@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:45:55 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11249 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:15, David Chinner wrote: > On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >> And all applications will run against a multitude of >> kernels. So version X of the application will run on kernel 2.4.*, >> 2.6.*, a.b.*, etc... For future expandability of the interface I >> think it is important to have both compulsory and non-compulsory >> flags. > > Ah, so that's what you want - a mutable interface. i.e. versioning. > > So how does compusory flags help here? A concrete example: Let's say that the FIEMAP interface goes live as is without any flags at all and just defined bits for "these are optional and those are compulsory". Then the next kernel adds support for optional flag HSM_READ and compulsory flag XATTR_READ. FS that do not support XATTR_READ will return -ENOTSUP as they cannot return the wanted data. FS that do not support HSM_READ will still return the correct data in majority of cases (except when the FS supports HSM and the data is actually OFFLINE which the application will need to be able to cope with anyway incase the FS failed to bring the file ONLINE even if it supports the HSM_READ flag so no added complexity for handling this case). > What happens if a voluntary flag now becomes compulsory? Or vice > versa? How is the application supposed to deal with this dynamically? Forgot to answer this bit: This cannot happen. There cannot be flags that move from compulsory to non-compulsory or anything stupid like that. It would have to be a totally new flag otherwise it breaks backwards compatibility and hence this interface becomes useless crap. > I suggested a version number for this right back at the start of > this discussion and got told that we don't want versioned interfaces > because we should make the effort to get it right the first time. > I don't think this can be called "getting it right". So all applications end up doing: if (version X, do blah) else if (version Y, do blob) else if (version Z, do foo) else if (version A, do bar) else exit(1); Every time a new version is added? And abort for unknown versions? Now that is a great interface! Not. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:49:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:49:21 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l429nEfB004317 for ; Wed, 2 May 2007 02:49:17 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA03843; Wed, 2 May 2007 19:49:01 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l429mtAf82223314; Wed, 2 May 2007 19:48:57 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l429mqgl82278699; Wed, 2 May 2007 19:48:52 +1000 (AEST) Date: Wed, 2 May 2007 19:48:51 +1000 From: David Chinner To: Anton Altaparmakov Cc: Andreas Dilger , David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502094851.GX77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11250 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote: > On a different issue, do you think it would be worth adding an option > flags like FIEMAP_DONT_RELOCATE or something similar that would be a > compulsory flag and if set the FS is not allowed to move the file > around/change the block allocation of the file. We already have an inode flag in XFS to say this - the defrag tool checks it and ignores the file if it is set. > Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell > the FS we want to access the actual raw blocks so the FS can make > sure the data is on block aligned boundaries and if the FS does not > support this (e.g. ZFS or a compressed or encrypted NTFS file) then > it can return -ENOTSUP. > > Perhaps this is totally the wrong interface and such a "prepare file > for direct access" API should be a different ioctl() or syscall or > whatever. It just seems very simple and appropriate to combine it > here as people who use FIEMAP are at least sometimes going to be > wanting to access those blocks directly as well and it feels right to > be able to communicate this to the FS in the same call, kind of like > an "open intent" of "I want to use the data directly on disk"... I think this is wrong interface for this. Sure, use it to get the mappings (that's what it's for) but what you do with the mappings after that is not part of FIEMAP.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 02:57:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:57:55 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429vnfB007855 for ; Wed, 2 May 2007 02:57:52 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49383) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBZ6-0001FA-PT (Exim 4.63) (return-path ); Wed, 02 May 2007 10:56:04 +0100 In-Reply-To: <20070502094851.GX77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> <20070502094851.GX77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: Andreas Dilger , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:56:03 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11251 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:48, David Chinner wrote: > On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote: >> On a different issue, do you think it would be worth adding an option >> flags like FIEMAP_DONT_RELOCATE or something similar that would be a >> compulsory flag and if set the FS is not allowed to move the file >> around/change the block allocation of the file. > > We already have an inode flag in XFS to say this - the defrag > tool checks it and ignores the file if it is set. That is great for XFS but you control the metadata. NTFS, HFS, etc are cases where we cannot add such a flag because we cannot modify the metadata format (ok we could in some kludgy manner like storing an EA with an inode to say "com.linux.ntfs.immutable" or something but I would rather not if I can avoid it). >> Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell >> the FS we want to access the actual raw blocks so the FS can make >> sure the data is on block aligned boundaries and if the FS does not >> support this (e.g. ZFS or a compressed or encrypted NTFS file) then >> it can return -ENOTSUP. >> >> Perhaps this is totally the wrong interface and such a "prepare file >> for direct access" API should be a different ioctl() or syscall or >> whatever. It just seems very simple and appropriate to combine it >> here as people who use FIEMAP are at least sometimes going to be >> wanting to access those blocks directly as well and it feels right to >> be able to communicate this to the FS in the same call, kind of like >> an "open intent" of "I want to use the data directly on disk"... > > I think this is wrong interface for this. Sure, use it to get the > mappings (that's what it's for) but what you do with the mappings > after that is not part of FIEMAP.... Thanks for the comments. I am not sure it is a good idea either, just thought it would be worth discussing in case people thought it a good idea. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 03:52:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 03:52:48 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42AqhfB021110 for ; Wed, 2 May 2007 03:52:44 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l42AqgmK016050 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Wed, 2 May 2007 12:52:42 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l42Aqfmi016048 for xfs@oss.sgi.com; Wed, 2 May 2007 12:52:41 +0200 Date: Wed, 2 May 2007 12:52:41 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: Re: [Bug 756] New: File data corruption when writing to files with DM_EVENT_WRITE enabled over NFS (2.4 kernel) Message-ID: <20070502105241.GA15399@lst.de> References: <200705012104.l41L4CI3029767@oss.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705012104.l41L4CI3029767@oss.sgi.com> User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11252 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs > Note that a similar issue existed in the 2.6 SGI kernel up until it was resolved > by this recent change: > http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_lrw.c.diff?r1=1.258;r2=1.259;f=h Seems like someone forgot to send TAKEs to the xfs list once again.. From owner-xfs@oss.sgi.com Wed May 2 03:58:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 03:58:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l42AwCfB023745 for ; Wed, 2 May 2007 03:58:14 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id UAA05217; Wed, 2 May 2007 20:57:57 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l42AvrAf82323358; Wed, 2 May 2007 20:57:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l42AvnBI81446737; Wed, 2 May 2007 20:57:49 +1000 (AEST) Date: Wed, 2 May 2007 20:57:49 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502105749.GY77450368@melbourne.sgi.com> References: <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11253 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: > On 2 May 2007, at 10:15, David Chinner wrote: > >On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > >>And all applications will run against a multitude of > >>kernels. So version X of the application will run on kernel 2.4.*, > >>2.6.*, a.b.*, etc... For future expandability of the interface I > >>think it is important to have both compulsory and non-compulsory > >>flags. > > > >Ah, so that's what you want - a mutable interface. i.e. versioning. > > > >So how does compusory flags help here? What happens if a voluntary > >flag now becomes compulsory? Or vice versa? How is the application > >supposed to deal with this dynamically? > > > >I suggested a version number for this right back at the start of > >this discussion and got told that we don't want versioned interfaces > >because we should make the effort to get it right the first time. > >I don't think this can be called "getting it right". > > Look at ext2/3/4. They do it that way and it works well. No > versioning just compatible and incompatible flags... The proposal is > to do the same here. Just because it works for extN doesn't make it right for this interface. > >>For example there is no reason why FIEMAP_HSM_READ needs to be > >>compulsory. Most filesystems do not support HSM so can safely ignore > >>it. > > > >They might be able to safely ignore it, but in reality it should > >be saying "I don't understand this". If the application *needs* to > >use a flag like this, then it should be told that the filesystem is > >not capable of doing what it was asked! > > That is where you are completely wrong! (-: Or rather you are wrong > for my example, i.e. you are wrong/right depending on the type of > flag in question. And that is the crux of the argument. My point is that *any* flag returns an error if the filesystem does not support it. > HSM_READ is definitely _NOT_ required because all > it means is "if the file is OFFLINE, bring it ONLINE and then return > the extent map". You've got the definition of HSM_READ wrong. If the flag is *not* set, then we bring everything back online and return the full extent map. Specifying the flag indicates that we do *not* want the offline extents brought back online. i.e. it is a HSM or a datamover (e.g. backup program) that is querying the extents and we want to known *exactly* what the current state of the file is right now. So, if the HSM_READ flag is set, then the application is expecting the filesytem to be part of a HSM. Hence if it's not, it should return an error because somebody has done something wrong. > >OTOH if the application does not need to use the flag, then it > >shouldn't be using it and we shouldn't be silently ignoring > >incorrect usage of the provided API. > > > >What you are effectively saying about these "voluntary" flags > >is that their behaviour is _undefined_. That is, if you use > >these flags what you get on a successful call is undefined; > >it may or may not contain what you asked for but you can't > >tell if it really did what you want or returned the information > >you asked for. > > > >This is a really bad semantic to encode into an API. > > That is your opinion. There is nothing undefined in the API at all. > You just fail to understand it... FIEMAP returned success. Did it do what I asked? I don't know because it's allowed to return success when it did ignored me. This is as silly an interface definition as saying you can implement fsync() with { return 0; }. So, when fsync() succeeded did it write my data to disk? I don't know; it's allowed to return success when it ignored me. It's crazy, isn't it? It makes writing applications portable across operating systems a real PITA (ask the MySQL folk ;) because POSIX really does allow fsync() to be implemented like this. I use this example because the "allow some filesystems to silently ignore flags they don't understand" is a portability problem for applications - rather than a cross-OS issue it is a cross-filesystem issue. That is, if different filesystems behave differently to the same request they will have to be handled specifically by the application. Every filesystem should behave in *exactly* the same way to the FIEMAP ioctls - if they don't support something they throw an error, if they do then they return the correct data. > >>And vice versa, an application might specify some weird and funky yet > >>to be developed feature that it expects the FS to perform and if the > >>FS cannot do it (either because it does not support it or because it > >>failed to perform the operation) the application expects the FS to > >>return an error and not to ignore the flag. An example could be the > >>asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS > >>ignores it it will return the extent map for the file data instead of > >>the XATTR_FORK! Not what the application wanted at all. Ouch! So > >>this is definitely a compulsory flag if I ever saw one. > > > >Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But > >we don't need a flag defined in the user visible API to tell us > >that we need to return an error here. > > Heh? What are you talking about? You need a flag to specify that you > want XATTR_FORK. If not how the hell does the application specify > that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you > of the opinion that FIEMAP should definitely not support XATTR_FORK. > If the latter I fully agree. This should be a separate API with > named streams and the FD of the named stream should be passed to > FIEMAP without the silly XATTR_FORK flag... Ummmm - I think you misunderstood what I was saying. I was agreeing with you that is a FS does not support FIEMAP_XATTR_FORK "the correct answer is -EOPNOTSUPP or -EINVAL". What I was saying is that we don't need a COMPAT flag bit to tell us the obvious error return if the filesystem does not support this functionality.... > >>Also consider what I said above about different kernels. A new > >>feature is implemented in kernel 2.8.13 say that was not there before > >>and an application is updated to use that feature. There will be > >>lots of instances where that application will still be run on older > >>kernels where this feature does not exist. > > > >This is *exactly* where silently ignoring flags really falls down. > > It does not! > > >On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does > >something and it returns different structure contents for the same > > No it does not. You do NOT understand at all what we are talking > about do you?!? > > If a flag would do something weird like returning different data then > OBVIOUSLY you would make this a mandatory flag and it will NOT be > ignored! You've just successfully argued my case for me. By your reasoning, if we have voluntary flags 1, 2 and 3 and filesystems A, B and C and filesystem A is the only filesystem to implement 1, when B implements 1 bit must become a compulsory flag and hence C must now return an error despite being unchanged. Likewise when C implement 3, 3 must become a comulsory flag and A and B must now return an error despite being unchanged. IOWs, whenever *any* filesystem implements a voluntary feature that it didn't previously support, we have to make that a mandatory feature and all other filesystems that don't support it now must return an error. You're guaranteeing th application sees changes in behaviour with this interface, not preventing. Can we simply mandate that filesystems return an error to commands they don't support or don't understand and drop this silly interface mutation thing? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 04:19:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 04:19:31 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42BJPfB003965 for ; Wed, 2 May 2007 04:19:27 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49519) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjCpx-000625-Oy (Exim 4.63) (return-path ); Wed, 02 May 2007 12:17:33 +0100 In-Reply-To: <20070502105749.GY77450368@melbourne.sgi.com> References: <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 12:17:32 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11254 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 11:57, David Chinner wrote: > On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: >> On 2 May 2007, at 10:15, David Chinner wrote: >>> On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >>>> And all applications will run against a multitude of >>>> kernels. So version X of the application will run on kernel 2.4.*, >>>> 2.6.*, a.b.*, etc... For future expandability of the interface I >>>> think it is important to have both compulsory and non-compulsory >>>> flags. >>> >>> Ah, so that's what you want - a mutable interface. i.e. versioning. >>> >>> So how does compusory flags help here? What happens if a voluntary >>> flag now becomes compulsory? Or vice versa? How is the application >>> supposed to deal with this dynamically? >>> >>> I suggested a version number for this right back at the start of >>> this discussion and got told that we don't want versioned interfaces >>> because we should make the effort to get it right the first time. >>> I don't think this can be called "getting it right". >> >> Look at ext2/3/4. They do it that way and it works well. No >> versioning just compatible and incompatible flags... The proposal is >> to do the same here. > > Just because it works for extN doesn't make it right for this > interface. > >>>> For example there is no reason why FIEMAP_HSM_READ needs to be >>>> compulsory. Most filesystems do not support HSM so can safely >>>> ignore >>>> it. >>> >>> They might be able to safely ignore it, but in reality it should >>> be saying "I don't understand this". If the application *needs* to >>> use a flag like this, then it should be told that the filesystem is >>> not capable of doing what it was asked! >> >> That is where you are completely wrong! (-: Or rather you are wrong >> for my example, i.e. you are wrong/right depending on the type of >> flag in question. > > And that is the crux of the argument. > > My point is that *any* flag returns an error if the filesystem > does not support it. Yes and my point is that it should not do so as there are flags where it is not necessary. >> HSM_READ is definitely _NOT_ required because all >> it means is "if the file is OFFLINE, bring it ONLINE and then return >> the extent map". > > You've got the definition of HSM_READ wrong. If the flag is *not* > set, then we bring everything back online and return the full extent > map. Ah, sorry, I did indeed misunderstand what it was meant to mean. >>> OTOH if the application does not need to use the flag, then it >>> shouldn't be using it and we shouldn't be silently ignoring >>> incorrect usage of the provided API. >>> >>> What you are effectively saying about these "voluntary" flags >>> is that their behaviour is _undefined_. That is, if you use >>> these flags what you get on a successful call is undefined; >>> it may or may not contain what you asked for but you can't >>> tell if it really did what you want or returned the information >>> you asked for. >>> >>> This is a really bad semantic to encode into an API. >> >> That is your opinion. There is nothing undefined in the API at all. >> You just fail to understand it... > > FIEMAP returned success. Did it do what I asked? I don't > know because it's allowed to return success when it did ignored me. So what? > This is as silly an interface definition as saying you can > implement fsync() with { return 0; }. So, when fsync() succeeded > did it write my data to disk? I don't know; it's allowed to return > success when it ignored me. No it is not silly at all. There can be flags that fail but still the operation is a success. Example from admittedly unrelated area: when truncating a file to smaller size if the freeing of the allocated blocks fails it does not cause the truncate to fail, it just means some space is wasted/marked used when it is unused on the volume and running fsck fixes this. At least that is how I have implemented it for NTFS and I think this is the most sensible way to do it. The user does not care if some blocks could not be freed. All they care about is that the file is now truncated. The volume is then marked dirty thus running fsck/ chkdsk will reclaim the lost space. > It's crazy, isn't it? It makes writing applications portable > across operating systems a real PITA (ask the MySQL folk ;) > because POSIX really does allow fsync() to be implemented like this. > > I use this example because the "allow some filesystems to silently > ignore flags they don't understand" is a portability problem for > applications - rather than a cross-OS issue it is a cross-filesystem > issue. That is, if different filesystems behave differently to > the same request they will have to be handled specifically by > the application. Every filesystem should behave in *exactly* the > same way to the FIEMAP ioctls - if they don't support something > they throw an error, if they do then they return the correct > data. It is only a problem if you do not choose wisely which flags my be ignored silently... >>>> And vice versa, an application might specify some weird and >>>> funky yet >>>> to be developed feature that it expects the FS to perform and if >>>> the >>>> FS cannot do it (either because it does not support it or >>>> because it >>>> failed to perform the operation) the application expects the FS to >>>> return an error and not to ignore the flag. An example could be >>>> the >>>> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and >>>> the FS >>>> ignores it it will return the extent map for the file data >>>> instead of >>>> the XATTR_FORK! Not what the application wanted at all. Ouch! So >>>> this is definitely a compulsory flag if I ever saw one. >>> >>> Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But >>> we don't need a flag defined in the user visible API to tell us >>> that we need to return an error here. >> >> Heh? What are you talking about? You need a flag to specify that you >> want XATTR_FORK. If not how the hell does the application specify >> that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you >> of the opinion that FIEMAP should definitely not support XATTR_FORK. >> If the latter I fully agree. This should be a separate API with >> named streams and the FD of the named stream should be passed to >> FIEMAP without the silly XATTR_FORK flag... > > Ummmm - I think you misunderstood what I was saying. I was agreeing > with you that is a FS does not support FIEMAP_XATTR_FORK "the correct > answer is -EOPNOTSUPP or -EINVAL". > > What I was saying is that we don't need a COMPAT flag bit to tell > us the obvious error return if the filesystem does not support this > functionality.... But there is no COMPAT bit. I don't understand what you are saying... >>>> Also consider what I said above about different kernels. A new >>>> feature is implemented in kernel 2.8.13 say that was not there >>>> before >>>> and an application is updated to use that feature. There will be >>>> lots of instances where that application will still be run on older >>>> kernels where this feature does not exist. >>> >>> This is *exactly* where silently ignoring flags really falls down. >> >> It does not! >> >>> On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does >>> something and it returns different structure contents for the same >> >> No it does not. You do NOT understand at all what we are talking >> about do you?!? >> >> If a flag would do something weird like returning different data then >> OBVIOUSLY you would make this a mandatory flag and it will NOT be >> ignored! > > You've just successfully argued my case for me. No I have not at all. > By your reasoning, if we have voluntary flags 1, 2 and 3 and > filesystems A, B and C and filesystem A is the only filesystem to > implement 1, when B implements 1 bit must become a compulsory flag WHY? It does not at all. Flags CANNOT move from voluntary to compulsory. Read my argument again... > and hence C must now return an error despite being unchanged. Nope. > Likewise when C implement 3, 3 must become a comulsory flag and > A and B must now return an error despite being unchanged. Again no. > IOWs, whenever *any* filesystem implements a voluntary feature that > it didn't previously support, we have to make that a mandatory > feature and all other filesystems that don't support it now This is total crap. > must return an error. You're guaranteeing th application sees > changes in behaviour with this interface, not preventing. > > Can we simply mandate that filesystems return an error > to commands they don't support or don't understand and > drop this silly interface mutation thing? Can we simply not and drop this silly argument? Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 05:19:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 05:19:37 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42CJWfB016412 for ; Wed, 2 May 2007 05:19:33 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HjDMP-0005ml-DC; Wed, 02 May 2007 12:51:05 +0100 Date: Wed, 2 May 2007 12:51:05 +0100 From: Christoph Hellwig To: Lachlan McIlroy Cc: xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk Subject: Re: TAKE 963965 - Add lockdep support for XFS Message-ID: <20070502115105.GA21031@infradead.org> References: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11255 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, Apr 27, 2007 at 06:50:45PM +1000, Lachlan McIlroy wrote: > Add lockdep support for XFS I don't think this is entirely correct, and it misses some of the most interesting cases. I've Cc'ed -fsdevel and Al to get some comments on the more tricky issues in the rename section at the end of the mail. > Modid: xfs-linux-melb:xfs-kern:28485a > fs/xfs/xfs_vnodeops.c - 1.695 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h The XFS_ILOCK_PARENT uses in xfs_create, xfs_mkdir and xfs_symlink look good. xfs_lock_dir_and_entry should go away and just become and opencoded xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); xfs_ilock(ip, XFS_ILOCK_EXCL); in the two callers, once we made sure to have a sufficient locking protocol where we always lock the parent before the child. xfs_lock_dir_and_entry can be totally removed and replaced with just the two ilock calls if we sort out the locking as proposed in this mail. > > fs/xfs/xfs_vfsops.c - 1.518 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h This looks a bit odd to me - the rt inodes are not connected to the filesystem namespace so the root inode can't really be it's parent. Why are we locking the root inode so early. Is there a good reason we don't delay the locking until we're done with the rt inodes? If not the parent annotation is probably safe beause we never lock the rt inode at the same time as any other inode, but it at least needs a big comment describing what's going on. Now what seems to be completely lacking is any kind of annotation in xfs_rename.c, which is the most difficult thing to get right for inode locking because we may have to lock up to four inodes. I suggest to implement the same locking protocol the the VFS uses for locking i_mutex, as document in Documentation/filesystems/directory-locking: Also xfs_lock_inodes lacks any kind of annotation. Let's start with the xfs_lock_inodes that don't fall into rename or xfs_lock_dir_and_entry handled above: - xfs_swap_extents locks two inodes of the same type, but these could be directories, so there is a chance we can get into conflicts with the parent->child type locking - xfs_link locks the source inode and the target directory inode. vfs locking rule is lock parent, lock source and we should follow this as it's in line with the directory before child rule except that the source doesn't always have to be a child, in which case we don't have a problem anyway And now rename gets ugly, we should follow the VFS rules with the following required adjustments: - XFS needs both source and target inode (if existing) locked. Because both must be non-directories sorting by inode number should be okay - Doing a lock_rename equivalent for locking the parent directories requires dentries, but only inodes are passed down from the VFS. On the other hand they are obviously guranteed to be directories so i_dentry has exactly one dentry on which we can do the upwards walk. s_vfs_rename_mutex is already held by the vfs so we don't need to do that again. I'd suggest having a copy of the directory-locking file with the XFS adjustments somewhere so all this is actually well documented. - case for source directory == parent directory is trivial. lock parent From owner-xfs@oss.sgi.com Wed May 2 05:53:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 05:53:21 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42CrGfB024777 for ; Wed, 2 May 2007 05:53:17 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l42CrBgT015874 for ; Wed, 2 May 2007 08:53:11 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l42CrB9i554574 for ; Wed, 2 May 2007 08:53:11 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l42CrAET015347 for ; Wed, 2 May 2007 08:53:10 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l42Cr9Ww015185; Wed, 2 May 2007 08:53:09 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id B3BC793BC1; Wed, 2 May 2007 18:23:13 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l42CrCw4025574; Wed, 2 May 2007 18:23:12 +0530 Date: Wed, 2 May 2007 18:23:12 +0530 From: "Amit K. Arora" To: Chris Wedgwood Cc: David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070502125312.GA5845@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> <20070430052559.GA13145@tuatara.stupidest.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430052559.GA13145@tuatara.stupidest.org> User-Agent: Mutt/1.4.1i X-archive-position: 11256 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > For FA_ALLOCATE, it's supposed to change the file size if we > > allocate past EOF, right? > > I would argue no. Use truncate for that. The patch I posted for ext4 *does* change the filesize after preallocation, if required (i.e. when preallocation is after EOF). I may have to change that, if we decide on not doing this. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Wed May 2 06:12:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 06:12:04 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l42DBvfB029629 for ; Wed, 2 May 2007 06:12:00 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA08194; Wed, 2 May 2007 23:11:48 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l42DBkAf82475833; Wed, 2 May 2007 23:11:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l42DBipP82488324; Wed, 2 May 2007 23:11:44 +1000 (AEST) Date: Wed, 2 May 2007 23:11:44 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Missing TAKE 958522 (was Re: [Bug 756] New: File data corruption.....) Message-ID: <20070502131144.GZ77450368@melbourne.sgi.com> References: <200705012104.l41L4CI3029767@oss.sgi.com> <20070502105241.GA15399@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502105241.GA15399@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11257 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 12:52:41PM +0200, Christoph Hellwig wrote: > > Note that a similar issue existed in the 2.6 SGI kernel up until it was resolved > > by this recent change: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_lrw.c.diff?r1=1.258;r2=1.259;f=h > > Seems like someone forgot to send TAKEs to the xfs list once again.. Hmmm - that was a bad one to miss considering the importance of the problem it fixes...... ----- TAKE 958522 - XFS has conflicting strategies between metadata and file data flushing Fix to prevent the notorious 'NULL files' problem after a crash. The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. Date: Fri Mar 30 02:24:06 AEST 2007 Workarea: vpn-emea-sw-emea-160-18.emea.sgi.com:/home/lachlan/isms/2.6.x-null Inspected by: dgc,tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28322a fs/xfs/xfsidbg.c - 1.312 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfsidbg.c.diff?r1=text&tr1=1.312&r2=text&tr2=1.311&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_vnodeops.c - 1.693 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.693&r2=text&tr2=1.692&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iocore.c - 1.52 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iocore.c.diff?r1=text&tr1=1.52&r2=text&tr2=1.51&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_inode.c - 1.463 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_inode.c.diff?r1=text&tr1=1.463&r2=text&tr2=1.462&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_inode.h - 1.219 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_inode.h.diff?r1=text&tr1=1.219&r2=text&tr2=1.218&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_bmap.c - 1.367 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_bmap.c.diff?r1=text&tr1=1.367&r2=text&tr2=1.366&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iomap.h - 1.10 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iomap.h.diff?r1=text&tr1=1.10&r2=text&tr2=1.9&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iomap.c - 1.52 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iomap.c.diff?r1=text&tr1=1.52&r2=text&tr2=1.51&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/linux-2.6/xfs_lrw.c - 1.259 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_lrw.c.diff?r1=text&tr1=1.259&r2=text&tr2=1.258&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/linux-2.6/xfs_aops.c - 1.142 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_aops.c.diff?r1=text&tr1=1.142&r2=text&tr2=1.141&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/dmapi/xfs_dm.c - 1.34 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/dmapi/xfs_dm.c.diff?r1=text&tr1=1.34&r2=text&tr2=1.33&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 23:45:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 23:45:20 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l436jDfB003835 for ; Wed, 2 May 2007 23:45:15 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA04360; Thu, 3 May 2007 16:45:03 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l436j2Af82987621; Thu, 3 May 2007 16:45:02 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l436ixY983041938; Thu, 3 May 2007 16:44:59 +1000 (AEST) Date: Thu, 3 May 2007 16:44:59 +1000 From: David Chinner To: Christoph Hellwig Cc: Lachlan McIlroy , xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk Subject: Re: TAKE 963965 - Add lockdep support for XFS Message-ID: <20070503064459.GJ77450368@melbourne.sgi.com> References: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> <20070502115105.GA21031@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502115105.GA21031@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11258 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 12:51:05PM +0100, Christoph Hellwig wrote: > On Fri, Apr 27, 2007 at 06:50:45PM +1000, Lachlan McIlroy wrote: > > Add lockdep support for XFS > > I don't think this is entirely correct, and it misses some of the > most interesting cases. Yeah, we decided it was better to get something out there that fixes the obvious and frequently reported false positives than hold it up on the hard stuff.... > I've Cc'ed -fsdevel and Al to get some comments on the more tricky > issues in the rename section at the end of the mail. There's several other tricky cases that we're not sure to handle as well - they are mainly due to *valid* lock inversions. i.e. we do "lock A, lock B" in most places, but in others we do "lock B, *trylock* A" to avoid deadlocks. I think the MOUNT_ILOCK/inode ilock is one of these pairs. > > > > Modid: xfs-linux-melb:xfs-kern:28485a > > fs/xfs/xfs_vnodeops.c - 1.695 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> > xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > > The XFS_ILOCK_PARENT uses in xfs_create, xfs_mkdir and xfs_symlink look good. > > xfs_lock_dir_and_entry should go away and just become and opencoded > > xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); > xfs_ilock(ip, XFS_ILOCK_EXCL); > > in the two callers, once we made sure to have a sufficient locking > protocol where we always lock the parent before the child. > > xfs_lock_dir_and_entry can be totally removed and replaced with just > the two ilock calls if we sort out the locking as proposed in this > mail. I'm not sure it is that simple - we currently always group locking of multiple inodes in increasing inode number order. i don't know what deadlock that is protecting against. There's also the case that we can't sleep on the ilock if the inode in the AIL while we hold the directory lock. Once again I'm not sure what the deadlock is, but given we are now in a transaction it's probably a tail-pushing deadlock that it is avoiding. Without knowing for certain what these are avoiding, I don't think we should be removing the code blindly.... > > fs/xfs/xfs_vfsops.c - 1.518 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> > xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > > This looks a bit odd to me - the rt inodes are not connected to the > filesystem namespace so the root inode can't really be it's parent. > > Why are we locking the root inode so early. Is there a good reason we > don't delay the locking until we're done with the rt inodes? No idea - it's like that on irix too, and I don't have time right now to discover why.... > If not the parent annotation is probably safe beause we never lock > the rt inode at the same time as any other inode, but it at least needs > a big comment describing what's going on. > > > > Now what seems to be completely lacking is any kind of annotation in > xfs_rename.c, which is the most difficult thing to get right for > inode locking because we may have to lock up to four inodes. I suggest > to implement the same locking protocol the the VFS uses for locking > i_mutex, as document in Documentation/filesystems/directory-locking: > > Also xfs_lock_inodes lacks any kind of annotation. It calls xfs_lock_inumorder() to set up the annotation. The inode number in the set of inodes to be locked drives the lock subclass for nesting. Also xfs_rename locking ends up calling xfs_lock_inodes() and so it does get annotated. > Let's start with the xfs_lock_inodes that don't fall into rename or > xfs_lock_dir_and_entry handled above: > > > - xfs_swap_extents locks two inodes of the same type, but these > could be directories, so there is a chance we can get into > conflicts with the parent->child type locking Uses xfs_lock_inodes() so subclass nesting is used instead of parent/child. > - xfs_link locks the source inode and the target directory > inode. vfs locking rule is lock parent, lock source and > we should follow this as it's in line with the directory > before child rule except that the source doesn't always > have to be a child, in which case we don't have a problem > anyway It locks in inode number order as per xfs_lock_dir_and_entry() and uses xfs_lock_inodes() for annotation. > And now rename gets ugly, we should follow the VFS rules with > the following required adjustments: > > - XFS needs both source and target inode (if existing) locked. > Because both must be non-directories sorting by inode number > should be okay > - Doing a lock_rename equivalent for locking the parent directories > requires dentries, but only inodes are passed down from the VFS. > On the other hand they are obviously guranteed to be directories > so i_dentry has exactly one dentry on which we can do the upwards > walk. This is a lot of churn that I don't really see as necessary - why should we risk deadlocks and difficult to diagnose problems when the current code works and is now annotated? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 00:49:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 00:49:17 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l437nCfB010685 for ; Thu, 3 May 2007 00:49:13 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id C2D3C4E456B; Thu, 3 May 2007 01:49:10 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 4864A406D; Thu, 3 May 2007 00:49:09 -0700 (PDT) Date: Thu, 3 May 2007 00:49:09 -0700 From: Andreas Dilger To: David Chinner Cc: Anton Altaparmakov , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070503074909.GA6220@schatzie.adilger.int> Mail-Followup-To: David Chinner , Anton Altaparmakov , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502105749.GY77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11259 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 20:57 +1000, David Chinner wrote: > On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: > > HSM_READ is definitely _NOT_ required because all > > it means is "if the file is OFFLINE, bring it ONLINE and then return > > the extent map". > > You've got the definition of HSM_READ wrong. If the flag is *not* > set, then we bring everything back online and return the full extent > map. > > Specifying the flag indicates that we do *not* want the offline > extents brought back online. i.e. it is a HSM or a datamover > (e.g. backup program) that is querying the extents and we want to > known *exactly* what the current state of the file is right now. > > So, if the HSM_READ flag is set, then the application is > expecting the filesytem to be part of a HSM. Hence if it's not, > it should return an error because somebody has done something wrong. In my original proposal I specifically pointed out that the FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the HSM_READ flag is set. That's why the flag is called "HSM_READ" instead of "HSM_NO_READ". The reason is that it seems bad if the default behaviour for calling ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is only disabled by specifying a flag. It makes a lot more sense to just leave the data as it is and return the extent mapping by default (i.e. this is the principle of least surprise). It would probably be equally surprising and undesirable if the default behaviour was to force all data out to HSM. For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should even be a part of this interface? I have no problem with returning a flag that reports if the data is migrated to HSM and whether it is UNMAPPED. Having FIEMAP force the retrieval of data from HSM strikes me as something that should be a part of a separate HSM interface, which also needs to be able to do things like push specific files or parts thereof out to HSM, set the aging policy, and return information like "where does the HSM file live" and "how many copies are there". Do you know the reasoning behind including this into XFS_IOC_GETBMAPX? Looking at the bmap.c comments it appears it is simply because the API isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate there is data in HSM but it has no blocks allocated in the filesystem. I don't think it makes the operation significantly more efficient than say "ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP)" if an application actually needs the data to be present instead of just returning mapping info that includes "UNMAPPED. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu May 3 01:24:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 01:24:31 -0700 (PDT) Received: from ppsw-2.csi.cam.ac.uk (ppsw-2.csi.cam.ac.uk [131.111.8.132]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l438OPfB031233 for ; Thu, 3 May 2007 01:24:27 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49510) by ppsw-2.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.152]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjWb8-0005kD-8u (Exim 4.63) (return-path ); Thu, 03 May 2007 09:23:34 +0100 In-Reply-To: <20070503074909.GA6220@schatzie.adilger.int> References: <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> <20070503074909.GA6220@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <13539C2E-16DA-4F86-9CBB-D16050EDDC44@cam.ac.uk> Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Thu, 3 May 2007 09:23:33 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11260 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 3 May 2007, at 08:49, Andreas Dilger wrote: > On May 02, 2007 20:57 +1000, David Chinner wrote: >> On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: >>> HSM_READ is definitely _NOT_ required because all >>> it means is "if the file is OFFLINE, bring it ONLINE and then return >>> the extent map". >> >> You've got the definition of HSM_READ wrong. If the flag is *not* >> set, then we bring everything back online and return the full extent >> map. >> >> Specifying the flag indicates that we do *not* want the offline >> extents brought back online. i.e. it is a HSM or a datamover >> (e.g. backup program) that is querying the extents and we want to >> known *exactly* what the current state of the file is right now. >> >> So, if the HSM_READ flag is set, then the application is >> expecting the filesytem to be part of a HSM. Hence if it's not, >> it should return an error because somebody has done something wrong. > > In my original proposal I specifically pointed out that the > FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the > XFS_IOC_GETBMAPX > BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the > HSM_READ flag is set. That's why the flag is called "HSM_READ" > instead > of "HSM_NO_READ". Cool. I did not misunderstand after all then. (-: > The reason is that it seems bad if the default behaviour for calling > ioctl(FIEMAP) would be to force retrieval of data from HSM, and > this is > only disabled by specifying a flag. It makes a lot more sense to just > leave the data as it is and return the extent mapping by default (i.e. > this is the principle of least surprise). It would probably be > equally > surprising and undesirable if the default behaviour was to force all > data out to HSM. > > For that matter, I'm also beginning to wonder if the FLAG_HSM_READ > should > even be a part of this interface? I have no problem with returning a > flag that reports if the data is migrated to HSM and whether it is > UNMAPPED. > > Having FIEMAP force the retrieval of data from HSM strikes me as > something > that should be a part of a separate HSM interface, which also needs > to be > able to do things like push specific files or parts thereof out to > HSM, > set the aging policy, and return information like "where does the HSM > file live" and "how many copies are there". That would seem sensible to me also. Just like David argued that causing the data to be in a fixed location should be a separate interface rather than part of FIEMAP so by analogy the same should apply to touching HSM. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Thu May 3 03:34:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 03:34:34 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l43AYRfB018100 for ; Thu, 3 May 2007 03:34:28 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 6152E7BA319; Thu, 3 May 2007 04:34:26 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid