From owner-xfs@oss.sgi.com Tue May 1 07:21:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 07:21:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l41EL5fB015382 for ; Tue, 1 May 2007 07:21:08 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id AAA04161; Wed, 2 May 2007 00:20:54 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l41EKqAf81627789; Wed, 2 May 2007 00:20:53 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l41EKn6U80735839; Wed, 2 May 2007 00:20:49 +1000 (AEST) Date: Wed, 2 May 2007 00:20:49 +1000 From: David Chinner To: Nicholas Miell Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501142049.GG77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177994346.3362.5.camel@entropy> User-Agent: Mutt/1.4.2.1i X-archive-position: 11237 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: > On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > This is actually for future use. Any flags that are added into this > > > range must be understood by both sides or it should be considered an > > > error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need > > > to be supported. If it turns out that 8 bits is too small a range for > > > INCOMPAT flags, then we can make 0x01000000 an incompat flag that means > > > e.g. 0x00ff0000 are also incompat flags also. > > > > Ah, ok. So it's not really a set of "compatibility" flags, it's more a > > "compulsory" set. Under those terms, i don't really see why this is > > necessary - either the filesystem will understand the flags or it will > > return EINVAL or ignore them... > > > > > I'm assuming that all flags that will be in the original FIEMAP proposal > > > will be understood by the implementations. Most filesystems can safely > > > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for > > > that matter FLAG_SYNC is probably moot for most filesystems also because > > > they do block allocation at preprw time. > > > > Exactly my point - so why do we really need to encode a compulsory set of > > > > Because flags have meaning, independent of whether or not the filesystem > understands them. And if the filesystem chooses to ignore critically > important flags (instead of returning EINVAL), bad things may happen. > > So, either the filesystem will understand the flag or iff the unknown flag > is in the incompat set, it will return EINVAL or else the unknown flag will > be safely ignored. My point was that there is a difference between specification and implementation - if the specification says something is compulsory, then they must be implemented in the filesystem. This is easy enough to ensure by code review - we don't need additional interface complexity for this.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 1 11:38:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 11:38:27 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41IcLfB004929 for ; Tue, 1 May 2007 11:38:23 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from altaparmakov.plus.com ([212.159.79.82]:49945 helo=[192.168.1.64]) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HixE1-00068o-W4 (Exim 4.63) (return-path ); Tue, 01 May 2007 19:37:22 +0100 In-Reply-To: <20070501042254.GD77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Tue, 1 May 2007 19:37:20 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11238 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 05:22, David Chinner wrote: > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >> The FIBMAP ioctl is for privileged users >> only, and I wonder if FIEMAP should be the same, or at least >> disallow >> mapping files that the user can't access especially with >> FLAG_SYNC and/or >> FLAG_HSM_READ. > > I see little reason for restricting FI[BE]MAP to privileged users - > anyone should be able to determine if files they have permission to > access are fragmented. Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the machine. Perhaps for non-privileged users FIEMAP has to be read- only? As soon as any of the FLAG_* flags come into play you make it privileged. For example fancy any user being able to fill up your file system by calling FIEMAP with FLAG_HSM_READ on all files recursively? This should certainly not be simply dismissed as a non- issue without thinking about it first... Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Tue May 1 11:48:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 11:48:44 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41ImefB006913 for ; Tue, 1 May 2007 11:48:41 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from altaparmakov.plus.com ([212.159.79.82]:49949 helo=[192.168.1.64]) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HixNG-0000gV-WA (Exim 4.63) (return-path ); Tue, 01 May 2007 19:46:55 +0100 In-Reply-To: <20070501142049.GG77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Tue, 1 May 2007 19:46:53 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11239 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 15:20, David Chinner wrote: > On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: >> On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> This is actually for future use. Any flags that are added into >>>> this >>>> range must be understood by both sides or it should be >>>> considered an >>>> error. Flags outside the FIEMAP_FLAG_INCOMPAT do not >>>> necessarily need >>>> to be supported. If it turns out that 8 bits is too small a >>>> range for >>>> INCOMPAT flags, then we can make 0x01000000 an incompat flag >>>> that means >>>> e.g. 0x00ff0000 are also incompat flags also. >>> >>> Ah, ok. So it's not really a set of "compatibility" flags, it's >>> more a >>> "compulsory" set. Under those terms, i don't really see why this is >>> necessary - either the filesystem will understand the flags or it >>> will >>> return EINVAL or ignore them... >>> >>>> I'm assuming that all flags that will be in the original FIEMAP >>>> proposal >>>> will be understood by the implementations. Most filesystems can >>>> safely >>>> ignore FLAG_HSM_READ, for example, since they don't support HSM, >>>> and for >>>> that matter FLAG_SYNC is probably moot for most filesystems also >>>> because >>>> they do block allocation at preprw time. >>> >>> Exactly my point - so why do we really need to encode a >>> compulsory set of >> >> Because flags have meaning, independent of whether or not the >> filesystem >> understands them. And if the filesystem chooses to ignore critically >> important flags (instead of returning EINVAL), bad things may happen. >> >> So, either the filesystem will understand the flag or iff the >> unknown flag >> is in the incompat set, it will return EINVAL or else the unknown >> flag will >> be safely ignored. > > My point was that there is a difference between specification and > implementation - if the specification says something is compulsory, > then they must be implemented in the filesystem. This is easy > enough to ensure by code review - we don't need additional interface > complexity for this.... You are wrong about this because you are missing the point that you have no code to review. The users that will use those flags are going to be applications that run in user space. Chances are you will never see their code. Heck, they might not even be open source applications... And all applications will run against a multitude of kernels. So version X of the application will run on kernel 2.4.*, 2.6.*, a.b.*, etc... For future expandability of the interface I think it is important to have both compulsory and non-compulsory flags. For example there is no reason why FIEMAP_HSM_READ needs to be compulsory. Most filesystems do not support HSM so can safely ignore it. And applications that want to read/write the data locations that are obtained with the FIEMAP call will likely always supply FIEMAP_HSM_READ because they want to ensure the file is brought in if it is off line so they definitely want file systems that do not support this flag to ignore it. And vice versa, an application might specify some weird and funky yet to be developed feature that it expects the FS to perform and if the FS cannot do it (either because it does not support it or because it failed to perform the operation) the application expects the FS to return an error and not to ignore the flag. An example could be the asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS ignores it it will return the extent map for the file data instead of the XATTR_FORK! Not what the application wanted at all. Ouch! So this is definitely a compulsory flag if I ever saw one. So as you see you must support both voluntary and compulsory flags... Also consider what I said above about different kernels. A new feature is implemented in kernel 2.8.13 say that was not there before and an application is updated to use that feature. There will be lots of instances where that application will still be run on older kernels where this feature does not exist. Depending on the feature it may be quite sensible to simply ignore in the kernel that the application set an unknown flag whilst for a different feature it may be the opposite. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Tue May 1 15:32:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 15:32:47 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41MWgfB012145 for ; Tue, 1 May 2007 15:32:43 -0700 Received: from localhost.adilger.int (72-254-21-136.client.stsn.net [72.254.21.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id DFE564E4564; Tue, 1 May 2007 16:32:41 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id AC4524179; Tue, 1 May 2007 15:32:36 -0700 (PDT) Date: Tue, 1 May 2007 15:32:36 -0700 From: Andreas Dilger To: David Chinner Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501223236.GM5722@schatzie.adilger.int> Mail-Followup-To: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501142049.GG77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11241 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 00:20 +1000, David Chinner wrote: > My point was that there is a difference between specification and > implementation - if the specification says something is compulsory, > then they must be implemented in the filesystem. This is easy > enough to ensure by code review - we don't need additional interface > complexity for this.... What you seem to be missing about my proposal is that the FLAG_INCOMPAT is for future use by that part of the specification we haven't thought of yet... Having COMPAT/INCOMPAT flags has been very useful for ext2/3/4, and is much better than having version numbers for the interface. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 1 15:30:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 15:30:53 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l41MUnfB011674 for ; Tue, 1 May 2007 15:30:50 -0700 Received: from localhost.adilger.int (72-254-21-136.client.stsn.net [72.254.21.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 459B44E4564; Tue, 1 May 2007 16:30:47 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id F06254179; Tue, 1 May 2007 15:30:40 -0700 (PDT) Date: Tue, 1 May 2007 15:30:40 -0700 From: Andreas Dilger To: David Chinner Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501223040.GL5722@schatzie.adilger.int> Mail-Followup-To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501042254.GD77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11240 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 01, 2007 14:22 +1000, David Chinner wrote: > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't > > I disagree - why would you want to indicate the state is unknown when we know > very well that it is offline? If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a catch-all flag that indicates "this extent contains data but there is nothing sensible to be returned for the extent mapping." > Effectively, when your extent is offline in the HSM, it is inaccessable, and > you have to bring it back from tape so it becomes accessible again. i.e. some > action is necessary on behalf of the user to make it accessible. So I think > that OFFLINE is a good name for this state because it really is inaccessible. What you are calling OFFLINE I would prefer to call UNMAPPED, since that can be used by applications as a catch-all for "no mapping". There can be further flags that give refinements to UNMAPPED that some applications might care about them (e.g. HSM_RESIDENT), but many users/apps will not if they just want the number of fragments in a given file. > Also, I don't think "secondary" is a good term because most large systems > have more than one tier of storage. One possibility is "HSM_RESIDENT" > which indicates the extent is current and resident with a HSM's archive.... Sure. > > Can you propose reasonable flag names for these (I can't think of anything > > very good) and a clear explanation of what they mean. I suspect it will > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > > the concept of stripe unit and stripe width, but as yet they are not > > communicated between the two very well. I'd be much happier if this info > > could be queried in a standard way from the block layer instead of the > > user having to specify it and the filesystem having to track it. > > My preference is definitely for a separate ioctl to grab the > filesystem geometry so this stuff can be calculated in userspace. > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't > bother trying to define names until we decide which appraoch we take > to implement this. Hmm, previously you wrote "This information could be easily passed up in the flags fields if the filesystem has geometry information". So, I _think_ what you are saying is that you want 4 flags to convey this start/end alignment information, but the exact semantics of what a "stripe unit" and a "stripe width" is filesystem specific? I definitely do NOT want to get into any issues of querying the block device geometry here. I was just making a passing comment that ext4+mballoc can already do RAID-specific allocation alignment, but it depends on the admin to specify this information and it would be nice if there was some easy way to get this from userspace/kernel interfaces. Having an API that can request "tell me the number of blocks from this offset until the next physical disk boundary" or similar would be useful to any allocator, and the block layer already needs to know this when submitting IO. > In XFS, mkfs.xfs does the work of getting this information > to see in the filesystem superblock. Here's the code for getting > sunit/swidth from the underlying block device: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ > > Not much in common there ;) It looks like this might be just what e2fsprogs needs also. > > It does make sense to specify zero for the fm_extent_count array and a > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > > extent data itself, for the non-verbose mode of filefrag, and for > > pre-allocating a buffer large enough to hold the file if that is important. > > Rather than rely on implicit behaviour of "pass in extent count of > zero and a don't try to return any extents" to return the number of > extents on the file, why not just explicitly define this as a valid > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my clever-clever for "return no extents" and "return number of extents" is wasted :-/. > > - does XFS return an extent for the metadata parts of the file (e.g. btree)? > > No, but we can return the extent map for the attribute fork (i.e. > extended attrs) if asked for (XFS_IOC_GETBMAPA). This seems like it would be a useful addition to the interface also, having FIEMAP_FLAG_METADATA request the return of metadata allocations too. > > - does XFS return preallocated extents beyond EOF? > > Yes - they are part of the extent map for the file. OK. > > - does XFS allow non-root users to call xfs_bmap on files they don't own, or > > use by non-root users at all? > > Users can run xfs_bmap on any file they have permission to > open(O_RDONLY). > > > The FIBMAP ioctl is for privileged users > > only, and I wonder if FIEMAP should be the same, or at least disallow > > mapping files that the user can't access especially with FLAG_SYNC and/or > > FLAG_HSM_READ. > > I see little reason for restricting FI[BE]MAP to privileged users - > anyone should be able to determine if files they have permission to > access are fragmented. I think I agree with Anton that allowing some of the flags for non-privileged users seems dangerous. I think this needs to be determined on a flag-by-flag basis, and -EPERM should be returned in some cases. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 1 17:07:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 17:07:22 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4207GfB029493 for ; Tue, 1 May 2007 17:07:18 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA19765; Wed, 2 May 2007 10:07:02 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4206xAf82132681; Wed, 2 May 2007 10:07:00 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4206tcJ81768258; Wed, 2 May 2007 10:06:55 +1000 (AEST) Date: Wed, 2 May 2007 10:06:54 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502000654.GK77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11242 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 05:22, David Chinner wrote: > >On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > >> The FIBMAP ioctl is for privileged users > >> only, and I wonder if FIEMAP should be the same, or at least > >>disallow > >> mapping files that the user can't access especially with > >>FLAG_SYNC and/or > >> FLAG_HSM_READ. > > > >I see little reason for restricting FI[BE]MAP to privileged users - > >anyone should be able to determine if files they have permission to > >access are fragmented. > > Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the > machine. Perhaps for non-privileged users FIEMAP has to be read- > only? As soon as any of the FLAG_* flags come into play you make it > privileged. For example fancy any user being able to fill up your > file system by calling FIEMAP with FLAG_HSM_READ on all files > recursively? By that reasoning, users should not be allowed to recall any files without root privileges. HSMs don't work that way, though - any user is allowed to recall any files they have permission to access either by manual command or by trying to read the file daata. If that runs the filesytem out of space, then the HSM either hasn't been configured properly or it's failed to manage the space correctly. Either way, that's not the fault of the user for recalling their own files. Hence allowing FIEMAP to be executed by the user does not open up any DOS conditions that don't already exist in normal HSM-managed filesystem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 1 19:27:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 May 2007 19:27:06 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l422QxfB029690 for ; Tue, 1 May 2007 19:27:01 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA22695; Wed, 2 May 2007 12:26:48 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l422QkAf82214176; Wed, 2 May 2007 12:26:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l422Qisa78652236; Wed, 2 May 2007 12:26:44 +1000 (AEST) Date: Wed, 2 May 2007 12:26:44 +1000 From: David Chinner To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502022644.GO77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070501223040.GL5722@schatzie.adilger.int> User-Agent: Mutt/1.4.2.1i X-archive-position: 11243 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote: > On May 01, 2007 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't > > > > I disagree - why would you want to indicate the state is unknown when we know > > very well that it is offline? > > If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a > catch-all flag that indicates "this extent contains data but there is > nothing sensible to be returned for the extent mapping." Yes, I like that much more. Good suggestion. ;) > > Effectively, when your extent is offline in the HSM, it is inaccessable, and > > you have to bring it back from tape so it becomes accessible again. i.e. some > > action is necessary on behalf of the user to make it accessible. So I think > > that OFFLINE is a good name for this state because it really is inaccessible. > > What you are calling OFFLINE I would prefer to call UNMAPPED, since that > can be used by applications as a catch-all for "no mapping". There can > be further flags that give refinements to UNMAPPED that some applications > might care about them (e.g. HSM_RESIDENT), but many users/apps will not > if they just want the number of fragments in a given file. Agreed - UNMAPPED does make a lot more sense in this case. > > > Can you propose reasonable flag names for these (I can't think of anything > > > very good) and a clear explanation of what they mean. I suspect it will > > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > > > the concept of stripe unit and stripe width, but as yet they are not > > > communicated between the two very well. I'd be much happier if this info > > > could be queried in a standard way from the block layer instead of the > > > user having to specify it and the filesystem having to track it. > > > > My preference is definitely for a separate ioctl to grab the > > filesystem geometry so this stuff can be calculated in userspace. > > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't > > bother trying to define names until we decide which appraoch we take > > to implement this. > > Hmm, previously you wrote "This information could be easily passed up in the > flags fields if the filesystem has geometry information". So, I _think_ > what you are saying is that you want 4 flags to convey this start/end > alignment information, but the exact semantics of what a "stripe unit" and > a "stripe width" is filesystem specific? Right. > I definitely do NOT want to get into any issues of querying the block > device geometry here. I was just making a passing comment that ext4+mballoc > can already do RAID-specific allocation alignment, but it depends on the > admin to specify this information and it would be nice if there was some > easy way to get this from userspace/kernel interfaces. > > Having an API that can request "tell me the number of blocks from this > offset until the next physical disk boundary" or similar would be useful > to any allocator, and the block layer already needs to know this when > submitting IO. The block layer knows this once you get inside the volume manager. I think the issue is that there is no common export interface for this information. > > In XFS, mkfs.xfs does the work of getting this information > > to see in the filesystem superblock. Here's the code for getting > > sunit/swidth from the underlying block device: > > > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ > > > > Not much in common there ;) > > It looks like this might be just what e2fsprogs needs also. More than likely. > > > It does make sense to specify zero for the fm_extent_count array and a > > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > > > extent data itself, for the non-verbose mode of filefrag, and for > > > pre-allocating a buffer large enough to hold the file if that is important. > > > > Rather than rely on implicit behaviour of "pass in extent count of > > zero and a don't try to return any extents" to return the number of > > extents on the file, why not just explicitly define this as a valid > > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS > > That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my > clever-clever for "return no extents" and "return number of extents" > is wasted :-/. Too clever for an API, I think. ;) My point is mainly that if you are going to use an API for a specific function (e.g. query the number of extents) I think that the API should have an obvious method for executing that specific function. Using a command of "get no extents" to provide the query of "how many extents in this file" is kind of obscure. When you read the code it doesn't make a lot of sense, as opposed to seeing a clear statement of intent from the code itself. i.e. FIEMAP_FLAG_GET_NUMEXTENTS is self-documenting in both the API and the code that uses it... > > > - does XFS return an extent for the metadata parts of the file (e.g. btree)? > > > > No, but we can return the extent map for the attribute fork (i.e. > > extended attrs) if asked for (XFS_IOC_GETBMAPA). > > This seems like it would be a useful addition to the interface also, having > FIEMAP_FLAG_METADATA request the return of metadata allocations too. Agreed. The different types of requests need to be mutually exclusive, though - returning the map of the attribute fork mixed with the map of the data fork is going to be confusing.... > > > - does XFS allow non-root users to call xfs_bmap on files they don't own, or > > > use by non-root users at all? > > > > Users can run xfs_bmap on any file they have permission to > > open(O_RDONLY). > > > > > The FIBMAP ioctl is for privileged users > > > only, and I wonder if FIEMAP should be the same, or at least disallow > > > mapping files that the user can't access especially with FLAG_SYNC and/or > > > FLAG_HSM_READ. > > > > I see little reason for restricting FI[BE]MAP to privileged users - > > anyone should be able to determine if files they have permission to > > access are fragmented. > > I think I agree with Anton that allowing some of the flags for non-privileged > users seems dangerous. I think this needs to be determined on a flag-by-flag > basis, and -EPERM should be returned in some cases. Agreed, but I'm yet to see any flags where I think that is necessary yet. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 01:18:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:18:25 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428IKfB012099 for ; Wed, 2 May 2007 01:18:21 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49210) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjA0M-0001Ma-PD (Exim 4.63) (return-path ); Wed, 02 May 2007 09:16:06 +0100 In-Reply-To: <20070502000654.GK77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1FA8E92B-954D-4624-A089-80D4AA7399FD@cam.ac.uk> <20070502000654.GK77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <8464EA47-03AC-4162-A2D0-683517568640@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:16:04 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11244 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 01:06, David Chinner wrote: > On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: >> On 1 May 2007, at 05:22, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> The FIBMAP ioctl is for privileged users >>>> only, and I wonder if FIEMAP should be the same, or at least >>>> disallow >>>> mapping files that the user can't access especially with >>>> FLAG_SYNC and/or >>>> FLAG_HSM_READ. >>> >>> I see little reason for restricting FI[BE]MAP to privileged users - >>> anyone should be able to determine if files they have permission to >>> access are fragmented. >> >> Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the >> machine. Perhaps for non-privileged users FIEMAP has to be read- >> only? As soon as any of the FLAG_* flags come into play you make it >> privileged. For example fancy any user being able to fill up your >> file system by calling FIEMAP with FLAG_HSM_READ on all files >> recursively? > > By that reasoning, users should not be allowed to recall any files > without root privileges. HSMs don't work that way, though - any user > is allowed to recall any files they have permission to access either > by manual command or by trying to read the file daata. > > If that runs the filesytem out of space, then the HSM either hasn't > been configured properly or it's failed to manage the space > correctly. Either way, that's not the fault of the user for > recalling their own files. > > Hence allowing FIEMAP to be executed by the user does not open up > any DOS conditions that don't already exist in normal HSM-managed > filesystem. Sorry, it was not a great example. But the point still stands that there are/may be created flags that you do not want to allow everyone to use. I completely agree with Andreas that those can simply return -EPERM and the rest can be allowed through. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 01:25:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:25:15 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428P7fB013738 for ; Wed, 2 May 2007 01:25:08 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49214) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjA7h-0003fi-Mq (Exim 4.63) (return-path ); Wed, 02 May 2007 09:23:41 +0100 In-Reply-To: <20070501223040.GL5722@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Cc: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:23:38 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11245 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 1 May 2007, at 23:30, Andreas Dilger wrote: > On May 01, 2007 14:22 +1000, David Chinner wrote: >> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I >>> didn't >> >> I disagree - why would you want to indicate the state is unknown >> when we know >> very well that it is offline? > > If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a > catch-all flag that indicates "this extent contains data but there is > nothing sensible to be returned for the extent mapping." I like UNMAPPED. I even use it in NTFS internally for extents maps that have not been read into memory yet. (-: On a different issue, do you think it would be worth adding an option flags like FIEMAP_DONT_RELOCATE or something similar that would be a compulsory flag and if set the FS is not allowed to move the file around/change the block allocation of the file. My thinking is that the extent map is not terribly useful if the FS goes and relocates the file to somewhere else just after you have done the ioctl. For example HFS on OSX automatically defragments files whilst it is running... Linux file systems may one day do similar things. Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell the FS we want to access the actual raw blocks so the FS can make sure the data is on block aligned boundaries and if the FS does not support this (e.g. ZFS or a compressed or encrypted NTFS file) then it can return -ENOTSUP. Perhaps this is totally the wrong interface and such a "prepare file for direct access" API should be a different ioctl() or syscall or whatever. It just seems very simple and appropriate to combine it here as people who use FIEMAP are at least sometimes going to be wanting to access those blocks directly as well and it feels right to be able to communicate this to the FS in the same call, kind of like an "open intent" of "I want to use the data directly on disk"... What do you think? Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 01:31:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 01:31:39 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l428VZfB015273 for ; Wed, 2 May 2007 01:31:36 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49220) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjAE7-0006gx-N5 (Exim 4.63) (return-path ); Wed, 02 May 2007 09:30:19 +0100 In-Reply-To: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <69B76939-CAAD-4F43-BE9F-6C3CA3ECCF5E@cam.ac.uk> Cc: David Chinner , linux-ext4@vger.kernel.org, Linux Filesystems , xfs@oss.sgi.com, Christoph Hellwig Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 09:30:17 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11246 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 09:23, Anton Altaparmakov wrote: > On 1 May 2007, at 23:30, Andreas Dilger wrote: > >> On May 01, 2007 14:22 +1000, David Chinner wrote: >>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: >>>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but >>>> I didn't >>> >>> I disagree - why would you want to indicate the state is unknown >>> when we know >>> very well that it is offline? >> >> If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a >> catch-all flag that indicates "this extent contains data but there is >> nothing sensible to be returned for the extent mapping." > > I like UNMAPPED. I even use it in NTFS internally for extents maps > that have not been read into memory yet. (-: Oops, I use NOT_MAPPED in NTFS rather than UNMAPPED but I still like UNMAPPED, too. (-: Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:15:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:15:52 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l429FjfB025664 for ; Wed, 2 May 2007 02:15:47 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA03102; Wed, 2 May 2007 19:15:33 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l429FUAf82146138; Wed, 2 May 2007 19:15:31 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l429FQYw81999881; Wed, 2 May 2007 19:15:26 +1000 (AEST) Date: Wed, 2 May 2007 19:15:26 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11247 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 15:20, David Chinner wrote: > >> > >>So, either the filesystem will understand the flag or iff the > >>unknown flag > >>is in the incompat set, it will return EINVAL or else the unknown > >>flag will > >>be safely ignored. > > > >My point was that there is a difference between specification and > >implementation - if the specification says something is compulsory, > >then they must be implemented in the filesystem. This is easy > >enough to ensure by code review - we don't need additional interface > >complexity for this.... > > You are wrong about this because you are missing the point that you > have no code to review. The users that will use those flags are > going to be applications that run in user space. Chances are you > will never see their code. Heck, they might not even be open source > applications... Ummm - the specification defines what is compulsory for *filesystems* to implement, not what applications can use. We don't need to see what the applications do - what we care about is that all filesystems implement the compulsory part of the specification. That's the code we review, and that's what I was referring to. > And all applications will run against a multitude of > kernels. So version X of the application will run on kernel 2.4.*, > 2.6.*, a.b.*, etc... For future expandability of the interface I > think it is important to have both compulsory and non-compulsory flags. Ah, so that's what you want - a mutable interface. i.e. versioning. So how does compusory flags help here? What happens if a voluntary flag now becomes compulsory? Or vice versa? How is the application supposed to deal with this dynamically? I suggested a version number for this right back at the start of this discussion and got told that we don't want versioned interfaces because we should make the effort to get it right the first time. I don't think this can be called "getting it right". > For example there is no reason why FIEMAP_HSM_READ needs to be > compulsory. Most filesystems do not support HSM so can safely ignore > it. They might be able to safely ignore it, but in reality it should be saying "I don't understand this". If the application *needs* to use a flag like this, then it should be told that the filesystem is not capable of doing what it was asked! OTOH if the application does not need to use the flag, then it shouldn't be using it and we shouldn't be silently ignoring incorrect usage of the provided API. What you are effectively saying about these "voluntary" flags is that their behaviour is _undefined_. That is, if you use these flags what you get on a successful call is undefined; it may or may not contain what you asked for but you can't tell if it really did what you want or returned the information you asked for. This is a really bad semantic to encode into an API. > And vice versa, an application might specify some weird and funky yet > to be developed feature that it expects the FS to perform and if the > FS cannot do it (either because it does not support it or because it > failed to perform the operation) the application expects the FS to > return an error and not to ignore the flag. An example could be the > asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS > ignores it it will return the extent map for the file data instead of > the XATTR_FORK! Not what the application wanted at all. Ouch! So > this is definitely a compulsory flag if I ever saw one. Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But we don't need a flag defined in the user visible API to tell us that we need to return an error here. > So as you see you must support both voluntary and compulsory flags... No, you've managed to convince me that they are not necessary and they are in fact a Bad Idea... ;) > Also consider what I said above about different kernels. A new > feature is implemented in kernel 2.8.13 say that was not there before > and an application is updated to use that feature. There will be > lots of instances where that application will still be run on older > kernels where this feature does not exist. This is *exactly* where silently ignoring flags really falls down. On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does something and it returns different structure contents for the same state. Now how does the application writer know which is correct or how to tell the difference? They have to guess or write detection code which is exactly what we want to avoid. I objected to the UNKNOWN flag because it wasn't explicit in it's meaning - I'm doing the same thing here. An interface needs to be explicitly defined and should not have and undefined behaviour in it.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 02:38:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:38:32 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429cPfB032340 for ; Wed, 2 May 2007 02:38:28 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49355) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBFu-0000Yj-Ne (Exim 4.63) (return-path ); Wed, 02 May 2007 10:36:14 +0100 In-Reply-To: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:36:12 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11248 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:15, David Chinner wrote: > On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >> On 1 May 2007, at 15:20, David Chinner wrote: >>>> >>>> So, either the filesystem will understand the flag or iff the >>>> unknown flag >>>> is in the incompat set, it will return EINVAL or else the unknown >>>> flag will >>>> be safely ignored. >>> >>> My point was that there is a difference between specification and >>> implementation - if the specification says something is compulsory, >>> then they must be implemented in the filesystem. This is easy >>> enough to ensure by code review - we don't need additional interface >>> complexity for this.... >> >> You are wrong about this because you are missing the point that you >> have no code to review. The users that will use those flags are >> going to be applications that run in user space. Chances are you >> will never see their code. Heck, they might not even be open source >> applications... > > Ummm - the specification defines what is compulsory for *filesystems* > to implement, not what applications can use. We don't need to see > what the applications do - what we care about is that all filesystems > implement the compulsory part of the specification. That's the code > we review, and that's what I was referring to. > >> And all applications will run against a multitude of >> kernels. So version X of the application will run on kernel 2.4.*, >> 2.6.*, a.b.*, etc... For future expandability of the interface I >> think it is important to have both compulsory and non-compulsory >> flags. > > Ah, so that's what you want - a mutable interface. i.e. versioning. > > So how does compusory flags help here? What happens if a voluntary > flag now becomes compulsory? Or vice versa? How is the application > supposed to deal with this dynamically? > > I suggested a version number for this right back at the start of > this discussion and got told that we don't want versioned interfaces > because we should make the effort to get it right the first time. > I don't think this can be called "getting it right". Look at ext2/3/4. They do it that way and it works well. No versioning just compatible and incompatible flags... The proposal is to do the same here. >> For example there is no reason why FIEMAP_HSM_READ needs to be >> compulsory. Most filesystems do not support HSM so can safely ignore >> it. > > They might be able to safely ignore it, but in reality it should > be saying "I don't understand this". If the application *needs* to > use a flag like this, then it should be told that the filesystem is > not capable of doing what it was asked! That is where you are completely wrong! (-: Or rather you are wrong for my example, i.e. you are wrong/right depending on the type of flag in question. HSM_READ is definitely _NOT_ required because all it means is "if the file is OFFLINE, bring it ONLINE and then return the extent map". Clearly all file systems that do not support HSM can 100% ignore this flag as all files will ALWAYS be ONLINE so they will return the correct data ALWAYS so no need to do anything for HSM_READ. > OTOH if the application does not need to use the flag, then it > shouldn't be using it and we shouldn't be silently ignoring > incorrect usage of the provided API. > > What you are effectively saying about these "voluntary" flags > is that their behaviour is _undefined_. That is, if you use > these flags what you get on a successful call is undefined; > it may or may not contain what you asked for but you can't > tell if it really did what you want or returned the information > you asked for. > > This is a really bad semantic to encode into an API. That is your opinion. There is nothing undefined in the API at all. You just fail to understand it... >> And vice versa, an application might specify some weird and funky yet >> to be developed feature that it expects the FS to perform and if the >> FS cannot do it (either because it does not support it or because it >> failed to perform the operation) the application expects the FS to >> return an error and not to ignore the flag. An example could be the >> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS >> ignores it it will return the extent map for the file data instead of >> the XATTR_FORK! Not what the application wanted at all. Ouch! So >> this is definitely a compulsory flag if I ever saw one. > > Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But > we don't need a flag defined in the user visible API to tell us > that we need to return an error here. Heh? What are you talking about? You need a flag to specify that you want XATTR_FORK. If not how the hell does the application specify that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you of the opinion that FIEMAP should definitely not support XATTR_FORK. If the latter I fully agree. This should be a separate API with named streams and the FD of the named stream should be passed to FIEMAP without the silly XATTR_FORK flag... >> So as you see you must support both voluntary and compulsory flags... > > No, you've managed to convince me that they are not necessary and > they are in fact a Bad Idea... ;) We agree to disagree then. I think they are a very Good Idea(TM). (-; >> Also consider what I said above about different kernels. A new >> feature is implemented in kernel 2.8.13 say that was not there before >> and an application is updated to use that feature. There will be >> lots of instances where that application will still be run on older >> kernels where this feature does not exist. > > This is *exactly* where silently ignoring flags really falls down. It does not! > On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does > something and it returns different structure contents for the same No it does not. You do NOT understand at all what we are talking about do you?!? If a flag would do something weird like returning different data then OBVIOUSLY you would make this a mandatory flag and it will NOT be ignored! You should know better than arguing with fallacies. Seriously... > state. Now how does the application writer know which is correct or > how to tell the difference? They have to guess or write detection > code which is exactly what we want to avoid. No they don't. It is then a compulsory flag so your argument is totally moot. > I objected to the UNKNOWN flag because it wasn't explicit > in it's meaning - I'm doing the same thing here. An interface > needs to be explicitly defined and should not have and undefined > behaviour in it.... That is exactly the point. It is explicitly defined and has NO undefined behaviour in it. (-: Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:48:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:48:16 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429mCfB003811 for ; Wed, 2 May 2007 02:48:14 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49362) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBPJ-0006HX-NQ (Exim 4.63) (return-path ); Wed, 02 May 2007 10:45:57 +0100 In-Reply-To: <20070502091526.GW77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <1AFF1746-8313-4DC2-81D6-4271B5FB71A3@cam.ac.uk> Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:45:55 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11249 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:15, David Chinner wrote: > On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >> And all applications will run against a multitude of >> kernels. So version X of the application will run on kernel 2.4.*, >> 2.6.*, a.b.*, etc... For future expandability of the interface I >> think it is important to have both compulsory and non-compulsory >> flags. > > Ah, so that's what you want - a mutable interface. i.e. versioning. > > So how does compusory flags help here? A concrete example: Let's say that the FIEMAP interface goes live as is without any flags at all and just defined bits for "these are optional and those are compulsory". Then the next kernel adds support for optional flag HSM_READ and compulsory flag XATTR_READ. FS that do not support XATTR_READ will return -ENOTSUP as they cannot return the wanted data. FS that do not support HSM_READ will still return the correct data in majority of cases (except when the FS supports HSM and the data is actually OFFLINE which the application will need to be able to cope with anyway incase the FS failed to bring the file ONLINE even if it supports the HSM_READ flag so no added complexity for handling this case). > What happens if a voluntary flag now becomes compulsory? Or vice > versa? How is the application supposed to deal with this dynamically? Forgot to answer this bit: This cannot happen. There cannot be flags that move from compulsory to non-compulsory or anything stupid like that. It would have to be a totally new flag otherwise it breaks backwards compatibility and hence this interface becomes useless crap. > I suggested a version number for this right back at the start of > this discussion and got told that we don't want versioned interfaces > because we should make the effort to get it right the first time. > I don't think this can be called "getting it right". So all applications end up doing: if (version X, do blah) else if (version Y, do blob) else if (version Z, do foo) else if (version A, do bar) else exit(1); Every time a new version is added? And abort for unknown versions? Now that is a great interface! Not. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 02:49:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:49:21 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l429nEfB004317 for ; Wed, 2 May 2007 02:49:17 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA03843; Wed, 2 May 2007 19:49:01 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l429mtAf82223314; Wed, 2 May 2007 19:48:57 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l429mqgl82278699; Wed, 2 May 2007 19:48:52 +1000 (AEST) Date: Wed, 2 May 2007 19:48:51 +1000 From: David Chinner To: Anton Altaparmakov Cc: Andreas Dilger , David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502094851.GX77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11250 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote: > On a different issue, do you think it would be worth adding an option > flags like FIEMAP_DONT_RELOCATE or something similar that would be a > compulsory flag and if set the FS is not allowed to move the file > around/change the block allocation of the file. We already have an inode flag in XFS to say this - the defrag tool checks it and ignores the file if it is set. > Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell > the FS we want to access the actual raw blocks so the FS can make > sure the data is on block aligned boundaries and if the FS does not > support this (e.g. ZFS or a compressed or encrypted NTFS file) then > it can return -ENOTSUP. > > Perhaps this is totally the wrong interface and such a "prepare file > for direct access" API should be a different ioctl() or syscall or > whatever. It just seems very simple and appropriate to combine it > here as people who use FIEMAP are at least sometimes going to be > wanting to access those blocks directly as well and it feels right to > be able to communicate this to the FS in the same call, kind of like > an "open intent" of "I want to use the data directly on disk"... I think this is wrong interface for this. Sure, use it to get the mappings (that's what it's for) but what you do with the mappings after that is not part of FIEMAP.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 02:57:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 02:57:55 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l429vnfB007855 for ; Wed, 2 May 2007 02:57:52 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49383) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjBZ6-0001FA-PT (Exim 4.63) (return-path ); Wed, 02 May 2007 10:56:04 +0100 In-Reply-To: <20070502094851.GX77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <20070501223040.GL5722@schatzie.adilger.int> <03C89173-3AD1-421F-B7A0-64C999BD9DAB@cam.ac.uk> <20070502094851.GX77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: Andreas Dilger , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 10:56:03 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11251 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 10:48, David Chinner wrote: > On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote: >> On a different issue, do you think it would be worth adding an option >> flags like FIEMAP_DONT_RELOCATE or something similar that would be a >> compulsory flag and if set the FS is not allowed to move the file >> around/change the block allocation of the file. > > We already have an inode flag in XFS to say this - the defrag > tool checks it and ignores the file if it is set. That is great for XFS but you control the metadata. NTFS, HFS, etc are cases where we cannot add such a flag because we cannot modify the metadata format (ok we could in some kludgy manner like storing an EA with an inode to say "com.linux.ntfs.immutable" or something but I would rather not if I can avoid it). >> Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell >> the FS we want to access the actual raw blocks so the FS can make >> sure the data is on block aligned boundaries and if the FS does not >> support this (e.g. ZFS or a compressed or encrypted NTFS file) then >> it can return -ENOTSUP. >> >> Perhaps this is totally the wrong interface and such a "prepare file >> for direct access" API should be a different ioctl() or syscall or >> whatever. It just seems very simple and appropriate to combine it >> here as people who use FIEMAP are at least sometimes going to be >> wanting to access those blocks directly as well and it feels right to >> be able to communicate this to the FS in the same call, kind of like >> an "open intent" of "I want to use the data directly on disk"... > > I think this is wrong interface for this. Sure, use it to get the > mappings (that's what it's for) but what you do with the mappings > after that is not part of FIEMAP.... Thanks for the comments. I am not sure it is a good idea either, just thought it would be worth discussing in case people thought it a good idea. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 03:52:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 03:52:48 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42AqhfB021110 for ; Wed, 2 May 2007 03:52:44 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l42AqgmK016050 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Wed, 2 May 2007 12:52:42 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l42Aqfmi016048 for xfs@oss.sgi.com; Wed, 2 May 2007 12:52:41 +0200 Date: Wed, 2 May 2007 12:52:41 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: Re: [Bug 756] New: File data corruption when writing to files with DM_EVENT_WRITE enabled over NFS (2.4 kernel) Message-ID: <20070502105241.GA15399@lst.de> References: <200705012104.l41L4CI3029767@oss.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705012104.l41L4CI3029767@oss.sgi.com> User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11252 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs > Note that a similar issue existed in the 2.6 SGI kernel up until it was resolved > by this recent change: > http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_lrw.c.diff?r1=1.258;r2=1.259;f=h Seems like someone forgot to send TAKEs to the xfs list once again.. From owner-xfs@oss.sgi.com Wed May 2 03:58:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 03:58:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l42AwCfB023745 for ; Wed, 2 May 2007 03:58:14 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id UAA05217; Wed, 2 May 2007 20:57:57 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l42AvrAf82323358; Wed, 2 May 2007 20:57:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l42AvnBI81446737; Wed, 2 May 2007 20:57:49 +1000 (AEST) Date: Wed, 2 May 2007 20:57:49 +1000 From: David Chinner To: Anton Altaparmakov Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070502105749.GY77450368@melbourne.sgi.com> References: <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11253 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: > On 2 May 2007, at 10:15, David Chinner wrote: > >On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > >>And all applications will run against a multitude of > >>kernels. So version X of the application will run on kernel 2.4.*, > >>2.6.*, a.b.*, etc... For future expandability of the interface I > >>think it is important to have both compulsory and non-compulsory > >>flags. > > > >Ah, so that's what you want - a mutable interface. i.e. versioning. > > > >So how does compusory flags help here? What happens if a voluntary > >flag now becomes compulsory? Or vice versa? How is the application > >supposed to deal with this dynamically? > > > >I suggested a version number for this right back at the start of > >this discussion and got told that we don't want versioned interfaces > >because we should make the effort to get it right the first time. > >I don't think this can be called "getting it right". > > Look at ext2/3/4. They do it that way and it works well. No > versioning just compatible and incompatible flags... The proposal is > to do the same here. Just because it works for extN doesn't make it right for this interface. > >>For example there is no reason why FIEMAP_HSM_READ needs to be > >>compulsory. Most filesystems do not support HSM so can safely ignore > >>it. > > > >They might be able to safely ignore it, but in reality it should > >be saying "I don't understand this". If the application *needs* to > >use a flag like this, then it should be told that the filesystem is > >not capable of doing what it was asked! > > That is where you are completely wrong! (-: Or rather you are wrong > for my example, i.e. you are wrong/right depending on the type of > flag in question. And that is the crux of the argument. My point is that *any* flag returns an error if the filesystem does not support it. > HSM_READ is definitely _NOT_ required because all > it means is "if the file is OFFLINE, bring it ONLINE and then return > the extent map". You've got the definition of HSM_READ wrong. If the flag is *not* set, then we bring everything back online and return the full extent map. Specifying the flag indicates that we do *not* want the offline extents brought back online. i.e. it is a HSM or a datamover (e.g. backup program) that is querying the extents and we want to known *exactly* what the current state of the file is right now. So, if the HSM_READ flag is set, then the application is expecting the filesytem to be part of a HSM. Hence if it's not, it should return an error because somebody has done something wrong. > >OTOH if the application does not need to use the flag, then it > >shouldn't be using it and we shouldn't be silently ignoring > >incorrect usage of the provided API. > > > >What you are effectively saying about these "voluntary" flags > >is that their behaviour is _undefined_. That is, if you use > >these flags what you get on a successful call is undefined; > >it may or may not contain what you asked for but you can't > >tell if it really did what you want or returned the information > >you asked for. > > > >This is a really bad semantic to encode into an API. > > That is your opinion. There is nothing undefined in the API at all. > You just fail to understand it... FIEMAP returned success. Did it do what I asked? I don't know because it's allowed to return success when it did ignored me. This is as silly an interface definition as saying you can implement fsync() with { return 0; }. So, when fsync() succeeded did it write my data to disk? I don't know; it's allowed to return success when it ignored me. It's crazy, isn't it? It makes writing applications portable across operating systems a real PITA (ask the MySQL folk ;) because POSIX really does allow fsync() to be implemented like this. I use this example because the "allow some filesystems to silently ignore flags they don't understand" is a portability problem for applications - rather than a cross-OS issue it is a cross-filesystem issue. That is, if different filesystems behave differently to the same request they will have to be handled specifically by the application. Every filesystem should behave in *exactly* the same way to the FIEMAP ioctls - if they don't support something they throw an error, if they do then they return the correct data. > >>And vice versa, an application might specify some weird and funky yet > >>to be developed feature that it expects the FS to perform and if the > >>FS cannot do it (either because it does not support it or because it > >>failed to perform the operation) the application expects the FS to > >>return an error and not to ignore the flag. An example could be the > >>asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS > >>ignores it it will return the extent map for the file data instead of > >>the XATTR_FORK! Not what the application wanted at all. Ouch! So > >>this is definitely a compulsory flag if I ever saw one. > > > >Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But > >we don't need a flag defined in the user visible API to tell us > >that we need to return an error here. > > Heh? What are you talking about? You need a flag to specify that you > want XATTR_FORK. If not how the hell does the application specify > that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you > of the opinion that FIEMAP should definitely not support XATTR_FORK. > If the latter I fully agree. This should be a separate API with > named streams and the FD of the named stream should be passed to > FIEMAP without the silly XATTR_FORK flag... Ummmm - I think you misunderstood what I was saying. I was agreeing with you that is a FS does not support FIEMAP_XATTR_FORK "the correct answer is -EOPNOTSUPP or -EINVAL". What I was saying is that we don't need a COMPAT flag bit to tell us the obvious error return if the filesystem does not support this functionality.... > >>Also consider what I said above about different kernels. A new > >>feature is implemented in kernel 2.8.13 say that was not there before > >>and an application is updated to use that feature. There will be > >>lots of instances where that application will still be run on older > >>kernels where this feature does not exist. > > > >This is *exactly* where silently ignoring flags really falls down. > > It does not! > > >On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does > >something and it returns different structure contents for the same > > No it does not. You do NOT understand at all what we are talking > about do you?!? > > If a flag would do something weird like returning different data then > OBVIOUSLY you would make this a mandatory flag and it will NOT be > ignored! You've just successfully argued my case for me. By your reasoning, if we have voluntary flags 1, 2 and 3 and filesystems A, B and C and filesystem A is the only filesystem to implement 1, when B implements 1 bit must become a compulsory flag and hence C must now return an error despite being unchanged. Likewise when C implement 3, 3 must become a comulsory flag and A and B must now return an error despite being unchanged. IOWs, whenever *any* filesystem implements a voluntary feature that it didn't previously support, we have to make that a mandatory feature and all other filesystems that don't support it now must return an error. You're guaranteeing th application sees changes in behaviour with this interface, not preventing. Can we simply mandate that filesystems return an error to commands they don't support or don't understand and drop this silly interface mutation thing? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 04:19:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 04:19:31 -0700 (PDT) Received: from ppsw-7.csi.cam.ac.uk (ppsw-7.csi.cam.ac.uk [131.111.8.137]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42BJPfB003965 for ; Wed, 2 May 2007 04:19:27 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49519) by ppsw-7.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.157]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjCpx-000625-Oy (Exim 4.63) (return-path ); Wed, 02 May 2007 12:17:33 +0100 In-Reply-To: <20070502105749.GY77450368@melbourne.sgi.com> References: <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Wed, 2 May 2007 12:17:32 +0100 To: David Chinner X-Mailer: Apple Mail (2.752.3) X-archive-position: 11254 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 2 May 2007, at 11:57, David Chinner wrote: > On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: >> On 2 May 2007, at 10:15, David Chinner wrote: >>> On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: >>>> And all applications will run against a multitude of >>>> kernels. So version X of the application will run on kernel 2.4.*, >>>> 2.6.*, a.b.*, etc... For future expandability of the interface I >>>> think it is important to have both compulsory and non-compulsory >>>> flags. >>> >>> Ah, so that's what you want - a mutable interface. i.e. versioning. >>> >>> So how does compusory flags help here? What happens if a voluntary >>> flag now becomes compulsory? Or vice versa? How is the application >>> supposed to deal with this dynamically? >>> >>> I suggested a version number for this right back at the start of >>> this discussion and got told that we don't want versioned interfaces >>> because we should make the effort to get it right the first time. >>> I don't think this can be called "getting it right". >> >> Look at ext2/3/4. They do it that way and it works well. No >> versioning just compatible and incompatible flags... The proposal is >> to do the same here. > > Just because it works for extN doesn't make it right for this > interface. > >>>> For example there is no reason why FIEMAP_HSM_READ needs to be >>>> compulsory. Most filesystems do not support HSM so can safely >>>> ignore >>>> it. >>> >>> They might be able to safely ignore it, but in reality it should >>> be saying "I don't understand this". If the application *needs* to >>> use a flag like this, then it should be told that the filesystem is >>> not capable of doing what it was asked! >> >> That is where you are completely wrong! (-: Or rather you are wrong >> for my example, i.e. you are wrong/right depending on the type of >> flag in question. > > And that is the crux of the argument. > > My point is that *any* flag returns an error if the filesystem > does not support it. Yes and my point is that it should not do so as there are flags where it is not necessary. >> HSM_READ is definitely _NOT_ required because all >> it means is "if the file is OFFLINE, bring it ONLINE and then return >> the extent map". > > You've got the definition of HSM_READ wrong. If the flag is *not* > set, then we bring everything back online and return the full extent > map. Ah, sorry, I did indeed misunderstand what it was meant to mean. >>> OTOH if the application does not need to use the flag, then it >>> shouldn't be using it and we shouldn't be silently ignoring >>> incorrect usage of the provided API. >>> >>> What you are effectively saying about these "voluntary" flags >>> is that their behaviour is _undefined_. That is, if you use >>> these flags what you get on a successful call is undefined; >>> it may or may not contain what you asked for but you can't >>> tell if it really did what you want or returned the information >>> you asked for. >>> >>> This is a really bad semantic to encode into an API. >> >> That is your opinion. There is nothing undefined in the API at all. >> You just fail to understand it... > > FIEMAP returned success. Did it do what I asked? I don't > know because it's allowed to return success when it did ignored me. So what? > This is as silly an interface definition as saying you can > implement fsync() with { return 0; }. So, when fsync() succeeded > did it write my data to disk? I don't know; it's allowed to return > success when it ignored me. No it is not silly at all. There can be flags that fail but still the operation is a success. Example from admittedly unrelated area: when truncating a file to smaller size if the freeing of the allocated blocks fails it does not cause the truncate to fail, it just means some space is wasted/marked used when it is unused on the volume and running fsck fixes this. At least that is how I have implemented it for NTFS and I think this is the most sensible way to do it. The user does not care if some blocks could not be freed. All they care about is that the file is now truncated. The volume is then marked dirty thus running fsck/ chkdsk will reclaim the lost space. > It's crazy, isn't it? It makes writing applications portable > across operating systems a real PITA (ask the MySQL folk ;) > because POSIX really does allow fsync() to be implemented like this. > > I use this example because the "allow some filesystems to silently > ignore flags they don't understand" is a portability problem for > applications - rather than a cross-OS issue it is a cross-filesystem > issue. That is, if different filesystems behave differently to > the same request they will have to be handled specifically by > the application. Every filesystem should behave in *exactly* the > same way to the FIEMAP ioctls - if they don't support something > they throw an error, if they do then they return the correct > data. It is only a problem if you do not choose wisely which flags my be ignored silently... >>>> And vice versa, an application might specify some weird and >>>> funky yet >>>> to be developed feature that it expects the FS to perform and if >>>> the >>>> FS cannot do it (either because it does not support it or >>>> because it >>>> failed to perform the operation) the application expects the FS to >>>> return an error and not to ignore the flag. An example could be >>>> the >>>> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and >>>> the FS >>>> ignores it it will return the extent map for the file data >>>> instead of >>>> the XATTR_FORK! Not what the application wanted at all. Ouch! So >>>> this is definitely a compulsory flag if I ever saw one. >>> >>> Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But >>> we don't need a flag defined in the user visible API to tell us >>> that we need to return an error here. >> >> Heh? What are you talking about? You need a flag to specify that you >> want XATTR_FORK. If not how the hell does the application specify >> that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you >> of the opinion that FIEMAP should definitely not support XATTR_FORK. >> If the latter I fully agree. This should be a separate API with >> named streams and the FD of the named stream should be passed to >> FIEMAP without the silly XATTR_FORK flag... > > Ummmm - I think you misunderstood what I was saying. I was agreeing > with you that is a FS does not support FIEMAP_XATTR_FORK "the correct > answer is -EOPNOTSUPP or -EINVAL". > > What I was saying is that we don't need a COMPAT flag bit to tell > us the obvious error return if the filesystem does not support this > functionality.... But there is no COMPAT bit. I don't understand what you are saying... >>>> Also consider what I said above about different kernels. A new >>>> feature is implemented in kernel 2.8.13 say that was not there >>>> before >>>> and an application is updated to use that feature. There will be >>>> lots of instances where that application will still be run on older >>>> kernels where this feature does not exist. >>> >>> This is *exactly* where silently ignoring flags really falls down. >> >> It does not! >> >>> On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does >>> something and it returns different structure contents for the same >> >> No it does not. You do NOT understand at all what we are talking >> about do you?!? >> >> If a flag would do something weird like returning different data then >> OBVIOUSLY you would make this a mandatory flag and it will NOT be >> ignored! > > You've just successfully argued my case for me. No I have not at all. > By your reasoning, if we have voluntary flags 1, 2 and 3 and > filesystems A, B and C and filesystem A is the only filesystem to > implement 1, when B implements 1 bit must become a compulsory flag WHY? It does not at all. Flags CANNOT move from voluntary to compulsory. Read my argument again... > and hence C must now return an error despite being unchanged. Nope. > Likewise when C implement 3, 3 must become a comulsory flag and > A and B must now return an error despite being unchanged. Again no. > IOWs, whenever *any* filesystem implements a voluntary feature that > it didn't previously support, we have to make that a mandatory > feature and all other filesystems that don't support it now This is total crap. > must return an error. You're guaranteeing th application sees > changes in behaviour with this interface, not preventing. > > Can we simply mandate that filesystems return an error > to commands they don't support or don't understand and > drop this silly interface mutation thing? Can we simply not and drop this silly argument? Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Wed May 2 05:19:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 05:19:37 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42CJWfB016412 for ; Wed, 2 May 2007 05:19:33 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HjDMP-0005ml-DC; Wed, 02 May 2007 12:51:05 +0100 Date: Wed, 2 May 2007 12:51:05 +0100 From: Christoph Hellwig To: Lachlan McIlroy Cc: xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk Subject: Re: TAKE 963965 - Add lockdep support for XFS Message-ID: <20070502115105.GA21031@infradead.org> References: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11255 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, Apr 27, 2007 at 06:50:45PM +1000, Lachlan McIlroy wrote: > Add lockdep support for XFS I don't think this is entirely correct, and it misses some of the most interesting cases. I've Cc'ed -fsdevel and Al to get some comments on the more tricky issues in the rename section at the end of the mail. > Modid: xfs-linux-melb:xfs-kern:28485a > fs/xfs/xfs_vnodeops.c - 1.695 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h The XFS_ILOCK_PARENT uses in xfs_create, xfs_mkdir and xfs_symlink look good. xfs_lock_dir_and_entry should go away and just become and opencoded xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); xfs_ilock(ip, XFS_ILOCK_EXCL); in the two callers, once we made sure to have a sufficient locking protocol where we always lock the parent before the child. xfs_lock_dir_and_entry can be totally removed and replaced with just the two ilock calls if we sort out the locking as proposed in this mail. > > fs/xfs/xfs_vfsops.c - 1.518 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h This looks a bit odd to me - the rt inodes are not connected to the filesystem namespace so the root inode can't really be it's parent. Why are we locking the root inode so early. Is there a good reason we don't delay the locking until we're done with the rt inodes? If not the parent annotation is probably safe beause we never lock the rt inode at the same time as any other inode, but it at least needs a big comment describing what's going on. Now what seems to be completely lacking is any kind of annotation in xfs_rename.c, which is the most difficult thing to get right for inode locking because we may have to lock up to four inodes. I suggest to implement the same locking protocol the the VFS uses for locking i_mutex, as document in Documentation/filesystems/directory-locking: Also xfs_lock_inodes lacks any kind of annotation. Let's start with the xfs_lock_inodes that don't fall into rename or xfs_lock_dir_and_entry handled above: - xfs_swap_extents locks two inodes of the same type, but these could be directories, so there is a chance we can get into conflicts with the parent->child type locking - xfs_link locks the source inode and the target directory inode. vfs locking rule is lock parent, lock source and we should follow this as it's in line with the directory before child rule except that the source doesn't always have to be a child, in which case we don't have a problem anyway And now rename gets ugly, we should follow the VFS rules with the following required adjustments: - XFS needs both source and target inode (if existing) locked. Because both must be non-directories sorting by inode number should be okay - Doing a lock_rename equivalent for locking the parent directories requires dentries, but only inodes are passed down from the VFS. On the other hand they are obviously guranteed to be directories so i_dentry has exactly one dentry on which we can do the upwards walk. s_vfs_rename_mutex is already held by the vfs so we don't need to do that again. I'd suggest having a copy of the directory-locking file with the XFS adjustments somewhere so all this is actually well documented. - case for source directory == parent directory is trivial. lock parent From owner-xfs@oss.sgi.com Wed May 2 05:53:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 05:53:21 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l42CrGfB024777 for ; Wed, 2 May 2007 05:53:17 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l42CrBgT015874 for ; Wed, 2 May 2007 08:53:11 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l42CrB9i554574 for ; Wed, 2 May 2007 08:53:11 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l42CrAET015347 for ; Wed, 2 May 2007 08:53:10 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l42Cr9Ww015185; Wed, 2 May 2007 08:53:09 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id B3BC793BC1; Wed, 2 May 2007 18:23:13 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l42CrCw4025574; Wed, 2 May 2007 18:23:12 +0530 Date: Wed, 2 May 2007 18:23:12 +0530 From: "Amit K. Arora" To: Chris Wedgwood Cc: David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070502125312.GA5845@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> <20070430052559.GA13145@tuatara.stupidest.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430052559.GA13145@tuatara.stupidest.org> User-Agent: Mutt/1.4.1i X-archive-position: 11256 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > For FA_ALLOCATE, it's supposed to change the file size if we > > allocate past EOF, right? > > I would argue no. Use truncate for that. The patch I posted for ext4 *does* change the filesize after preallocation, if required (i.e. when preallocation is after EOF). I may have to change that, if we decide on not doing this. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Wed May 2 06:12:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 06:12:04 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l42DBvfB029629 for ; Wed, 2 May 2007 06:12:00 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA08194; Wed, 2 May 2007 23:11:48 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l42DBkAf82475833; Wed, 2 May 2007 23:11:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l42DBipP82488324; Wed, 2 May 2007 23:11:44 +1000 (AEST) Date: Wed, 2 May 2007 23:11:44 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Missing TAKE 958522 (was Re: [Bug 756] New: File data corruption.....) Message-ID: <20070502131144.GZ77450368@melbourne.sgi.com> References: <200705012104.l41L4CI3029767@oss.sgi.com> <20070502105241.GA15399@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502105241.GA15399@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11257 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 12:52:41PM +0200, Christoph Hellwig wrote: > > Note that a similar issue existed in the 2.6 SGI kernel up until it was resolved > > by this recent change: > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_lrw.c.diff?r1=1.258;r2=1.259;f=h > > Seems like someone forgot to send TAKEs to the xfs list once again.. Hmmm - that was a bad one to miss considering the importance of the problem it fixes...... ----- TAKE 958522 - XFS has conflicting strategies between metadata and file data flushing Fix to prevent the notorious 'NULL files' problem after a crash. The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. Date: Fri Mar 30 02:24:06 AEST 2007 Workarea: vpn-emea-sw-emea-160-18.emea.sgi.com:/home/lachlan/isms/2.6.x-null Inspected by: dgc,tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28322a fs/xfs/xfsidbg.c - 1.312 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfsidbg.c.diff?r1=text&tr1=1.312&r2=text&tr2=1.311&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_vnodeops.c - 1.693 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.693&r2=text&tr2=1.692&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iocore.c - 1.52 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iocore.c.diff?r1=text&tr1=1.52&r2=text&tr2=1.51&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_inode.c - 1.463 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_inode.c.diff?r1=text&tr1=1.463&r2=text&tr2=1.462&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_inode.h - 1.219 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_inode.h.diff?r1=text&tr1=1.219&r2=text&tr2=1.218&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_bmap.c - 1.367 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_bmap.c.diff?r1=text&tr1=1.367&r2=text&tr2=1.366&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iomap.h - 1.10 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iomap.h.diff?r1=text&tr1=1.10&r2=text&tr2=1.9&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/xfs_iomap.c - 1.52 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iomap.c.diff?r1=text&tr1=1.52&r2=text&tr2=1.51&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/linux-2.6/xfs_lrw.c - 1.259 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_lrw.c.diff?r1=text&tr1=1.259&r2=text&tr2=1.258&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/linux-2.6/xfs_aops.c - 1.142 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_aops.c.diff?r1=text&tr1=1.142&r2=text&tr2=1.141&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. fs/xfs/dmapi/xfs_dm.c - 1.34 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/dmapi/xfs_dm.c.diff?r1=text&tr1=1.34&r2=text&tr2=1.33&f=h - Fix to prevent the notorious 'NULL files' problem after a crash. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 2 23:45:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 02 May 2007 23:45:20 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l436jDfB003835 for ; Wed, 2 May 2007 23:45:15 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA04360; Thu, 3 May 2007 16:45:03 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l436j2Af82987621; Thu, 3 May 2007 16:45:02 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l436ixY983041938; Thu, 3 May 2007 16:44:59 +1000 (AEST) Date: Thu, 3 May 2007 16:44:59 +1000 From: David Chinner To: Christoph Hellwig Cc: Lachlan McIlroy , xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk Subject: Re: TAKE 963965 - Add lockdep support for XFS Message-ID: <20070503064459.GJ77450368@melbourne.sgi.com> References: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> <20070502115105.GA21031@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502115105.GA21031@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11258 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 02, 2007 at 12:51:05PM +0100, Christoph Hellwig wrote: > On Fri, Apr 27, 2007 at 06:50:45PM +1000, Lachlan McIlroy wrote: > > Add lockdep support for XFS > > I don't think this is entirely correct, and it misses some of the > most interesting cases. Yeah, we decided it was better to get something out there that fixes the obvious and frequently reported false positives than hold it up on the hard stuff.... > I've Cc'ed -fsdevel and Al to get some comments on the more tricky > issues in the rename section at the end of the mail. There's several other tricky cases that we're not sure to handle as well - they are mainly due to *valid* lock inversions. i.e. we do "lock A, lock B" in most places, but in others we do "lock B, *trylock* A" to avoid deadlocks. I think the MOUNT_ILOCK/inode ilock is one of these pairs. > > > > Modid: xfs-linux-melb:xfs-kern:28485a > > fs/xfs/xfs_vnodeops.c - 1.695 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> > xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h > > The XFS_ILOCK_PARENT uses in xfs_create, xfs_mkdir and xfs_symlink look good. > > xfs_lock_dir_and_entry should go away and just become and opencoded > > xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); > xfs_ilock(ip, XFS_ILOCK_EXCL); > > in the two callers, once we made sure to have a sufficient locking > protocol where we always lock the parent before the child. > > xfs_lock_dir_and_entry can be totally removed and replaced with just > the two ilock calls if we sort out the locking as proposed in this > mail. I'm not sure it is that simple - we currently always group locking of multiple inodes in increasing inode number order. i don't know what deadlock that is protecting against. There's also the case that we can't sleep on the ilock if the inode in the AIL while we hold the directory lock. Once again I'm not sure what the deadlock is, but given we are now in a transaction it's probably a tail-pushing deadlock that it is avoiding. Without knowing for certain what these are avoiding, I don't think we should be removing the code blindly.... > > fs/xfs/xfs_vfsops.c - 1.518 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/> > xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h > > This looks a bit odd to me - the rt inodes are not connected to the > filesystem namespace so the root inode can't really be it's parent. > > Why are we locking the root inode so early. Is there a good reason we > don't delay the locking until we're done with the rt inodes? No idea - it's like that on irix too, and I don't have time right now to discover why.... > If not the parent annotation is probably safe beause we never lock > the rt inode at the same time as any other inode, but it at least needs > a big comment describing what's going on. > > > > Now what seems to be completely lacking is any kind of annotation in > xfs_rename.c, which is the most difficult thing to get right for > inode locking because we may have to lock up to four inodes. I suggest > to implement the same locking protocol the the VFS uses for locking > i_mutex, as document in Documentation/filesystems/directory-locking: > > Also xfs_lock_inodes lacks any kind of annotation. It calls xfs_lock_inumorder() to set up the annotation. The inode number in the set of inodes to be locked drives the lock subclass for nesting. Also xfs_rename locking ends up calling xfs_lock_inodes() and so it does get annotated. > Let's start with the xfs_lock_inodes that don't fall into rename or > xfs_lock_dir_and_entry handled above: > > > - xfs_swap_extents locks two inodes of the same type, but these > could be directories, so there is a chance we can get into > conflicts with the parent->child type locking Uses xfs_lock_inodes() so subclass nesting is used instead of parent/child. > - xfs_link locks the source inode and the target directory > inode. vfs locking rule is lock parent, lock source and > we should follow this as it's in line with the directory > before child rule except that the source doesn't always > have to be a child, in which case we don't have a problem > anyway It locks in inode number order as per xfs_lock_dir_and_entry() and uses xfs_lock_inodes() for annotation. > And now rename gets ugly, we should follow the VFS rules with > the following required adjustments: > > - XFS needs both source and target inode (if existing) locked. > Because both must be non-directories sorting by inode number > should be okay > - Doing a lock_rename equivalent for locking the parent directories > requires dentries, but only inodes are passed down from the VFS. > On the other hand they are obviously guranteed to be directories > so i_dentry has exactly one dentry on which we can do the upwards > walk. This is a lot of churn that I don't really see as necessary - why should we risk deadlocks and difficult to diagnose problems when the current code works and is now annotated? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 00:49:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 00:49:17 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l437nCfB010685 for ; Thu, 3 May 2007 00:49:13 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id C2D3C4E456B; Thu, 3 May 2007 01:49:10 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 4864A406D; Thu, 3 May 2007 00:49:09 -0700 (PDT) Date: Thu, 3 May 2007 00:49:09 -0700 From: Andreas Dilger To: David Chinner Cc: Anton Altaparmakov , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070503074909.GA6220@schatzie.adilger.int> Mail-Followup-To: David Chinner , Anton Altaparmakov , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502105749.GY77450368@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11259 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 20:57 +1000, David Chinner wrote: > On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: > > HSM_READ is definitely _NOT_ required because all > > it means is "if the file is OFFLINE, bring it ONLINE and then return > > the extent map". > > You've got the definition of HSM_READ wrong. If the flag is *not* > set, then we bring everything back online and return the full extent > map. > > Specifying the flag indicates that we do *not* want the offline > extents brought back online. i.e. it is a HSM or a datamover > (e.g. backup program) that is querying the extents and we want to > known *exactly* what the current state of the file is right now. > > So, if the HSM_READ flag is set, then the application is > expecting the filesytem to be part of a HSM. Hence if it's not, > it should return an error because somebody has done something wrong. In my original proposal I specifically pointed out that the FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the HSM_READ flag is set. That's why the flag is called "HSM_READ" instead of "HSM_NO_READ". The reason is that it seems bad if the default behaviour for calling ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is only disabled by specifying a flag. It makes a lot more sense to just leave the data as it is and return the extent mapping by default (i.e. this is the principle of least surprise). It would probably be equally surprising and undesirable if the default behaviour was to force all data out to HSM. For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should even be a part of this interface? I have no problem with returning a flag that reports if the data is migrated to HSM and whether it is UNMAPPED. Having FIEMAP force the retrieval of data from HSM strikes me as something that should be a part of a separate HSM interface, which also needs to be able to do things like push specific files or parts thereof out to HSM, set the aging policy, and return information like "where does the HSM file live" and "how many copies are there". Do you know the reasoning behind including this into XFS_IOC_GETBMAPX? Looking at the bmap.c comments it appears it is simply because the API isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate there is data in HSM but it has no blocks allocated in the filesystem. I don't think it makes the operation significantly more efficient than say "ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP)" if an application actually needs the data to be present instead of just returning mapping info that includes "UNMAPPED. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu May 3 01:24:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 01:24:31 -0700 (PDT) Received: from ppsw-2.csi.cam.ac.uk (ppsw-2.csi.cam.ac.uk [131.111.8.132]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l438OPfB031233 for ; Thu, 3 May 2007 01:24:27 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49510) by ppsw-2.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.152]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HjWb8-0005kD-8u (Exim 4.63) (return-path ); Thu, 03 May 2007 09:23:34 +0100 In-Reply-To: <20070503074909.GA6220@schatzie.adilger.int> References: <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> <1177994346.3362.5.camel@entropy> <20070501142049.GG77450368@melbourne.sgi.com> <084192A9-D739-44F2-AD21-30BC30486F07@cam.ac.uk> <20070502091526.GW77450368@melbourne.sgi.com> <2604946E-CF10-426F-9720-DDABD10C8E0D@cam.ac.uk> <20070502105749.GY77450368@melbourne.sgi.com> <20070503074909.GA6220@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <13539C2E-16DA-4F86-9CBB-D16050EDDC44@cam.ac.uk> Cc: David Chinner , Nicholas Miell , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Thu, 3 May 2007 09:23:33 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11260 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 3 May 2007, at 08:49, Andreas Dilger wrote: > On May 02, 2007 20:57 +1000, David Chinner wrote: >> On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: >>> HSM_READ is definitely _NOT_ required because all >>> it means is "if the file is OFFLINE, bring it ONLINE and then return >>> the extent map". >> >> You've got the definition of HSM_READ wrong. If the flag is *not* >> set, then we bring everything back online and return the full extent >> map. >> >> Specifying the flag indicates that we do *not* want the offline >> extents brought back online. i.e. it is a HSM or a datamover >> (e.g. backup program) that is querying the extents and we want to >> known *exactly* what the current state of the file is right now. >> >> So, if the HSM_READ flag is set, then the application is >> expecting the filesytem to be part of a HSM. Hence if it's not, >> it should return an error because somebody has done something wrong. > > In my original proposal I specifically pointed out that the > FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the > XFS_IOC_GETBMAPX > BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the > HSM_READ flag is set. That's why the flag is called "HSM_READ" > instead > of "HSM_NO_READ". Cool. I did not misunderstand after all then. (-: > The reason is that it seems bad if the default behaviour for calling > ioctl(FIEMAP) would be to force retrieval of data from HSM, and > this is > only disabled by specifying a flag. It makes a lot more sense to just > leave the data as it is and return the extent mapping by default (i.e. > this is the principle of least surprise). It would probably be > equally > surprising and undesirable if the default behaviour was to force all > data out to HSM. > > For that matter, I'm also beginning to wonder if the FLAG_HSM_READ > should > even be a part of this interface? I have no problem with returning a > flag that reports if the data is migrated to HSM and whether it is > UNMAPPED. > > Having FIEMAP force the retrieval of data from HSM strikes me as > something > that should be a part of a separate HSM interface, which also needs > to be > able to do things like push specific files or parts thereof out to > HSM, > set the aging policy, and return information like "where does the HSM > file live" and "how many copies are there". That would seem sensible to me also. Just like David argued that causing the data to be in a fixed location should be a separate interface rather than part of FIEMAP so by analogy the same should apply to touching HSM. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Thu May 3 03:34:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 03:34:34 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l43AYRfB018100 for ; Thu, 3 May 2007 03:34:28 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 6152E7BA319; Thu, 3 May 2007 04:34:26 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 2F13B4153; Thu, 3 May 2007 03:34:25 -0700 (PDT) Date: Thu, 3 May 2007 03:34:25 -0700 From: Andreas Dilger To: "Amit K. Arora" Cc: Chris Wedgwood , David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070503103425.GE6220@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , Chris Wedgwood , David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> <20070430052559.GA13145@tuatara.stupidest.org> <20070502125312.GA5845@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070502125312.GA5845@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11261 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 02, 2007 18:23 +0530, Amit K. Arora wrote: > On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > > > For FA_ALLOCATE, it's supposed to change the file size if we > > > allocate past EOF, right? > > > > I would argue no. Use truncate for that. > > The patch I posted for ext4 *does* change the filesize after > preallocation, if required (i.e. when preallocation is after EOF). > I may have to change that, if we decide on not doing this. I think I'd agree - it may be useful to allow preallocation beyond EOF for some kinds of applications (e.g. PVR preallocating live TV in 10 minute segments or something, but not knowing in advance how long the show will actually be recorded or the final encoded size). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu May 3 08:01:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 08:02:00 -0700 (PDT) Received: from smtp-ft6.fr.colt.net (smtp-ft6.fr.colt.net [213.41.78.198]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l43F1ufB005799 for ; Thu, 3 May 2007 08:01:57 -0700 Received: from harpe.intellique.com (host.93.124.68.195.rev.coltfrance.com [195.68.124.93]) by smtp-ft6.fr.colt.net (8.13.8/8.13.8/Debian-3) with ESMTP id l43EjJQV005258 for ; Thu, 3 May 2007 16:45:19 +0200 Date: Thu, 3 May 2007 16:45:21 +0200 From: Emmanuel Florac To: xfs@oss.sgi.com Subject: XFS crash on linux raid Message-ID: <20070503164521.16efe075@harpe.intellique.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 11262 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Hello, Apparently quite a lot of people do encounter the same problem from time to time, but I couldn't find any solution. When writing quite a lot to the filesystem (heavy load on the fileserver), the filesystem crashes when filled at 2.5~3TB (varies from time to time). The filesystems tested where always running on a software raid 0, with disabled barriers. I tend to think that disabled write barriers are causing the crash but I'll do some more tests to get sure. I've met this problem for the first time on 12/23 (yup... merry christmas :) when a 13 TB filesystem went belly up : Dec 23 01:38:10 storiq1 -- MARK -- Dec 23 01:58:10 storiq1 -- MARK -- Dec 23 02:10:29 storiq1 kernel: xfs_iunlink_remove: xfs_itobp() returned an error 990 on md0. Returning error. Dec 23 02:10:29 storiq1 kernel: xfs_inactive:^Ixfs_ifree() returned an error = 990 on md0 Dec 23 02:10:29 storiq1 kernel: xfs_force_shutdown(md0,0x1) called from line 1763 of file fs/xfs/xfs_vnodeops.c. Return address = 0xc027f78b Dec 23 02:38:11 storiq1 -- MARK -- Dec 23 02:58:11 storiq1 -- MARK -- When mounting, it did that : Filesystem "md0": Disabling barriers, not supported by the underlying device XFS mounting filesystem md0 Starting XFS recovery on filesystem: md0 (logdev: internal) Filesystem "md0": xfs_inode_recover: Bad inode magic number, dino ptr = 0xf7196600, dino bp = 0xf718e980, ino = 119318 Filesystem "md0": XFS internal error xlog_recover_do_inode_trans(1) at line 2352 of file fs/xfs/xfs_log_recover.c. Caller 0xc025d180 xlog_recover_do_inode_trans+0x93d/0xa00 xlog_recover_do_trans+0x140/0x160 xfs_buf_delwri_queue+0x2b/0xb0 xlog_recover_do_trans+0x140/0x160 kmem_zalloc+0x1f/0x50 xlog_recover_commit_trans+0x3f/0x50 xlog_recover_process_data+0xea/0x240 xlog_do_recovery_pass+0x39a/0xb70 hrtimer_run_queues+0x29/0x110 xlog_do_log_recovery+0x96/0xd0 xlog_do_recover+0x3b/0x170 xlog_recover+0xdd/0xf0 xfs_log_mount+0xa1/0x110 xfs_mountfs+0x825/0xf30 xfs_fs_cmn_err+0x27/0x30 xfs_ioinit+0x27/0x50 xfs_mount+0x2ff/0x520 vfs_mount+0x43/0x50 xfs_fs_fill_super+0x9a/0x200 debug_mutex_add_waiter+0x3d/0xd0 snprintf+0x27/0x30 disk_name+0xb4/0xc0 sb_set_blocksize+0x1f/0x50 get_sb_bdev+0x106/0x150 xfs_fs_get_sb+0x30/0x40 xfs_fs_fill_super+0x0/0x200 do_kern_mount+0x5f/0xe0 do_new_mount+0x77/0xc0 do_mount+0x18d/0x1f0 take_cpu_down+0xb/0x20 copy_mount_options+0x63/0xc0 sys_mount+0x9f/0xe0 syscall_call+0x7/0xb XFS: log mount/recovery failed: error 990 XFS: log mount failed XFS_repair (too old a version...) hosed the filesystem and destroyed most of the 2.6TB of data. Yes, there were no backup, I wrote a recovery tool to restore the video data from the raw device but the is a different story. The system was running vanilla 2.6.17.9, and md0 was made of 3 striped RAID-5 on 3 3Ware-9550 cards, each hardware RAID-5 made of 8 750 GB drives. On a similar hardware with 2 3Ware-9550 16x750GB striped together, but running 2.6.17.13, I had a similar fs crash last week. Unfortunately I don't have the logs at hand, but we where able to reproduce several times the crash at home : Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xc01fb282 xfs_btree_check_sblock+0x58/0xe0 xfs_alloc_lookup+0x142/0x400 xfs_alloc_lookup+0x142/0x400 kmem_zone_alloc+0x59/0xd0 xfs_btree_init_cursor+0x23/0x190 xfs_alloc_ag_vextent_near+0x54/0x9e0 xfs_bmap_add_extent+0x383/0x430 xfs_bmap_search_multi_extents+0x76/0xf0 xfs_alloc_ag_vextent+0x119/0x120 xfs_alloc_vextent+0x3db/0x4f0 xfs_bmap_btalloc+0x3ee/0x890 xfs_bmapi+0x1216/0x1690 xfs_dir2_grow_inode+0xf6/0x400 cache_alloc_refill+0xb6/0x1e0 xfs_idata_realloc+0x3b/0x130 xfs_dir2_sf_to_block+0xac/0x5d0 xfs_dir2_lookup+0x129/0x130 xfs_dir2_sf_addname+0x97/0x110 xfs_dir2_createname+0x144/0x150 xfs_trans_ijoin+0x2b/0x80 xfs_rename+0x354/0x9f0 xfs_access+0x3f/0x50 xfs_vn_rename+0x48/0xa0 __link_path_walk+0xc7c/0xc90 xfs_getattr+0x23f/0x2f0 mntput_no_expire+0x1b/0x80 cache_alloc_refill+0xb6/0x1e0 vfs_rename_other+0x96/0xd0 vfs_rename+0x258/0x2d0 do_rename+0x171/0x1a0 cache_grow+0x10b/0x160 cache_alloc_refill+0xb6/0x1e0 do_getname+0x4b/0x80 sys_renameat+0x47/0x80 sys_rename+0x28/0x30 syscall_call+0x7/0xb Filesystem "md0": XFS internal error xfs_trans_cancel at line 1150 of file fs/xfs/xfs_trans.c. Caller 0xc0245ec7 xfs_trans_cancel+0xd0/0x100 xfs_rename+0x6a7/0x9f0 xfs_rename+0x6a7/0x9f0 xfs_access+0x3f/0x50 xfs_vn_rename+0x48/0xa0 __link_path_walk+0xc7c/0xc90 xfs_getattr+0x23f/0x2f0 mntput_no_expire+0x1b/0x80 cache_alloc_refill+0xb6/0x1e0 vfs_rename_other+0x96/0xd0 vfs_rename+0x258/0x2d0 do_rename+0x171/0x1a0 cache_grow+0x10b/0x160 cache_alloc_refill+0xb6/0x1e0 do_getname+0x4b/0x80 sys_renameat+0x47/0x80 sys_rename+0x28/0x30 syscall_call+0x7/0xb xfs_force_shutdown(md0,0x8) called from line 1151 of file fs/xfs/xfs_trans.c. Return address = 0xc025f7b9 Filesystem "md0": Corruption of in-memory data detected. Shutting down filesystem: md0 Please umount the filesystem, and rectify the problem(s) xfs_force_shutdown(md0,0x1) called from line 338 of file fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 xfs_force_shutdown(md0,0x1) called from line 338 of file fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 After xfs_repair, the fs is fine. However, it crashes again when writing again a couple of GBs of data. It crashes again under 2.6.17.13, 2.6.17.13 SMP, 2.6.18.8, 2.6.16.36... Out of curiosity, I've tried to use reiserfs (just to see how it compares regarding this). Reiserfs crashed before even writing 100MB! So I tend to believe this is a "write barrier" problem and it looks really nasty!!! To sort this out I've started a test on a single 3Ware raid, without software raid. Any idea on how to circumvent the problem to make software RAID/LVM usable? -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- From owner-xfs@oss.sgi.com Thu May 3 16:02:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 16:02:10 -0700 (PDT) Received: from smtp111.sbc.mail.mud.yahoo.com (smtp111.sbc.mail.mud.yahoo.com [68.142.198.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l43N25fB020477 for ; Thu, 3 May 2007 16:02:06 -0700 Received: (qmail 94083 invoked from network); 3 May 2007 23:02:04 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp111.sbc.mail.mud.yahoo.com with SMTP; 3 May 2007 23:02:03 -0000 X-YMail-OSG: ArKZSuYVM1kqn6qAuVrrwBMH7q78gcbdZ1PV.SHTJD7BztaEkuYJYhv3Ob5ff5ZJrgc4r7nNHw-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id B6EFB1827265; Thu, 3 May 2007 16:02:02 -0700 (PDT) Date: Thu, 3 May 2007 16:02:02 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070503230202.GA12747@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503164521.16efe075@harpe.intellique.com> X-archive-position: 11263 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 04:45:21PM +0200, Emmanuel Florac wrote: > After xfs_repair, the fs is fine. However, it crashes again when > writing again a couple of GBs of data. It crashes again under > 2.6.17.13, 2.6.17.13 SMP, 2.6.18.8, 2.6.16.36... 4K stacks? > So I tend to believe this is a "write barrier" problem and it looks > really nasty!!! You could try "mount -o nobarrier ...." From owner-xfs@oss.sgi.com Thu May 3 17:59:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 17:59:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l440xXfB009201 for ; Thu, 3 May 2007 17:59:35 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA29867; Fri, 4 May 2007 10:59:25 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l440xNAf83828843; Fri, 4 May 2007 10:59:24 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l440xMeV83970284; Fri, 4 May 2007 10:59:22 +1000 (AEST) Date: Fri, 4 May 2007 10:59:22 +1000 From: David Chinner To: Emmanuel Florac Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504005922.GC32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503164521.16efe075@harpe.intellique.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11264 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 04:45:21PM +0200, Emmanuel Florac wrote: > > Hello, > Apparently quite a lot of people do encounter the same problem from > time to time, but I couldn't find any solution. > > When writing quite a lot to the filesystem (heavy load on the > fileserver), the filesystem crashes when filled at 2.5~3TB (varies from > time to time). The filesystems tested where always running on a software > raid 0, with disabled barriers. I tend to think that disabled write > barriers are causing the crash but I'll do some more tests to get sure. > > I've met this problem for the first time on 12/23 (yup... merry > christmas :) when a 13 TB filesystem went belly up : > > Dec 23 01:38:10 storiq1 -- MARK -- > Dec 23 01:58:10 storiq1 -- MARK -- > Dec 23 02:10:29 storiq1 kernel: xfs_iunlink_remove: xfs_itobp() > returned an error 990 on md0. Returning error. > Dec 23 02:10:29 storiq1 kernel: xfs_inactive:^Ixfs_ifree() returned an > error = 990 on md0 > Dec 23 02:10:29 storiq1 kernel: xfs_force_shutdown(md0,0x1) called from > line 1763 of file fs/xfs/xfs_vnodeops.c. Return address = 0xc027f78b > Dec 23 02:38:11 storiq1 -- MARK -- > Dec 23 02:58:11 storiq1 -- MARK -- So, trying to remove an inode there was a corruption found on disk and it shut the filesystem down. Where there any I/o errors reported before the shutdown? > When mounting, it did that : > > Filesystem "md0": Disabling barriers, not supported by the underlying > device XFS mounting filesystem md0 > Starting XFS recovery on filesystem: md0 (logdev: internal) > Filesystem "md0": xfs_inode_recover: Bad inode magic number, dino ptr = > 0xf7196600, dino bp = 0xf718e980, ino = 119318 Filesystem "md0": XFS Which was found again during log recovery. > The system was running vanilla 2.6.17.9, and md0 was made of 3 striped > RAID-5 on 3 3Ware-9550 cards, each hardware RAID-5 made of 8 750 GB > drives. > > On a similar hardware with 2 3Ware-9550 16x750GB striped together, but > running 2.6.17.13, I had a similar fs crash last week. Unfortunately I > don't have the logs at hand, but we where able to reproduce several > times the crash at home : Hmm - 750GB drives are brand new. i wouldn't rule out media issues at this point... > Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 336 > of file fs/xfs/xfs_btree.c. Caller 0xc01fb282 Memory corruption? > line 1151 of file fs/xfs/xfs_trans.c. Return address = 0xc025f7b9 > Filesystem "md0": Corruption of in-memory data detected. Shutting down > filesystem: md0 Please umount the filesystem, and rectify the > problem(s) xfs_force_shutdown(md0,0x1) called from line 338 of file > fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 > xfs_force_shutdown(md0,0x1) called from line 338 of file > fs/xfs/xfs_rw.c. Return address = 0xc025f7b9 > > After xfs_repair, the fs is fine. However, it crashes again when > writing again a couple of GBs of data. It crashes again under 2.6.17.13, > 2.6.17.13 SMP, 2.6.18.8, 2.6.16.36... > > Out of curiosity, I've tried to use reiserfs (just to see how it > compares regarding this). Reiserfs crashed before even writing 100MB! That indicates there's something wrong other than the filesystem. I'd suggest making sure your raid arrays, memory, etc are all functioning correctly first. What platform are you running on? Are you running ia32 with 4k stacks? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 19:46:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 19:46:05 -0700 (PDT) Received: from mailsecure1.itc.griffith.edu.au (mailsecure1-out.itc.griffith.edu.au [132.234.242.61]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l442jwfB031706 for ; Thu, 3 May 2007 19:46:01 -0700 Received: from mailsecure1.itc.griffith.edu.au (unknown [127.0.0.1]) by mailsecure1.itc.griffith.edu.au (Symantec Mail Security) with ESMTP id 04449286 for ; Fri, 4 May 2007 12:45:57 +1000 (EST) X-AuditID: 84eaf23c-af2f2bb000004912-c9-463a9e64a23b Received: from nox-1.itc.griffith.edu.au (sc2bigip02-242.nms.griffith.edu.au [132.234.242.254]) by mailsecure1.itc.griffith.edu.au (Symantec Mail Security) with ESMTP id 4AF7730187 for ; Fri, 4 May 2007 12:45:56 +1000 (EST) Received: from [132.234.242.254] (helo=studentemail.griffith.edu.au) by nox-1.itc.griffith.edu.au with esmtp (Exim 4.63) (envelope-from ) id 1Hjnnw-0006gz-52 for xfs@oss.sgi.com; Fri, 04 May 2007 12:45:56 +1000 Received: from ss64.me.griffith.edu.au ([132.234.103.168]) by studentemail.griffith.edu.au (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPA id <0JHH002HWX0KTM40@studentemail.griffith.edu.au> for xfs@oss.sgi.com; Fri, 04 May 2007 12:45:56 +1000 (EST) Date: Fri, 04 May 2007 12:45:55 +1000 From: Stephen So Subject: Re: Slow performance when extracting tarballs In-reply-to: <20070430213538.GA30809@tuatara.stupidest.org> To: xfs@oss.sgi.com Message-id: <463A9E63.7010007@griffith.edu.au> Organization: Griffith School of Engineering, Griffith University, Australia MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7BIT X-Enigmail-Version: 0.95.0 References: <4635DAA4.4070402@griffith.edu.au> <20070430213538.GA30809@tuatara.stupidest.org> User-Agent: Thunderbird 2.0.0.0 (X11/20070326) X-Brightmail-Tracker: AAAAAA== X-archive-position: 11265 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: S.So@griffith.edu.au Precedence: bulk X-list: xfs Hi, thanks for the reply xfs-bounce@oss.sgi.com wrote: > what does "vmstat 1" look like during this? > I did a vmstat 1 and this is the output: % vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 1002716 3540 745316 0 0 59 15 559 560 1 2 96 1 0 0 0 0 1002700 3540 745316 0 0 0 12 1091 1543 2 2 97 0 0 1 0 0 995464 3540 750300 0 0 2060 401 1134 2569 18 3 76 4 0 2 0 0 980884 3540 762652 0 0 3712 1376 1238 4850 43 7 43 8 0 1 0 0 968368 3540 776152 0 0 3968 1568 1224 5155 43 7 44 7 0 2 0 0 954660 3540 787264 0 0 3584 1344 1244 4542 38 6 45 11 0 1 0 0 942668 3540 797556 0 0 2944 1431 1224 4376 36 6 48 11 0 1 0 0 932852 3540 807304 0 0 3072 1312 1229 4164 33 6 46 15 0 3 0 0 922724 3540 817912 0 0 3072 1440 1215 4378 37 7 44 12 0 0 1 0 911612 3540 828552 0 0 3328 1568 1242 4558 37 5 46 12 0 1 0 0 900804 3540 839140 0 0 3072 1568 1222 4279 36 5 45 13 0 0 0 0 887824 3540 848788 0 0 3072 1427 1250 3862 35 5 46 14 0 1 0 0 880036 3540 857700 0 0 2560 1529 1229 3775 31 7 47 16 0 1 0 0 867552 3540 867548 0 0 3072 1632 1250 4035 36 5 46 14 0 0 1 0 859156 3540 877576 0 0 2944 1696 1239 4291 33 6 45 16 0 1 0 0 852904 3540 883628 0 0 1664 5403 1229 3111 23 4 48 25 0 0 1 0 846328 3540 888188 0 0 1536 5300 1188 2622 21 6 61 12 0 0 1 0 842076 3540 892752 0 0 1280 5383 1232 2478 21 5 62 12 0 1 1 0 837312 3540 897396 0 0 1408 5330 1211 2476 20 5 53 24 0 6 1 0 828876 3540 903572 0 0 1920 5771 1245 2904 24 5 46 25 0 1 0 0 822016 3540 912304 0 0 2304 1203 1216 3897 30 7 55 7 0 0 1 0 818404 3540 915628 0 0 1024 9446 1181 2028 14 5 63 17 0 0 1 0 809552 3540 923336 0 0 2432 1109 1228 3344 28 5 46 22 0 1 0 0 801124 3540 928892 0 0 1664 9195 1201 2821 22 6 59 13 0 0 0 0 794364 3540 935364 0 0 1792 5296 1218 3052 24 6 52 18 0 2 1 0 789784 3540 941564 0 0 2048 4992 1194 3116 23 4 51 23 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 781540 3540 947180 0 0 1536 6434 1226 2942 23 6 50 22 0 4 0 0 777300 3540 953628 0 0 1920 1088 1200 2970 25 5 56 14 0 0 1 0 772892 3540 957032 0 0 1152 9440 1201 2141 17 4 59 21 0 1 0 0 764432 3540 964572 0 0 2304 1253 1216 3198 29 4 46 22 0 2 0 0 756516 3540 970284 0 0 1664 9720 1222 2832 22 5 57 17 0 1 1 0 750880 3540 977204 0 0 2176 1100 1207 2973 25 5 49 20 0 0 0 0 745424 3540 980768 0 0 1024 9140 1200 2205 16 4 66 14 0 0 1 0 741928 3540 986200 0 0 1664 1376 1193 2746 20 5 61 15 0 0 1 0 734536 3540 992480 0 0 1920 5516 1226 2874 24 5 57 14 0 0 1 0 729072 3540 997168 0 0 1408 5328 1199 2473 21 5 62 13 0 0 1 0 723228 3540 1003288 0 0 1792 5509 1243 2959 24 6 54 15 0 2 0 0 717948 3540 1007752 0 0 1408 5308 1196 2418 20 4 59 18 0 4 0 0 709940 3540 1013564 0 0 1536 5568 1217 3145 25 4 55 16 0 0 0 0 701132 3540 1021948 0 0 2816 5612 1224 3562 32 6 47 16 0 0 1 0 702448 3540 1023140 0 0 256 5108 1203 1538 6 5 73 15 0 0 1 0 691688 3540 1032264 0 0 2688 1852 1239 3630 32 5 45 18 0 0 1 0 688292 3540 1034228 0 0 768 9348 1198 1671 10 3 60 27 0 1 0 0 682636 3540 1039248 0 0 1408 1069 1198 2729 20 5 47 29 0 1 0 0 676848 3540 1044456 0 0 1408 5704 1234 2897 20 5 59 16 0 1 0 0 672460 3540 1049428 0 0 1536 5484 1215 2813 19 5 55 22 0 1 0 0 663820 3540 1056108 0 0 2176 5258 1241 3245 27 5 49 20 0 1 0 0 660064 3540 1061708 0 0 1664 1688 1222 3100 22 6 60 11 0 0 0 0 653400 3540 1065924 0 0 1152 5496 1221 2495 17 4 51 28 0 0 1 0 651468 3540 1069324 0 0 1152 5278 1187 2157 16 3 67 14 0 2 0 0 645132 3540 1073620 0 0 1152 5466 1221 2714 19 5 61 17 0 3 0 0 640544 3540 1078720 0 0 1664 5587 1219 2830 21 6 51 21 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 634040 3540 1083872 0 0 1536 5223 1208 2996 20 3 64 14 0 0 1 0 629024 3540 1090772 0 0 2048 5342 1199 3141 26 5 49 20 0 0 0 0 621116 3540 1095840 0 0 1664 4410 1211 2631 22 4 52 22 0 0 0 0 615760 3540 1100840 0 0 1408 6032 1186 2601 20 6 61 14 0 0 0 0 608852 3540 1107448 0 0 1920 1192 1215 3228 24 6 50 21 0 0 1 0 605872 3540 1112248 0 0 1536 5424 1220 2779 22 4 63 12 0 0 0 0 598016 3540 1117476 0 0 1536 5603 1227 3016 23 4 53 21 0 2 0 0 592416 3540 1122576 0 0 1536 5407 1217 2671 22 7 56 16 0 0 0 0 587504 3540 1127404 0 0 1408 4624 1230 2599 19 5 55 21 0 2 1 0 585800 3540 1130704 0 0 1152 1880 1175 2431 15 2 53 30 0 0 1 0 582732 3540 1133696 0 0 896 5293 1210 2357 16 4 74 6 0 2 0 0 575528 3540 1138696 0 0 1536 5424 1214 2585 22 5 48 26 0 1 0 0 569992 3540 1145872 0 0 2176 1519 1245 3267 27 5 50 17 0 0 0 0 563568 3540 1149772 0 0 1152 8164 1189 2364 15 4 74 6 0 0 0 0 559936 3540 1153020 0 0 896 2483 1198 2145 16 6 64 15 0 1 0 0 556504 3540 1156720 0 0 1408 5248 1206 2152 17 6 62 14 0 0 1 0 553568 3540 1161280 0 0 1280 5716 1231 2620 19 4 59 18 0 1 0 0 544820 3540 1167580 0 0 2048 1545 1234 2947 26 5 51 18 0 1 0 0 541096 3540 1170748 0 0 1024 5272 1205 2107 17 3 73 8 0 0 1 0 535092 3540 1176848 0 0 1792 6132 1225 2861 25 6 49 19 0 0 1 0 531696 3540 1181220 0 0 1280 969 1215 2758 18 3 66 14 0 0 1 0 528920 3540 1184220 0 0 896 5268 1192 2248 16 4 71 10 0 0 0 0 520532 3540 1189884 0 0 1664 5425 1252 3008 21 4 64 10 0 0 0 0 514012 3540 1196084 0 0 1920 1920 1214 3110 25 6 60 10 0 1 0 0 511804 3540 1199608 0 0 1152 5336 1240 2224 20 6 60 15 0 0 0 0 503212 3540 1206108 0 0 2048 6516 1227 2963 26 4 58 12 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 500968 3540 1208380 0 0 512 4684 1214 2066 13 5 79 3 0 2 0 0 496216 3540 1212580 0 0 1408 5727 1214 2399 21 7 65 8 0 4 0 0 491268 3540 1217184 0 0 1408 4304 1243 2593 21 5 64 11 0 2 0 0 488856 3540 1219784 0 0 896 2058 1189 1849 15 4 70 11 0 0 1 0 483660 3540 1224340 0 0 1408 5824 1240 2571 23 4 52 22 0 0 0 0 477704 3540 1229940 0 0 1536 5170 1173 2855 21 6 52 21 0 0 1 0 474500 3540 1234952 0 0 1536 5163 1212 2629 20 3 55 23 0 1 0 0 465196 3540 1242552 0 0 2304 940 1204 3265 28 4 47 22 0 0 0 0 458280 3540 1247892 0 0 1664 9382 1211 2719 19 6 70 4 0 1 0 0 453276 3540 1252592 0 0 1408 5040 1176 2827 19 5 58 18 0 0 0 0 446840 3540 1258496 0 0 1792 5676 1221 3025 24 5 56 14 0 1 0 0 443180 3540 1264096 0 0 1664 932 1193 2680 21 5 56 19 0 1 0 0 435748 3540 1269060 0 0 1664 5182 1209 2635 21 4 49 27 0 0 1 0 432060 3540 1274860 0 0 1536 5376 1183 2860 21 7 51 20 0 0 1 0 426376 3540 1279492 0 0 1408 5177 1214 2480 19 3 51 27 0 0 1 0 422356 3540 1283992 0 0 1280 5256 1196 2516 18 4 55 23 0 0 0 0 410112 3540 1292848 0 0 2560 1916 1254 3839 32 5 49 14 0 1 0 0 407296 3540 1295448 0 0 896 8244 1203 1816 15 6 74 6 0 1 0 0 405256 3540 1297456 0 0 384 2276 1192 1729 10 5 78 7 0 1 0 0 401044 3540 1303756 0 0 1920 2004 1260 2779 29 5 56 11 0 1 0 0 397668 3540 1306976 0 0 1024 5432 1229 2264 18 4 68 9 0 1 0 0 393720 3540 1310076 0 0 1024 5520 1219 1983 17 6 66 11 0 1 0 0 384148 3540 1316896 0 0 2048 2224 1279 3279 33 5 54 9 0 1 0 0 384716 3540 1318996 0 0 336 5291 1194 2084 12 3 74 13 0 0 0 0 384716 3540 1319252 0 0 0 149 1115 1467 1 1 98 0 0 0 0 0 384716 3540 1319252 0 0 0 92 1065 1075 2 3 95 0 0 > have you also tried setting (increasing) logbsize? (i think you need > > v2 logs to make that work) > I read in the man page for mount that the max logbsize is 32K and the default value for machines with more than 32 MB of memory is 32768, so I assumed it was already set to maximum. Best regards, Steve. -- __________________________________________________ Dr Stephen So, PhD, MIEEE Griffith School of Engineering & Institute for Integrated and Intelligent Systems Griffith University, Gold Coast Campus PMB 50 Gold Coast Mail Centre Gold Coast, QLD, 9726, Australia. E-mail: s.so@griffith.edu.au Phone: +61 7 5552 8663 Fax: +61 7 5552 8065 __________________________________________________ From owner-xfs@oss.sgi.com Thu May 3 21:30:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:30:17 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444UDfB026817 for ; Thu, 3 May 2007 21:30:14 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444U3vS017825 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:30:04 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444U2BH028973; Thu, 3 May 2007 21:30:02 -0700 Date: Thu, 3 May 2007 21:30:02 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 3/5] ext4: Extent overlap bugfix Message-Id: <20070503213002.eff696db.akpm@linux-foundation.org> In-Reply-To: <20070426181101.GC7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181101.GC7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11267 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" wrote: > +unsigned int ext4_ext_check_overlap(struct inode *inode, > + struct ext4_extent *newext, > + struct ext4_ext_path *path) > +{ > + unsigned long b1, b2; > + unsigned int depth, len1; > + > + b1 = le32_to_cpu(newext->ee_block); > + len1 = le16_to_cpu(newext->ee_len); > + depth = ext_depth(inode); > + if (!path[depth].p_ext) > + goto out; > + b2 = le32_to_cpu(path[depth].p_ext->ee_block); > + > + /* get the next allocated block if the extent in the path > + * is before the requested block(s) */ > + if (b2 < b1) { > + b2 = ext4_ext_next_allocated_block(path); > + if (b2 == EXT_MAX_BLOCK) > + goto out; > + } > + > + if (b1 + len1 > b2) { Are we sure that b1+len cannot wrap through zero here? > + newext->ee_len = cpu_to_le16(b2 - b1); > + return 1; > + } From owner-xfs@oss.sgi.com Thu May 3 21:30:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:30:11 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444U6fB026766 for ; Thu, 3 May 2007 21:30:06 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444Tu2f017820 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:29:57 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444TtUT028928; Thu, 3 May 2007 21:29:55 -0700 Date: Thu, 3 May 2007 21:29:55 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-Id: <20070503212955.b1b6443c.akpm@linux-foundation.org> In-Reply-To: <20070426180332.GA7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11266 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > This patch implements the fallocate() system call and adds support for > i386, x86_64 and powerpc. > > ... > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. If that's all too much, this material should at least be spelled out in the changelog. Because there's no way in which this change can be fully reviewed unless someone (ie: you) tells us what it is setting out to achieve. If we 100% implement some standard then a URL for what we claim to implement would suffice. Given that we're at least using different types from posix I doubt if such a thing would be sufficient. And given the complexity and potential variability within the filesystem implementations of this, I'd expect that _something_ additional needs to be said? > +{ > + struct file *file; > + struct inode *inode; > + long ret = -EINVAL; > + > + if (len == 0 || offset < 0) > + goto out; The posix spec implies that negative `len' is permitted - presumably "allocate ahead of `offset'". How peculiar. > + ret = -EBADF; > + file = fget(fd); > + if (!file) > + goto out; > + if (!(file->f_mode & FMODE_WRITE)) > + goto out_fput; > + > + inode = file->f_path.dentry->d_inode; > + > + ret = -ESPIPE; > + if (S_ISFIFO(inode->i_mode)) > + goto out_fput; > + > + ret = -ENODEV; > + if (!S_ISREG(inode->i_mode)) > + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. > + ret = -EFBIG; > + if (offset + len > inode->i_sb->s_maxbytes) > + goto out_fput; This code does handle offset+len going negative, but only by accident, I suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment here would settle the reader's mind. > + if (inode->i_op && inode->i_op->fallocate) > + ret = inode->i_op->fallocate(inode, mode, offset, len); > + else > + ret = -ENOSYS; If we _are_ going to support negative `len', as posix suggests, I think we should perform the appropriate sanity conversions to `offset' and `len' right here, rather than expecting each filesystem to do it. If we're not going to handle negative `len' then we should check for it. > +out_fput: > + fput(file); > +out: > + return ret; > +} > +EXPORT_SYMBOL(sys_fallocate); I don't believe this needs to be exported to modules? > +/* > + * fallocate() modes > + */ > +#define FA_ALLOCATE 0x1 > +#define FA_DEALLOCATE 0x2 Now those aren't in posix. They should be documented, along with their expected semantics. > #ifdef __KERNEL__ > > #include > @@ -1125,6 +1131,7 @@ struct inode_operations { > ssize_t (*listxattr) (struct dentry *, char *, size_t); > int (*removexattr) (struct dentry *, const char *); > void (*truncate_range)(struct inode *, loff_t, loff_t); > + long (*fallocate)(struct inode *, int, loff_t, loff_t); I really do think it's better to put the variable names in definitions such as this. Especially when we have two identically-typed variables next to each other like that. Quick: which one is the offset and which is the length? From owner-xfs@oss.sgi.com Thu May 3 21:31:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:31:48 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444VhfB027433 for ; Thu, 3 May 2007 21:31:44 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444VY8K017921 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:31:35 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444VXbq029006; Thu, 3 May 2007 21:31:33 -0700 Date: Thu, 3 May 2007 21:31:33 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070503213133.d1559f52.akpm@linux-foundation.org> In-Reply-To: <20070426181332.GD7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11268 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > This patch has the ext4 implemtation of fallocate system call. > > ... > > + /* ext4_can_extents_be_merged should have checked that either > + * both extents are uninitialized, or both aren't. Thus we > + * need to check only one of them here. > + */ Please always format multiline comments like this: /* * ext4_can_extents_be_merged should have checked that either * both extents are uninitialized, or both aren't. Thus we * need to check only one of them here. */ > ... > > +/* > + * ext4_fallocate: > + * preallocate space for a file > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > + */ This description is rather thin. What is the filesystem's actual behaviour here? If the file is using extents then the implementation will do . If the file is using bitmaps then we will do . But what? Here is where it should be described. > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > +{ > + handle_t *handle; > + ext4_fsblk_t block, max_blocks; > + int ret, ret2, nblocks = 0, retries = 0; > + struct buffer_head map_bh; > + unsigned int credits, blkbits = inode->i_blkbits; > + > + /* Currently supporting (pre)allocate mode _only_ */ > + if (mode != FA_ALLOCATE) > + return -EOPNOTSUPP; > + > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > + return -ENOTTY; So we don't implement fallocate on bitmap-based files! Well that's huge news. The changelog would be an appropriate place to communicate this, along with reasons why, or a description of the plan to fix it. Also, posix says nothing about fallocate() returning ENOTTY. > + block = offset >> blkbits; > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > + - block; > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? > + handle=ext4_journal_start(inode, credits + Please always put spaces around "=" > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); And around "+" > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > +retry: > + ret = 0; > + while (ret >= 0 && ret < max_blocks) { > + block = block + ret; > + max_blocks = max_blocks - ret; > + ret = ext4_ext_get_blocks(handle, inode, block, > + max_blocks, &map_bh, > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > + BUG_ON(!ret); BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and ext4_error() would be safer and more useful here. > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) Use buffer_new() here. A separate patch which fixes the three existing instances of open-coded BH_foo usage would be appreciated. > + && ((block + ret) > (i_size_read(inode) << blkbits))) Check for wrap though the sign bit and through zero please. > + nblocks = nblocks + ret; > + } > + > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > + goto retry; > + > + /* Time to update the file size. > + * Update only when preallocation was requested beyond the file size. > + */ Fix comment layout. > + if ((offset + len) > i_size_read(inode)) { Both the lhs and the rhs here are signed. Please review for possible overflows through the sign bit and through zero. Perhaps a comment explaining why it's correct would be appropriate. > + if (ret > 0) { > + /* if no error, we assume preallocation succeeded completely */ > + mutex_lock(&inode->i_mutex); > + i_size_write(inode, offset + len); > + EXT4_I(inode)->i_disksize = i_size_read(inode); > + mutex_unlock(&inode->i_mutex); > + } else if (ret < 0 && nblocks) { > + /* Handle partial allocation scenario */ The above two comments should be indented one additional tabstop. > + loff_t newsize; > + mutex_lock(&inode->i_mutex); > + newsize = (nblocks << blkbits) + i_size_read(inode); > + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); > + EXT4_I(inode)->i_disksize = i_size_read(inode); > + mutex_unlock(&inode->i_mutex); > + } > + } > + ext4_mark_inode_dirty(handle, inode); > + ret2 = ext4_journal_stop(handle); > + if (ret > 0) > + ret = ret2; > + > + return ret > 0 ? 0 : ret; > +} > + > EXPORT_SYMBOL(ext4_mark_inode_dirty); > EXPORT_SYMBOL(ext4_ext_invalidate_cache); > EXPORT_SYMBOL(ext4_ext_insert_extent); > EXPORT_SYMBOL(ext4_ext_walk_space); > EXPORT_SYMBOL(ext4_ext_find_goal); > EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); > +EXPORT_SYMBOL(ext4_fallocate); > > Index: linux-2.6.21/fs/ext4/file.c > =================================================================== > --- linux-2.6.21.orig/fs/ext4/file.c > +++ linux-2.6.21/fs/ext4/file.c > @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ > .removexattr = generic_removexattr, > #endif > .permission = ext4_permission, > + .fallocate = ext4_fallocate, > }; > > Index: linux-2.6.21/include/linux/ext4_fs.h > =================================================================== > --- linux-2.6.21.orig/include/linux/ext4_fs.h > +++ linux-2.6.21/include/linux/ext4_fs.h > @@ -102,6 +102,8 @@ > EXT4_GOOD_OLD_FIRST_INO : \ > (s)->s_first_ino) > #endif > +#define EXT4_BLOCK_ALIGN(size, blkbits) (((size)+(1 << blkbits)-1) & \ > + (~((1 << blkbits)-1))) Maybe a comment describing what this does? Probably it's obvious enough. I think it could use the standard ALIGN macro. Is blkbits sufficiently parenthesised here? Even if it is, adding the parens would be better practice. > /* > * Macro-instructions used to manage fragments > @@ -225,6 +227,10 @@ struct ext4_new_group_data { > __u32 free_blocks_count; > }; > > +/* Following is used by preallocation logic to tell get_blocks() that we > + * want uninitialzed extents. > + */ Please convert all newly-added multiline comments to the preferred layout. > +#define EXT4_CREATE_UNINITIALIZED_EXT 2 > > /* > * ioctl commands > @@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t > extern void ext4_ext_truncate(struct inode *, struct page *); > extern void ext4_ext_init(struct super_block *); > extern void ext4_ext_release(struct super_block *); > +extern int ext4_fallocate(struct inode *, int, loff_t, loff_t); argh. And feel free to give these args some useful names. > static inline int > ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > unsigned long max_blocks, struct buffer_head *bh, > Index: linux-2.6.21/include/linux/ext4_fs_extents.h > =================================================================== > --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h > +++ linux-2.6.21/include/linux/ext4_fs_extents.h > @@ -125,6 +125,19 @@ struct ext4_ext_path { > #define EXT4_EXT_CACHE_EXTENT 2 > > /* > + * Macro-instructions to handle (mark/unmark/check/create) unitialized > + * extents. Applications can issue an IOCTL for preallocation, which results > + * in assigning unitialized extents to the file. > + */ > +#define ext4_ext_mark_uninitialized(ext) ((ext)->ee_len |= \ > + cpu_to_le16(0x8000)) > +#define ext4_ext_is_uninitialized(ext) ((le16_to_cpu((ext)->ee_len))& \ > + 0x8000) > +#define ext4_ext_get_actual_len(ext) ((le16_to_cpu((ext)->ee_len))& \ > + 0x7FFF) inlined C functions are preferred, and I think these could be implemented that way. From owner-xfs@oss.sgi.com Thu May 3 21:32:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:32:51 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444WmfB027913 for ; Thu, 3 May 2007 21:32:49 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444WdFD017959 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:32:40 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444Wc1E029024; Thu, 3 May 2007 21:32:39 -0700 Date: Thu, 3 May 2007 21:32:38 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-Id: <20070503213238.5cdb1585.akpm@linux-foundation.org> In-Reply-To: <20070426181623.GE7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11269 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" wrote: > This patch adds write support for preallocated (using fallocate system > call) blocks/extents. The preallocated extents in ext4 are marked > "uninitialized", hence they need special handling especially while > writing to them. This patch takes care of that. > > ... > > /* > + * ext4_ext_try_to_merge: > + * tries to merge the "ex" extent to the next extent in the tree. > + * It always tries to merge towards right. If you want to merge towards > + * left, pass "ex - 1" as argument instead of "ex". > + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > + * 1 if they got merged. OK. > + */ > +int ext4_ext_try_to_merge(struct inode *inode, > + struct ext4_ext_path *path, > + struct ext4_extent *ex) > +{ > + struct ext4_extent_header *eh; > + unsigned int depth, len; > + int merge_done=0, uninitialized = 0; space around "=", please. Many people prefer not to do the multiple-definitions-per-line, btw: int merge_done = 0; int uninitialized = 0; reasons: - If gives you some space for a nice comment - It makes patches much more readable, and it makes rejects easier to fix - standardisation. > + depth = ext_depth(inode); > + BUG_ON(path[depth].p_hdr == NULL); > + eh = path[depth].p_hdr; > + > + while (ex < EXT_LAST_EXTENT(eh)) { > + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) > + break; > + /* merge with next extent! */ > + if (ext4_ext_is_uninitialized(ex)) > + uninitialized = 1; > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) > + + ext4_ext_get_actual_len(ex + 1)); > + if (uninitialized) > + ext4_ext_mark_uninitialized(ex); > + > + if (ex + 1 < EXT_LAST_EXTENT(eh)) { > + len = (EXT_LAST_EXTENT(eh) - ex - 1) > + * sizeof(struct ext4_extent); > + memmove(ex + 1, ex + 2, len); > + } > + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); Kenrel convention is to put spaces around "-" > + merge_done = 1; > + BUG_ON(eh->eh_entries == 0); eek, scary BUG_ON. Do we really need to be that severe? Would it be better to warn and run ext4_error() here? > + } > + > + return merge_done; > +} > + > + > > ... > > +/* > + * ext4_ext_convert_to_initialized: > + * this function is called by ext4_ext_get_blocks() if someone tries to write > + * to an uninitialized extent. It may result in splitting the uninitialized > + * extent into multiple extents (upto three). Atleast one initialized extent > + * and atmost two uninitialized extents can result. There are some typos here > + * There are three possibilities: > + * a> No split required: Entire extent should be initialized. > + * b> Split into two extents: Only one end of the extent is being written to. > + * c> Split into three extents: Somone is writing in middle of the extent. and here > + */ > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, > + struct ext4_ext_path *path, > + ext4_fsblk_t iblock, > + unsigned long max_blocks) > +{ > + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; > + struct ext4_extent_header *eh; > + unsigned int allocated, ee_block, ee_len, depth; > + ext4_fsblk_t newblock; > + int err = 0, ret = 0; > + > + depth = ext_depth(inode); > + eh = path[depth].p_hdr; > + ex = path[depth].p_ext; > + ee_block = le32_to_cpu(ex->ee_block); > + ee_len = ext4_ext_get_actual_len(ex); > + allocated = ee_len - (iblock - ee_block); > + newblock = iblock - ee_block + ext_pblock(ex); > + ex2 = ex; > + > + /* ex1: ee_block to iblock - 1 : uninitialized */ > + if (iblock > ee_block) { > + ex1 = ex; > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > + ext4_ext_mark_uninitialized(ex1); > + ex2 = &newex; > + } > + /* for sanity, update the length of the ex2 extent before > + * we insert ex3, if ex1 is NULL. This is to avoid temporary > + * overlap of blocks. > + */ > + if (!ex1 && allocated > max_blocks) > + ex2->ee_len = cpu_to_le16(max_blocks); > + /* ex3: to ee_block + ee_len : uninitialised */ > + if (allocated > max_blocks) { > + unsigned int newdepth; > + ex3 = &newex; > + ex3->ee_block = cpu_to_le32(iblock + max_blocks); > + ext4_ext_store_pblock(ex3, newblock + max_blocks); > + ex3->ee_len = cpu_to_le16(allocated - max_blocks); > + ext4_ext_mark_uninitialized(ex3); > + err = ext4_ext_insert_extent(handle, inode, path, ex3); > + if (err) > + goto out; > + /* The depth, and hence eh & ex might change > + * as part of the insert above. > + */ > + newdepth = ext_depth(inode); > + if (newdepth != depth) > + { Use if (newdepth != depth) { > + depth=newdepth; spaces > + path = ext4_ext_find_extent(inode, iblock, NULL); > + if (IS_ERR(path)) { > + err = PTR_ERR(path); > + path = NULL; > + goto out; > + } > + eh = path[depth].p_hdr; > + ex = path[depth].p_ext; > + if (ex2 != &newex) > + ex2 = ex; > + } > + allocated = max_blocks; > + } > + /* If there was a change of depth as part of the > + * insertion of ex3 above, we need to update the length > + * of the ex1 extent again here > + */ > + if (ex1 && ex1 != ex) { > + ex1 = ex; > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > + ext4_ext_mark_uninitialized(ex1); > + ex2 = &newex; > + } > + /* ex2: iblock to iblock + maxblocks-1 : initialised */ > + ex2->ee_block = cpu_to_le32(iblock); > + ex2->ee_start = cpu_to_le32(newblock); > + ext4_ext_store_pblock(ex2, newblock); > + ex2->ee_len = cpu_to_le16(allocated); > + if (ex2 != ex) > + goto insert; > + if ((err = ext4_ext_get_access(handle, inode, path + depth))) > + goto out; The preferred style is err = ext4_ext_get_access(handle, inode, path + depth); if (err) goto out; > + /* New (initialized) extent starts from the first block > + * in the current extent. i.e., ex2 == ex > + * We have to see if it can be merged with the extent > + * on the left. > + */ > + if (ex2 > EXT_FIRST_EXTENT(eh)) { > + /* To merge left, pass "ex2 - 1" to try_to_merge(), > + * since it merges towards right _only_. > + */ > + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); > + if (ret) { > + err = ext4_ext_correct_indexes(handle, inode, path); > + if (err) > + goto out; > + depth = ext_depth(inode); > + ex2--; > + } > + } > + /* Try to Merge towards right. This might be required > + * only when the whole extent is being written to. > + * i.e. ex2==ex and ex3==NULL. > + */ > + if (!ex3) { > + ret = ext4_ext_try_to_merge(inode, path, ex2); > + if (ret) { > + err = ext4_ext_correct_indexes(handle, inode, path); > + if (err) > + goto out; > + } > + } > + /* Mark modified extent as dirty */ > + err = ext4_ext_dirty(handle, inode, path + depth); > + goto out; > +insert: > + err = ext4_ext_insert_extent(handle, inode, path, &newex); > +out: > + return err ? err : allocated; > +} Sigh. I hope you guys know how all this works, because the extent code is a mystery to me. Is the on-disk layout and the allocation strategy described anywhere? > +extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *); Again, I do think that sticking the identifiers in there helps readability. Although it is not as important in a boring old declaration as it is in, say, inode_operations, etc. Please try to keep the code looking nice in an 80-column display. From owner-xfs@oss.sgi.com Thu May 3 21:55:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 21:55:43 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l444tdfB002706 for ; Thu, 3 May 2007 21:55:40 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l444tTgs018661 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 21:55:31 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l444tSik029320; Thu, 3 May 2007 21:55:29 -0700 Date: Thu, 3 May 2007 21:55:28 -0700 From: Andrew Morton To: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-Id: <20070503215528.d8ab4e47.akpm@linux-foundation.org> In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11270 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton wrote: > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. But it doesn't handle offset+len wrapping through zero. From owner-xfs@oss.sgi.com Thu May 3 22:16:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 22:16:49 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l445GifB007660 for ; Thu, 3 May 2007 22:16:45 -0700 Received: by ozlabs.org (Postfix, from userid 1003) id 8F525DDFF5; Fri, 4 May 2007 15:16:43 +1000 (EST) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17978.47502.786970.196554@cargo.ozlabs.ibm.com> Date: Fri, 4 May 2007 14:41:50 +1000 From: Paul Mackerras To: Andrew Morton Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> X-Mailer: VM 7.19 under Emacs 21.4.1 X-archive-position: 11271 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paulus@samba.org Precedence: bulk X-list: xfs Andrew Morton writes: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Paul. From owner-xfs@oss.sgi.com Thu May 3 23:08:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 23:08:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l44687fB021573 for ; Thu, 3 May 2007 23:08:09 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA06552; Fri, 4 May 2007 16:07:46 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4467cAf83970051; Fri, 4 May 2007 16:07:38 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4467VZ384026819; Fri, 4 May 2007 16:07:31 +1000 (AEST) Date: Fri, 4 May 2007 16:07:31 +1000 From: David Chinner To: Andrew Morton Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070504060731.GJ32602149@melbourne.sgi.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11272 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I just checked the man page for posix_fallocate() and it says: EINVAL offset or len was less than zero. We should probably follow this lead. > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. Hmmmm - I thought that the intention of sys_fallocate() was to be generic enough to eventually allow preallocation on directories. If that is the case, then this check will prevent that.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 3 23:28:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 23:28:37 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l446STfB031053 for ; Thu, 3 May 2007 23:28:30 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l446SGLQ021546 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 May 2007 23:28:18 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l446SFXl030589; Thu, 3 May 2007 23:28:16 -0700 Date: Thu, 3 May 2007 23:28:15 -0700 From: Andrew Morton To: David Chinner Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-Id: <20070503232815.2f62a75e.akpm@linux-foundation.org> In-Reply-To: <20070504060731.GJ32602149@melbourne.sgi.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11273 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Fri, 4 May 2007 16:07:31 +1000 David Chinner wrote: > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > +{ > > > + struct file *file; > > > + struct inode *inode; > > > + long ret = -EINVAL; > > > + > > > + if (len == 0 || offset < 0) > > > + goto out; > > > > The posix spec implies that negative `len' is permitted - presumably "allocate > > ahead of `offset'". How peculiar. > > I just checked the man page for posix_fallocate() and it says: > > EINVAL offset or len was less than zero. > > We should probably follow this lead. Yes, I think so. I'm suspecting that http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html is just buggy. Or I can't read. I mean, if we're going to support negative `len' then is the byte at `offset' inside or outside the segment? Head spins. However it would be neat if someone could test $OTHER_OS and, perhaps more importantly, the present glibc emulation (which I assume your manpage is referring to, so this would be a manpage test ;)). > > > + > > > + ret = -ENODEV; > > > + if (!S_ISREG(inode->i_mode)) > > > + goto out_fput; > > > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > > seems a bit silly of them. > > Hmmmm - I thought that the intention of sys_fallocate() was to > be generic enough to eventually allow preallocation on directories. > If that is the case, then this check will prevent that.... The above opengroup page only permits S_ISREG. Preallocating directories sounds quite useful to me, although it's something which would be pretty hard to emulate if the FS doesn't support it. And there's a decent case to be made for emulating it - run-anywhere reasons. Does glibc emulation support directories? Quite unlikely. But yes, sounds like a desirable thing. Would XFS support it easily if the above check was relaxed? From owner-xfs@oss.sgi.com Thu May 3 23:57:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 03 May 2007 23:57:07 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l446uwfB004955 for ; Thu, 3 May 2007 23:57:00 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l446ujPL002724; Fri, 4 May 2007 02:56:46 -0400 Received: from devserv.devel.redhat.com (devserv.devel.redhat.com [172.16.58.1]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l446uef7021912; Fri, 4 May 2007 02:56:40 -0400 Received: from devserv.devel.redhat.com (localhost.localdomain [127.0.0.1]) by devserv.devel.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id l446ueGH007487; Fri, 4 May 2007 02:56:40 -0400 Received: (from jakub@localhost) by devserv.devel.redhat.com (8.12.11.20060308/8.12.11/Submit) id l446uQr9007476; Fri, 4 May 2007 02:56:26 -0400 Date: Fri, 4 May 2007 02:56:26 -0400 From: Jakub Jelinek To: Andrew Morton Cc: Ulrich Drepper , David Chinner , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070504065626.GW355@devserv.devel.redhat.com> Reply-To: Jakub Jelinek References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503232815.2f62a75e.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11274 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jakub@redhat.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > > > The posix spec implies that negative `len' is permitted - presumably "allocate > > > ahead of `offset'". How peculiar. > > > > I just checked the man page for posix_fallocate() and it says: > > > > EINVAL offset or len was less than zero. That describes the current glibc implementation. > > We should probably follow this lead. > > Yes, I think so. I'm suspecting that > http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html > is just buggy. Or I can't read. > > I mean, if we're going to support negative `len' then is the byte at > `offset' inside or outside the segment? Head spins. > > However it would be neat if someone could test $OTHER_OS and, perhaps more > importantly, the present glibc emulation (which I assume your manpage is > referring to, so this would be a manpage test ;)). int posix_fallocate (int fd, __off_t offset, __off_t len) { struct stat64 st; struct statfs f; /* `off_t' is a signed type. Therefore we can determine whether OFFSET + LEN is too large if it is a negative value. */ if (offset < 0 || len < 0) return EINVAL; if (offset + len < 0) return EFBIG; /* First thing we have to make sure is that this is really a regular file. */ if (__fxstat64 (_STAT_VER, fd, &st) != 0) return EBADF; if (S_ISFIFO (st.st_mode)) return ESPIPE; if (! S_ISREG (st.st_mode)) return ENODEV; if (len == 0) { if (st.st_size < offset) { int ret = __ftruncate (fd, offset); if (ret != 0) ret = errno; return ret; } return 0; } ... is what glibc does ATM. Seems we violate the case where len == 0, as EINVAL in that case is "shall fail". But reading the standard to imply negative len is ok is too much guessing, there is no word what it means when len is negative and "required storage for regular file data starting at offset and continuing for len bytes" doesn't make sense for negative size. And given the general "Implementations may support additional errors not included in this list, may generate errors included in this list under circumstances other than those described here, or may contain extensions or limitations that prevent some errors from occurring." I believe returning EINVAL for len < 0 is not a POSIX violation. That doesn't mean the standard shouldn't be clarified, whether by saying EINVAL must be returned for non-positive len or saying that using negative len has undefined or implementation defined behavior. > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. No, see above. Jakub From owner-xfs@oss.sgi.com Fri May 4 00:28:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 00:28:25 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l447SHfB017704 for ; Fri, 4 May 2007 00:28:19 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA08305; Fri, 4 May 2007 17:27:56 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l447RnAf84055039; Fri, 4 May 2007 17:27:50 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l447Rg2j84042753; Fri, 4 May 2007 17:27:42 +1000 (AEST) Date: Fri, 4 May 2007 17:27:42 +1000 From: David Chinner To: Andrew Morton Cc: David Chinner , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070504072742.GK32602149@melbourne.sgi.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503232815.2f62a75e.akpm@linux-foundation.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11275 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > On Fri, 4 May 2007 16:07:31 +1000 David Chinner wrote: > > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > > > > > This patch implements the fallocate() system call and adds support for > > > > i386, x86_64 and powerpc. > > > > > > > > ... > > > > +{ > > > > + struct file *file; > > > > + struct inode *inode; > > > > + long ret = -EINVAL; > > > > + > > > > + if (len == 0 || offset < 0) > > > > + goto out; > > > > > > The posix spec implies that negative `len' is permitted - presumably "allocate > > > ahead of `offset'". How peculiar. > > > > I just checked the man page for posix_fallocate() and it says: > > > > EINVAL offset or len was less than zero. > > > > We should probably follow this lead. > > Yes, I think so. I'm suspecting that > http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html > is just buggy. Or I can't read. > > I mean, if we're going to support negative `len' then is the byte at > `offset' inside or outside the segment? Head spins. I don't think we should care. If we provide a syscall with the semantics of "allocate from offset to offset+len" then glibc's implementation can turn negative length into two separate fallocate syscalls.... > > > > + ret = -ENODEV; > > > > + if (!S_ISREG(inode->i_mode)) > > > > + goto out_fput; > > > > > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > > > seems a bit silly of them. > > > > Hmmmm - I thought that the intention of sys_fallocate() was to > > be generic enough to eventually allow preallocation on directories. > > If that is the case, then this check will prevent that.... > > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the above > check was relaxed? No - right now empty blocks are pruned from the directory immediately so I don't think we really have a concept of empty blocks in the btree structure. dir2 is bloody complex, so adding preallocation is probably not going to be simple to do. It's not high on my list to add, either, because we can typically avoid the worst case directory fragmentation by using larger directory block sizes (e.g. 16k instead of the default 4k on a 4k block size fs). IIRC directory preallocation has been talked about more for ext3/4.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri May 4 00:29:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 00:29:41 -0700 (PDT) Received: from postfix2-g20.free.fr (postfix2-g20.free.fr [212.27.60.43]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l447TbfB018149 for ; Fri, 4 May 2007 00:29:38 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix2-g20.free.fr (Postfix) with ESMTP id CD3ADFA8658 for ; Fri, 4 May 2007 08:06:38 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id CD49517BA4; Fri, 4 May 2007 09:06:13 +0200 (CEST) Date: Fri, 4 May 2007 09:06:13 +0200 From: Emmanuel Florac To: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504090613.7c0f97d3@galadriel.home> In-Reply-To: <20070504005922.GC32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l447TcfB018166 X-archive-position: 11276 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 10:59:22 +1000 vous écriviez: > Where there any I/o errors reported before the shutdown? > Nope. To make it clear : the problem can be reproduce on several different systems, different motherboards, different drives, different RAID controllers... This isn't a hardware problem. > > On a similar hardware with 2 3Ware-9550 16x750GB striped together, > > but running 2.6.17.13, I had a similar fs crash last week. > > Unfortunately I don't have the logs at hand, but we where able to > > reproduce several times the crash at home : > > Hmm - 750GB drives are brand new. i wouldn't rule out media issues > at this point... The problem is quite easily reproduced with 500GB drives too. > > Filesystem "md0": XFS internal error xfs_btree_check_sblock at line > > 336 of file fs/xfs/xfs_btree.c. Caller 0xc01fb282 > > Memory corruption? Tried with different RAMs, and the problem occurs on ECC RAM too. > > > > Out of curiosity, I've tried to use reiserfs (just to see how it > > compares regarding this). Reiserfs crashed before even writing > > 100MB! > > That indicates there's something wrong other than the filesystem. > I'd suggest making sure your raid arrays, memory, etc are all > functioning correctly first. They are. I've tested 5 different machines so far (Supermicro or Tyan mobos, kingston RAM, Intel or AMD cpus, hitachi and seagate drives...) > What platform are you running on? Are you running ia32 with 4k stacks? Yes. I'll try this week 2.6.18.8 thoroughly and 2.6.20.11 too. Then jfs, just to be sure. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 00:34:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 00:34:03 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l447XvfB019746 for ; Fri, 4 May 2007 00:33:59 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA08568; Fri, 4 May 2007 17:33:47 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l447XkAf83983180; Fri, 4 May 2007 17:33:46 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l447Xi8582990264; Fri, 4 May 2007 17:33:44 +1000 (AEST) Date: Fri, 4 May 2007 17:33:44 +1000 From: David Chinner To: Emmanuel Florac Cc: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504073344.GL32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070504090613.7c0f97d3@galadriel.home> User-Agent: Mutt/1.4.2.1i X-archive-position: 11277 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 09:06:13AM +0200, Emmanuel Florac wrote: > Le Fri, 4 May 2007 10:59:22 +1000 vous écriviez: > > What platform are you running on? Are you running ia32 with 4k stacks? > > Yes. I'll try this week 2.6.18.8 thoroughly and 2.6.20.11 too. Then > jfs, just to be sure. Well, there's your problem. Stack overflows. IMO, if you use a filesystem, you shouldn't use 4k stacks. ;) If you remake you kernel with 8k stacks then your problems will most likely go away. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri May 4 06:25:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 06:25:53 -0700 (PDT) Received: from smtp-ft5.fr.colt.net (smtp-ft5.fr.colt.net [213.41.78.197]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44DPlfB025077 for ; Fri, 4 May 2007 06:25:49 -0700 Received: from harpe.intellique.com (host.93.124.68.195.rev.coltfrance.com [195.68.124.93]) by smtp-ft5.fr.colt.net (8.13.8/8.13.8/Debian-3) with ESMTP id l44DPhpu000578; Fri, 4 May 2007 15:25:43 +0200 Date: Fri, 4 May 2007 15:25:46 +0200 From: Emmanuel Florac To: David Chinner Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504152546.614374ac@harpe.intellique.com> In-Reply-To: <20070504073344.GL32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l44DPofB025089 X-archive-position: 11278 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 17:33:44 +1000 David Chinner écrivait: > Well, there's your problem. Stack overflows. IMO, if you use a > filesystem, you shouldn't use 4k stacks. ;) > > If you remake you kernel with 8k stacks then your problems will > most likely go away. Well, I've double-checked the asm-i386/module.h, and it actually looks like 4K stacks is NOT the default, so I must be using 8K, isn't it? I've ran the same test on the same machine but WITHOUT software raid-0 (so write barriers are in use), and all went well, more than 3TB written without a glitch. I still think there's something related to the write barriers here. I'll try with another RAID controller, Adaptec for instance, to get sure the 3ware driver isn't involved. I'll also try again with an amd64 kernel. I'd really like to sort this out... -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 07:55:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 07:55:38 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44EtXfB019895 for ; Fri, 4 May 2007 07:55:35 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id C945B18022E01; Fri, 4 May 2007 09:55:30 -0500 (CDT) Message-ID: <463B4962.70904@sandeen.net> Date: Fri, 04 May 2007 09:55:30 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> In-Reply-To: <20070504152546.614374ac@harpe.intellique.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11279 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 4 May 2007 17:33:44 +1000 > David Chinner écrivait: > >> Well, there's your problem. Stack overflows. IMO, if you use a >> filesystem, you shouldn't use 4k stacks. ;) >> >> If you remake you kernel with 8k stacks then your problems will >> most likely go away. > > Well, I've double-checked the asm-i386/module.h, and it actually looks > like 4K stacks is NOT the default, so I must be using 8K, isn't it? Depends on how you config'd it, just look at the .config you built with, and search for CONFIG_4KSTACKS On Fedora at least (and I can't remember - I don't think this is a fedora-ism...) you can do "modinfo" on some module, and see: vermagic: 2.6.21 SMP mod_unload 686 4KSTACKS -Eric From owner-xfs@oss.sgi.com Fri May 4 08:30:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 08:30:51 -0700 (PDT) Received: from smtp-ft1.fr.colt.net (smtp-ft1.fr.colt.net [213.41.78.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44FUlfB030646 for ; Fri, 4 May 2007 08:30:49 -0700 Received: from harpe.intellique.com (host.93.124.68.195.rev.coltfrance.com [195.68.124.93]) by smtp-ft1.fr.colt.net (8.13.8/8.13.8/Debian-3) with ESMTP id l44FUdlH008756; Fri, 4 May 2007 17:30:41 +0200 Date: Fri, 4 May 2007 17:30:49 +0200 From: Emmanuel Florac To: Eric Sandeen Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504173049.14606033@harpe.intellique.com> In-Reply-To: <463B4962.70904@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 X-Antivirus: checked in 0.023sec at smtp-ft1.fr.colt.net ([213.41.78.210]) by smf-clamd Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l44FUnfB030656 X-archive-position: 11280 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 04 May 2007 09:55:30 -0500 Eric Sandeen écrivait: > Emmanuel Florac wrote: > > Le Fri, 4 May 2007 17:33:44 +1000 > > David Chinner écrivait: > > > >> Well, there's your problem. Stack overflows. IMO, if you use a > >> filesystem, you shouldn't use 4k stacks. ;) > >> > >> If you remake you kernel with 8k stacks then your problems will > >> most likely go away. > > > > Well, I've double-checked the asm-i386/module.h, and it actually > > looks like 4K stacks is NOT the default, so I must be using 8K, > > isn't it? > > Depends on how you config'd it, just look at the .config you built > with, and search for CONFIG_4KSTACKS config-2.6.17.13: # CONFIG_4KSTACKS is not set So the problem lies elsewhere... -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 08:58:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 08:58:29 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44FwOfB005594 for ; Fri, 4 May 2007 08:58:25 -0700 Received: from localhost (dslb-084-057-112-255.pools.arcor-ip.net [84.57.112.255]) by mail.lichtvoll.de (Postfix) with ESMTP id AF67E5AD3F for ; Fri, 4 May 2007 17:58:22 +0200 (CEST) From: Martin Steigerwald To: linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Date: Fri, 4 May 2007 17:58:21 +0200 User-Agent: KMail/1.9.6 References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> (sfid-20070504_161005_263297_AD8C4AAD) In-Reply-To: <20070504152546.614374ac@harpe.intellique.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705041758.21320.Martin@lichtvoll.de> X-archive-position: 11281 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Freitag 04 Mai 2007 schrieb Emmanuel Florac: > I've ran the same test on the same machine but WITHOUT software raid-0 > (so write barriers are in use), and all went well, more than 3TB > written without a glitch. I still think there's something related to > the write barriers here. I'll try with another RAID controller, Adaptec > for instance, to get sure the 3ware driver isn't involved. I'll also > try again with an amd64 kernel. Hello Emmanuel! When you can't use write barriers as XFS tell you in the logs, you better switch of write caching for the harddisks / raid controller, unless you happen to have NVRAM or safe power supply. But then using write cache without barrier should not make any difference unless you actually have a crash or power failure during write operation. Did you test with ext3 as well? You wrote it crashes with ReiserFS (version 3) even faster. When it crashes with several filesystems its unlikely to be a filesystem issue. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Fri May 4 15:12:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 15:12:33 -0700 (PDT) Received: from postfix2-g20.free.fr (postfix2-g20.free.fr [212.27.60.43]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l44MCTfB015022 for ; Fri, 4 May 2007 15:12:30 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix2-g20.free.fr (Postfix) with ESMTP id 48027FA5D2B for ; Fri, 4 May 2007 22:44:22 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 89A381838B; Fri, 4 May 2007 23:43:56 +0200 (CEST) Date: Fri, 4 May 2007 23:43:57 +0200 From: Emmanuel Florac To: Martin Steigerwald Cc: linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504234357.24d22883@galadriel.home> In-Reply-To: <200705041758.21320.Martin@lichtvoll.de> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l44MCVfB015030 X-archive-position: 11282 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 17:58:21 +0200 vous écriviez: > Did you test with ext3 as well? You wrote it crashes with ReiserFS > (version 3) even faster. When it crashes with several filesystems its > unlikely to be a filesystem issue. Unfortunately ext3 doesn't support volumes bigger than 8TB, so that's useless to me. I plan to test jfs, however. I think it's more a dm/md issue, but I'm not sure... -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Fri May 4 16:20:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 16:20:38 -0700 (PDT) Received: from smtp108.sbc.mail.mud.yahoo.com (smtp108.sbc.mail.mud.yahoo.com [68.142.198.207]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l44NKVfB031820 for ; Fri, 4 May 2007 16:20:32 -0700 Received: (qmail 71668 invoked from network); 4 May 2007 23:20:30 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp108.sbc.mail.mud.yahoo.com with SMTP; 4 May 2007 23:20:29 -0000 X-YMail-OSG: OPP1hd4VM1lfaVvz3tObISaM4S9Wsbmdmu7ru90QC85M5NGiDwRjeqFhPzMWSgDFVI.VQ1CzkQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 7B9EE1827261; Fri, 4 May 2007 16:20:28 -0700 (PDT) Date: Fri, 4 May 2007 16:20:28 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070504232028.GA19744@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070504173049.14606033@harpe.intellique.com> X-archive-position: 11283 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 05:30:49PM +0200, Emmanuel Florac wrote: > # CONFIG_4KSTACKS is not set > > So the problem lies elsewhere... CONFIG_4KSTACKS is badly named. It means you have 4K process + 4K interrupt stacks. Without this set you have just a single 8K stack for processes and interrupts. One argument for 4K+4K stacks is that 8K+0K isn't really safer in many cases --- it just appears that way becasue the problems are harder to hit. Almost three years ago I posted patches to split the CONFIG_4KSTACKS option into two options. I quickly just ported that to 2.6.21 just now (very quickly, I might have goofed fixing up the rejects). You could if you have time try this and enable CONFIG_I386_IRQSTACKS but don't enable CONFIG_I386_4KSTACKS and see if that helps... diff --git a/arch/i386/Kconfig.debug b/arch/i386/Kconfig.debug index 458bc16..f32fbec 100644 --- a/arch/i386/Kconfig.debug +++ b/arch/i386/Kconfig.debug @@ -56,15 +56,22 @@ config DEBUG_RODATA portion of the kernel code won't be covered by a 2MB TLB anymore. If in doubt, say "N". -config 4KSTACKS +config I386_4KSTACKS bool "Use 4Kb for kernel stacks instead of 8Kb" depends on DEBUG_KERNEL help If you say Y here the kernel will use a 4Kb stacksize for the kernel stack attached to each process/thread. This facilitates running more threads on a system and also reduces the pressure - on the VM subsystem for higher order allocations. This option - will also use IRQ stacks to compensate for the reduced stackspace. + on the VM subsystem for higher order allocations. + +config I386_IRQSTACKS + bool "Allocate separate IRQ stacks" + depends on DEBUG_KERNEL + default y + help + If you say Y here the kernel will allocate and use separate + stacks for interrupts. config X86_FIND_SMP_CONFIG bool diff --git a/arch/i386/defconfig b/arch/i386/defconfig diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c index 8db8d51..f6224fd 100644 --- a/arch/i386/kernel/irq.c +++ b/arch/i386/kernel/irq.c @@ -47,7 +47,7 @@ void ack_bad_irq(unsigned int irq) #endif } -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * per-CPU IRQ handling contexts (thread information and stack) */ @@ -58,7 +58,7 @@ union irq_ctx { static union irq_ctx *hardirq_ctx[NR_CPUS] __read_mostly; static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly; -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * do_IRQ handles all normal device IRQ's (the special @@ -71,7 +71,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) /* high bit used in ret_from_ code */ int irq = ~regs->orig_eax; struct irq_desc *desc = irq_desc + irq; -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS union irq_ctx *curctx, *irqctx; u32 *isp; #endif @@ -99,7 +99,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) } #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS curctx = (union irq_ctx *) current_thread_info(); irqctx = hardirq_ctx[smp_processor_id()]; @@ -136,7 +136,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) : "memory", "cc" ); } else -#endif +#endif /* CONFIG_I386_IRQSTACKS */ desc->handle_irq(irq, desc); irq_exit(); @@ -144,7 +144,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) return 1; } -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * These should really be __section__(".bss.page_aligned") as well, but @@ -234,7 +234,7 @@ asmlinkage void do_softirq(void) } EXPORT_SYMBOL(do_softirq); -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * Interrupt statistics: diff --git a/include/asm-i386/irq.h b/include/asm-i386/irq.h index 11761cd..7db95e1 100644 --- a/include/asm-i386/irq.h +++ b/include/asm-i386/irq.h @@ -24,14 +24,14 @@ static __inline__ int irq_canonicalize(int irq) # define ARCH_HAS_NMI_WATCHDOG /* See include/linux/nmi.h */ #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS extern void irq_ctx_init(int cpu); extern void irq_ctx_exit(int cpu); # define __ARCH_HAS_DO_SOFTIRQ -#else +#else /* !CONFIG_I386_IRQSTACKS */ # define irq_ctx_init(cpu) do { } while (0) # define irq_ctx_exit(cpu) do { } while (0) -#endif +#endif /* CONFIG_I386_IRQSTACKS */ #ifdef CONFIG_IRQBALANCE extern int irqbalance_disable(char *str); diff --git a/include/asm-i386/module.h b/include/asm-i386/module.h index 02f8f54..7d5d2df 100644 --- a/include/asm-i386/module.h +++ b/include/asm-i386/module.h @@ -62,11 +62,11 @@ struct mod_arch_specific #error unknown processor family #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define MODULE_STACKSIZE "4KSTACKS " -#else +#else /* not using CONFIG_I386_4KSTACKS */ #define MODULE_STACKSIZE "" -#endif +#endif /* CONFIG_I386_4KSTACKS */ #define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY MODULE_STACKSIZE diff --git a/include/asm-i386/thread_info.h b/include/asm-i386/thread_info.h index 4b187bb..f5268e0 100644 --- a/include/asm-i386/thread_info.h +++ b/include/asm-i386/thread_info.h @@ -53,7 +53,7 @@ struct thread_info { #endif #define PREEMPT_ACTIVE 0x10000000 -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define THREAD_SIZE (4096) #else #define THREAD_SIZE (8192) From owner-xfs@oss.sgi.com Fri May 4 22:21:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 22:21:40 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l455LZfB010386 for ; Fri, 4 May 2007 22:21:36 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 659901802EE36; Fri, 4 May 2007 23:49:31 -0500 (CDT) Message-ID: <463C0CD8.4090402@sandeen.net> Date: Fri, 04 May 2007 23:49:28 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> In-Reply-To: <20070504234357.24d22883@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11284 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 4 May 2007 17:58:21 +0200 vous écriviez: > >> Did you test with ext3 as well? You wrote it crashes with ReiserFS >> (version 3) even faster. When it crashes with several filesystems its >> unlikely to be a filesystem issue. > > Unfortunately ext3 doesn't support volumes bigger than 8TB, so that's > useless to me. I plan to test jfs, however. > I think it's more a dm/md issue, but I'm not sure... > Most recent kernels (2.6.19 or so IIRC) & cvs e2fsprogs (or that from rhel5/centos5) can do up to 16T ext3 filesystems, so you should be able to test that if you like. -Eric From owner-xfs@oss.sgi.com Fri May 4 23:06:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 May 2007 23:06:50 -0700 (PDT) Received: from mta5.adelphia.net (mta5.adelphia.net [68.168.78.187]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4566jfB021481 for ; Fri, 4 May 2007 23:06:47 -0700 Subject: Re: Mail System Error - Returned Mail To: linux-xfs@oss.sgi.com From: "Auto-reply from pjmarkert@adelphia.net" In-Reply-To: <20070505053606.FRLF26012.mta9.adelphia.net@oss.sgi.com> Precedence: bulk Date: Sat, 5 May 2007 01:36:08 -0400 Message-ID: <20070505053608.FRPH26012.mta9.adelphia.net@mta9> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 11285 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: pjmarkert@adelphia.net Precedence: bulk X-list: xfs My email address is changed to pjmarkert@verizon.net From owner-xfs@oss.sgi.com Sat May 5 08:20:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 08:20:34 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45FKUfB010573 for ; Sat, 5 May 2007 08:20:31 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 26851F15FD1 for ; Sat, 5 May 2007 17:20:30 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 19169182B0; Sat, 5 May 2007 17:20:27 +0200 (CEST) Date: Sat, 5 May 2007 17:19:31 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505171931.6fe9b6f5@galadriel.home> In-Reply-To: <20070504232028.GA19744@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45FKVfB010597 X-archive-position: 11287 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 4 May 2007 16:20:28 -0700 vous écriviez: > You could if you have time try this and enable CONFIG_I386_IRQSTACKS > but don't enable CONFIG_I386_4KSTACKS and see if that helps... That sounds very interesting, I'll give it a try monday. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 08:18:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 08:18:29 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45FIOfB006549 for ; Sat, 5 May 2007 08:18:24 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 2E267F15D45 for ; Sat, 5 May 2007 17:18:23 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id A524E17BFE; Sat, 5 May 2007 17:18:20 +0200 (CEST) Date: Sat, 5 May 2007 17:18:20 +0200 From: Emmanuel Florac To: Eric Sandeen Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505171820.6e92d437@galadriel.home> In-Reply-To: <463C0CD8.4090402@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45FIPfB006557 X-archive-position: 11286 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Fri, 04 May 2007 23:49:28 -0500 vous écriviez: > > Most recent kernels (2.6.19 or so IIRC) & cvs e2fsprogs (or that from > rhel5/centos5) can do up to 16T ext3 filesystems, so you should be > able to test that if you like. Thanks, I'll try that too. Though it won't cover all my needs (I plan to set up 50 and 150TB systems really soon). -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 09:33:51 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 09:33:55 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45GXofB000741 for ; Sat, 5 May 2007 09:33:51 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 8860BB02F5B2; Sat, 5 May 2007 12:33:49 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 849C85000166; Sat, 5 May 2007 12:33:49 -0400 (EDT) Date: Sat, 5 May 2007 12:33:49 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: linux-raid@vger.kernel.org cc: xfs@oss.sgi.com Subject: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11288 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Question, I currently have a 965 chipset-based motherboard, use 4 port onboard and several PCI-e x1 controller cards for a raid 5 of 10 raptor drives. I get pretty decent speeds: user@host$ time dd if=/dev/zero of=100gb bs=1M count=102400 102400+0 records in 102400+0 records out 107374182400 bytes (107 GB) copied, 247.134 seconds, 434 MB/s real 4m7.164s user 0m0.223s sys 3m3.505s user@host$ time dd if=100gb of=/dev/null bs=1M count=102400 102400+0 records in 102400+0 records out 107374182400 bytes (107 GB) copied, 172.588 seconds, 622 MB/s real 2m52.631s user 0m0.212s sys 1m50.905s user@host$ Also, when I run simultaenous dd's from all of the drives, I see 850-860MB/s, I am curious if there is some kind of limitation with software raid as to why I am not getting better than 500MB/s for sequential write speed? With 7 disks, I got about the same speed, adding 3 more for a total of 10 did not seem to help in regards to write. However, read improved to 622MBs/ from about 420-430MB/s. However, if I want to upgrade to more than 12 disks, I am out of PCI-e slots, so I was wondering, does anyone on this list run a 16 port Areca or 3ware card and use it for JBOD? What kind of performance do you see when using mdadm with such a card? Or if anyone uses mdadm with less than a 16 port card, I'd like to hear what kind of experiences you have seen with that type of configuration. Justin. From owner-xfs@oss.sgi.com Sat May 5 09:48:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 09:48:03 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45GlxfB005748 for ; Sat, 5 May 2007 09:48:00 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 93A4518022E01; Sat, 5 May 2007 11:47:58 -0500 (CDT) Message-ID: <463CB53E.8000202@sandeen.net> Date: Sat, 05 May 2007 11:47:58 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> In-Reply-To: <20070505171820.6e92d437@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11289 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 04 May 2007 23:49:28 -0500 vous écriviez: > >> Most recent kernels (2.6.19 or so IIRC) & cvs e2fsprogs (or that from >> rhel5/centos5) can do up to 16T ext3 filesystems, so you should be >> able to test that if you like. > > Thanks, I'll try that too. Though it won't cover all my needs (I plan > to set up 50 and 150TB systems really soon). > Sure, I understand - it may be helpful in figuring out what the problem is, though. I'll be curious to see how it goes... Oh, btw, you'll need the -F (force) flag for mkfs.ext3 -Eric From owner-xfs@oss.sgi.com Sat May 5 09:50:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 09:50:17 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45GoDfB006809 for ; Sat, 5 May 2007 09:50:15 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id C4EDC18022E01; Sat, 5 May 2007 11:50:12 -0500 (CDT) Message-ID: <463CB5C4.7040803@sandeen.net> Date: Sat, 05 May 2007 11:50:12 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Emmanuel Florac CC: Chris Wedgwood , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> In-Reply-To: <20070505171931.6fe9b6f5@galadriel.home> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11290 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Emmanuel Florac wrote: > Le Fri, 4 May 2007 16:20:28 -0700 vous écriviez: > >> You could if you have time try this and enable CONFIG_I386_IRQSTACKS >> but don't enable CONFIG_I386_4KSTACKS and see if that helps... > > That sounds very interesting, I'll give it a try monday. > There are also stack debugging config options; one that will warn if you are about to overflow (CONFIG_DEBUG_STACKOVERFLOW) and one that will print max stack depth in sysrq-t output (CONFIG_DEBUG_STACK_USAGE). -Eric From owner-xfs@oss.sgi.com Sat May 5 13:35:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:35:35 -0700 (PDT) Received: from smtp111.sbc.mail.mud.yahoo.com (smtp111.sbc.mail.mud.yahoo.com [68.142.198.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45KZSfB010426 for ; Sat, 5 May 2007 13:35:29 -0700 Received: (qmail 92356 invoked from network); 5 May 2007 20:35:28 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp111.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 20:35:27 -0000 X-YMail-OSG: NfkFI3wVM1l2KAzcA7Gpvf5kMfsvZM8GGJA_DL2tvbfn03E9cLQ8rwaGzn2fNG.7uUhguDxBvQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 887111827261; Sat, 5 May 2007 13:35:25 -0700 (PDT) Date: Sat, 5 May 2007 13:35:25 -0700 From: Chris Wedgwood To: Eric Sandeen Cc: Emmanuel Florac , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505203525.GA16477@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <463CB5C4.7040803@sandeen.net> X-archive-position: 11291 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sat, May 05, 2007 at 11:50:12AM -0500, Eric Sandeen wrote: > There are also stack debugging config options; one that will warn if > you are about to overflow (CONFIG_DEBUG_STACKOVERFLOW) and one that > will print max stack depth in sysrq-t output > (CONFIG_DEBUG_STACK_USAGE). I was in such a hurry I don't think I tweaked that sanely. I'll go over the patch checking that and test it later today. Is there some preferred kernel version people would like? From owner-xfs@oss.sgi.com Sat May 5 13:55:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:55:02 -0700 (PDT) Received: from smtp104.sbc.mail.mud.yahoo.com (smtp104.sbc.mail.mud.yahoo.com [68.142.198.203]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45KsxfB015700 for ; Sat, 5 May 2007 13:54:59 -0700 Received: (qmail 62153 invoked from network); 5 May 2007 20:54:58 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp104.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 20:54:58 -0000 X-YMail-OSG: m97Te.QVM1mbh8aFrqo95Qk4qrjAE4R81UJBHQJ1y14F1mB3VHCW427ig.b06hW2BI2KGF6gBQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id C851A1827261; Sat, 5 May 2007 13:54:56 -0700 (PDT) Date: Sat, 5 May 2007 13:54:56 -0700 From: Chris Wedgwood To: Justin Piszcz Cc: linux-raid@vger.kernel.org, xfs@oss.sgi.com Subject: Re: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? Message-ID: <20070505205456.GA17112@tuatara.stupidest.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-archive-position: 11292 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sat, May 05, 2007 at 12:33:49PM -0400, Justin Piszcz wrote: > Also, when I run simultaenous dd's from all of the drives, I see > 850-860MB/s, I am curious if there is some kind of limitation with > software raid as to why I am not getting better than 500MB/s for > sequential write speed? What does "vmstat 1" output look like in both cases? My guess is that for large writes it's NOT CPU bound but it can't hurt to check. > With 7 disks, I got about the same speed, adding 3 more for a total > of 10 did not seem to help in regards to write. However, read > improved to 622MBs/ from about 420-430MB/s. RAID is quirky. It's worth fiddling with the stripe size as that can have a big difference in terms of performance --- it's far from clear why on some setups some values work well and other setups you want very different values. It would be good to know if anyone has ever studied stripe size and also controller interleave/layout issues to get a good understanding of why certain values are good and others are very poor and why it varies so much from one setup to the other. Also, 'dd performance' varies between the start of a disk and the end. Typically you get better performance at the start of the disk so dd might not be a very good benchmark here. > However, if I want to upgrade to more than 12 disks, I am out of > PCI-e slots, so I was wondering, does anyone on this list run a 16 > port Areca or 3ware card and use it for JBOD? What kind of > performance do you see when using mdadm with such a card? Or if > anyone uses mdadm with less than a 16 port card, I'd like to hear > what kind of experiences you have seen with that type of > configuration. I've used some 2, 4 and 8 port 3ware cards. As JBODS they worked fine, as RAID cards I had no end of problems. I'm happy to test larger cards if someone wants to donate them :-) From owner-xfs@oss.sgi.com Sat May 5 13:56:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:56:58 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45KuqfB016458 for ; Sat, 5 May 2007 13:56:53 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 5B167F2888F for ; Sat, 5 May 2007 22:56:48 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id B30BE15507; Sat, 5 May 2007 22:56:46 +0200 (CEST) Date: Sat, 5 May 2007 22:56:46 +0200 From: Emmanuel Florac To: Eric Sandeen Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505225646.1e16b0c4@galadriel.home> In-Reply-To: <463CB53E.8000202@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <463CB53E.8000202@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45KusfB016478 X-archive-position: 11293 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 05 May 2007 11:47:58 -0500 vous écriviez: > Sure, I understand - it may be helpful in figuring out what the > problem is, though. I'll be curious to see how it goes... Sure, stay tuned! > Oh, btw, you'll need the -F (force) flag for mkfs.ext3 Thanks! -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 13:57:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 13:57:31 -0700 (PDT) Received: from postfix2-g20.free.fr (postfix2-g20.free.fr [212.27.60.43]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45KvQfB016809 for ; Sat, 5 May 2007 13:57:28 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix2-g20.free.fr (Postfix) with ESMTP id 26B7EFBB29D for ; Sat, 5 May 2007 21:57:49 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 134EF17BA4; Sat, 5 May 2007 22:57:23 +0200 (CEST) Date: Sat, 5 May 2007 22:57:23 +0200 From: Emmanuel Florac To: xfs@oss.sgi.com Cc: Chris Wedgwood Subject: Re: XFS crash on linux raid Message-ID: <20070505225723.012cc38b@galadriel.home> In-Reply-To: <463CB5C4.7040803@sandeen.net> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45KvSfB016842 X-archive-position: 11294 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 05 May 2007 11:50:12 -0500 vous écriviez: > There are also stack debugging config options; one that will warn if > you are about to overflow (CONFIG_DEBUG_STACKOVERFLOW) and one that > will print max stack depth in sysrq-t output > (CONFIG_DEBUG_STACK_USAGE). Fine, I'll try that. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 14:00:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:00:25 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45L0LfB018275 for ; Sat, 5 May 2007 14:00:22 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 8F802F10C62 for ; Sat, 5 May 2007 22:58:21 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 9CFA618302; Sat, 5 May 2007 22:58:19 +0200 (CEST) Date: Sat, 5 May 2007 22:58:19 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505225819.0dd3c0fa@galadriel.home> In-Reply-To: <20070505203525.GA16477@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> <20070505203525.GA16477@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45L0MfB018282 X-archive-position: 11295 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 13:35:25 -0700 vous écriviez: > Is there some preferred kernel version people would like? > Well I prefer staying away from the very latest bleeding edge, so I stick to 2.6.20.11 for now. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 14:18:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:18:50 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45LIkfB023937 for ; Sat, 5 May 2007 14:18:47 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 118A517BD2; Sat, 5 May 2007 23:18:45 +0200 (CEST) Date: Sat, 5 May 2007 23:18:45 +0200 From: Emmanuel Florac To: Justin Piszcz Cc: linux-raid@vger.kernel.org, xfs@oss.sgi.com Subject: Re: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? Message-ID: <20070505231845.7b1cbdc5@galadriel.home> In-Reply-To: References: Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l45LImfB023945 X-archive-position: 11296 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 12:33:49 -0400 (EDT) vous écriviez: > However, if I want to upgrade to more than 12 disks, I am out of > PCI-e slots, so I was wondering, does anyone on this list run a 16 > port Areca or 3ware card and use it for JBOD? I don't use this setup in production, but I tried it with 8 ports 3Ware cards. I didn't try the latest 9650 though. > What kind of > performance do you see when using mdadm with such a card? 3Ghz Supermicro P4D 1 GB RAM, 3Ware 9550SX with 8x250GB 8MB cache 7200 RPM Seagate drives, raid 0 Tested XFS and reiserfs, with 64 and 256K stripes. tested under Linux 2.6.15.1, with bonnie++ in "fast mode" (-f option). use bon_csv2html to translate, or see bonnie++ documentation, roughly : 2G is the file size tested, then numbers on the first line are : write speed (KB/s), CPU usage (%), rewrite speed (overwrite), cpu usage, read speed, cpu usage. Then follow sequential and random seeks, reads, writes and delete with their cpu usage. "+++++" means "no significant value". # XFS, stripe 256k storiq,2G,,,353088,69,76437,17,,,197376,16,410.8,0,16,11517,57,+++++,+++,10699,51,11502,59,+++++,+++,12158,61 storiq,2G,,,349166,71,75397,17,,,196057,16,433.3,0,16,12744,64,+++++,+++,12700,58,13008,67,+++++,+++,9890,51 storiq,2G,,,336683,68,72581,16,,,191254,18,419.9,0,16,12377,62,+++++,+++,10991,52,12947,67,+++++,+++,10580,52 storiq,2G,,,335646,65,77938,17,,,195350,17,397.4,0,16,14578,74,+++++,+++,11085,53,14377,74,+++++,+++,10852,54 storiq,2G,,,330022,67,73004,17,,,197846,18,412.3,0,16,12534,65,+++++,+++,10983,52,12161,63,+++++,+++,11752,61 storiq,2G,,,279454,55,75256,17,,,196065,18,412.7,0,16,13022,67,+++++,+++,10802,52,13759,72,+++++,+++,9800,47 storiq,2G,,,314606,61,74883,16,,,194131,16,401.2,0,16,11665,58,+++++,+++,10723,52,11880,61,+++++,+++,6659,33 storiq,2G,,,264382,53,72011,15,,,196690,18,411.5,0,16,10194,52,+++++,+++,12202,57,10367,52,+++++,+++,9175,45 storiq,2G,,,360252,72,75845,17,,,199721,18,432.7,0,16,12067,61,+++++,+++,11047,54,12156,62,+++++,+++,12372,60 storiq,2G,,,280746,57,74541,17,,,193562,19,414.0,0,16,12418,61,+++++,+++,11090,52,11135,57,+++++,+++,11309,55 storiq,2G,,,309464,61,79153,18,,,191533,17,419.5,0,16,12705,62,+++++,+++,11889,57,12027,61,+++++,+++,10960,54 storiq,2G,,,342122,67,68113,15,,,195572,16,413.5,0,16,13667,69,+++++,+++,10596,55,12731,66,+++++,+++,10766,54 storiq,2G,,,329945,63,72183,15,,,193082,18,421.8,0,16,12627,62,+++++,+++,9270,43,12455,63,+++++,+++,8878,44 storiq,2G,,,309570,63,69628,16,,,192415,19,413.1,0,16,13568,69,+++++,+++,10104,48,13512,70,+++++,+++,9261,45 storiq,2G,,,298528,58,70029,15,,,193531,17,399.5,0,16,13028,64,+++++,+++,9990,47,10098,52,+++++,+++,7544,38 storiq,2G,,,260341,52,66979,15,,,197199,18,393.1,0,16,10633,53,+++++,+++,9189,43,11159,56,+++++,+++,11696,58 # XFS, stripe 64k storiq,2G,,,351241,70,90868,22,,,305222,29,408.7,0,16,8593,43,+++++,+++,6639,31,7555,39,+++++,+++,6639,33 storiq,2G,,,340145,67,83790,19,,,297148,28,401.4,0,16,9132,46,+++++,+++,6790,34,8881,45,+++++,+++,6305,31 storiq,2G,,,325791,65,81314,19,,,282439,26,395.5,0,16,9095,44,+++++,+++,6255,29,8173,42,+++++,+++,6194,31 storiq,2G,,,266009,53,83362,20,,,308438,26,407.7,0,16,8362,43,+++++,+++,6443,30,9264,47,+++++,+++,6339,33 storiq,2G,,,322776,65,76466,17,,,288001,26,399.7,0,16,8038,41,+++++,+++,5387,26,6389,34,+++++,+++,6545,31 storiq,2G,,,309007,60,77846,18,,,290613,29,392.8,0,16,7183,37,+++++,+++,6492,30,8270,41,+++++,+++,6813,35 storiq,2G,,,287662,58,72920,17,,,287911,26,398.4,0,16,8893,44,+++++,+++,7777,36,8150,41,+++++,+++,7717,39 storiq,2G,,,288149,56,75743,17,,,300949,29,386.2,0,16,9545,47,+++++,+++,7572,35,9115,46,+++++,+++,7211,36 # reiser, stripe 256k storiq,2G,,,289179,98,102775,26,,,188307,22,444.0,0,16,27326,100,+++++,+++,21887,99,26726,99,+++++,+++,20633,98 storiq,2G,,,275847,93,101970,25,,,190551,21,450.2,0,16,27397,100,+++++,+++,21926,100,26609,100,+++++,+++,20895,99 storiq,2G,,,289414,99,105080,26,,,189022,22,423.9,0,16,27212,100,+++++,+++,21757,100,26651,99,+++++,+++,20863,100 storiq,2G,,,292746,99,103681,25,,,186303,21,431.5,0,16,27375,100,+++++,+++,21989,99,26251,99,+++++,+++,20924,99 storiq,2G,,,290222,99,104135,26,,,189656,22,449.7,0,16,27453,99,+++++,+++,21849,100,26757,99,+++++,+++,20845,99 storiq,2G,,,291716,99,103872,26,,,187410,23,437.0,0,16,27419,99,+++++,+++,22119,99,26516,100,+++++,+++,20934,100 storiq,2G,,,285545,99,101637,25,,,189788,21,422.1,0,16,27224,99,+++++,+++,21742,99,26500,99,+++++,+++,20922,100 storiq,2G,,,293042,98,100272,24,,,185631,22,453.8,0,16,27268,99,+++++,+++,21944,100,26777,100,+++++,+++,21042,99 # reiser stripe 64k storiq,2G,,,295569,99,112563,29,,,282178,32,434.5,0,16,27631,99,+++++,+++,22015,99,27021,100,+++++,+++,21028,99 storiq,2G,,,287830,98,112449,29,,,271047,33,425.1,0,16,27447,99,+++++,+++,21973,99,26810,99,+++++,+++,21008,100 storiq,2G,,,271668,95,114410,30,,,282419,33,438.7,0,16,27495,100,+++++,+++,22158,100,26707,100,+++++,+++,21106,100 storiq,2G,,,282535,99,118620,30,,,272089,33,425.0,0,16,27569,100,+++++,+++,22021,100,26778,100,+++++,+++,20629,98 storiq,2G,,,294392,98,119654,32,,,273269,32,429.7,0,16,27591,100,+++++,+++,21984,99,26786,100,+++++,+++,20994,99 storiq,2G,,,296652,99,118420,31,,,279586,33,425.5,0,16,15007,78,+++++,+++,21889,99,26998,99,+++++,+++,20952,100 storiq,2G,,,290551,98,124374,32,,,273852,32,424.0,0,16,27534,99,+++++,+++,21974,99,26746,100,+++++,+++,20786,99 storiq,2G,,,287033,99,100559,26,,,204845,24,390.9,0,16,27620,99,+++++,+++,21996,99,26811,100,+++++,+++,21009,100 Here are the tests I did with a similar system, but with 500GB drives, XFS only, 64KB stripe (3ware default).I tested RAID 5 software RAID compared to RAID-5 hardware (3Ware 9550). # software raid 5 storiq-5U,2G,,,155913,22,23390,4,,,84327,9,531.5,0,16,1323,3,+++++,+++,634,1,657,2,+++++,+++,903,3 storiq-5U,2G,,,168104,24,23964,4,,,81666,8,534.2,0,16,605,2,+++++,+++,608,2,770,2,+++++,+++,706,1 storiq-5U,2G,,,149516,21,22612,4,,,82111,9,571.3,0,16,606,2,+++++,+++,590,2,729,2,+++++,+++,450,1 storiq-5U,2G,,,141883,20,22966,4,,,78116,8,568.5,0,16,615,2,+++++,+++,553,2,684,2,+++++,+++,508,2 # hardware raid 5 storiq-1,2G,,,148500,29,43043,9,,,148808,14,442.3,0,16,5953,27,+++++,+++,4408,20,4994,24,+++++,+++,2399,11 storiq-1,2G,,,191440,37,38092,8,,,155494,15,420.9,0,16,3074,15,+++++,+++,3356,17,4246,21,+++++,+++,2513,12 storiq-1,2G,,,150460,29,40018,9,,,144936,14,386.9,0,16,4206,20,+++++,+++,2497,11,5182,26,+++++,+++,2440,11 storiq-1,2G,,,163132,34,34525,8,,,132131,13,369.7,0,16,6796,33,+++++,+++,10002,47,5475,28,+++++,+++,3652,17 As you can see, hardware RAID-5 doesn't perform significantly faster at writing, but read thruput and rewrite performance is way better, and seeks are an order of magnitude faster. That's why I use striped 3Ware hardware RAID-5 to build high capacity systems instead of software RAID 5. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sat May 5 14:23:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:23:05 -0700 (PDT) Received: from smtp107.sbc.mail.mud.yahoo.com (smtp107.sbc.mail.mud.yahoo.com [68.142.198.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45LN2fB026375 for ; Sat, 5 May 2007 14:23:03 -0700 Received: (qmail 55252 invoked from network); 5 May 2007 20:56:18 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp107.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 20:56:18 -0000 X-YMail-OSG: B34Ic84VM1nv8UOP1HdKHiretfaGEYgoETcAHAVgODfGX.6akLlrLcUL7d6IS85Oo1moL3QNRw-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 624B21827261; Sat, 5 May 2007 13:56:17 -0700 (PDT) Date: Sat, 5 May 2007 13:56:17 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505205617.GB17112@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070504234357.24d22883@galadriel.home> X-archive-position: 11297 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 11:43:57PM +0200, Emmanuel Florac wrote: > Unfortunately ext3 doesn't support volumes bigger than 8TB, so > that's useless to me. I plan to test jfs, however. Is jfs supported by anyone right now? From owner-xfs@oss.sgi.com Sat May 5 14:32:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 14:32:53 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l45LWmfB030926 for ; Sat, 5 May 2007 14:32:49 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 5F5C4B02F5B2; Sat, 5 May 2007 17:32:47 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 5887A5000177; Sat, 5 May 2007 17:32:47 -0400 (EDT) Date: Sat, 5 May 2007 17:32:47 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Emmanuel Florac cc: linux-raid@vger.kernel.org, xfs@oss.sgi.com Subject: Re: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards? In-Reply-To: <20070505231845.7b1cbdc5@galadriel.home> Message-ID: References: <20070505231845.7b1cbdc5@galadriel.home> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1478584756-1178400767=:18820" X-archive-position: 11298 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1478584756-1178400767=:18820 Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Sat, 5 May 2007, Emmanuel Florac wrote: > Le Sat, 5 May 2007 12:33:49 -0400 (EDT) vous =E9criviez: > >> However, if I want to upgrade to more than 12 disks, I am out of >> PCI-e slots, so I was wondering, does anyone on this list run a 16 >> port Areca or 3ware card and use it for JBOD? > > I don't use this setup in production, but I tried it with 8 ports 3Ware > cards. > I didn't try the latest 9650 though. > >> What kind of >> performance do you see when using mdadm with such a card? > > 3Ghz Supermicro P4D 1 GB RAM, 3Ware 9550SX with 8x250GB 8MB cache 7200 > RPM Seagate drives, raid 0 > > Tested XFS and reiserfs, with 64 and 256K stripes. > > tested under Linux 2.6.15.1, with bonnie++ in "fast mode" (-f option). > use bon_csv2html to translate, or see bonnie++ documentation, roughly : > 2G is the file size tested, then numbers on the first line are : write > speed (KB/s), CPU usage (%), rewrite speed (overwrite), cpu usage, read > speed, cpu usage. Then follow sequential and random seeks, reads, > writes and delete with their cpu usage. "+++++" means "no significant > value". > > # XFS, stripe 256k > storiq,2G,,,353088,69,76437,17,,,197376,16,410.8,0,16,11517,57,+++++,+++,= 10699,51,11502,59,+++++,+++,12158,61 > storiq,2G,,,349166,71,75397,17,,,196057,16,433.3,0,16,12744,64,+++++,+++,= 12700,58,13008,67,+++++,+++,9890,51 > storiq,2G,,,336683,68,72581,16,,,191254,18,419.9,0,16,12377,62,+++++,+++,= 10991,52,12947,67,+++++,+++,10580,52 > storiq,2G,,,335646,65,77938,17,,,195350,17,397.4,0,16,14578,74,+++++,+++,= 11085,53,14377,74,+++++,+++,10852,54 > storiq,2G,,,330022,67,73004,17,,,197846,18,412.3,0,16,12534,65,+++++,+++,= 10983,52,12161,63,+++++,+++,11752,61 > storiq,2G,,,279454,55,75256,17,,,196065,18,412.7,0,16,13022,67,+++++,+++,= 10802,52,13759,72,+++++,+++,9800,47 > storiq,2G,,,314606,61,74883,16,,,194131,16,401.2,0,16,11665,58,+++++,+++,= 10723,52,11880,61,+++++,+++,6659,33 > storiq,2G,,,264382,53,72011,15,,,196690,18,411.5,0,16,10194,52,+++++,+++,= 12202,57,10367,52,+++++,+++,9175,45 > storiq,2G,,,360252,72,75845,17,,,199721,18,432.7,0,16,12067,61,+++++,+++,= 11047,54,12156,62,+++++,+++,12372,60 > storiq,2G,,,280746,57,74541,17,,,193562,19,414.0,0,16,12418,61,+++++,+++,= 11090,52,11135,57,+++++,+++,11309,55 > storiq,2G,,,309464,61,79153,18,,,191533,17,419.5,0,16,12705,62,+++++,+++,= 11889,57,12027,61,+++++,+++,10960,54 > storiq,2G,,,342122,67,68113,15,,,195572,16,413.5,0,16,13667,69,+++++,+++,= 10596,55,12731,66,+++++,+++,10766,54 > storiq,2G,,,329945,63,72183,15,,,193082,18,421.8,0,16,12627,62,+++++,+++,= 9270,43,12455,63,+++++,+++,8878,44 > storiq,2G,,,309570,63,69628,16,,,192415,19,413.1,0,16,13568,69,+++++,+++,= 10104,48,13512,70,+++++,+++,9261,45 > storiq,2G,,,298528,58,70029,15,,,193531,17,399.5,0,16,13028,64,+++++,+++,= 9990,47,10098,52,+++++,+++,7544,38 > storiq,2G,,,260341,52,66979,15,,,197199,18,393.1,0,16,10633,53,+++++,+++,= 9189,43,11159,56,+++++,+++,11696,58 > # XFS, stripe 64k > storiq,2G,,,351241,70,90868,22,,,305222,29,408.7,0,16,8593,43,+++++,+++,6= 639,31,7555,39,+++++,+++,6639,33 > storiq,2G,,,340145,67,83790,19,,,297148,28,401.4,0,16,9132,46,+++++,+++,6= 790,34,8881,45,+++++,+++,6305,31 > storiq,2G,,,325791,65,81314,19,,,282439,26,395.5,0,16,9095,44,+++++,+++,6= 255,29,8173,42,+++++,+++,6194,31 > storiq,2G,,,266009,53,83362,20,,,308438,26,407.7,0,16,8362,43,+++++,+++,6= 443,30,9264,47,+++++,+++,6339,33 > storiq,2G,,,322776,65,76466,17,,,288001,26,399.7,0,16,8038,41,+++++,+++,5= 387,26,6389,34,+++++,+++,6545,31 > storiq,2G,,,309007,60,77846,18,,,290613,29,392.8,0,16,7183,37,+++++,+++,6= 492,30,8270,41,+++++,+++,6813,35 > storiq,2G,,,287662,58,72920,17,,,287911,26,398.4,0,16,8893,44,+++++,+++,7= 777,36,8150,41,+++++,+++,7717,39 > storiq,2G,,,288149,56,75743,17,,,300949,29,386.2,0,16,9545,47,+++++,+++,7= 572,35,9115,46,+++++,+++,7211,36 > # reiser, stripe 256k > storiq,2G,,,289179,98,102775,26,,,188307,22,444.0,0,16,27326,100,+++++,++= +,21887,99,26726,99,+++++,+++,20633,98 > storiq,2G,,,275847,93,101970,25,,,190551,21,450.2,0,16,27397,100,+++++,++= +,21926,100,26609,100,+++++,+++,20895,99 > storiq,2G,,,289414,99,105080,26,,,189022,22,423.9,0,16,27212,100,+++++,++= +,21757,100,26651,99,+++++,+++,20863,100 > storiq,2G,,,292746,99,103681,25,,,186303,21,431.5,0,16,27375,100,+++++,++= +,21989,99,26251,99,+++++,+++,20924,99 > storiq,2G,,,290222,99,104135,26,,,189656,22,449.7,0,16,27453,99,+++++,+++= ,21849,100,26757,99,+++++,+++,20845,99 > storiq,2G,,,291716,99,103872,26,,,187410,23,437.0,0,16,27419,99,+++++,+++= ,22119,99,26516,100,+++++,+++,20934,100 > storiq,2G,,,285545,99,101637,25,,,189788,21,422.1,0,16,27224,99,+++++,+++= ,21742,99,26500,99,+++++,+++,20922,100 > storiq,2G,,,293042,98,100272,24,,,185631,22,453.8,0,16,27268,99,+++++,+++= ,21944,100,26777,100,+++++,+++,21042,99 > # reiser stripe 64k > storiq,2G,,,295569,99,112563,29,,,282178,32,434.5,0,16,27631,99,+++++,+++= ,22015,99,27021,100,+++++,+++,21028,99 > storiq,2G,,,287830,98,112449,29,,,271047,33,425.1,0,16,27447,99,+++++,+++= ,21973,99,26810,99,+++++,+++,21008,100 > storiq,2G,,,271668,95,114410,30,,,282419,33,438.7,0,16,27495,100,+++++,++= +,22158,100,26707,100,+++++,+++,21106,100 > storiq,2G,,,282535,99,118620,30,,,272089,33,425.0,0,16,27569,100,+++++,++= +,22021,100,26778,100,+++++,+++,20629,98 > storiq,2G,,,294392,98,119654,32,,,273269,32,429.7,0,16,27591,100,+++++,++= +,21984,99,26786,100,+++++,+++,20994,99 > storiq,2G,,,296652,99,118420,31,,,279586,33,425.5,0,16,15007,78,+++++,+++= ,21889,99,26998,99,+++++,+++,20952,100 > storiq,2G,,,290551,98,124374,32,,,273852,32,424.0,0,16,27534,99,+++++,+++= ,21974,99,26746,100,+++++,+++,20786,99 > storiq,2G,,,287033,99,100559,26,,,204845,24,390.9,0,16,27620,99,+++++,+++= ,21996,99,26811,100,+++++,+++,21009,100 > > Here are the tests I did with a similar system, but with 500GB drives, > XFS only, 64KB stripe (3ware default).I tested RAID 5 software RAID > compared to RAID-5 hardware (3Ware 9550). > > # software raid 5 > storiq-5U,2G,,,155913,22,23390,4,,,84327,9,531.5,0,16,1323,3,+++++,+++,63= 4,1,657,2,+++++,+++,903,3 > storiq-5U,2G,,,168104,24,23964,4,,,81666,8,534.2,0,16,605,2,+++++,+++,608= ,2,770,2,+++++,+++,706,1 > storiq-5U,2G,,,149516,21,22612,4,,,82111,9,571.3,0,16,606,2,+++++,+++,590= ,2,729,2,+++++,+++,450,1 > storiq-5U,2G,,,141883,20,22966,4,,,78116,8,568.5,0,16,615,2,+++++,+++,553= ,2,684,2,+++++,+++,508,2 > # hardware raid 5 > storiq-1,2G,,,148500,29,43043,9,,,148808,14,442.3,0,16,5953,27,+++++,+++,= 4408,20,4994,24,+++++,+++,2399,11 > storiq-1,2G,,,191440,37,38092,8,,,155494,15,420.9,0,16,3074,15,+++++,+++,= 3356,17,4246,21,+++++,+++,2513,12 > storiq-1,2G,,,150460,29,40018,9,,,144936,14,386.9,0,16,4206,20,+++++,+++,= 2497,11,5182,26,+++++,+++,2440,11 > storiq-1,2G,,,163132,34,34525,8,,,132131,13,369.7,0,16,6796,33,+++++,+++,= 10002,47,5475,28,+++++,+++,3652,17 > > As you can see, hardware RAID-5 doesn't perform significantly faster > at writing, but read thruput and rewrite performance is way better, and > seeks are an order of magnitude faster. That's why I use striped 3Ware > hardware RAID-5 to build high capacity systems instead of software RAID > 5. > > --=20 > -------------------------------------------------- > Emmanuel Florac www.intellique.com > -------------------------------------------------- > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Wow, very impressive benchmarks, thank you very much for this. Justin.= ---1463747160-1478584756-1178400767=:18820-- From owner-xfs@oss.sgi.com Sat May 5 15:12:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 05 May 2007 15:13:00 -0700 (PDT) Received: from smtp113.sbc.mail.mud.yahoo.com (smtp113.sbc.mail.mud.yahoo.com [68.142.198.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l45MCrfB009680 for ; Sat, 5 May 2007 15:12:54 -0700 Received: (qmail 80976 invoked from network); 5 May 2007 22:12:52 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp113.sbc.mail.mud.yahoo.com with SMTP; 5 May 2007 22:12:52 -0000 X-YMail-OSG: kYmy1WoVM1lLSqHi8kH_YIg9mAqfQe1Fv.gpSI1oIdhpFszmQ05A3stLHe0TtQ_9tudApt0ekA-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 2A6251827261; Sat, 5 May 2007 15:12:50 -0700 (PDT) Date: Sat, 5 May 2007 15:12:50 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070505221249.GA21960@tuatara.stupidest.org> References: <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> <20070505203525.GA16477@tuatara.stupidest.org> <20070505225819.0dd3c0fa@galadriel.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070505225819.0dd3c0fa@galadriel.home> X-archive-position: 11299 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sat, May 05, 2007 at 10:58:19PM +0200, Emmanuel Florac wrote: > Well I prefer staying away from the very latest bleeding edge, so I > stick to 2.6.20.11 for now. diff --git a/arch/i386/Kconfig.debug b/arch/i386/Kconfig.debug index f68cc6f..908b755 100644 --- a/arch/i386/Kconfig.debug +++ b/arch/i386/Kconfig.debug @@ -56,15 +56,22 @@ config DEBUG_RODATA portion of the kernel code won't be covered by a 2MB TLB anymore. If in doubt, say "N". -config 4KSTACKS +config I386_4KSTACKS bool "Use 4Kb for kernel stacks instead of 8Kb" depends on DEBUG_KERNEL help If you say Y here the kernel will use a 4Kb stacksize for the kernel stack attached to each process/thread. This facilitates running more threads on a system and also reduces the pressure - on the VM subsystem for higher order allocations. This option - will also use IRQ stacks to compensate for the reduced stackspace. + on the VM subsystem for higher order allocations. + +config I386_IRQSTACKS + bool "Allocate separate IRQ stacks" + depends on DEBUG_KERNEL + default y + help + If you say Y here the kernel will allocate and use separate + stacks for interrupts. config X86_FIND_SMP_CONFIG bool diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c index 3201d42..0da8251 100644 --- a/arch/i386/kernel/irq.c +++ b/arch/i386/kernel/irq.c @@ -33,7 +33,7 @@ void ack_bad_irq(unsigned int irq) } #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * per-CPU IRQ handling contexts (thread information and stack) */ @@ -44,7 +44,7 @@ union irq_ctx { static union irq_ctx *hardirq_ctx[NR_CPUS] __read_mostly; static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly; -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * do_IRQ handles all normal device IRQ's (the special @@ -57,7 +57,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) /* high bit used in ret_from_ code */ int irq = ~regs->orig_eax; struct irq_desc *desc = irq_desc + irq; -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS union irq_ctx *curctx, *irqctx; u32 *isp; #endif @@ -85,7 +85,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) } #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS curctx = (union irq_ctx *) current_thread_info(); irqctx = hardirq_ctx[smp_processor_id()]; @@ -122,7 +122,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) : "memory", "cc" ); } else -#endif +#endif /* CONFIG_I386_IRQSTACKS */ desc->handle_irq(irq, desc); irq_exit(); @@ -130,7 +130,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) return 1; } -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS /* * These should really be __section__(".bss.page_aligned") as well, but @@ -220,7 +220,7 @@ asmlinkage void do_softirq(void) } EXPORT_SYMBOL(do_softirq); -#endif +#endif /* CONFIG_I386_IRQSTACKS */ /* * Interrupt statistics: diff --git a/include/asm-i386/irq.h b/include/asm-i386/irq.h index 11761cd..7db95e1 100644 --- a/include/asm-i386/irq.h +++ b/include/asm-i386/irq.h @@ -24,14 +24,14 @@ static __inline__ int irq_canonicalize(int irq) # define ARCH_HAS_NMI_WATCHDOG /* See include/linux/nmi.h */ #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_IRQSTACKS extern void irq_ctx_init(int cpu); extern void irq_ctx_exit(int cpu); # define __ARCH_HAS_DO_SOFTIRQ -#else +#else /* !CONFIG_I386_IRQSTACKS */ # define irq_ctx_init(cpu) do { } while (0) # define irq_ctx_exit(cpu) do { } while (0) -#endif +#endif /* CONFIG_I386_IRQSTACKS */ #ifdef CONFIG_IRQBALANCE extern int irqbalance_disable(char *str); diff --git a/include/asm-i386/module.h b/include/asm-i386/module.h index 02f8f54..7d5d2df 100644 --- a/include/asm-i386/module.h +++ b/include/asm-i386/module.h @@ -62,11 +62,11 @@ struct mod_arch_specific #error unknown processor family #endif -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define MODULE_STACKSIZE "4KSTACKS " -#else +#else /* not using CONFIG_I386_4KSTACKS */ #define MODULE_STACKSIZE "" -#endif +#endif /* CONFIG_I386_4KSTACKS */ #define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY MODULE_STACKSIZE diff --git a/include/asm-i386/thread_info.h b/include/asm-i386/thread_info.h index 4b187bb..f5268e0 100644 --- a/include/asm-i386/thread_info.h +++ b/include/asm-i386/thread_info.h @@ -53,7 +53,7 @@ struct thread_info { #endif #define PREEMPT_ACTIVE 0x10000000 -#ifdef CONFIG_4KSTACKS +#ifdef CONFIG_I386_4KSTACKS #define THREAD_SIZE (4096) #else #define THREAD_SIZE (8192) From owner-xfs@oss.sgi.com Sun May 6 10:21:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:21:05 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46HL1fB028324 for ; Sun, 6 May 2007 10:21:02 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id CCD9D18762; Sun, 6 May 2007 19:21:00 +0200 (CEST) Date: Sun, 6 May 2007 19:21:04 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506192104.3becdd81@galadriel.home> In-Reply-To: <20070505210002.GC17112@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <20070505210002.GC17112@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46HL2fB028330 X-archive-position: 11301 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 14:00:02 -0700 vous écriviez: > A 50TB filesystem might suck horrible on a 32-bit platform. I'm not > sure there is *ANY* way you coiuld fsck that should you need in some > cases. > > Is that what you're planning to do? Nope, I'll use an x86_64 system running an x86_64 kernel :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 10:21:44 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:21:46 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46HLhfB028657 for ; Sun, 6 May 2007 10:21:44 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id CC0CF18718; Sun, 6 May 2007 19:21:42 +0200 (CEST) Date: Sun, 6 May 2007 19:21:46 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506192146.7f03cd4e@galadriel.home> In-Reply-To: <20070505221249.GA21960@tuatara.stupidest.org> References: <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <463B4962.70904@sandeen.net> <20070504173049.14606033@harpe.intellique.com> <20070504232028.GA19744@tuatara.stupidest.org> <20070505171931.6fe9b6f5@galadriel.home> <463CB5C4.7040803@sandeen.net> <20070505203525.GA16477@tuatara.stupidest.org> <20070505225819.0dd3c0fa@galadriel.home> <20070505221249.GA21960@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46HLifB028674 X-archive-position: 11302 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 15:12:50 -0700 vous écriviez: > diff --git a/arch/i386/Kconfig.debug b/arch/i386/Kconfig.debug Thanks! -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 10:19:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:19:53 -0700 (PDT) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46HJmfB027935 for ; Sun, 6 May 2007 10:19:49 -0700 Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by postfix1-g20.free.fr (Postfix) with ESMTP id 582C9F34029 for ; Sun, 6 May 2007 19:19:47 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id E2A6818206; Sun, 6 May 2007 19:19:43 +0200 (CEST) Date: Sun, 6 May 2007 19:19:47 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: Martin Steigerwald , linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506191947.75a2058a@galadriel.home> In-Reply-To: <20070505205617.GB17112@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <20070505205617.GB17112@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46HJofB027943 X-archive-position: 11300 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sat, 5 May 2007 13:56:17 -0700 vous écriviez: > Is jfs supported by anyone right now? Huh, IBM I hope :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 10:26:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:26:14 -0700 (PDT) Received: from smtp114.sbc.mail.mud.yahoo.com (smtp114.sbc.mail.mud.yahoo.com [68.142.198.213]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l46HQ8fB030347 for ; Sun, 6 May 2007 10:26:09 -0700 Received: (qmail 92936 invoked from network); 6 May 2007 17:26:08 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp114.sbc.mail.mud.yahoo.com with SMTP; 6 May 2007 17:26:08 -0000 X-YMail-OSG: JpyOiz0VM1mldeiku.Hr8o32aTLyos4dDOQSFemrA1zdTVKzh2MZehYlzOHEUP1wl41_FWvOGg-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id DA4271827261; Sun, 6 May 2007 10:26:06 -0700 (PDT) Date: Sun, 6 May 2007 10:26:06 -0700 From: Chris Wedgwood To: Emmanuel Florac Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506172606.GB4823@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <20070505210002.GC17112@tuatara.stupidest.org> <20070506192104.3becdd81@galadriel.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070506192104.3becdd81@galadriel.home> X-archive-position: 11303 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Sun, May 06, 2007 at 07:21:04PM +0200, Emmanuel Florac wrote: > Nope, I'll use an x86_64 system running an x86_64 kernel :) How much RAM? I think you'll want 10s of GBs possibly (well, it depends very much on what you're storing but you can fit a lot of small files in 150TB...) From owner-xfs@oss.sgi.com Sun May 6 10:56:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 10:56:08 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46Hu4fB005398 for ; Sun, 6 May 2007 10:56:05 -0700 Received: from localhost (dslb-084-057-122-104.pools.arcor-ip.net [84.57.122.104]) by mail.lichtvoll.de (Postfix) with ESMTP id 3A3FF5AD40 for ; Sun, 6 May 2007 19:56:03 +0200 (CEST) From: Martin Steigerwald To: linux-xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Date: Sun, 6 May 2007 19:56:02 +0200 User-Agent: KMail/1.9.6 References: <20070503164521.16efe075@harpe.intellique.com> <20070504234357.24d22883@galadriel.home> <20070505205617.GB17112@tuatara.stupidest.org> (sfid-20070506_174955_742323_AFBCDD13) In-Reply-To: <20070505205617.GB17112@tuatara.stupidest.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705061956.02375.Martin@lichtvoll.de> X-archive-position: 11304 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Samstag 05 Mai 2007 schrieb Chris Wedgwood: > On Fri, May 04, 2007 at 11:43:57PM +0200, Emmanuel Florac wrote: > > Unfortunately ext3 doesn't support volumes bigger than 8TB, so > > that's useless to me. I plan to test jfs, however. > > Is jfs supported by anyone right now? David 'Dave' Kleikamp was still taking care of JFS as I asked him some questions about write barrier support back in July 2007. He concentrated on bug fixes tough, not on new features. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Sun May 6 11:37:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 11:37:16 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l46IbCfB014325 for ; Sun, 6 May 2007 11:37:13 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 88D3D18737; Sun, 6 May 2007 20:37:11 +0200 (CEST) Date: Sun, 6 May 2007 20:36:49 +0200 From: Emmanuel Florac To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070506203649.1c4d9d14@galadriel.home> In-Reply-To: <20070506172606.GB4823@tuatara.stupidest.org> References: <20070503164521.16efe075@harpe.intellique.com> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <200705041758.21320.Martin@lichtvoll.de> <20070504234357.24d22883@galadriel.home> <463C0CD8.4090402@sandeen.net> <20070505171820.6e92d437@galadriel.home> <20070505210002.GC17112@tuatara.stupidest.org> <20070506192104.3becdd81@galadriel.home> <20070506172606.GB4823@tuatara.stupidest.org> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l46IbDfB014343 X-archive-position: 11305 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Sun, 6 May 2007 10:26:06 -0700 vous écriviez: > How much RAM? I think you'll want 10s of GBs possibly (well, it > depends very much on what you're storing but you can fit a lot of > small files in 150TB...) It will be video storage, big to huge file mainly. But I'll remember to stick as much RAM as I can :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Sun May 6 18:38:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 18:38:15 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [202.32.8.193]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l471c5fB018505 for ; Sun, 6 May 2007 18:38:08 -0700 Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.197]) by tyo201.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l471c2sH008659 for ; Mon, 7 May 2007 10:38:02 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l471c2s11594 for xfs@oss.sgi.com; Mon, 7 May 2007 10:38:02 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv.nec.co.jp (8.11.7/3.7W-MAILSV-NEC) with ESMTP id l471c2O04063 for ; Mon, 7 May 2007 10:38:02 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070507.092351.98402312 for ; Mon, 7 May 2007 09:23:52 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Mon May 07 09:23:51 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 36BF6AE4B3; Mon, 7 May 2007 10:38:01 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l471c1ok001475; Mon, 7 May 2007 10:38:01 +0900 Message-Id: <200705070137.AA05294@TNESG9305.tnes.nec.co.jp> Date: Mon, 07 May 2007 10:37:56 +0900 To: xfs@oss.sgi.com Cc: tes@sgi.com Subject: [PATCH] Fix disable, enable, off and remove commands in xfs_quota. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: multipart/mixed; boundary="--------------------0751065352324900" X-archive-position: 11306 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs This is multipart message. ----------------------0751065352324900 Content-Type: text/plain; charset=iso-2022-jp Hi, I send this mail 10 days ago but it got lost...$B!!(B disable, enable, off and remove commands in xfs_quota don't work. Because: 1) The argument type to quotactl() is wrong. "addr" is fs_quota_stat_t structure in the original code but it should be an unsigned int as shown in man page. (disable, enable, off and remove) 2) The wrong flag is used for -ugp option check. (disable, enable, off and remove) 3) The accounting flag (XFS_QUOTA_*DQ_ACCT) is used for disabling quota enforcement incorrectly. (disable) 4) The accounting and enforcement flag is used for removing space incorrectly. (remove) 5) The quota types must be specified to quotactl() one by one. But multiple quota types are passed to quotactl() when specifying -ug|-up option. (remove) Attached patch fixes these problems. Signed-off-by: Utako Kusaka --- ----------------------0751065352324900 Content-Type: application/octet-stream; name="state.diff" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="state.diff" LS0tIHhmc3Byb2dzLTIuOC4yMC9xdW90YS9zdGF0ZS5vcmlnCTIwMDctMDQt MTkgMTM6MDc6MzguMDAwMDAwMDAwICswOTAwCisrKyB4ZnNwcm9ncy0yLjgu MjAvcXVvdGEvc3RhdGUuYwkyMDA3LTA0LTI2IDExOjQ2OjQ2LjAwMDAwMDAw MCArMDkwMApAQCAtMjUwLDEwICsyNTAsNiBAQCBlbmFibGVfZW5mb3JjZW1l bnQoCiAJdWludAkJZmxhZ3MpCiB7CiAJZnNfcGF0aF90CSptb3VudDsKLQlm c19xdW90YV9zdGF0X3QJcXN0YXQgPSB7IDAgfTsKLQotCXFzdGF0LnFzX3Zl cnNpb24gPSBGU19RU1RBVF9WRVJTSU9OOwotCXFzdGF0LnFzX2ZsYWdzID0g cWZsYWdzOwogCiAJbW91bnQgPSBmc190YWJsZV9sb29rdXAoZGlyLCBGU19N T1VOVF9QT0lOVCk7CiAJaWYgKCFtb3VudCkgewpAQCAtMjYxLDcgKzI1Nyw3 IEBAIGVuYWJsZV9lbmZvcmNlbWVudCgKIAkJcmV0dXJuOwogCX0KIAlkaXIg PSBtb3VudC0+ZnNfbmFtZTsKLQlpZiAoeGZzcXVvdGFjdGwoWEZTX1FVT1RB T04sIGRpciwgdHlwZSwgMCwgKHZvaWQgKikmcXN0YXQpIDwgMCkKKwlpZiAo eGZzcXVvdGFjdGwoWEZTX1FVT1RBT04sIGRpciwgdHlwZSwgMCwgKHZvaWQg KikmcWZsYWdzKSA8IDApCiAJCXBlcnJvcigiWEZTX1FVT1RBT04iKTsKIAll bHNlIGlmIChmbGFncyAmIFZFUkJPU0VfRkxBRykKIAkJc3RhdGVfcXVvdGFm aWxlX21vdW50KHN0ZG91dCwgdHlwZSwgbW91bnQsIGZsYWdzKTsKQEAgLTI3 NSwxMCArMjcxLDYgQEAgZGlzYWJsZV9lbmZvcmNlbWVudCgKIAl1aW50CQlm bGFncykKIHsKIAlmc19wYXRoX3QJKm1vdW50OwotCWZzX3F1b3RhX3N0YXRf dAlxc3RhdCA9IHsgMCB9OwotCi0JcXN0YXQucXNfdmVyc2lvbiA9IEZTX1FT VEFUX1ZFUlNJT047Ci0JcXN0YXQucXNfZmxhZ3MgPSBxZmxhZ3M7CiAKIAlt b3VudCA9IGZzX3RhYmxlX2xvb2t1cChkaXIsIEZTX01PVU5UX1BPSU5UKTsK IAlpZiAoIW1vdW50KSB7CkBAIC0yODYsNyArMjc4LDcgQEAgZGlzYWJsZV9l bmZvcmNlbWVudCgKIAkJcmV0dXJuOwogCX0KIAlkaXIgPSBtb3VudC0+ZnNf bmFtZTsKLQlpZiAoeGZzcXVvdGFjdGwoWEZTX1FVT1RBT0ZGLCBkaXIsIHR5 cGUsIDAsICh2b2lkICopJnFzdGF0KSA8IDApCisJaWYgKHhmc3F1b3RhY3Rs KFhGU19RVU9UQU9GRiwgZGlyLCB0eXBlLCAwLCAodm9pZCAqKSZxZmxhZ3Mp IDwgMCkKIAkJcGVycm9yKCJYRlNfUVVPVEFPRkYiKTsKIAllbHNlIGlmIChm bGFncyAmIFZFUkJPU0VfRkxBRykKIAkJc3RhdGVfcXVvdGFmaWxlX21vdW50 KHN0ZG91dCwgdHlwZSwgbW91bnQsIGZsYWdzKTsKQEAgLTMwMCwxMCArMjky LDYgQEAgcXVvdGFvZmYoCiAJdWludAkJZmxhZ3MpCiB7CiAJZnNfcGF0aF90 CSptb3VudDsKLQlmc19xdW90YV9zdGF0X3QJcXN0YXQgPSB7IDAgfTsKLQot CXFzdGF0LnFzX3ZlcnNpb24gPSBGU19RU1RBVF9WRVJTSU9OOwotCXFzdGF0 LnFzX2ZsYWdzID0gcWZsYWdzOwogCiAJbW91bnQgPSBmc190YWJsZV9sb29r dXAoZGlyLCBGU19NT1VOVF9QT0lOVCk7CiAJaWYgKCFtb3VudCkgewpAQCAt MzExLDI0ICsyOTksMzEgQEAgcXVvdGFvZmYoCiAJCXJldHVybjsKIAl9CiAJ ZGlyID0gbW91bnQtPmZzX25hbWU7Ci0JaWYgKHhmc3F1b3RhY3RsKFhGU19R VU9UQU9GRiwgZGlyLCB0eXBlLCAwLCAodm9pZCAqKSZxc3RhdCkgPCAwKQor CWlmICh4ZnNxdW90YWN0bChYRlNfUVVPVEFPRkYsIGRpciwgdHlwZSwgMCwg KHZvaWQgKikmcWZsYWdzKSA8IDApCiAJCXBlcnJvcigiWEZTX1FVT1RBT0ZG Iik7CiAJZWxzZSBpZiAoZmxhZ3MgJiBWRVJCT1NFX0ZMQUcpCiAJCXN0YXRl X3F1b3RhZmlsZV9tb3VudChzdGRvdXQsIHR5cGUsIG1vdW50LCBmbGFncyk7 CiB9CiAKK3N0YXRpYyBpbnQKK3JlbW92ZV9xdHlwZV9leHRlbnRzKAorCWNo YXIJCSpkaXIsCisJdWludAkJdHlwZSkKK3sKKwlpbnQJZXJyb3IgPSAwOwor CisJaWYgKChlcnJvciA9IHhmc3F1b3RhY3RsKFhGU19RVU9UQVJNLCBkaXIs IHR5cGUsIDAsICh2b2lkICopJnR5cGUpKSA8IDApCisJCXBlcnJvcigiWEZT X1FVT1RBUk0iKTsKKwlyZXR1cm4gZXJyb3I7Cit9CisKIHN0YXRpYyB2b2lk CiByZW1vdmVfZXh0ZW50cygKIAljaGFyCQkqZGlyLAogCXVpbnQJCXR5cGUs Ci0JdWludAkJcWZsYWdzLAogCXVpbnQJCWZsYWdzKQogewogCWZzX3BhdGhf dAkqbW91bnQ7Ci0JZnNfcXVvdGFfc3RhdF90CXFzdGF0ID0geyAwIH07Ci0K LQlxc3RhdC5xc192ZXJzaW9uID0gRlNfUVNUQVRfVkVSU0lPTjsKLQlxc3Rh dC5xc19mbGFncyA9IHFmbGFnczsKIAogCW1vdW50ID0gZnNfdGFibGVfbG9v a3VwKGRpciwgRlNfTU9VTlRfUE9JTlQpOwogCWlmICghbW91bnQpIHsKQEAg LTMzNiw5ICszMzEsMTggQEAgcmVtb3ZlX2V4dGVudHMoCiAJCXJldHVybjsK IAl9CiAJZGlyID0gbW91bnQtPmZzX25hbWU7Ci0JaWYgKHhmc3F1b3RhY3Rs KFhGU19RVU9UQVJNLCBkaXIsIHR5cGUsIDAsICh2b2lkICopJnFzdGF0KSA8 IDApCi0JCXBlcnJvcigiWEZTX1FVT1RBUk0iKTsKLQllbHNlIGlmIChmbGFn cyAmIFZFUkJPU0VfRkxBRykKKwlpZiAodHlwZSAmIFhGU19VU0VSX1FVT1RB KSB7CisJCWlmIChyZW1vdmVfcXR5cGVfZXh0ZW50cyhkaXIsIFhGU19VU0VS X1FVT1RBKSA8IDApIAorCQkJcmV0dXJuOworCX0KKwlpZiAodHlwZSAmIFhG U19HUk9VUF9RVU9UQSkgeworCQlpZiAocmVtb3ZlX3F0eXBlX2V4dGVudHMo ZGlyLCBYRlNfR1JPVVBfUVVPVEEpIDwgMCkgCisJCQlyZXR1cm47CisJfSBl bHNlIGlmICh0eXBlICYgWEZTX1BST0pfUVVPVEEpIHsKKwkJaWYgKHJlbW92 ZV9xdHlwZV9leHRlbnRzKGRpciwgWEZTX1BST0pfUVVPVEEpIDwgMCkgCisJ CQlyZXR1cm47CisJfQorCWlmIChmbGFncyAmIFZFUkJPU0VfRkxBRykKIAkJ c3RhdGVfcXVvdGFmaWxlX21vdW50KHN0ZG91dCwgdHlwZSwgbW91bnQsIGZs YWdzKTsKIH0KIApAQCAtMzc0LDcgKzM3OCw3IEBAIGVuYWJsZV9mKAogCWlm IChhcmdjICE9IG9wdGluZCkKIAkJcmV0dXJuIGNvbW1hbmRfdXNhZ2UoJmVu YWJsZV9jbWQpOwogCi0JaWYgKCFmbGFncykgeworCWlmICghdHlwZSkgewog CQl0eXBlIHw9IFhGU19VU0VSX1FVT1RBOwogCQlxZmxhZ3MgfD0gWEZTX1FV T1RBX1VEUV9BQ0NUIHwgWEZTX1FVT1RBX1VEUV9FTkZEOwogCX0KQEAgLTM5 NSwxNSArMzk5LDE1IEBAIGRpc2FibGVfZigKIAkJc3dpdGNoIChjKSB7CiAJ CWNhc2UgJ2cnOgogCQkJdHlwZSB8PSBYRlNfR1JPVVBfUVVPVEE7Ci0JCQlx ZmxhZ3MgfD0gWEZTX1FVT1RBX0dEUV9BQ0NUOworCQkJcWZsYWdzIHw9IFhG U19RVU9UQV9HRFFfRU5GRDsKIAkJCWJyZWFrOwogCQljYXNlICdwJzoKIAkJ CXR5cGUgfD0gWEZTX1BST0pfUVVPVEE7Ci0JCQlxZmxhZ3MgfD0gWEZTX1FV T1RBX1BEUV9BQ0NUOworCQkJcWZsYWdzIHw9IFhGU19RVU9UQV9QRFFfRU5G RDsKIAkJCWJyZWFrOwogCQljYXNlICd1JzoKIAkJCXR5cGUgfD0gWEZTX1VT RVJfUVVPVEE7Ci0JCQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NUOwor CQkJcWZsYWdzIHw9IFhGU19RVU9UQV9VRFFfRU5GRDsKIAkJCWJyZWFrOwog CQljYXNlICd2JzoKIAkJCWZsYWdzIHw9IFZFUkJPU0VfRkxBRzsKQEAgLTQx Niw5ICs0MjAsOSBAQCBkaXNhYmxlX2YoCiAJaWYgKGFyZ2MgIT0gb3B0aW5k KQogCQlyZXR1cm4gY29tbWFuZF91c2FnZSgmZGlzYWJsZV9jbWQpOwogCi0J aWYgKCFmbGFncykgeworCWlmICghdHlwZSkgewogCQl0eXBlIHw9IFhGU19V U0VSX1FVT1RBOwotCQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NUOwor CQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9FTkZEOwogCX0KIAogCWlmIChm c19wYXRoLT5mc19mbGFncyAmIEZTX01PVU5UX1BPSU5UKQpAQCAtNDU4LDcg KzQ2Miw3IEBAIG9mZl9mKAogCWlmIChhcmdjICE9IG9wdGluZCkKIAkJcmV0 dXJuIGNvbW1hbmRfdXNhZ2UoJm9mZl9jbWQpOwogCi0JaWYgKCFmbGFncykg eworCWlmICghdHlwZSkgewogCQl0eXBlIHw9IFhGU19VU0VSX1FVT1RBOwog CQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NUIHwgWEZTX1FVT1RBX1VE UV9FTkZEOwogCX0KQEAgLTQ3MywyMSArNDc3LDE4IEBAIHJlbW92ZV9mKAog CWludAkJYXJnYywKIAljaGFyCQkqKmFyZ3YpCiB7Ci0JaW50CQljLCBmbGFn cyA9IDAsIHFmbGFncyA9IDAsIHR5cGUgPSAwOworCWludAkJYywgZmxhZ3Mg PSAwLCB0eXBlID0gMDsKIAogCXdoaWxlICgoYyA9IGdldG9wdChhcmdjLCBh cmd2LCAiZ3B1diIpKSAhPSBFT0YpIHsKIAkJc3dpdGNoIChjKSB7CiAJCWNh c2UgJ2cnOgogCQkJdHlwZSB8PSBYRlNfR1JPVVBfUVVPVEE7Ci0JCQlxZmxh Z3MgfD0gWEZTX1FVT1RBX0dEUV9BQ0NUIHwgWEZTX1FVT1RBX0dEUV9FTkZE OwogCQkJYnJlYWs7CiAJCWNhc2UgJ3AnOgogCQkJdHlwZSB8PSBYRlNfUFJP Sl9RVU9UQTsKLQkJCXFmbGFncyB8PSBYRlNfUVVPVEFfUERRX0FDQ1QgfCBY RlNfUVVPVEFfUERRX0VORkQ7CiAJCQlicmVhazsKIAkJY2FzZSAndSc6CiAJ CQl0eXBlIHw9IFhGU19VU0VSX1FVT1RBOwotCQkJcWZsYWdzIHw9IFhGU19R VU9UQV9VRFFfQUNDVCB8IFhGU19RVU9UQV9VRFFfRU5GRDsKIAkJCWJyZWFr OwogCQljYXNlICd2JzoKIAkJCWZsYWdzIHw9IFZFUkJPU0VfRkxBRzsKQEAg LTUwMCwxMyArNTAxLDEyIEBAIHJlbW92ZV9mKAogCWlmIChhcmdjICE9IG9w dGluZCkKIAkJcmV0dXJuIGNvbW1hbmRfdXNhZ2UoJnJlbW92ZV9jbWQpOwog Ci0JaWYgKCFmbGFncykgeworCWlmICghdHlwZSkgewogCQl0eXBlIHw9IFhG U19VU0VSX1FVT1RBOwotCQlxZmxhZ3MgfD0gWEZTX1FVT1RBX1VEUV9BQ0NU IHwgWEZTX1FVT1RBX1VEUV9FTkZEOwogCX0KIAogCWlmIChmc19wYXRoLT5m c19mbGFncyAmIEZTX01PVU5UX1BPSU5UKQotCQlyZW1vdmVfZXh0ZW50cyhm c19wYXRoLT5mc19kaXIsIHR5cGUsIHFmbGFncywgZmxhZ3MpOworCQlyZW1v dmVfZXh0ZW50cyhmc19wYXRoLT5mc19kaXIsIHR5cGUsIGZsYWdzKTsKIAly ZXR1cm4gMDsKIH0KIAo= ----------------------0751065352324900-- From owner-xfs@oss.sgi.com Sun May 6 19:11:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 06 May 2007 19:11:38 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l472BXfB024497 for ; Sun, 6 May 2007 19:11:34 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA17577; Mon, 7 May 2007 12:11:25 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l472BOAf86463303; Mon, 7 May 2007 12:11:24 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l472BMNV85439097; Mon, 7 May 2007 12:11:22 +1000 (AEST) Date: Mon, 7 May 2007 12:11:22 +1000 From: David Chinner To: Emmanuel Florac Cc: David Chinner , xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070507021122.GQ32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070504152546.614374ac@harpe.intellique.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11307 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 03:25:46PM +0200, Emmanuel Florac wrote: > Le Fri, 4 May 2007 17:33:44 +1000 > David Chinner écrivait: > > > Well, there's your problem. Stack overflows. IMO, if you use a > > filesystem, you shouldn't use 4k stacks. ;) > > > > If you remake you kernel with 8k stacks then your problems will > > most likely go away. > > Well, I've double-checked the asm-i386/module.h, and it actually looks > like 4K stacks is NOT the default, so I must be using 8K, isn't it? Yes. > I've ran the same test on the same machine but WITHOUT software raid-0 > (so write barriers are in use), and all went well, more than 3TB > written without a glitch. I still think there's something related to > the write barriers here. I'll try with another RAID controller, Adaptec > for instance, to get sure the 3ware driver isn't involved. I'll also try > again with an amd64 kernel. So you use software raid and you get corruptions, right? I doubt this has anything to do with write barriers - if it does thats an indication of broken drivers or hardware..... Can you run with "-o nobarrier" and no software raid and see if you still have a problem? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 7 03:07:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 03:07:59 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47A7tfB005622 for ; Mon, 7 May 2007 03:07:56 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id A440617CFA; Mon, 7 May 2007 12:07:54 +0200 (CEST) Date: Mon, 7 May 2007 12:07:54 +0200 From: Emmanuel Florac To: David Chinner Cc: xfs@oss.sgi.com Subject: Re: XFS crash on linux raid Message-ID: <20070507120754.289deffd@galadriel.home> In-Reply-To: <20070507021122.GQ32602149@melbourne.sgi.com> References: <20070503164521.16efe075@harpe.intellique.com> <20070504005922.GC32602149@melbourne.sgi.com> <20070504090613.7c0f97d3@galadriel.home> <20070504073344.GL32602149@melbourne.sgi.com> <20070504152546.614374ac@harpe.intellique.com> <20070507021122.GQ32602149@melbourne.sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l47A7vfB005627 X-archive-position: 11308 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Mon, 7 May 2007 12:11:22 +1000 vous écriviez: > So you use software raid and you get corruptions, right? I doubt this > has anything to do with write barriers - if it does thats an > indication of broken drivers or hardware..... > > Can you run with "-o nobarrier" and no software raid and see if you > still have a problem? I tried on the same machine without software RAID and barriers, and i worked OK. I'll try today with nobarrier. Stay tuned :) -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Mon May 7 04:03:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 04:03:55 -0700 (PDT) Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.144]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47B3lfB017677 for ; Mon, 7 May 2007 04:03:49 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47B3hN7031521 for ; Mon, 7 May 2007 07:03:43 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47B3hcJ515560 for ; Mon, 7 May 2007 07:03:43 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47B3g21005021 for ; Mon, 7 May 2007 07:03:43 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47B3frL004964; Mon, 7 May 2007 07:03:42 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 5AB9D94BBD; Mon, 7 May 2007 16:33:49 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47B3nwC010945; Mon, 7 May 2007 16:33:49 +0530 Date: Mon, 7 May 2007 16:33:48 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070507110348.GA7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503212955.b1b6443c.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11309 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs Andrew, Thanks for the review comments! On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > Please add a comment over this function which specifies its behaviour. > Really it should be enough material from which a full manpage can be > written. > > If that's all too much, this material should at least be spelled out in the > changelog. Because there's no way in which this change can be fully > reviewed unless someone (ie: you) tells us what it is setting out to > achieve. > > If we 100% implement some standard then a URL for what we claim to > implement would suffice. Given that we're at least using different types from > posix I doubt if such a thing would be sufficient. > > And given the complexity and potential variability within the filesystem > implementations of this, I'd expect that _something_ additional needs to be > said? Ok. I will add a detailed comment here. > > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I think we should go ahead with current glibc implementation (which Jakub poited at) of not allowing a negative 'len', since posix also doesn't explicitly say anything about allowing negative 'len'. > > > + ret = -EBADF; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + if (!(file->f_mode & FMODE_WRITE)) > > + goto out_fput; > > + > > + inode = file->f_path.dentry->d_inode; > > + > > + ret = -ESPIPE; > > + if (S_ISFIFO(inode->i_mode)) > > + goto out_fput; > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. True. > > + ret = -EFBIG; > > + if (offset + len > inode->i_sb->s_maxbytes) > > + goto out_fput; > > This code does handle offset+len going negative, but only by accident, I > suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment > here would settle the reader's mind. Ok. I will add a check here for wrap though zero. > > + if (inode->i_op && inode->i_op->fallocate) > > + ret = inode->i_op->fallocate(inode, mode, offset, len); > > + else > > + ret = -ENOSYS; > > If we _are_ going to support negative `len', as posix suggests, I think we > should perform the appropriate sanity conversions to `offset' and `len' > right here, rather than expecting each filesystem to do it. > > If we're not going to handle negative `len' then we should check for it. Will add a check for negative 'len' and return -EINVAL. This will be done where currently we check for negative offset (i.e. at the start of the function). > > +out_fput: > > + fput(file); > > +out: > > + return ret; > > +} > > +EXPORT_SYMBOL(sys_fallocate); > > I don't believe this needs to be exported to modules? Ok. Will remove it. > > +/* > > + * fallocate() modes > > + */ > > +#define FA_ALLOCATE 0x1 > > +#define FA_DEALLOCATE 0x2 > > Now those aren't in posix. They should be documented, along with their > expected semantics. Will add a comment describing the role of these modes. > > #ifdef __KERNEL__ > > > > #include > > @@ -1125,6 +1131,7 @@ struct inode_operations { > > ssize_t (*listxattr) (struct dentry *, char *, size_t); > > int (*removexattr) (struct dentry *, const char *); > > void (*truncate_range)(struct inode *, loff_t, loff_t); > > + long (*fallocate)(struct inode *, int, loff_t, loff_t); > > I really do think it's better to put the variable names in definitions such > as this. Especially when we have two identically-typed variables next to > each other like that. Quick: which one is the offset and which is the > length? Ok. Will add the variable names here. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 04:10:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 04:10:42 -0700 (PDT) Received: from e6.ny.us.ibm.com (e6.ny.us.ibm.com [32.97.182.146]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47BAbfB018894 for ; Mon, 7 May 2007 04:10:39 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47BBYno028901 for ; Mon, 7 May 2007 07:11:34 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47BAbK5550866 for ; Mon, 7 May 2007 07:10:37 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47BAaUk020671 for ; Mon, 7 May 2007 07:10:36 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47BAZ5f020654; Mon, 7 May 2007 07:10:36 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id B3CE494BBD; Mon, 7 May 2007 16:40:38 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47BAcV7013746; Mon, 7 May 2007 16:40:38 +0530 Date: Mon, 7 May 2007 16:40:38 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: David Chinner , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070507111038.GB7012@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503232815.2f62a75e.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11310 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the above > check was relaxed? I think we may relax the check here and let the individual file system decide if they support preallocation for directories or not. What do you think ? One thing to be thought in this case is the error code which should be returned by the file system implementation, incase it doesn't support preallocation for directories. Should it be -ENODEV (to match with what posix says) , or something else (which might make more sense in this case) ? -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 04:46:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 04:46:48 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47BkifB032153 for ; Mon, 7 May 2007 04:46:45 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e2.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47Bkh21026930 for ; Mon, 7 May 2007 07:46:43 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47BkhAb550744 for ; Mon, 7 May 2007 07:46:43 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47Bkh3g010826 for ; Mon, 7 May 2007 07:46:43 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47Bkg9x010807; Mon, 7 May 2007 07:46:42 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id ECD1694BBD; Mon, 7 May 2007 17:16:49 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47BknTn028767; Mon, 7 May 2007 17:16:49 +0530 Date: Mon, 7 May 2007 17:16:49 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 3/5] ext4: Extent overlap bugfix Message-ID: <20070507114649.GC7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181101.GC7209@amitarora.in.ibm.com> <20070503213002.eff696db.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213002.eff696db.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11311 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" wrote: > > > +unsigned int ext4_ext_check_overlap(struct inode *inode, > > + struct ext4_extent *newext, > > + struct ext4_ext_path *path) > > +{ > > + unsigned long b1, b2; > > + unsigned int depth, len1; > > + > > + b1 = le32_to_cpu(newext->ee_block); > > + len1 = le16_to_cpu(newext->ee_len); > > + depth = ext_depth(inode); > > + if (!path[depth].p_ext) > > + goto out; > > + b2 = le32_to_cpu(path[depth].p_ext->ee_block); > > + > > + /* get the next allocated block if the extent in the path > > + * is before the requested block(s) */ > > + if (b2 < b1) { > > + b2 = ext4_ext_next_allocated_block(path); > > + if (b2 == EXT_MAX_BLOCK) > > + goto out; > > + } > > + > > + if (b1 + len1 > b2) { > > Are we sure that b1+len cannot wrap through zero here? No. Will add a check here for this. Thanks! > > + newext->ee_len = cpu_to_le16(b2 - b1); > > + return 1; > > + } -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 05:11:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 05:11:58 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47CBqfB003761 for ; Mon, 7 May 2007 05:11:54 -0700 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47C7E7Y027576 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 08:07:15 -0400 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l47C3sWu029214 for ; Mon, 7 May 2007 08:03:54 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47C7CFZ184746 for ; Mon, 7 May 2007 06:07:12 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47C7CKS012675 for ; Mon, 7 May 2007 06:07:12 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47C7Bi2012612; Mon, 7 May 2007 06:07:11 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 2FC5D94BBD; Mon, 7 May 2007 17:37:19 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47C7Jq7004761; Mon, 7 May 2007 17:37:19 +0530 Date: Mon, 7 May 2007 17:37:19 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507120719.GD7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213133.d1559f52.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11312 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > > This patch has the ext4 implemtation of fallocate system call. > > > > ... > > > > + /* ext4_can_extents_be_merged should have checked that either > > + * both extents are uninitialized, or both aren't. Thus we > > + * need to check only one of them here. > > + */ > > Please always format multiline comments like this: > > /* > * ext4_can_extents_be_merged should have checked that either > * both extents are uninitialized, or both aren't. Thus we > * need to check only one of them here. > */ Ok. > > ... > > > > +/* > > + * ext4_fallocate: > > + * preallocate space for a file > > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > > + */ > > This description is rather thin. What is the filesystem's actual behaviour > here? If the file is using extents then the implementation will do > . If the file is using bitmaps then we will do . > > But what? Here is where it should be described. Ok. Will expand the description. > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > +{ > > + handle_t *handle; > > + ext4_fsblk_t block, max_blocks; > > + int ret, ret2, nblocks = 0, retries = 0; > > + struct buffer_head map_bh; > > + unsigned int credits, blkbits = inode->i_blkbits; > > + > > + /* Currently supporting (pre)allocate mode _only_ */ > > + if (mode != FA_ALLOCATE) > > + return -EOPNOTSUPP; > > + > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + return -ENOTTY; > > So we don't implement fallocate on bitmap-based files! Well that's huge > news. The changelog would be an appropriate place to communicate this, > along with reasons why, or a description of the plan to fix it. Ok. Will add this in the function description as well. > Also, posix says nothing about fallocate() returning ENOTTY. Right. I don't seem to find any suitable error from posix description. Can you please suggest an error code which might make more sense here ? Will -ENOTSUPP be ok ? Since we want to say here that we don't support non-extent files. > > + block = offset >> blkbits; > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > + - block; > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > space, and that this disk space will require an arbitrary amount of > metadata, how can we work out how much journal space we'll be needing > without at least looking at `len'? You are right to say that the credits can not be fixed here. But, 'len' will not directly tell us how many extents might need to be inserted and how many block groups (if any - think about the "segment range" already being allocated case) the allocation request might touch. One solution I have thought is to check the buffer credits after a call to ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the credits are falling short. Incase journal_extend fails, we call journal_restart. This will automatically take care of how much journal space we might need for any value of "len". > > + handle=ext4_journal_start(inode, credits + > > Please always put spaces around "="A Ok. > > > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); > > And around "+" Ok. > > > + if (IS_ERR(handle)) > > + return PTR_ERR(handle); > > +retry: > > + ret = 0; > > + while (ret >= 0 && ret < max_blocks) { > > + block = block + ret; > > + max_blocks = max_blocks - ret; > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and > ext4_error() would be safer and more useful here. Ok. Will do that. > > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > Use buffer_new() here. A separate patch which fixes the three existing > instances of open-coded BH_foo usage would be appreciated. Ok. > > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > Check for wrap though the sign bit and through zero please. Ok. > > > + nblocks = nblocks + ret; > > + } > > + > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > + goto retry; > > + > > + /* Time to update the file size. > > + * Update only when preallocation was requested beyond the file size. > > + */ > > Fix comment layout. Ok. > > > + if ((offset + len) > i_size_read(inode)) { > > Both the lhs and the rhs here are signed. Please review for possible > overflows through the sign bit and through zero. Perhaps a comment > explaining why it's correct would be appropriate. Ok. > > > > + if (ret > 0) { > > + /* if no error, we assume preallocation succeeded completely */ > > + mutex_lock(&inode->i_mutex); > > + i_size_write(inode, offset + len); > > + EXT4_I(inode)->i_disksize = i_size_read(inode); > > + mutex_unlock(&inode->i_mutex); > > + } else if (ret < 0 && nblocks) { > > + /* Handle partial allocation scenario */ > > The above two comments should be indented one additional tabstop. Ok. > > > + loff_t newsize; > > + mutex_lock(&inode->i_mutex); > > + newsize = (nblocks << blkbits) + i_size_read(inode); > > + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); > > + EXT4_I(inode)->i_disksize = i_size_read(inode); > > + mutex_unlock(&inode->i_mutex); > > + } > > + } > > + ext4_mark_inode_dirty(handle, inode); > > + ret2 = ext4_journal_stop(handle); > > + if (ret > 0) > > + ret = ret2; > > + > > + return ret > 0 ? 0 : ret; > > +} > > + > > EXPORT_SYMBOL(ext4_mark_inode_dirty); > > EXPORT_SYMBOL(ext4_ext_invalidate_cache); > > EXPORT_SYMBOL(ext4_ext_insert_extent); > > EXPORT_SYMBOL(ext4_ext_walk_space); > > EXPORT_SYMBOL(ext4_ext_find_goal); > > EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); > > +EXPORT_SYMBOL(ext4_fallocate); > > > > Index: linux-2.6.21/fs/ext4/file.c > > =================================================================== > > --- linux-2.6.21.orig/fs/ext4/file.c > > +++ linux-2.6.21/fs/ext4/file.c > > @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ > > .removexattr = generic_removexattr, > > #endif > > .permission = ext4_permission, > > + .fallocate = ext4_fallocate, > > }; > > > > Index: linux-2.6.21/include/linux/ext4_fs.h > > =================================================================== > > --- linux-2.6.21.orig/include/linux/ext4_fs.h > > +++ linux-2.6.21/include/linux/ext4_fs.h > > @@ -102,6 +102,8 @@ > > EXT4_GOOD_OLD_FIRST_INO : \ > > (s)->s_first_ino) > > #endif > > +#define EXT4_BLOCK_ALIGN(size, blkbits) (((size)+(1 << blkbits)-1) & \ > > + (~((1 << blkbits)-1))) > > Maybe a comment describing what this does? Probably it's obvious enough. > > I think it could use the standard ALIGN macro. > > Is blkbits sufficiently parenthesised here? Even if it is, adding the > parens would be better practice. I agree. Will change it. > > > /* > > * Macro-instructions used to manage fragments > > @@ -225,6 +227,10 @@ struct ext4_new_group_data { > > __u32 free_blocks_count; > > }; > > > > +/* Following is used by preallocation logic to tell get_blocks() that we > > + * want uninitialzed extents. > > + */ > > Please convert all newly-added multiline comments to the preferred layout. Ok. > > > +#define EXT4_CREATE_UNINITIALIZED_EXT 2 > > > > /* > > * ioctl commands > > @@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t > > extern void ext4_ext_truncate(struct inode *, struct page *); > > extern void ext4_ext_init(struct super_block *); > > extern void ext4_ext_release(struct super_block *); > > +extern int ext4_fallocate(struct inode *, int, loff_t, loff_t); > > argh. And feel free to give these args some useful names. Ok. > > > static inline int > > ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, > > unsigned long max_blocks, struct buffer_head *bh, > > Index: linux-2.6.21/include/linux/ext4_fs_extents.h > > =================================================================== > > --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h > > +++ linux-2.6.21/include/linux/ext4_fs_extents.h > > @@ -125,6 +125,19 @@ struct ext4_ext_path { > > #define EXT4_EXT_CACHE_EXTENT 2 > > > > /* > > + * Macro-instructions to handle (mark/unmark/check/create) unitialized > > + * extents. Applications can issue an IOCTL for preallocation, which results > > + * in assigning unitialized extents to the file. > > + */ > > +#define ext4_ext_mark_uninitialized(ext) ((ext)->ee_len |= \ > > + cpu_to_le16(0x8000)) > > +#define ext4_ext_is_uninitialized(ext) ((le16_to_cpu((ext)->ee_len))& \ > > + 0x8000) > > +#define ext4_ext_get_actual_len(ext) ((le16_to_cpu((ext)->ee_len))& \ > > + 0x7FFF) > > inlined C functions are preferred, and I think these could be implemented > that way. Ok. Will convert them to inline functions. Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 05:24:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 05:24:59 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47COsfB005440 for ; Mon, 7 May 2007 05:24:55 -0700 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47CBCnQ029352 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 08:11:12 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l47C7pKq031566 for ; Mon, 7 May 2007 08:07:51 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47CB9dj171804 for ; Mon, 7 May 2007 06:11:09 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47CB9VC024191 for ; Mon, 7 May 2007 06:11:09 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47CB7of024105; Mon, 7 May 2007 06:11:08 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 2B6F894BBD; Mon, 7 May 2007 17:41:16 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47CBFgC006463; Mon, 7 May 2007 17:41:15 +0530 Date: Mon, 7 May 2007 17:41:15 +0530 From: "Amit K. Arora" To: Andrew Morton Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-ID: <20070507121115.GE7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> <20070503213238.5cdb1585.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213238.5cdb1585.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-archive-position: 11313 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" wrote: > > + */ > > +int ext4_ext_try_to_merge(struct inode *inode, > > + struct ext4_ext_path *path, > > + struct ext4_extent *ex) > > +{ > > + struct ext4_extent_header *eh; > > + unsigned int depth, len; > > + int merge_done=0, uninitialized = 0; > > space around "=", please. > > Many people prefer not to do the multiple-definitions-per-line, btw: > > int merge_done = 0; > int uninitialized = 0; Ok. Will make the change. > > reasons: > > - If gives you some space for a nice comment > > - It makes patches much more readable, and it makes rejects easier to fix > > - standardisation. > > > + depth = ext_depth(inode); > > + BUG_ON(path[depth].p_hdr == NULL); > > + eh = path[depth].p_hdr; > > + > > + while (ex < EXT_LAST_EXTENT(eh)) { > > + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) > > + break; > > + /* merge with next extent! */ > > + if (ext4_ext_is_uninitialized(ex)) > > + uninitialized = 1; > > + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) > > + + ext4_ext_get_actual_len(ex + 1)); > > + if (uninitialized) > > + ext4_ext_mark_uninitialized(ex); > > + > > + if (ex + 1 < EXT_LAST_EXTENT(eh)) { > > + len = (EXT_LAST_EXTENT(eh) - ex - 1) > > + * sizeof(struct ext4_extent); > > + memmove(ex + 1, ex + 2, len); > > + } > > + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); > > Kenrel convention is to put spaces around "-" Will fix this. > > > + merge_done = 1; > > + BUG_ON(eh->eh_entries == 0); > > eek, scary BUG_ON. Do we really need to be that severe? Would it be > better to warn and run ext4_error() here? Ok. > > > + } > > + > > + return merge_done; > > +} > > + > > + > > > > ... > > > > +/* > > + * ext4_ext_convert_to_initialized: > > + * this function is called by ext4_ext_get_blocks() if someone tries to write > > + * to an uninitialized extent. It may result in splitting the uninitialized > > + * extent into multiple extents (upto three). Atleast one initialized extent > > + * and atmost two uninitialized extents can result. > > There are some typos here > > > + * There are three possibilities: > > + * a> No split required: Entire extent should be initialized. > > + * b> Split into two extents: Only one end of the extent is being written to. > > + * c> Split into three extents: Somone is writing in middle of the extent. > > and here > Ok. Will fix them. > > + */ > > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, > > + struct ext4_ext_path *path, > > + ext4_fsblk_t iblock, > > + unsigned long max_blocks) > > +{ > > + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; > > + struct ext4_extent_header *eh; > > + unsigned int allocated, ee_block, ee_len, depth; > > + ext4_fsblk_t newblock; > > + int err = 0, ret = 0; > > + > > + depth = ext_depth(inode); > > + eh = path[depth].p_hdr; > > + ex = path[depth].p_ext; > > + ee_block = le32_to_cpu(ex->ee_block); > > + ee_len = ext4_ext_get_actual_len(ex); > > + allocated = ee_len - (iblock - ee_block); > > + newblock = iblock - ee_block + ext_pblock(ex); > > + ex2 = ex; > > + > > + /* ex1: ee_block to iblock - 1 : uninitialized */ > > + if (iblock > ee_block) { > > + ex1 = ex; > > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > > + ext4_ext_mark_uninitialized(ex1); > > + ex2 = &newex; > > + } > > + /* for sanity, update the length of the ex2 extent before > > + * we insert ex3, if ex1 is NULL. This is to avoid temporary > > + * overlap of blocks. > > + */ > > + if (!ex1 && allocated > max_blocks) > > + ex2->ee_len = cpu_to_le16(max_blocks); > > + /* ex3: to ee_block + ee_len : uninitialised */ > > + if (allocated > max_blocks) { > > + unsigned int newdepth; > > + ex3 = &newex; > > + ex3->ee_block = cpu_to_le32(iblock + max_blocks); > > + ext4_ext_store_pblock(ex3, newblock + max_blocks); > > + ex3->ee_len = cpu_to_le16(allocated - max_blocks); > > + ext4_ext_mark_uninitialized(ex3); > > + err = ext4_ext_insert_extent(handle, inode, path, ex3); > > + if (err) > > + goto out; > > + /* The depth, and hence eh & ex might change > > + * as part of the insert above. > > + */ > > + newdepth = ext_depth(inode); > > + if (newdepth != depth) > > + { > > Use > > if (newdepth != depth) { Ok. > > > + depth=newdepth; > > spaces Ok. > > > + path = ext4_ext_find_extent(inode, iblock, NULL); > > + if (IS_ERR(path)) { > > + err = PTR_ERR(path); > > + path = NULL; > > + goto out; > > + } > > + eh = path[depth].p_hdr; > > + ex = path[depth].p_ext; > > + if (ex2 != &newex) > > + ex2 = ex; > > + } > > + allocated = max_blocks; > > + } > > + /* If there was a change of depth as part of the > > + * insertion of ex3 above, we need to update the length > > + * of the ex1 extent again here > > + */ > > + if (ex1 && ex1 != ex) { > > + ex1 = ex; > > + ex1->ee_len = cpu_to_le16(iblock - ee_block); > > + ext4_ext_mark_uninitialized(ex1); > > + ex2 = &newex; > > + } > > + /* ex2: iblock to iblock + maxblocks-1 : initialised */ > > + ex2->ee_block = cpu_to_le32(iblock); > > + ex2->ee_start = cpu_to_le32(newblock); > > + ext4_ext_store_pblock(ex2, newblock); > > + ex2->ee_len = cpu_to_le16(allocated); > > + if (ex2 != ex) > > + goto insert; > > + if ((err = ext4_ext_get_access(handle, inode, path + depth))) > > + goto out; > > The preferred style is > > err = ext4_ext_get_access(handle, inode, path + depth); > if (err) > goto out; Right. Will change it. > > + /* New (initialized) extent starts from the first block > > + * in the current extent. i.e., ex2 == ex > > + * We have to see if it can be merged with the extent > > + * on the left. > > + */ > > + if (ex2 > EXT_FIRST_EXTENT(eh)) { > > + /* To merge left, pass "ex2 - 1" to try_to_merge(), > > + * since it merges towards right _only_. > > + */ > > + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); > > + if (ret) { > > + err = ext4_ext_correct_indexes(handle, inode, path); > > + if (err) > > + goto out; > > + depth = ext_depth(inode); > > + ex2--; > > + } > > + } > > + /* Try to Merge towards right. This might be required > > + * only when the whole extent is being written to. > > + * i.e. ex2==ex and ex3==NULL. > > + */ > > + if (!ex3) { > > + ret = ext4_ext_try_to_merge(inode, path, ex2); > > + if (ret) { > > + err = ext4_ext_correct_indexes(handle, inode, path); > > + if (err) > > + goto out; > > + } > > + } > > + /* Mark modified extent as dirty */ > > + err = ext4_ext_dirty(handle, inode, path + depth); > > + goto out; > > +insert: > > + err = ext4_ext_insert_extent(handle, inode, path, &newex); > > +out: > > + return err ? err : allocated; > > +} > > Sigh. I hope you guys know how all this works, because the extent code is > a mystery to me. Is the on-disk layout and the allocation strategy > described anywhere? > > > +extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *); > > Again, I do think that sticking the identifiers in there helps > readability. Although it is not as important in a boring old declaration > as it is in, say, inode_operations, etc. > > Please try to keep the code looking nice in an 80-column display. Ok. Will make the required changes. Thanks again for your comments! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 06:04:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 06:04:30 -0700 (PDT) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47D4PfB010686 for ; Mon, 7 May 2007 06:04:26 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47D4Odp031307 for ; Mon, 7 May 2007 09:04:24 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47D4ObH130260 for ; Mon, 7 May 2007 07:04:24 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47D4NSZ024479 for ; Mon, 7 May 2007 07:04:23 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47D4MAa024384; Mon, 7 May 2007 07:04:23 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 774D594BBD; Mon, 7 May 2007 18:34:30 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l47D4T2s028246; Mon, 7 May 2007 18:34:29 +0530 Date: Mon, 7 May 2007 18:34:29 +0530 From: "Amit K. Arora" To: Pekka Enberg Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-ID: <20070507130429.GA6681@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> <84144f020705070540tf3b1986yd4b1ab65e3a17d5e@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <84144f020705070540tf3b1986yd4b1ab65e3a17d5e@mail.gmail.com> User-Agent: Mutt/1.4.1i X-archive-position: 11314 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote: > On 4/26/07, Amit K. Arora wrote: > > /* > >+ * ext4_ext_try_to_merge: > >+ * tries to merge the "ex" extent to the next extent in the tree. > >+ * It always tries to merge towards right. If you want to merge towards > >+ * left, pass "ex - 1" as argument instead of "ex". > >+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > >+ * 1 if they got merged. > >+ */ > >+int ext4_ext_try_to_merge(struct inode *inode, > >+ struct ext4_ext_path *path, > >+ struct ext4_extent *ex) > >+{ > > Please either use proper kerneldoc format or drop > "ext4_ext_try_to_merge" from the comment. Ok, Thanks. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 7 06:07:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 06:07:09 -0700 (PDT) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.174]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47D75fB011315 for ; Mon, 7 May 2007 06:07:06 -0700 Received: by ug-out-1314.google.com with SMTP id t39so888009ugd for ; Mon, 07 May 2007 06:07:04 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=urrBzjEDw6lND4d8kP5iZTiYWtAhrbTo0ORHvRO5Ac0osqutU/p2ps7ovA3enA6g5I7Jm25FfmAzxoA7atZKmc+vZuTtqPfy9vd5MJb3PzeWa9bscWaYVyMm7LqyoXNAnbXxY1thOUP7Bbzn5Tcc1AexjGTILIVaW1ugvFG3vGM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=EMuKhDR2U7hdITRMtxB/Oej5BaFNrGc5hvurNCbd0H9fYPUePT20nmMki3NBZJPq657HruCk61mcjc92u/jpzZQ8RuphVoJtrLKhzBQT/CvM4E+FcNF5nfiW9ei7sZh0QH9smIIL1eDa9egvH4kK/9Z+4XVOjfUOcHV5JVm02qg= Received: by 10.67.90.19 with SMTP id s19mr3525671ugl.1178541626311; Mon, 07 May 2007 05:40:26 -0700 (PDT) Received: by 10.67.9.19 with HTTP; Mon, 7 May 2007 05:40:26 -0700 (PDT) Message-ID: <84144f020705070540tf3b1986yd4b1ab65e3a17d5e@mail.gmail.com> Date: Mon, 7 May 2007 15:40:26 +0300 From: "Pekka Enberg" To: "Amit K. Arora" Subject: Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070426181623.GE7209@amitarora.in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181623.GE7209@amitarora.in.ibm.com> X-Google-Sender-Auth: 7ffddca7cb123766 X-archive-position: 11315 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: penberg@cs.helsinki.fi Precedence: bulk X-list: xfs On 4/26/07, Amit K. Arora wrote: > /* > + * ext4_ext_try_to_merge: > + * tries to merge the "ex" extent to the next extent in the tree. > + * It always tries to merge towards right. If you want to merge towards > + * left, pass "ex - 1" as argument instead of "ex". > + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns > + * 1 if they got merged. > + */ > +int ext4_ext_try_to_merge(struct inode *inode, > + struct ext4_ext_path *path, > + struct ext4_extent *ex) > +{ Please either use proper kerneldoc format or drop "ext4_ext_try_to_merge" from the comment. From owner-xfs@oss.sgi.com Mon May 7 06:22:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 06:22:21 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47DMGfB013748 for ; Mon, 7 May 2007 06:22:17 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l47D8pKs005754; Mon, 7 May 2007 09:08:51 -0400 Received: from lacrosse.corp.redhat.com (lacrosse.corp.redhat.com [172.16.52.154]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l47D8otS020113; Mon, 7 May 2007 09:08:50 -0400 Received: from myware66.akkadia.org (vpn-14-5.rdu.redhat.com [10.11.14.5]) by lacrosse.corp.redhat.com (8.12.11.20060308/8.11.6) with ESMTP id l47D8med016906; Mon, 7 May 2007 09:08:49 -0400 Message-ID: <463F24DB.5040406@redhat.com> Date: Mon, 07 May 2007 06:08:43 -0700 From: Ulrich Drepper Organization: Red Hat, Inc. User-Agent: Thunderbird 2.0.0.0 (X11/20070419) MIME-Version: 1.0 To: Jakub Jelinek CC: Andrew Morton , David Chinner , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <20070504060731.GJ32602149@melbourne.sgi.com> <20070503232815.2f62a75e.akpm@linux-foundation.org> <20070504065626.GW355@devserv.devel.redhat.com> In-Reply-To: <20070504065626.GW355@devserv.devel.redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 11316 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: drepper@redhat.com Precedence: bulk X-list: xfs Jakub Jelinek wrote: > is what glibc does ATM. Seems we violate the case where len == 0, as > EINVAL in that case is "shall fail". But reading the standard to imply > negative len is ok is too much guessing, there is no word what it means > when len is negative and > "required storage for regular file data starting at offset and continuing for len bytes" > doesn't make sense for negative size. This wording has already been cleaned up. The current draft for the next revision reads: [EINVAL] The len argument is less than or equal to zero, or the offset argument is less than zero, or the underlying file system does not support this operation. I still don't like it since len==0 shouldn't create an error (it's inconsistent) but len<0 is already outlawed. -- âž§ Ulrich Drepper âž§ Red Hat, Inc. âž§ 444 Castro St âž§ Mountain View, CA â– From owner-xfs@oss.sgi.com Mon May 7 08:48:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 08:48:24 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47FmJfB009196 for ; Mon, 7 May 2007 08:48:21 -0700 Received: from e1.ny.us.ibm.com ([192.168.1.101]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47FOkpf023745 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 11:24:46 -0400 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l47FOffK009339 for ; Mon, 7 May 2007 11:24:41 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l47FOfJk549588 for ; Mon, 7 May 2007 11:24:41 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l47FOewe027863 for ; Mon, 7 May 2007 11:24:40 -0400 Received: from [9.53.41.190] (kleikamp.austin.ibm.com [9.53.41.190]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l47FOdo3027760; Mon, 7 May 2007 11:24:39 -0400 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Dave Kleikamp To: "Amit K. Arora" Cc: Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070507120719.GD7012@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507120719.GD7012@amitarora.in.ibm.com> Content-Type: text/plain Date: Mon, 07 May 2007 10:24:37 -0500 Message-Id: <1178551477.12900.6.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 Content-Transfer-Encoding: 7bit X-archive-position: 11317 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: shaggy@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > > +{ > > > + handle_t *handle; > > > + ext4_fsblk_t block, max_blocks; > > > + int ret, ret2, nblocks = 0, retries = 0; > > > + struct buffer_head map_bh; > > > + unsigned int credits, blkbits = inode->i_blkbits; > > > + > > > + /* Currently supporting (pre)allocate mode _only_ */ > > > + if (mode != FA_ALLOCATE) > > > + return -EOPNOTSUPP; > > > + > > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > > + return -ENOTTY; > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > news. The changelog would be an appropriate place to communicate this, > > along with reasons why, or a description of the plan to fix it. > > Ok. Will add this in the function description as well. > > > Also, posix says nothing about fallocate() returning ENOTTY. > > Right. I don't seem to find any suitable error from posix description. > Can you please suggest an error code which might make more sense here ? > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > non-extent files. Isn't the idea that libc will interpret -ENOTTY, or whatever is returned here, and fall back to the current library code to do preallocation? This way, the caller of fallocate() will never see this return code, so it won't violate posix. -- David Kleikamp IBM Linux Technology Center From owner-xfs@oss.sgi.com Mon May 7 11:35:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 11:35:08 -0700 (PDT) Received: from tur.go2.pl (tur.go2.pl [193.17.41.50]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47IZ0fB008787 for ; Mon, 7 May 2007 11:35:02 -0700 Received: from poczta.o2.pl (mx10.go2.pl [193.17.41.74]) by tur.go2.pl (o2.pl Mailer 2.0.1) with ESMTP id CF0912349DA for ; Mon, 7 May 2007 20:04:28 +0200 (CEST) Received: from poczta.o2.pl (mx10.go2.pl [127.0.0.1]) by poczta.o2.pl (Postfix) with ESMTP id 07A2C58113 for ; Mon, 7 May 2007 20:04:26 +0200 (CEST) Received: from lucke.localnet (xdsl-7687.bielsko.dialog.net.pl [62.87.234.135]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by poczta.o2.pl (Postfix) with ESMTP for ; Mon, 7 May 2007 20:04:25 +0200 (CEST) From: =?utf-8?q?=C5=81ukasz_Fibinger?= Reply-To: lucke@o2.pl To: xfs@oss.sgi.com Subject: RESVSP problems Date: Mon, 7 May 2007 20:04:22 +0200 User-Agent: KMail/1.9.6 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705072004.22848.lucke@o2.pl> X-archive-position: 11318 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lucke@o2.pl Precedence: bulk X-list: xfs Hello, guys, I've been trying to implement RESVSP-based allocation in rtorrent. From the very beginning it has, alas, misbehaved, thus (also considering my very basic programming skills and experience and unfamiliarity with rtorrent's code) after hours of trying to determine what's wrong, I finally observed that blocks of files allocated with RESVSP (previously ftruncated to a proper size) and being downloaded in rtorrent don't have their unwritten flags removed (as confirmed by xfs_bmap -vp). In the effect downloaded file promptly corrupts (read: changes its md5sum). What is interesting, files RESVSP-allocated in ktorrent and then imported to rtorrent seem to download properly. Everything works properly with ALLOCSP (although I've noticed that while RESVSP worked with l_start = 0 and l_length = size, ALLOCSP worked with l_start = size and l_length = 0; is that intended?). I'm not quite sure what's at fault here. Perhaps rtorrent, as it prides itself on "directly between file pages mapped to memory by the mmap() function and the network stack". I haven't been yet able to determine how it actually writes chunks to files (aforementioned lacks of skills, experience and familiarity). Perhaps it's somehow XFS's fault, hence my posting to this ML. Any help/suggestions would be appreciated. Cheers, Luke From owner-xfs@oss.sgi.com Mon May 7 11:46:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 11:46:21 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47IkAfB010835 for ; Mon, 7 May 2007 11:46:12 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l47Ik5aL021039; Mon, 7 May 2007 14:46:06 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l47Ik5VF013479; Mon, 7 May 2007 14:46:05 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l47Ik4LP014643; Mon, 7 May 2007 14:46:04 -0400 Message-ID: <463F7368.8090101@sandeen.net> Date: Mon, 07 May 2007 13:43:52 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: lucke@o2.pl CC: xfs@oss.sgi.com Subject: Re: RESVSP problems References: <200705072004.22848.lucke@o2.pl> In-Reply-To: <200705072004.22848.lucke@o2.pl> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-archive-position: 11319 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Åukasz Fibinger wrote: > Hello, guys, > > I've been trying to implement RESVSP-based allocation in rtorrent. From the > very beginning it has, alas, misbehaved, thus (also considering my very basic > programming skills and experience and unfamiliarity with rtorrent's code) > after hours of trying to determine what's wrong, I finally observed that > blocks of files allocated with RESVSP (previously ftruncated to a proper > size) and being downloaded in rtorrent don't have their unwritten flags > removed (as confirmed by xfs_bmap -vp). You've probably hit: http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 unwritten extents remain unwritten after mmap() modifies them Bug dchinner about it... ;-) > In the effect downloaded file > promptly corrupts (read: changes its md5sum). What is interesting, files > RESVSP-allocated in ktorrent and then imported to rtorrent seem to download > properly. > > Everything works properly with ALLOCSP (although I've noticed that while > RESVSP worked with l_start = 0 and l_length = size, ALLOCSP worked with > l_start = size and l_length = 0; is that intended?). yeah... ISTR that the arguments are funky. I can't remember if it's a bug or not. :) FWIW, allocsp just writes zeros to the file, so you could do it just as well from userspace w/ no fancy ioctls... ALLOCSP is a bit pointless if you ask me... though maybe someone knows why it's there :) -Eric > I'm not quite sure what's at fault here. Perhaps rtorrent, as it prides itself > on "directly between file pages mapped to memory by the mmap() function and > the network stack". I haven't been yet able to determine how it actually > writes chunks to files (aforementioned lacks of skills, experience and > familiarity). Perhaps it's somehow XFS's fault, hence my posting to this ML. > Any help/suggestions would be appreciated. > > Cheers, > > Luke > > From owner-xfs@oss.sgi.com Mon May 7 11:58:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 11:58:53 -0700 (PDT) Received: from poczta.o2.pl (mx12.go2.pl [193.17.41.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47IwjfB013062 for ; Mon, 7 May 2007 11:58:46 -0700 Received: from poczta.o2.pl (mx12 [127.0.0.1]) by poczta.o2.pl (Postfix) with ESMTP id 1FBB83E81A6; Mon, 7 May 2007 20:58:37 +0200 (CEST) Received: from lucke.localnet (xdsl-7687.bielsko.dialog.net.pl [62.87.234.135]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by poczta.o2.pl (Postfix) with ESMTP; Mon, 7 May 2007 20:58:37 +0200 (CEST) From: =?utf-8?q?=C5=81ukasz_Fibinger?= Reply-To: lucke@o2.pl To: Eric Sandeen Subject: Re: RESVSP problems Date: Mon, 7 May 2007 20:58:32 +0200 User-Agent: KMail/1.9.6 References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> In-Reply-To: <463F7368.8090101@sandeen.net> Cc: xfs@oss.sgi.com MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705072058.32679.lucke@o2.pl> X-archive-position: 11320 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lucke@o2.pl Precedence: bulk X-list: xfs On Monday 07 of May 2007, you wrote: > You've probably hit: > http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 > unwritten extents remain unwritten after mmap() modifies them > > Bug dchinner about it... ;-) Dave, consider it a bugging from my humble self :-) > yeah... ISTR that the arguments are funky. I can't remember if it's a > bug or not. :) FWIW, allocsp just writes zeros to the file, so you > could do it just as well from userspace w/ no fancy ioctls... ALLOCSP > is a bit pointless if you ask me... though maybe someone knows why it's > there :) Let me say that I have noticed that using ALLOCSP seems to create less extents than posix_fallocate/manual zeroing. Thanks for your answer. Incidentally, I'm really happy that XFS has been bestowed upon linux users. Thanks for all your work, guys :-) Cheers, Luke From owner-xfs@oss.sgi.com Mon May 7 12:49:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 12:49:44 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47JnbfB023617 for ; Mon, 7 May 2007 12:49:39 -0700 Received: from localhost.adilger.int (dhcp215-19.nersc.gov [128.55.19.215]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id EB3E17BA315; Mon, 7 May 2007 13:49:36 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 2DAA8406D; Mon, 7 May 2007 05:37:54 -0600 (MDT) Date: Mon, 7 May 2007 05:37:54 -0600 From: Andreas Dilger To: Andrew Morton Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507113753.GA5439@schatzie.adilger.int> Mail-Followup-To: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070503213133.d1559f52.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11321 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 03, 2007 21:31 -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > + * ext4_fallocate: > > + * preallocate space for a file > > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > > + */ > > This description is rather thin. What is the filesystem's actual behaviour > here? If the file is using extents then the implementation will do > . If the file is using bitmaps then we will do . > > But what? Here is where it should be described. My understanding is that glibc will handle zero-filling of files for filesystems that do not support fallocate(). > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > +{ > > + handle_t *handle; > > + ext4_fsblk_t block, max_blocks; > > + int ret, ret2, nblocks = 0, retries = 0; > > + struct buffer_head map_bh; > > + unsigned int credits, blkbits = inode->i_blkbits; > > + > > + /* Currently supporting (pre)allocate mode _only_ */ > > + if (mode != FA_ALLOCATE) > > + return -EOPNOTSUPP; > > + > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > + return -ENOTTY; > > So we don't implement fallocate on bitmap-based files! Well that's huge > news. The changelog would be an appropriate place to communicate this, > along with reasons why, or a description of the plan to fix it. > > Also, posix says nothing about fallocate() returning ENOTTY. I _think_ this is to convince glibc to do the zero-filling in userspace, but I'm not up on the API specifics. > > + block = offset >> blkbits; > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > + - block; > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > space, and that this disk space will require an arbitrary amount of > metadata, how can we work out how much journal space we'll be needing > without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). Also, since @path is not being given then truncate_mutex is not needed. > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and > ext4_error() would be safer and more useful here. Ouch, not very friendly error handling. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon May 7 13:58:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 13:58:46 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47KwffB028868 for ; Mon, 7 May 2007 13:58:42 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l47KwQAQ005761 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 13:58:27 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l47KwPPl005141; Mon, 7 May 2007 13:58:25 -0700 Date: Mon, 7 May 2007 13:58:25 -0700 From: Andrew Morton To: Andreas Dilger Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507135825.f8545a65.akpm@linux-foundation.org> In-Reply-To: <20070507113753.GA5439@schatzie.adilger.int> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11322 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 7 May 2007 05:37:54 -0600 Andreas Dilger wrote: > > > + block = offset >> blkbits; > > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > > + - block; > > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > > space, and that this disk space will require an arbitrary amount of > > metadata, how can we work out how much journal space we'll be needing > > without at least looking at `len'? > > Good question. > > The uninitialized extent can cover up to 128MB with a single entry. > If @path isn't specified, then ext4_ext_calc_credits_for_insert() > function returns the maximum number of extents needed to insert a leaf, > including splitting all of the index blocks. That would allow up to 43GB > (340 extents/block * 128MB) to be preallocated, but it still needs to take > the size of the preallocation into account (adding 3 blocks per 43GB - a > leaf block, a bitmap block and a group descriptor). I think the use of ext4_journal_extend() (as Amit has proposed) will help here, but it is not sufficient. Because under some circumstances, a journal_extend() failure could mean that we fail to allocate all the required disk space. If it is infrequent enough, that is acceptable when the caller is using fallocate() for performance reasons. But it is very much not acceptable if the caller is using fallocate() for space-reservation reasons. If you used fallocate to reserve 1GB of disk and fallocate() "succeeded" and you later get ENOSPC then you'd have a right to get a bit upset. So I think the ext3/4 fallocate() implementation will need to be implemented as a loop: while (len) { journal_start(); len -= do_fallocate(len, ...); journal_stop(); } Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. Does the proposed implementation handle quotas correctly, btw? Has that been tested? Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. From owner-xfs@oss.sgi.com Mon May 7 15:21:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 15:21:14 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47ML9fB005559 for ; Mon, 7 May 2007 15:21:10 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 99AB27BA306; Mon, 7 May 2007 16:21:08 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 6A5173F57; Mon, 7 May 2007 15:21:04 -0700 (PDT) Date: Mon, 7 May 2007 15:21:04 -0700 From: Andreas Dilger To: Andrew Morton Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507222103.GJ8181@schatzie.adilger.int> Mail-Followup-To: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070507135825.f8545a65.akpm@linux-foundation.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11323 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 07, 2007 13:58 -0700, Andrew Morton wrote: > Final point: it's fairly disappointing that the present implementation is > ext4-only, and extent-only. I do think we should be aiming at an ext4 > bitmap-based implementation and an ext3 implementation. Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon May 7 15:39:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 15:39:43 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47MdcfB007821 for ; Mon, 7 May 2007 15:39:39 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l47McuH6010334 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 15:38:58 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l47Mcut9007194; Mon, 7 May 2007 15:38:56 -0700 Date: Mon, 7 May 2007 15:38:56 -0700 From: Andrew Morton To: Andreas Dilger Cc: "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507153856.d56a5133.akpm@linux-foundation.org> In-Reply-To: <20070507222103.GJ8181@schatzie.adilger.int> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11324 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 7 May 2007 15:21:04 -0700 Andreas Dilger wrote: > On May 07, 2007 13:58 -0700, Andrew Morton wrote: > > Final point: it's fairly disappointing that the present implementation is > > ext4-only, and extent-only. I do think we should be aiming at an ext4 > > bitmap-based implementation and an ext3 implementation. > > Actually, this is a non-issue. The reason that it is handled for extent-only > is that this is the only way to allocate space in the filesystem without > doing the explicit zeroing. For other filesystems (including ext3 and > ext4 with block-mapped files) the filesystem should return an error (e.g. > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. hrm, spose so. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Totally blowing pagecache could be a problem. Fixable in userspace by using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't. From owner-xfs@oss.sgi.com Mon May 7 16:31:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:31:56 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47NVnfB016106 for ; Mon, 7 May 2007 16:31:52 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l47NVZ47012460 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 16:31:37 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l47NVZ6H008256; Mon, 7 May 2007 16:31:35 -0700 Date: Mon, 7 May 2007 16:31:35 -0700 From: Andrew Morton To: Theodore Tso Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507163135.cf455103.akpm@linux-foundation.org> In-Reply-To: <20070507231442.GA29907@thunk.org> References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <20070507153856.d56a5133.akpm@linux-foundation.org> <20070507231442.GA29907@thunk.org> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11325 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 7 May 2007 19:14:42 -0400 Theodore Tso wrote: > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > > Actually, this is a non-issue. The reason that it is handled for extent-only > > > is that this is the only way to allocate space in the filesystem without > > > doing the explicit zeroing. For other filesystems (including ext3 and > > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > > > It can be a bit suboptimal from the layout POV. The reservations code will > > largely save us here, but kernel support might make it a bit better. > > Actually, the reservations code won't matter, since glibc will fall > back to its current behavior, which is it will do the preallocation by > explicitly writing zeros to the file. No! Reservations code is *critical* here. Without reservations, we get disastrously-bad layout if two processes were running a large fallocate() at the same time. (This is an SMP-only problem, btw: on UP the timeslice lengths save us). My point is that even though reservations save us, we could do even-better in-kernel. But then, a smart application would bypass the glibc() fallocate() implementation and would tune the reservation window size and would use direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). > This wlil result in the same > layout as if we had done the persistent preallocation, but of course > it will mean the posix_fallocate() could potentially take a long time > if you're a PVR and you're reserving a gig or two for a two hour movie > at high quality. That seems suboptimal, granted, and ideally the > application should be warned about this before it calls > posix_fallocate(). On the other hand, it's what happens today, all > the time, so applications won't be too badly surprised. A PVR implementor would take all this over and would do it themselves, for sure. > If we think applications programmers badly need to know in advance if > posix_fallocate() will be fast or slow, probably the right thing is to > define a new fpathconf() configuration option so they can query to see > whether a particular file will support a fast posix_fallocate(). I'm > not 100% convinced such complexity is really needed, but I'm willing > to be convinced.... what do folks think? > An application could do sys_fallocate(one-byte) to work out whether it's supported in-kernel, I guess. From owner-xfs@oss.sgi.com Mon May 7 16:36:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:36:41 -0700 (PDT) Received: from thunker.thunk.org (THUNK.ORG [69.25.196.29]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47NabfB016849 for ; Mon, 7 May 2007 16:36:38 -0700 Received: from root (helo=candygram.thunk.org) by thunker.thunk.org with local-esmtps (tls_cipher TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50 #1 (Debian)) id 1HlCrZ-00083r-GU; Mon, 07 May 2007 19:43:30 -0400 Received: from tytso by candygram.thunk.org with local (Exim 4.63) (envelope-from ) id 1HlCkg-0006Ub-B6; Mon, 07 May 2007 19:36:22 -0400 Date: Mon, 7 May 2007 19:36:22 -0400 From: Theodore Tso To: Jeff Garzik Cc: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507233622.GB29907@thunk.org> Mail-Followup-To: Theodore Tso , Jeff Garzik , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <463FB008.3080706@garzik.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <463FB008.3080706@garzik.org> User-Agent: Mutt/1.5.13 (2006-08-11) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-archive-position: 11326 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tytso@mit.edu Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote: > Andreas Dilger wrote: > >On May 07, 2007 13:58 -0700, Andrew Morton wrote: > >>Final point: it's fairly disappointing that the present implementation is > >>ext4-only, and extent-only. I do think we should be aiming at an ext4 > >>bitmap-based implementation and an ext3 implementation. > > > >Actually, this is a non-issue. The reason that it is handled for > >extent-only > >is that this is the only way to allocate space in the filesystem without > >doing the explicit zeroing. For other filesystems (including ext3 and > > Precisely /how/ do you avoid the zeroing issue, for extents? > > If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, > otherwise the implementation is broken. There is a bit in the extent structure which indicates that the extent has not been initialized. When reading from a block where the extent is marked as unitialized, ext4 returns zero's, to avoid returning the uninitalized contents of the disk, which might contain someone else's love letters, p0rn, or other information which we shouldn't leak out. When writing to an extent which is uninitalized, we may potentially have to split the extent into three extents in the worst case. My understanding is that XFS uses a similar implementation; it's a pretty obvious and standard way to implement allocated-but-not-initialized extents. We thought about supporting persistent preallocation for inodes using indirect blocks, but it would require stealing a bit from each entry in the indirect block, reducing the maximum size of the filesystem by two (i.e., 2**31 blocks). It was decided it wasn't worth the complexity, given the tradeoffs. - Ted From owner-xfs@oss.sgi.com Mon May 7 16:44:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:44:13 -0700 (PDT) Received: from thunker.thunk.org (THUNK.ORG [69.25.196.29]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47Ni8fB017858 for ; Mon, 7 May 2007 16:44:09 -0700 Received: from root (helo=candygram.thunk.org) by thunker.thunk.org with local-esmtps (tls_cipher TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50 #1 (Debian)) id 1HlCWZ-0007zC-Rg; Mon, 07 May 2007 19:21:48 -0400 Received: from tytso by candygram.thunk.org with local (Exim 4.63) (envelope-from ) id 1HlCPi-0003Ie-6y; Mon, 07 May 2007 19:14:42 -0400 Date: Mon, 7 May 2007 19:14:42 -0400 From: Theodore Tso To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070507231442.GA29907@thunk.org> Mail-Followup-To: Theodore Tso , Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <20070507153856.d56a5133.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070507153856.d56a5133.akpm@linux-foundation.org> User-Agent: Mutt/1.5.13 (2006-08-11) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-archive-position: 11327 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tytso@mit.edu Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > Actually, this is a non-issue. The reason that it is handled for extent-only > > is that this is the only way to allocate space in the filesystem without > > doing the explicit zeroing. For other filesystems (including ext3 and > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > It can be a bit suboptimal from the layout POV. The reservations code will > largely save us here, but kernel support might make it a bit better. Actually, the reservations code won't matter, since glibc will fall back to its current behavior, which is it will do the preallocation by explicitly writing zeros to the file. This wlil result in the same layout as if we had done the persistent preallocation, but of course it will mean the posix_fallocate() could potentially take a long time if you're a PVR and you're reserving a gig or two for a two hour movie at high quality. That seems suboptimal, granted, and ideally the application should be warned about this before it calls posix_fallocate(). On the other hand, it's what happens today, all the time, so applications won't be too badly surprised. If we think applications programmers badly need to know in advance if posix_fallocate() will be fast or slow, probably the right thing is to define a new fpathconf() configuration option so they can query to see whether a particular file will support a fast posix_fallocate(). I'm not 100% convinced such complexity is really needed, but I'm willing to be convinced.... what do folks think? - Ted From owner-xfs@oss.sgi.com Mon May 7 16:57:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 16:57:51 -0700 (PDT) Received: from mail.dvmed.net (srv5.dvmed.net [207.36.208.214]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l47NvkfB019596 for ; Mon, 7 May 2007 16:57:47 -0700 Received: from cpe-065-190-194-075.nc.res.rr.com ([65.190.194.75] helo=[10.10.10.10]) by mail.dvmed.net with esmtpsa (Exim 4.63 #1 (Red Hat Linux)) id 1HlCDx-0000ze-FS; Mon, 07 May 2007 23:02:33 +0000 Message-ID: <463FB008.3080706@garzik.org> Date: Mon, 07 May 2007 19:02:32 -0400 From: Jeff Garzik User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> In-Reply-To: <20070507222103.GJ8181@schatzie.adilger.int> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11328 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeff@garzik.org Precedence: bulk X-list: xfs Andreas Dilger wrote: > On May 07, 2007 13:58 -0700, Andrew Morton wrote: >> Final point: it's fairly disappointing that the present implementation is >> ext4-only, and extent-only. I do think we should be aiming at an ext4 >> bitmap-based implementation and an ext3 implementation. > > Actually, this is a non-issue. The reason that it is handled for extent-only > is that this is the only way to allocate space in the filesystem without > doing the explicit zeroing. For other filesystems (including ext3 and Precisely /how/ do you avoid the zeroing issue, for extents? If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, otherwise the implementation is broken. Jeff From owner-xfs@oss.sgi.com Mon May 7 17:16:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:16:15 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480GBfB022224 for ; Mon, 7 May 2007 17:16:12 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l480Fii3015034 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 May 2007 17:15:46 -0700 Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l480FfvA009569; Mon, 7 May 2007 17:15:41 -0700 Date: Mon, 7 May 2007 17:15:41 -0700 From: Andrew Morton To: cmm@us.ibm.com Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-Id: <20070507171541.5370a36a.akpm@linux-foundation.org> In-Reply-To: <1178582424.3933.39.camel@dyn9047017103.beaverton.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <1178582424.3933.39.camel@dyn9047017103.beaverton.ibm.com> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11329 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Mon, 07 May 2007 17:00:24 -0700 Mingming Cao wrote: > > + while (ret >= 0 && ret < max_blocks) { > > + block = block + ret; > > + max_blocks = max_blocks - ret; > > + ret = ext4_ext_get_blocks(handle, inode, block, > > + max_blocks, &map_bh, > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > + BUG_ON(!ret); > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > + nblocks = nblocks + ret; > > + } > > + > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > + goto retry; > > + > > Now the interesting question is: what do we do if we get halfway through > > this loop and then run out of space? We could leave the disk all filled up > > and then return failure to the caller, but that's pretty poor behaviour, > > IMO. > > > The current code handles earlier ENOSPC by three times retries. After > that if we still run out of space, then it's propably right to notify > the caller there isn't much space left. > > We could extend the block reservation window size before the while loop > so we could get a lower chance to get more fragmented. yes, but my point is that the proposed behaviour is really quite bad. We will attempt to allocate the disk space and then we will return failure, having consumed all the disk space and having partially and uselessly populated an unknown amount of the file. Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. So we'd need to either keep track of which blocks were newly-allocated and then free them all again on the error path (doesn't work right across commit+crash+recovery) or we could later use the space-reservation scheme which delayed allocation will need to introduce. Or we could decide to live with the above IMO-crappy behaviour. From owner-xfs@oss.sgi.com Mon May 7 17:20:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:20:05 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480K1fB022916 for ; Mon, 7 May 2007 17:20:02 -0700 Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4800TK3031704 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 7 May 2007 20:00:30 -0400 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4800Rgk009304 for ; Mon, 7 May 2007 20:00:27 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4800RNf186946 for ; Mon, 7 May 2007 18:00:27 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4800QhU006734 for ; Mon, 7 May 2007 18:00:27 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4800OlA006675; Mon, 7 May 2007 18:00:25 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070507135825.f8545a65.akpm@linux-foundation.org> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Mon, 07 May 2007 17:00:24 -0700 Message-Id: <1178582424.3933.39.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11330 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote: > On Mon, 7 May 2007 05:37:54 -0600 > Andreas Dilger wrote: > > > > > + block = offset >> blkbits; > > > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > > > + - block; > > > > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > > > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > > > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); > > > > > > Now I'm mystified. Given that we're allocating an arbitrary amount of disk > > > space, and that this disk space will require an arbitrary amount of > > > metadata, how can we work out how much journal space we'll be needing > > > without at least looking at `len'? > > > > Good question. > > > > The uninitialized extent can cover up to 128MB with a single entry. > > If @path isn't specified, then ext4_ext_calc_credits_for_insert() > > function returns the maximum number of extents needed to insert a leaf, > > including splitting all of the index blocks. That would allow up to 43GB > > (340 extents/block * 128MB) to be preallocated, but it still needs to take > > the size of the preallocation into account (adding 3 blocks per 43GB - a > > leaf block, a bitmap block and a group descriptor). > > I think the use of ext4_journal_extend() (as Amit has proposed) will help > here, but it is not sufficient. > > Because under some circumstances, a journal_extend() failure could mean > that we fail to allocate all the required disk space. If it is infrequent > enough, that is acceptable when the caller is using fallocate() for > performance reasons. > > But it is very much not acceptable if the caller is using fallocate() for > space-reservation reasons. If you used fallocate to reserve 1GB of disk > and fallocate() "succeeded" and you later get ENOSPC then you'd have a > right to get a bit upset. > > So I think the ext3/4 fallocate() implementation will need to be > implemented as a loop: > > while (len) { > journal_start(); > len -= do_fallocate(len, ...); > journal_stop(); > } > > I agree. There is already a loop in Amit's current's patch to call ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to ask for in each journal_start()? > +/* > + * ext4_fallocate: > + * preallocate space for a file > + * mode is for future use, e.g. for unallocating preallocated blocks etc. > + */ > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > +{ .... > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > + credits = ext4_ext_calc_credits_for_insert(inode, NULL); > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); I think the calculation is based on the assumption that there is only a single extent to be inserted, which is the ideal case. But in some cases we may end up allocating several chunk of blocks(extents) for this single preallocation request when fs is fragmented (or part of preallocation request is already fulfilled) I think we should move this calculation inside the loop as well,and we really do not need to grab the lock to calculate the credit if the @path is always NULL, all the function does is mathmatics. I can't think of any good way to estimate the total credits needed for this whole preallocation request. Looked at ext4_get_block(), which is used for DIO code to deal with large amount of block allocation. The credit reservation is quite weak there too. The DIO_CREDIT is only (EXT4_RESERVE_TRANS_BLOCKS + 32) > + handle=ext4_journal_start(inode, credits + > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > +retry: > + ret = 0; > + while (ret >= 0 && ret < max_blocks) { > + block = block + ret; > + max_blocks = max_blocks - ret; > + ret = ext4_ext_get_blocks(handle, inode, block, > + max_blocks, &map_bh, > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > + BUG_ON(!ret); > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > + && ((block + ret) > (i_size_read(inode) << blkbits))) > + nblocks = nblocks + ret; > + } > + > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > + goto retry; > + > Now the interesting question is: what do we do if we get halfway through > this loop and then run out of space? We could leave the disk all filled up > and then return failure to the caller, but that's pretty poor behaviour, > IMO. > The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop so we could get a lower chance to get more fragmented. > > Does the proposed implementation handle quotas correctly, btw? Has that > been tested? > I think so. The ext4_ext_get_blocks() will end up calling ext4_new_blocks() to do the real block allocation, quota is being handled there, therefor is tested already. Mingming From owner-xfs@oss.sgi.com Mon May 7 17:30:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:30:43 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480UbfB024196 for ; Mon, 7 May 2007 17:30:38 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l480UWm5018735 for ; Mon, 7 May 2007 20:30:32 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l480UWTX162186 for ; Mon, 7 May 2007 18:30:32 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l480UVng005071 for ; Mon, 7 May 2007 18:30:32 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l480UU96005046; Mon, 7 May 2007 18:30:30 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: Theodore Tso , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070507163135.cf455103.akpm@linux-foundation.org> References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <20070507153856.d56a5133.akpm@linux-foundation.org> <20070507231442.GA29907@thunk.org> <20070507163135.cf455103.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Mon, 07 May 2007 17:30:29 -0700 Message-Id: <1178584229.3933.60.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11331 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote: > On Mon, 7 May 2007 19:14:42 -0400 > Theodore Tso wrote: > > > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: > > > > Actually, this is a non-issue. The reason that it is handled for extent-only > > > > is that this is the only way to allocate space in the filesystem without > > > > doing the explicit zeroing. For other filesystems (including ext3 and > > > > ext4 with block-mapped files) the filesystem should return an error (e.g. > > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. > > > > > > It can be a bit suboptimal from the layout POV. The reservations code will > > > largely save us here, but kernel support might make it a bit better. > > > > Actually, the reservations code won't matter, since glibc will fall > > back to its current behavior, which is it will do the preallocation by > > explicitly writing zeros to the file. > > No! Reservations code is *critical* here. Without reservations, we get > disastrously-bad layout if two processes were running a large fallocate() > at the same time. (This is an SMP-only problem, btw: on UP the timeslice > lengths save us). > > My point is that even though reservations save us, we could do even-better > in-kernel. > In this case, since the number of blocks to preallocate (eg. N=10GB) is clear, we could improve the current reservation code, to allow callers explicitly ask for a new window that have the minimum N free blocks for the blocks-to-preallocated(rather than just have at least 1 free blocks). Before the ext4_fallocate() is called, the right reservation window size is set with the flag to indicating "please spend time if needed to find a window covers at least N free blocks". So for ex4 block mapped files, later when glibc is doing allocation and zeroing, the ext4 block-mapped allocator will knows to reserve the right amount of free blocks before allocating and zeroing 10GB space. I am not sure whether this worth the effort though. > But then, a smart application would bypass the glibc() fallocate() > implementation and would tune the reservation window size and would use > direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). > > > This wlil result in the same > > layout as if we had done the persistent preallocation, but of course > > it will mean the posix_fallocate() could potentially take a long time > > if you're a PVR and you're reserving a gig or two for a two hour movie > > at high quality. That seems suboptimal, granted, and ideally the > > application should be warned about this before it calls > > posix_fallocate(). On the other hand, it's what happens today, all > > the time, so applications won't be too badly surprised. > > A PVR implementor would take all this over and would do it themselves, for > sure. > > > If we think applications programmers badly need to know in advance if > > posix_fallocate() will be fast or slow, probably the right thing is to > > define a new fpathconf() configuration option so they can query to see > > whether a particular file will support a fast posix_fallocate(). I'm > > not 100% convinced such complexity is really needed, but I'm willing > > to be convinced.... what do folks think? > > > > An application could do sys_fallocate(one-byte) to work out whether it's > supported in-kernel, I guess. > From owner-xfs@oss.sgi.com Mon May 7 17:41:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:42:01 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l480ftfB025394 for ; Mon, 7 May 2007 17:41:56 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l480cMBk002369 for ; Mon, 7 May 2007 20:38:22 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l480fgcB102890 for ; Mon, 7 May 2007 18:41:42 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l480ffZE025439 for ; Mon, 7 May 2007 18:41:42 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l480fePn025409; Mon, 7 May 2007 18:41:40 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070507171541.5370a36a.akpm@linux-foundation.org> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Mon, 07 May 2007 17:41:39 -0700 Message-Id: <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11332 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote: > On Mon, 07 May 2007 17:00:24 -0700 > Mingming Cao wrote: > > > > + while (ret >= 0 && ret < max_blocks) { > > > + block = block + ret; > > > + max_blocks = max_blocks - ret; > > > + ret = ext4_ext_get_blocks(handle, inode, block, > > > + max_blocks, &map_bh, > > > + EXT4_CREATE_UNINITIALIZED_EXT, 0); > > > + BUG_ON(!ret); > > > + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) > > > + && ((block + ret) > (i_size_read(inode) << blkbits))) > > > + nblocks = nblocks + ret; > > > + } > > > + > > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > > + goto retry; > > > + > > > Now the interesting question is: what do we do if we get halfway through > > > this loop and then run out of space? We could leave the disk all filled up > > > and then return failure to the caller, but that's pretty poor behaviour, > > > IMO. > > > > > The current code handles earlier ENOSPC by three times retries. After > > that if we still run out of space, then it's propably right to notify > > the caller there isn't much space left. > > > > We could extend the block reservation window size before the while loop > > so we could get a lower chance to get more fragmented. > > yes, but my point is that the proposed behaviour is really quite bad. > I agree your point, that's why I mention it only helped the fragmentation issue but not the ENOSPC case. > We will attempt to allocate the disk space and then we will return failure, > having consumed all the disk space and having partially and uselessly > populated an unknown amount of the file. > Not totally useless I think. If only half of the space is preallocated because run out of space, the application can decide whether it's good enough to start to use this preallocated space or wait for the fs to have more free space. > Userspace could presumably repair the mess in most situations by truncating > the file back again. The kernel cannot do that because there might be live > data in amongst there. > > So we'd need to either keep track of which blocks were newly-allocated and > then free them all again on the error path (doesn't work right across > commit+crash+recovery) or we could later use the space-reservation scheme which > delayed allocation will need to introduce. > > Or we could decide to live with the above IMO-crappy behaviour. In fact Amit and I had raised this issue before, whether it's okay to do allow partial preallocation. At that moment the feedback is it's no much different than the current zero-out-preallocation behavior: people might preallocating half-way then later deal with ENOSPC. We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. If there is enough free space, we could make a reservation window that have at least N free blocks and mark it not stealable by other files. So later we will not run into the ENOSPC error. The fs free blocks account is just a estimate though. Mingming From owner-xfs@oss.sgi.com Mon May 7 17:59:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 17:59:41 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l480xYfB027310 for ; Mon, 7 May 2007 17:59:36 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA18644; Tue, 8 May 2007 10:59:27 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l480xOAf87607519; Tue, 8 May 2007 10:59:25 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l480xNoZ87644275; Tue, 8 May 2007 10:59:23 +1000 (AEST) Date: Tue, 8 May 2007 10:59:23 +1000 From: David Chinner To: =?iso-8859-1?Q?=C5=81ukasz?= Fibinger Cc: Eric Sandeen , xfs@oss.sgi.com Subject: Re: RESVSP problems Message-ID: <20070508005923.GS77450368@melbourne.sgi.com> References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> <200705072058.32679.lucke@o2.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <200705072058.32679.lucke@o2.pl> User-Agent: Mutt/1.4.2.1i X-archive-position: 11333 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 08:58:32PM +0200, Å?ukasz Fibinger wrote: > On Monday 07 of May 2007, you wrote: > > You've probably hit: > > http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 > > unwritten extents remain unwritten after mmap() modifies them > > > > Bug dchinner about it... ;-) > > Dave, consider it a bugging from my humble self :-) Yeah, yeah ;) I'm waiting to see what happens with Nick's patches in .22 before going any further. If they are not merged into .22, then I think we should push the XFS specific fix in.... > > yeah... ISTR that the arguments are funky. I can't remember if it's a > > bug or not. :) FWIW, allocsp just writes zeros to the file, so you > > could do it just as well from userspace w/ no fancy ioctls... ALLOCSP > > is a bit pointless if you ask me... though maybe someone knows why it's > > there :) > > Let me say that I have noticed that using ALLOCSP seems to create less extents > than posix_fallocate/manual zeroing. Yes, that's likely ;) There's work currently active to make posix_fallocate() do the same thing as ALLOCSP (i.e. call into the filesystem and let it do smart stuff), but that's a ways off yet... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 7 18:07:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 18:07:45 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4817efB028619 for ; Mon, 7 May 2007 18:07:42 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 7F4124E457A; Mon, 7 May 2007 19:07:38 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 1941F3F57; Mon, 7 May 2007 18:07:36 -0700 (PDT) Date: Mon, 7 May 2007 18:07:36 -0700 From: Andreas Dilger To: Jeff Garzik Cc: Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508010736.GO8181@schatzie.adilger.int> Mail-Followup-To: Jeff Garzik , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <463FB008.3080706@garzik.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <463FB008.3080706@garzik.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11334 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 07, 2007 19:02 -0400, Jeff Garzik wrote: > Andreas Dilger wrote: > >Actually, this is a non-issue. The reason that it is handled for > >extent-only is that this is the only way to allocate space in the > >filesystem without doing the explicit zeroing. > > Precisely /how/ do you avoid the zeroing issue, for extents? > > If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, > otherwise the implementation is broken. In ext4 (as in XFS) there is a flag stored in the extent that tells if the extent is initialized or not. Reads from uninitialized extents will return zero-filled data, and writes that don't span the whole extent will cause the uninitialized extent to be split into a regular extent and one or two uninitialized extents (depending where the write is). My comment was just that the extent doesn't have to be explicitly zero filled on the disk, by virtue of the fact that the uninitialized flag will cause reads to return zero. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon May 7 18:26:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 18:26:10 -0700 (PDT) Received: from mail.dvmed.net (srv5.dvmed.net [207.36.208.214]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l481Q5fB030567 for ; Mon, 7 May 2007 18:26:06 -0700 Received: from cpe-065-190-194-075.nc.res.rr.com ([65.190.194.75] helo=[10.10.10.10]) by mail.dvmed.net with esmtpsa (Exim 4.63 #1 (Red Hat Linux)) id 1HlESi-0001kC-7i; Tue, 08 May 2007 01:25:56 +0000 Message-ID: <463FD1A2.1020505@garzik.org> Date: Mon, 07 May 2007 21:25:54 -0400 From: Jeff Garzik User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jeff Garzik , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> <20070507222103.GJ8181@schatzie.adilger.int> <463FB008.3080706@garzik.org> <20070508010736.GO8181@schatzie.adilger.int> In-Reply-To: <20070508010736.GO8181@schatzie.adilger.int> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11335 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeff@garzik.org Precedence: bulk X-list: xfs Andreas Dilger wrote: > My comment was just that the extent doesn't have to be explicitly zero > filled on the disk, by virtue of the fact that the uninitialized flag > will cause reads to return zero. Agreed, thanks for the clarification. Jeff From owner-xfs@oss.sgi.com Mon May 7 18:43:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 18:43:59 -0700 (PDT) Received: from thunker.thunk.org (THUNK.ORG [69.25.196.29]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l481hrfB032410 for ; Mon, 7 May 2007 18:43:54 -0700 Received: from root (helo=candygram.thunk.org) by thunker.thunk.org with local-esmtps (tls_cipher TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50 #1 (Debian)) id 1HlEqi-0008QR-T3; Mon, 07 May 2007 21:50:45 -0400 Received: from tytso by candygram.thunk.org with local (Exim 4.63) (envelope-from ) id 1HlEjp-00009j-MT; Mon, 07 May 2007 21:43:37 -0400 Date: Mon, 7 May 2007 21:43:37 -0400 From: Theodore Tso To: Mingming Cao Cc: Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508014337.GA14072@thunk.org> Mail-Followup-To: Theodore Tso , Mingming Cao , Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> User-Agent: Mutt/1.5.13 (2006-08-11) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false X-archive-position: 11336 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tytso@mit.edu Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > We could check the total number of fs free blocks account before > preallocation happens, if there isn't enough space left, there is no > need to bother preallocating. Checking against the fs free blocks is a good idea, since it will prevent the obvious error case where someone tries to preallocate 10GB when there is only 2GB left. But it won't help if there are multiple processes trying to allocate blocks the same time. On the other hand, that case is probably relatively rare, and in that case, the filesystem was probably going to be left completely full in any case. On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > Userspace could presumably repair the mess in most situations by truncating > the file back again. The kernel cannot do that because there might be live > data in amongst there. Actually, the kernel could do it, in that could simply release all unitialized extents back to the system. The problem is distinguishing between the unitialized extents that had just been newly added, versus the ones that had there from before. (On the other hand, if the filesystem was completely full, releasing unitialized blocks wouldn't be the worse thing in the world to do, although releasing previously fallocated blocks probably does violate the princple of least surprise, even if it's what the user would have wanted.) On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > If there is enough free space, we could make a reservation window that > have at least N free blocks and mark it not stealable by other files. So > later we will not run into the ENOSPC error. Could you really use a single reservation window? When the filesystem is almost full, the free extents are likely going to be scattered all over the disk. The general principle of grabbing all of the extents and keeping them in an in-memory data structure, and only adding them to the extent tree would work, though; I'm just not sure we could do it using the existing reservation window code, since it only supports a single reservation window per file, yes? - Ted From owner-xfs@oss.sgi.com Mon May 7 22:03:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 22:03:13 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48539fB010772 for ; Mon, 7 May 2007 22:03:10 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 1C86018077E7E; Tue, 8 May 2007 00:03:08 -0500 (CDT) Message-ID: <4640048B.6070803@sandeen.net> Date: Tue, 08 May 2007 00:03:07 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: David Chinner CC: =?UTF-8?B?xYF1a2FzeiBGaWJpbmdlcg==?= , xfs@oss.sgi.com Subject: Re: RESVSP problems References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> <200705072058.32679.lucke@o2.pl> <20070508005923.GS77450368@melbourne.sgi.com> In-Reply-To: <20070508005923.GS77450368@melbourne.sgi.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11337 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs David Chinner wrote: >>> yeah... ISTR that the arguments are funky. I can't remember if it's a >>> bug or not. :) FWIW, allocsp just writes zeros to the file, so you >>> could do it just as well from userspace w/ no fancy ioctls... ALLOCSP >>> is a bit pointless if you ask me... though maybe someone knows why it's >>> there :) >> Let me say that I have noticed that using ALLOCSP seems to create less extents >> than posix_fallocate/manual zeroing. > > Yes, that's likely ;) > > There's work currently active to make posix_fallocate() do the same thing > as ALLOCSP (i.e. call into the filesystem and let it do smart stuff), but > that's a ways off yet... Dave, doesn't ALLOCSP actually create actual zeroed space though? Pretty much as posix_fallocate from userspace does today, maybe with better allocation... And "smart stuff" would be *not* needing to write zeros.... i.e. what RESVSP does. -Eric From owner-xfs@oss.sgi.com Mon May 7 22:25:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 22:25:32 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l485PQfB014571 for ; Mon, 7 May 2007 22:25:28 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA24103; Tue, 8 May 2007 15:25:24 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l485PNAf83137079; Tue, 8 May 2007 15:25:23 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l485PLnZ87479850; Tue, 8 May 2007 15:25:21 +1000 (AEST) Date: Tue, 8 May 2007 15:25:21 +1000 From: David Chinner To: Eric Sandeen Cc: David Chinner , =?iso-8859-1?Q?=C5=81ukasz?= Fibinger , xfs@oss.sgi.com Subject: Re: RESVSP problems Message-ID: <20070508052521.GH32602149@melbourne.sgi.com> References: <200705072004.22848.lucke@o2.pl> <463F7368.8090101@sandeen.net> <200705072058.32679.lucke@o2.pl> <20070508005923.GS77450368@melbourne.sgi.com> <4640048B.6070803@sandeen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4640048B.6070803@sandeen.net> User-Agent: Mutt/1.4.2.1i X-archive-position: 11338 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 08, 2007 at 12:03:07AM -0500, Eric Sandeen wrote: > David Chinner wrote: > > >>>yeah... ISTR that the arguments are funky. I can't remember if it's a > >>>bug or not. :) FWIW, allocsp just writes zeros to the file, so you > >>>could do it just as well from userspace w/ no fancy ioctls... ALLOCSP > >>>is a bit pointless if you ask me... though maybe someone knows why it's > >>>there :) > >>Let me say that I have noticed that using ALLOCSP seems to create less > >>extents than posix_fallocate/manual zeroing. > > > >Yes, that's likely ;) > > > >There's work currently active to make posix_fallocate() do the same thing > >as ALLOCSP (i.e. call into the filesystem and let it do smart stuff), but > >that's a ways off yet... > > Dave, doesn't ALLOCSP actually create actual zeroed space though? Ah, yes it does - I was sort of lumping allocsp/resvsp together as one there. > Pretty much as posix_fallocate from userspace does today, maybe with > better allocation... Better allocations and with no ENOSPC-after-partial-zeroing problems, either. > And "smart stuff" would be *not* needing to write > zeros.... i.e. what RESVSP does. Yup. I've implemented fallocate() with the equivalent of RESVSP. xfs_zero_eof() is smart enough to not try to zero unwritten extents so changing the filesize after preallocation is effectively a no-op ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 7 23:51:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 23:51:37 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l486pWfB030333 for ; Mon, 7 May 2007 23:51:34 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA26143; Tue, 8 May 2007 16:51:28 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l486pRAf87623879; Tue, 8 May 2007 16:51:27 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l486pQsF87768021; Tue, 8 May 2007 16:51:26 +1000 (AEST) Date: Tue, 8 May 2007 16:51:26 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: unwritten extent conversion vs synchronous direct I/O Message-ID: <20070508065126.GK32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11339 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Back in 2.6.13, unwritten extent conversion was changed to be done via a workqueue because we can't do conversion in interrupt context (AIO issue). The problem was that the changes extent conversion to run asynchronously w.r.t I/o completion. Under heavy load (e.g. 100 fsstress processes), a direct write into an unwritten extent can complete and return to userspace before the unwritten extent is converted. If that range of the file is then read immediately, it will return zeros - unwritten - instead of the data that was written and is present on disk. A simpl etest case to show this is to run 100 fsstress processes, the loop doing: prealloc direct write bmap and at some point during this time, the bmap will return an unwritten extent spanning a range that has already been written. The following patch fixes the synchronous direct I/O by triggering a workqueue flush on detection of a sync direct I/O into an unwritten extent after queuing the conversion work. The other approach that could be taken is to simply do the conversion without passing it off to a work queue. Anyone have a preference on which would be the better method to choose? The patch below passes the QA test I wrote to exercise this bug. Comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_aops.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-04-26 09:25:26.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-08 14:28:20.854616591 +1000 @@ -108,14 +108,19 @@ xfs_page_trace( /* * Schedule IO completion handling on a xfsdatad if this was - * the final hold on this ioend. + * the final hold on this ioend. If we are asked to wait, + * flush the workqueue. */ STATIC void xfs_finish_ioend( - xfs_ioend_t *ioend) + xfs_ioend_t *ioend, + int wait) { - if (atomic_dec_and_test(&ioend->io_remaining)) + if (atomic_dec_and_test(&ioend->io_remaining)) { queue_work(xfsdatad_workqueue, &ioend->io_work); + if (wait) + flush_workqueue(xfsdatad_workqueue); + } } /* @@ -334,7 +339,7 @@ xfs_end_bio( bio->bi_end_io = NULL; bio_put(bio); - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); return 0; } @@ -470,7 +475,7 @@ xfs_submit_ioend( } if (bio) xfs_submit_ioend_bio(ioend, bio); - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); } while ((ioend = next) != NULL); } @@ -1408,6 +1413,13 @@ xfs_end_io_direct( * This is not necessary for synchronous direct I/O, but we do * it anyway to keep the code uniform and simpler. * + * Well, if only it were that simple. Because synchronous direct I/O + * requires extent conversion to occur *before* we return to userspace, + * we have to wait for extent conversion to complete. Look at the + * iocb that has been passed to use to determine if this is AIO or + * not. If it is synchronous, tell xfs_finish_ioend() to kick the + * workqueue and wait for it to complete. + * * The core direct I/O code might be changed to always call the * completion handler in the future, in which case all this can * go away. @@ -1415,9 +1427,9 @@ xfs_end_io_direct( ioend->io_offset = offset; ioend->io_size = size; if (ioend->io_type == IOMAP_READ) { - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); } else if (private && size > 0) { - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, is_sync_kiocb(iocb) ? 1 : 0); } else { /* * A direct I/O write ioend starts it's life in unwritten @@ -1426,7 +1438,7 @@ xfs_end_io_direct( * handler. */ INIT_WORK(&ioend->io_work, xfs_end_bio_written); - xfs_finish_ioend(ioend); + xfs_finish_ioend(ioend, 0); } /* From owner-xfs@oss.sgi.com Mon May 7 23:53:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 07 May 2007 23:53:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l486rYfB030943 for ; Mon, 7 May 2007 23:53:35 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA26229; Tue, 8 May 2007 16:53:29 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l486rSAf87718760; Tue, 8 May 2007 16:53:28 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l486rRnx87261508; Tue, 8 May 2007 16:53:27 +1000 (AEST) Date: Tue, 8 May 2007 16:53:27 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: XFSQA: unwritten extent conversion vs synchronous direct I/O Message-ID: <20070508065327.GL32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11340 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Test to exercise synchronous direct I/O into unwritten extents. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- xfstests/167 | 65 ++++++++++++++++ xfstests/167.out | 3 xfstests/group | 1 xfstests/src/Makefile | 5 + xfstests/src/unwritten_sync.c | 167 ++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 240 insertions(+), 1 deletion(-) Index: xfs-cmds/xfstests/src/Makefile =================================================================== --- xfs-cmds.orig/xfstests/src/Makefile 2007-05-03 17:10:54.000000000 +1000 +++ xfs-cmds/xfstests/src/Makefile 2007-05-07 10:54:08.296322074 +1000 @@ -10,7 +10,7 @@ TARGETS = dirstress fill fill2 getpagesi mmapcat append_reader append_writer dirperf metaperf \ devzero feature alloc fault fstest t_access_root \ godown resvtest writemod makeextents itrash \ - multi_open_unlink dmiperf + multi_open_unlink dmiperf unwritten_sync LINUX_TARGETS = loggen xfsctl bstat t_mtab getdevicesize \ preallo_rw_pattern_reader preallo_rw_pattern_writer ftrunc trunc \ @@ -111,6 +111,9 @@ looptest: looptest.o locktest: locktest.o $(LINKTEST) +unwritten_sync: unwritten_sync.o + $(LINKTEST) + ifeq ($(PKG_PLATFORM),irix) fill2: fill2.o $(LINKTEST) -lgen Index: xfs-cmds/xfstests/src/unwritten_sync.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ xfs-cmds/xfstests/src/unwritten_sync.c 2007-05-07 11:44:38.668980258 +1000 @@ -0,0 +1,167 @@ +#include +#include +#include +#include +#include +#include + +/* test thanks to judith@sgi.com */ + +#define IO_SIZE 1048576 + +void +print_getbmapx( + const char *pathname, + int fd, + int64_t start, + int64_t limit); + +int +main(int argc, char *argv[]) +{ + int i; + int fd; + char *buf; + struct dioattr dio; + xfs_flock64_t flock; + off_t offset; + char *file; + int loops; + + if(argc != 3) { + fprintf(stderr, "%s \n", argv[0]); + exit(1); + } + + errno = 0; + loops = strtoull(argv[1], NULL, 0); + if (errno) { + perror("strtoull"); + exit(errno); + } + file = argv[2]; + + while (loops-- > 0) { + sleep(1); + fd = open(file, O_RDWR|O_CREAT|O_DIRECT, 0666); + if (fd < 0) { + perror("open"); + exit(1); + } + if (xfsctl(file, fd, XFS_IOC_DIOINFO, &dio) < 0) { + perror("dioinfo"); + exit(1); + } + + if ((dio.d_miniosz > IO_SIZE) || (dio.d_maxiosz < IO_SIZE)) { + fprintf(stderr,"Test won't work. Sorry\n"); + exit(1); + } + buf = (char *)memalign(dio.d_mem , IO_SIZE); + if (buf == NULL) { + fprintf(stderr,"Can't get memory\n"); + exit(1); + } + memset(buf,'Z',IO_SIZE); + offset = 0; + + flock.l_whence = 0; + flock.l_start= 0; + flock.l_len = IO_SIZE*21; + if (xfsctl(file, fd, XFS_IOC_RESVSP64, &flock) < 0) { + perror("xfsctl "); + exit(1); + } + for (i = 0; i < 21; i++) { + if (pwrite(fd, buf, IO_SIZE, offset) != IO_SIZE) { + perror("pwrite"); + exit(1); + } + offset += IO_SIZE; + } + + print_getbmapx(file, fd, 0, 0); + + flock.l_whence = 0; + flock.l_start= 0; + flock.l_len = 0; + xfsctl(file, fd, XFS_IOC_FREESP64, &flock); + print_getbmapx(file, fd, 0, 0); + close(fd); + } +} + + + +int +get_getbmapx( + const char *pathname, + int fd, + struct getbmapx *bmapx) +{ + int rc; + + rc = ioctl(fd, XFS_IOC_GETBMAPX, bmapx); + if (rc < 0) { + perror("xfs_ioc_getbmapx"); + exit(1); + } +} + +void +print_getbmapx( +const char *pathname, + int fd, + int64_t start, + int64_t limit) +{ + struct getbmapx bmapx[50]; + int array_size = sizeof(bmapx) / sizeof(bmapx[0]); + int x; + int foundone = 0; + int foundany = 0; + +again: + foundone = 0; + memset(bmapx, '\0', sizeof(bmapx)); + + bmapx[0].bmv_offset = start; + bmapx[0].bmv_length = -1; /* limit - start; */ + bmapx[0].bmv_count = array_size; + bmapx[0].bmv_entries = 0; /* no entries filled in yet */ + + bmapx[0].bmv_iflags = BMV_IF_PREALLOC; + + x = array_size; + for (;;) { + if (x > bmapx[0].bmv_entries) { + if (x != array_size) { + break; /* end of file */ + } + if (get_getbmapx(pathname, fd, bmapx) < 0) { + fprintf(stderr, "getbmapx failed\n"); + exit(1); + } + if (bmapx[0].bmv_entries == 0) { + break; + } + x = 1; /* back at first extent in buffer */ + } + if (bmapx[x].bmv_oflags & 1) { + fprintf(stderr, "FOUND ONE %lld %lld %x\n", + bmapx[x].bmv_offset, bmapx[x].bmv_length,bmapx[x].bmv_oflags); + foundone = 1; + foundany = 1; + } + x++; + } + if (foundone) { + sleep(1); + fprintf(stderr,"Repeat\n"); + goto again; + } + if (foundany) { + exit(1); + } +} + Index: xfs-cmds/xfstests/167 =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ xfs-cmds/xfstests/167 2007-05-07 16:02:58.993892587 +1000 @@ -0,0 +1,65 @@ +#! /bin/sh +# FSQA Test No. 167 +# +# unwritten extent conversion test +# +#----------------------------------------------------------------------- +# Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved. +#----------------------------------------------------------------------- +# +# creator +owner=dgc@sgi.com + +seq=`basename $0` +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +rm -f $seq.full +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + killall -q -TERM fsstress 2> /dev/null + _cleanup_testdir +} + +workout() +{ + procs=100 + nops=15000 + $FSSTRESS_PROG -d $SCRATCH_MNT -p $procs -n $nops $FSSTRESS_AVOID \ + >>$seq.full & + sleep 2 +} + +# get standard environment, filters and checks +. ./common.rc +. ./common.filter + +# real QA test starts here +_supported_fs xfs +_supported_os Linux + +_setup_testdir +_require_scratch +_scratch_mkfs_xfs >/dev/null 2>&1 +_scratch_mount + +TEST_FILE=$SCRATCH_MNT/test_file +TEST_PROG=$here/src/unwritten_sync +LOOPS=100 + +echo "*** test unwritten extent conversion under heavy I/O" + +workout + +rm -f $TEST_FILE +$TEST_PROG $LOOPS $TEST_FILE +killall -q -TERM fsstress 2> /dev/null + +echo " *** test done" + +status=0 +exit Index: xfs-cmds/xfstests/167.out =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ xfs-cmds/xfstests/167.out 2007-05-07 11:46:46.560202917 +1000 @@ -0,0 +1,3 @@ +QA output created by 167 +*** test unwritten extent conversion under heavy I/O + *** test done Index: xfs-cmds/xfstests/group =================================================================== --- xfs-cmds.orig/xfstests/group 2007-04-23 16:22:06.000000000 +1000 +++ xfs-cmds/xfstests/group 2007-05-07 10:57:00.721817454 +1000 @@ -246,3 +246,4 @@ pattern ajones@sgi.com 164 rw pattern auto 165 rw pattern auto 166 rw metadata auto +167 rw metadata auto From owner-xfs@oss.sgi.com Tue May 8 01:08:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 01:08:26 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4888KfB019878 for ; Tue, 8 May 2007 01:08:22 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA27849; Tue, 8 May 2007 18:08:13 +1000 Date: Tue, 08 May 2007 18:11:37 +1000 From: Timothy Shimmin To: torvalds@linux-foundation.org cc: akpm@osdl.org, xfs@oss.sgi.com Subject: [GIT] XFS updates for 2.6.22 Message-ID: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11341 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Linus, Please pull from: git pull git://oss.sgi.com:8090/xfs/xfs-2.6 --Tim This will update the following files: fs/xfs/linux-2.6/mrlock.h | 12 +++ fs/xfs/linux-2.6/xfs_aops.c | 89 +++++++++++++++++++--- fs/xfs/linux-2.6/xfs_buf.c | 10 ++ fs/xfs/linux-2.6/xfs_buf.h | 3 + fs/xfs/linux-2.6/xfs_fs_subr.c | 21 +++-- fs/xfs/linux-2.6/xfs_fs_subr.h | 2 fs/xfs/linux-2.6/xfs_lrw.c | 163 +++++++++++++++++++++++----------------- fs/xfs/linux-2.6/xfs_vnode.h | 2 fs/xfs/quota/xfs_dquot.c | 3 - fs/xfs/quota/xfs_qm.c | 16 +++- fs/xfs/quota/xfs_qm_syscalls.c | 19 +++-- fs/xfs/quota/xfs_trans_dquot.c | 4 + fs/xfs/support/debug.c | 17 ---- fs/xfs/support/debug.h | 2 fs/xfs/xfs_alloc.c | 2 fs/xfs/xfs_attr.c | 12 +-- fs/xfs/xfs_attr_leaf.c | 2 fs/xfs/xfs_bmap.c | 28 +++---- fs/xfs/xfs_dfrag.c | 6 + fs/xfs/xfs_dir2_block.c | 14 +-- fs/xfs/xfs_dir2_data.c | 7 -- fs/xfs/xfs_dir2_data.h | 2 fs/xfs/xfs_dir2_leaf.c | 7 +- fs/xfs/xfs_dir2_node.c | 4 - fs/xfs/xfs_error.c | 2 fs/xfs/xfs_fsops.c | 4 - fs/xfs/xfs_iget.c | 15 ++-- fs/xfs/xfs_inode.c | 58 +++++++++++--- fs/xfs/xfs_inode.h | 65 ++++++++++++---- fs/xfs/xfs_iocore.c | 2 fs/xfs/xfs_iomap.c | 15 ++-- fs/xfs/xfs_iomap.h | 1 fs/xfs/xfs_log_recover.c | 15 +--- fs/xfs/xfs_mount.c | 2 fs/xfs/xfs_qmops.c | 2 fs/xfs/xfs_quota.h | 3 - fs/xfs/xfs_rename.c | 2 fs/xfs/xfs_rtalloc.c | 6 + fs/xfs/xfs_rw.c | 4 - fs/xfs/xfs_trans.c | 6 - fs/xfs/xfs_trans.h | 4 - fs/xfs/xfs_utils.c | 11 ++- fs/xfs/xfs_vfsops.c | 6 + fs/xfs/xfs_vnodeops.c | 125 ++++++++++++++++++------------- 44 files changed, 491 insertions(+), 304 deletions(-) through these commits: commit f7c66ce3f70d8417de0cfb481ca4e5430382ec5d Author: Lachlan McIlroy Date: Tue May 8 13:50:19 2007 +1000 [XFS] Add lockdep support for XFS SGI-PV: 963965 SGI-Modid: xfs-linux-melb:xfs-kern:28485a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit 71dfd5a396d11512aa6c8ed0d35b268bc084bb9b Author: Lachlan McIlroy Date: Tue May 8 13:50:12 2007 +1000 [XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks. In xfs_write() the iolock is dropped and reacquired in XFS_SEND_DATA() which means that the file could change from not-cached to cached and we need to redo the direct I/O checks. We should also redo the direct I/O checks when the file size changes regardless if O_APPEND is set or not. SGI-PV: 963483 SGI-Modid: xfs-linux-melb:xfs-kern:28440a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit 3a02ee1828915d6540b415a160344775e2a4f918 Author: Utako Kusaka Date: Tue May 8 13:50:06 2007 +1000 [XFS] Get rid of redundant "required" in msg. SGI-PV: 963466 SGI-Modid: xfs-linux-melb:xfs-kern:28416a Signed-off-by: Utako Kusaka Signed-off-by: Tim Shimmin Signed-off-by: Christoph Hellwig commit e6a0e9cdff79e1406e5653f759aaf9f59b7ce4c8 Author: Tim Shimmin Date: Tue May 8 13:49:59 2007 +1000 [XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg. SGI-PV: 963465 SGI-Modid: xfs-linux-melb:xfs-kern:28414a Signed-off-by: Tim Shimmin Signed-off-by: Lachlan McIlroy commit f10bb2dad02a846966064a531ba6eec301bbb9e0 Author: Tim Shimmin Date: Tue May 8 13:49:53 2007 +1000 [XFS] Remove unused ilen variable and references. SGI-PV: 907752 SGI-Modid: xfs-linux-melb:xfs-kern:28344a Signed-off-by: Tim Shimmin Signed-off-by: Lachlan McIlroy Signed-off-by: Eric Sandeen commit ba87ea699ebd9dd577bf055ebc4a98200e337542 Author: Lachlan McIlroy Date: Tue May 8 13:49:46 2007 +1000 [XFS] Fix to prevent the notorious 'NULL files' problem after a crash. The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. SGI-PV: 958522 SGI-Modid: xfs-linux-melb:xfs-kern:28322a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit 2a32963130aec5e157b58ff7dfa3dfa1afdf7ca1 Author: Lachlan McIlroy Date: Tue May 8 13:49:39 2007 +1000 [XFS] Fix race condition in xfs_write(). This change addresses a race in xfs_write() where, for direct I/O, the flags need_i_mutex and need_flush are setup before the iolock is acquired. The logic used to setup the flags may change between setting the flags and acquiring the iolock resulting in these flags having incorrect values. For example, if a file is not currently cached then need_i_mutex is set to zero and then if the file is cached before the iolock is acquired we will fail to do the flushinval before the direct write. The flush (and also the call to xfs_zero_eof()) need to be done with the iolock held exclusive so we need to acquire the iolock before checking for cached data (or if the write begins after eof) to prevent this state from changing. For direct I/O I've chosen to always acquire the iolock in shared mode initially and if there is a need to promote it then drop it and reacquire it. There's also some other tidy-ups including removing the O_APPEND offset adjustment since that work is done in generic_write_checks() (and we don't use offset as an input parameter anywhere). SGI-PV: 962170 SGI-Modid: xfs-linux-melb:xfs-kern:28319a Signed-off-by: Lachlan McIlroy Signed-off-by: David Chinner Signed-off-by: Tim Shimmin commit e6d29426bc8a5d07d0eebd0842fe0cf6ecc862cd Author: Kouta Ooizumi Date: Tue May 8 13:49:33 2007 +1000 [XFS] Fix uquota and oquota enforcement problems. When uquota and oquota (gquota/pquota) are enabled for accounting both are enforced if ether has enforcement active. Conditions: - Both XFS_UQUOTA_ACCT and XFS_GQUOTA_ACCT are enabled. - Either XFS_UQUOTA_ENFD or XFS_OQUOTA_ENFD is enabled. - The usage without enforce is reached at the soft limit. Problems: 1. "repquota" shows all grace time even if no enforcement. 2. we cannot make a file over a hard limits even if no enforcement. SGI-PV: 962291 SGI-Modid: xfs-linux-melb:xfs-kern:28272a Signed-off-by: Kouta Ooizumi Signed-off-by: Donald Douwsma Signed-off-by: Tim Shimmin commit d3cf209476b72c83907a412b6708c5e498410aa7 Author: Lachlan McIlroy Date: Tue May 8 13:49:27 2007 +1000 [XFS] propogate return codes from flush routines This patch handles error return values in fs_flush_pages and fs_flushinval_pages. It changes the prototype of fs_flushinval_pages so we can propogate the errors and handle them at higher layers. I also modified xfs_itruncate_start so that it could propogate the error further. SGI-PV: 961990 SGI-Modid: xfs-linux-melb:xfs-kern:28231a Signed-off-by: Lachlan McIlroy Signed-off-by: Stewart Smith Signed-off-by: Tim Shimmin commit 424ea91ba61c1cdc2dac68576c97030cbf47d84f Author: Donald Douwsma Date: Tue May 8 13:49:15 2007 +1000 [XFS] Fix quotaon syscall failures for group enforcement requests. xfs_qm_scall_quotaon was incorrectly failing requests to enable group quota enforcement. Fixes logic error in OQUOTA handling. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28227a Signed-off-by: Donald Douwsma Signed-off-by: Tim Shimmin commit 646d5bdab38c88f4b9088d4e517986a3f3b0edb9 Author: Donald Douwsma Date: Tue May 8 13:49:09 2007 +1000 [XFS] Invalidate quotacheck when mounting without a quota type. When quotas are mounted or remounted without a particular quota type the quota accounting for that type becomes invalid. Previously we were ignoring this leading to accounting errors. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28225a Signed-off-by: Donald Douwsma Signed-off-by: Utako Kusaka Signed-off-by: Vlad Apostolov Signed-off-by: Tim Shimmin commit e7a23a9b37c395a153a541d4c50e166eef6abe49 Author: Joe Perches Date: Tue May 8 13:49:03 2007 +1000 [XFS] reducing the number of random number functions. Patch provided by Joe Perches SGI-PV: 961696 SGI-Modid: xfs-linux-melb:xfs-kern:28209a Signed-off-by: Joe Perches Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin commit e9ed9d2240c71014a84043095af4465ffce61367 Author: Eric Sandeen Date: Tue May 8 13:48:56 2007 +1000 [XFS] remove more misc. unused args Patch provided by Eric Sandeen. SGI-PV: 961695 SGI-Modid: xfs-linux-melb:xfs-kern:28205a Signed-off-by: Eric Sandeen Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin commit ef497f8a1eafe0447f0473940ff2e0f6c8519a14 Author: Eric Sandeen Date: Tue May 8 13:48:49 2007 +1000 [XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it. Patch provided by Eric Sandeen. SGI-PV: 961694 SGI-Modid: xfs-linux-melb:xfs-kern:28204a Signed-off-by: Eric Sandeen Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin commit 1c72bf90037f32fc2b10e0a05dff2640abce8ee2 Author: Eric Sandeen Date: Tue May 8 13:48:42 2007 +1000 [XFS] The last argument "lsn" of xfs_trans_commit() is always called with NULL. Patch provided by Eric Sandeen. SGI-PV: 961693 SGI-Modid: xfs-linux-melb:xfs-kern:28199a Signed-off-by: Eric Sandeen Signed-off-by: Lachlan McIlroy Signed-off-by: Tim Shimmin From owner-xfs@oss.sgi.com Tue May 8 03:00:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 03:00:09 -0700 (PDT) Received: from atlas.informatik.uni-freiburg.de (atlas.informatik.uni-freiburg.de [132.230.150.3]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48A00fB010168 for ; Tue, 8 May 2007 03:00:01 -0700 Received: from login.informatik.uni-freiburg.de ([132.230.151.6]) by atlas.informatik.uni-freiburg.de with esmtps (TLSv1:DES-CBC3-SHA:168) (Exim 4.66) (envelope-from ) id 1HlLns-00067j-VF for xfs@oss.sgi.com; Tue, 08 May 2007 11:16:17 +0200 Received: from login.informatik.uni-freiburg.de (localhost [127.0.0.1]) by login.informatik.uni-freiburg.de (8.13.8+Sun/8.12.11) with ESMTP id l489GFYW008121 for ; Tue, 8 May 2007 11:16:15 +0200 (MEST) Received: (from zeisberg@localhost) by login.informatik.uni-freiburg.de (8.13.8+Sun/8.12.11/Submit) id l489GEui008120 for xfs@oss.sgi.com; Tue, 8 May 2007 11:16:14 +0200 (MEST) Date: Tue, 8 May 2007 11:16:14 +0200 From: Uwe =?iso-8859-1?Q?Kleine-K=F6nig?= To: xfs@oss.sgi.com Subject: Problems with XFS in a power failure Message-ID: <20070508091613.GA5852@cepheus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.14+cvs20070321 (2007-03-20) Organization: Universitaet Freiburg, Institut f. Informatik X-archive-position: 11342 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: ukleinek@informatik.uni-freiburg.de Precedence: bulk X-list: xfs Hello, my machine suffered a power failure while doing a apt-get upgrade. This damaged several files. E.g. root@cepheus:~# xxd /var/lib/dpkg/info/myspell-en-us.postrm 0000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000040: 0000 0000 0000 0000 0000 0000 0000 .............. while the repaired file has a size of 78 (= 0x4e) bytes. Some other files got broken with random data. I checked with debsums -c and it reported: root@cepheus:~# debsums -c >=������2[j�b��nw�������� in md5sums for irssi-scripts: ����gV{ڛ� �N���L���Mg{����.�����`����ӈL���j$�kC1'� ��S� ���Ý�� debsums: invalid line (2) in md5sums for irssi-scripts: ����g�DPH�� ìˆï¿½ï¿½ï¿½}�]g�����N�ci�5�h �w�W{SZ��q��F_�sR�[���ie�A|��Sv��@�@��;�5�'#c��$��l%���� ��T���$�!d�B�y debsums: invalid line (3) in md5sums for irssi-scripts: �����-Ä‹E��yq�/7Ä‘>�������Ў������Vu����V �+É‹A�f��:%O��_l���}������}� ��1���ȴϘ��=?��&��������F���mT�trZ� ���1���enO%.�YN��=�k��@����\{8É”w�x����z��-P!g�j����QV9u������)�m���5�l�8l �Rk5�;M���R��� �fx��O gÑ��;����٠�HYfrc��9�����u�q���Ox߀`����~_�ƃ2"J;�Q$vl?�{�=V������ �[��\�d��n�!�UH��Y�D��j2I���*� [�c��G�������[��h*���������2A��m&����������ޥGЉ�;R�0��̦��� ... I don't know exactly, but I think the damaged file here was /var/lib/dpkg/info/irssi.md5sums and debsums repaired it!? In theory this should not happen with a journaled fs, does it? This is a 2.6.19.3 kernel, unfortunately tainted by madwifi. There was nothing logged in dmesg and/or syslog. root@cepheus:~# xfs_info /var meta-data=/dev/hda9 isize=256 agcount=8, agsize=91619 blks = sectsz=512 attr=0 data = bsize=4096 blocks=732952, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=2560, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=65536 blocks=0, rtextents=0 I have to shutdown that machine because of a power cut by my supplier, but probably I will use a boot cd to bring it up again ... Best regards Uwe -- Uwe Kleine-König http://www.google.com/search?q=1+degree+celsius+in+kelvin From owner-xfs@oss.sgi.com Tue May 8 03:25:53 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 03:25:56 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48APqfB015225 for ; Tue, 8 May 2007 03:25:53 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l48APnoq007956 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 8 May 2007 03:25:50 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l48APkHi021147; Tue, 8 May 2007 03:25:47 -0700 Date: Tue, 8 May 2007 03:25:46 -0700 From: Andrew Morton To: Timothy Shimmin Cc: torvalds@linux-foundation.org, xfs@oss.sgi.com Subject: Re: [GIT] XFS updates for 2.6.22 Message-Id: <20070508032546.0728ae95.akpm@linux-foundation.org> In-Reply-To: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> References: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> X-Mailer: Sylpheed 2.4.1 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11343 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Tue, 08 May 2007 18:11:37 +1000 Timothy Shimmin wrote: > Please pull from: > git pull git://oss.sgi.com:8090/xfs/xfs-2.6 > I pull that regularly and it's always empty. Where did all this code suddenly come from? From owner-xfs@oss.sgi.com Tue May 8 03:52:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 03:52:49 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48AqhfB021631 for ; Tue, 8 May 2007 03:52:45 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e2.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l48AqgVH002005 for ; Tue, 8 May 2007 06:52:42 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l48Aqg6c518946 for ; Tue, 8 May 2007 06:52:42 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l48Aqg30020671 for ; Tue, 8 May 2007 06:52:42 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l48AqeM6020596; Tue, 8 May 2007 06:52:41 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id BDCF194C6E; Tue, 8 May 2007 16:22:48 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l48AqmIq011407; Tue, 8 May 2007 16:22:48 +0530 Date: Tue, 8 May 2007 16:22:47 +0530 From: "Amit K. Arora" To: Dave Kleikamp Cc: Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508105247.GA1950@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507120719.GD7012@amitarora.in.ibm.com> <1178551477.12900.6.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1178551477.12900.6.camel@kleikamp.austin.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11344 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote: > On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" wrote: > > > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) > > > > +{ > > > > + handle_t *handle; > > > > + ext4_fsblk_t block, max_blocks; > > > > + int ret, ret2, nblocks = 0, retries = 0; > > > > + struct buffer_head map_bh; > > > > + unsigned int credits, blkbits = inode->i_blkbits; > > > > + > > > > + /* Currently supporting (pre)allocate mode _only_ */ > > > > + if (mode != FA_ALLOCATE) > > > > + return -EOPNOTSUPP; > > > > + > > > > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > > > > + return -ENOTTY; > > > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > > news. The changelog would be an appropriate place to communicate this, > > > along with reasons why, or a description of the plan to fix it. > > > > Ok. Will add this in the function description as well. > > > > > Also, posix says nothing about fallocate() returning ENOTTY. > > > > Right. I don't seem to find any suitable error from posix description. > > Can you please suggest an error code which might make more sense here ? > > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > > non-extent files. > > Isn't the idea that libc will interpret -ENOTTY, or whatever is returned > here, and fall back to the current library code to do preallocation? > This way, the caller of fallocate() will never see this return code, so > it won't violate posix. You are right. But, we still need to "standardize" (and limit) the error codes which we should return from kernel when we want to fall back on the library implementation. The posix_fallocate() library function will have to look for a set of errors from fallocate() system call, upon receiving which it will do preallocation from user level; or else, it will return success/error-code returned by the system call to the user. I think we can make it fall back to library implementation of fallocate, whenever posix_fallocate() receives any of the following errors from fallocate() system call: 1. ENOSYS 2. EOPNOTSUPP 3. ENOTTY (?) Now the question is - should we limit the set of errors for this purpose to just 1 & 2 above ? In that case I will need to change the error being returned here to -EOPNOTSUPP (from current -ENOTTY). -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue May 8 06:28:57 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 06:29:00 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48DStfB001716 for ; Tue, 8 May 2007 06:28:57 -0700 Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id D1B2E18723; Tue, 8 May 2007 15:28:53 +0200 (CEST) Date: Tue, 8 May 2007 15:28:53 +0200 From: Emmanuel Florac To: Uwe =?ISO-8859-1?Q?Kleine-K=F6nig?= Cc: xfs@oss.sgi.com Subject: Re: Problems with XFS in a power failure Message-ID: <20070508152853.1d387fea@galadriel.home> In-Reply-To: <20070508091613.GA5852@cepheus> References: <20070508091613.GA5852@cepheus> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l48DSvfB001742 X-archive-position: 11345 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Tue, 8 May 2007 11:16:14 +0200 vous écriviez: > In theory this should not happen with a journaled fs, does it? That's the opposite. It's the expected behaviour. It's especially important to garantee proper power when using journaling filesystems. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Tue May 8 07:12:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 07:12:59 -0700 (PDT) Received: from amanpulo.fs3.ph (amanpulo.fs3.ph [72.51.42.241]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48ECufB017002 for ; Tue, 8 May 2007 07:12:56 -0700 Received: from localhost (localhost [127.0.0.1]) by amanpulo.fs3.ph (Postfix) with ESMTP id 25E8E1E0D5967 for ; Tue, 8 May 2007 21:55:59 +0800 (PHT) Received: from amanpulo.fs3.ph ([127.0.0.1]) by localhost (amanpulo.fs3.ph [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 1NVf-U3wH6+s for ; Tue, 8 May 2007 21:55:56 +0800 (PHT) Received: from musang.fs3.ph (smtp01.globe.com.ph [203.177.91.252]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by amanpulo.fs3.ph (Postfix) with ESMTP id A97BD1E0D5953 for ; Tue, 8 May 2007 21:55:55 +0800 (PHT) Received: by musang.fs3.ph (Postfix, from userid 1000) id BD35A2017683; Tue, 8 May 2007 21:55:42 +0800 (PHT) Date: Tue, 8 May 2007 21:55:42 +0800 From: Federico Sevilla III To: xfs@oss.sgi.com Subject: Re: Problems with XFS in a power failure Message-ID: <20070508135542.GF5621@fs3.ph> Mail-Followup-To: xfs@oss.sgi.com References: <20070508091613.GA5852@cepheus> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070508091613.GA5852@cepheus> X-Personal-URL: http://jijo.free.net.ph User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11346 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jijo@fs3.ph Precedence: bulk X-list: xfs On Tue, May 08, 2007 at 11:16:14AM +0200, Uwe Kleine-König wrote: > my machine suffered a power failure while doing a apt-get upgrade. Uh-oh. You hit the (in)famous binary nulls "issue". You may want to read the FAQ entry: http://oss.sgi.com/projects/xfs/faq.html#nulls. -- Federico Sevilla III F S 3 Consulting Inc. http://www.fs3.ph From owner-xfs@oss.sgi.com Tue May 8 07:47:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 07:48:02 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48ElvfB024039 for ; Tue, 8 May 2007 07:47:58 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l48EluvR013297 for ; Tue, 8 May 2007 10:47:56 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l48EluWL551990 for ; Tue, 8 May 2007 10:47:56 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l48ElujD011654 for ; Tue, 8 May 2007 10:47:56 -0400 Received: from [9.53.41.190] (kleikamp.austin.ibm.com [9.53.41.190]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l48Eltlu011606; Tue, 8 May 2007 10:47:55 -0400 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Dave Kleikamp To: "Amit K. Arora" Cc: Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070508105247.GA1950@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507120719.GD7012@amitarora.in.ibm.com> <1178551477.12900.6.camel@kleikamp.austin.ibm.com> <20070508105247.GA1950@amitarora.in.ibm.com> Content-Type: text/plain Date: Tue, 08 May 2007 09:47:54 -0500 Message-Id: <1178635675.11344.10.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 Content-Transfer-Encoding: 7bit X-archive-position: 11347 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: shaggy@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Tue, 2007-05-08 at 16:22 +0530, Amit K. Arora wrote: > On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote: > > On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: > > > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: > > > > So we don't implement fallocate on bitmap-based files! Well that's huge > > > > news. The changelog would be an appropriate place to communicate this, > > > > along with reasons why, or a description of the plan to fix it. > > > > > > Ok. Will add this in the function description as well. > > > > > > > Also, posix says nothing about fallocate() returning ENOTTY. > > > > > > Right. I don't seem to find any suitable error from posix description. > > > Can you please suggest an error code which might make more sense here ? > > > Will -ENOTSUPP be ok ? Since we want to say here that we don't support > > > non-extent files. > > > > Isn't the idea that libc will interpret -ENOTTY, or whatever is returned > > here, and fall back to the current library code to do preallocation? > > This way, the caller of fallocate() will never see this return code, so > > it won't violate posix. > > You are right. > > But, we still need to "standardize" (and limit) the error codes > which we should return from kernel when we want to fall back on the > library implementation. The posix_fallocate() library function will have > to look for a set of errors from fallocate() system call, upon receiving > which it will do preallocation from user level; or else, it will return > success/error-code returned by the system call to the user. > > I think we can make it fall back to library implementation of fallocate, > whenever posix_fallocate() receives any of the following errors from > fallocate() system call: > > 1. ENOSYS > 2. EOPNOTSUPP > 3. ENOTTY (?) > > Now the question is - should we limit the set of errors for this purpose > to just 1 & 2 above ? In that case I will need to change the error being > returned here to -EOPNOTSUPP (from current -ENOTTY). If you want my opinion, -EOPNOTSUPP is better than -ENOTTY. Shaggy -- David Kleikamp IBM Linux Technology Center From owner-xfs@oss.sgi.com Tue May 8 09:53:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 09:53:09 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48Gr2fB021897 for ; Tue, 8 May 2007 09:53:04 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id EA0D24E4557; Tue, 8 May 2007 10:53:00 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 28A3A3FB4; Tue, 8 May 2007 09:52:59 -0700 (PDT) Date: Tue, 8 May 2007 09:52:59 -0700 From: Andreas Dilger To: Theodore Tso , Mingming Cao , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070508165259.GD6375@schatzie.adilger.int> Mail-Followup-To: Theodore Tso , Mingming Cao , Andrew Morton , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> <20070508014337.GA14072@thunk.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070508014337.GA14072@thunk.org> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11348 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 07, 2007 21:43 -0400, Theodore Tso wrote: > On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > > Userspace could presumably repair the mess in most situations by truncating > > the file back again. The kernel cannot do that because there might be live > > data in amongst there. > > Actually, the kernel could do it, in that could simply release all > unitialized extents back to the system. The problem is distinguishing > between the unitialized extents that had just been newly added, versus > the ones that had there from before. (On the other hand, if the > filesystem was completely full, releasing unitialized blocks wouldn't > be the worse thing in the world to do, although releasing previously > fallocated blocks probably does violate the princple of least > surprise, even if it's what the user would have wanted.) I tend to agree with this. Having fallocate() fill up the filesystem is exactly what the caller asked. Doing a write() hit ENOSPC doesn't trucate off the whole write either, nor does "dd" delete the whole file when the filesystem is full. Even checking the statfs() space before doing the fallocate() may be counter intuitive, since it will return ENOSPC but the filesystem will not actually be full. Some applications (e.g. database) may WANT to fill the filesystem and then get the actual file size back to avoid trusting statfs() because of metadata overhead (e.g. indirect blocks). One of the design goals for sys_fallocate() was to allow FA_DELALLOC to deallocate unwritten extents in a safe manner. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 8 10:46:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 10:46:15 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l48Hk9fB032642 for ; Tue, 8 May 2007 10:46:10 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l48Hk7kD027946 for ; Tue, 8 May 2007 13:46:07 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l48Hk5a2140062 for ; Tue, 8 May 2007 11:46:05 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l48Hk44h029943 for ; Tue, 8 May 2007 11:46:05 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l48Hk2Vq029851; Tue, 8 May 2007 11:46:03 -0600 Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 From: Mingming Cao Reply-To: cmm@us.ibm.com To: Theodore Tso Cc: Andrew Morton , Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070508014337.GA14072@thunk.org> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070507171541.5370a36a.akpm@linux-foundation.org> <1178584899.3933.73.camel@dyn9047017103.beaverton.ibm.com> <20070508014337.GA14072@thunk.org> Content-Type: text/plain Organization: IBM LTC Date: Tue, 08 May 2007 10:46:01 -0700 Message-Id: <1178646362.4135.17.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11349 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Mon, 2007-05-07 at 21:43 -0400, Theodore Tso wrote: > On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > > We could check the total number of fs free blocks account before > > preallocation happens, if there isn't enough space left, there is no > > need to bother preallocating. > > Checking against the fs free blocks is a good idea, since it will > prevent the obvious error case where someone tries to preallocate 10GB > when there is only 2GB left. Think it again, this check is useful when preallocate blocks at EOF. It's not much useful is preallocating a range with holes. In that case 2GB space might be enough if the application tries to preallocate a 10GB. > But it won't help if there are multiple > processes trying to allocate blocks the same time. On the other hand, > that case is probably relatively rare, and in that case, the > filesystem was probably going to be left completely full in any case. > On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: > > Userspace could presumably repair the mess in most situations by truncating > > the file back again. The kernel cannot do that because there might be live > > data in amongst there. > > Actually, the kernel could do it, in that could simply release all > unitialized extents back to the system. The problem is distinguishing > between the unitialized extents that had just been newly added, versus > the ones that had there from before. True, the new uninitialized extents can be merged to the near old uninitialized extents, there is no way to distinguish the just added unintialized extents from the merged one. > (On the other hand, if the > filesystem was completely full, releasing unitialized blocks wouldn't > be the worse thing in the world to do, although releasing previously > fallocated blocks probably does violate the princple of least > surprise, even if it's what the user would have wanted.) > > On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: > > If there is enough free space, we could make a reservation window that > > have at least N free blocks and mark it not stealable by other files. So > > later we will not run into the ENOSPC error. > > Could you really use a single reservation window? When the filesystem > is almost full, the free extents are likely going to be scattered all > over the disk. The general principle of grabbing all of the extents > and keeping them in an in-memory data structure, and only adding them > to the extent tree would work, though; I'm just not sure we could do > it using the existing reservation window code, since it only supports > a single reservation window per file, yes? > You are right. One reservation window per file and there is limit to the maximum window size). So yeah this way it's not going to prevent ENOSPC for sure:( Mingming From owner-xfs@oss.sgi.com Tue May 8 18:06:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 18:06:36 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4916UfB010738 for ; Tue, 8 May 2007 18:06:32 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA24949; Wed, 9 May 2007 11:06:24 +1000 Date: Wed, 09 May 2007 11:09:51 +1000 From: Timothy Shimmin To: Andrew Morton cc: torvalds@linux-foundation.org, xfs@oss.sgi.com Subject: Re: [GIT] XFS updates for 2.6.22 Message-ID: In-Reply-To: <20070508032546.0728ae95.akpm@linux-foundation.org> References: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> <20070508032546.0728ae95.akpm@linux-foundation.org> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11350 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Andrew, --On 8 May 2007 3:25:46 AM -0700 Andrew Morton wrote: > On Tue, 08 May 2007 18:11:37 +1000 Timothy Shimmin wrote: > >> Please pull from: >> git pull git://oss.sgi.com:8090/xfs/xfs-2.6 >> > > I pull that regularly and it's always empty. Where did all this > code suddenly come from? It came from our internal tree which is also mirrored in cvs on oss. Our internal tree gets updated (non-xfs) from mainline every so often and has latest kdb patches applied (and dmapi patches). The internal tree is where our changes are originated from before moving out to an absolutely ridiculous number of trees. I only update the git tree every so often (start of rc's, important fixes, when I remember:) for Linus. Should I be updating a git branch for you more often? Cheers, Tim. From owner-xfs@oss.sgi.com Tue May 8 18:44:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 18:44:16 -0700 (PDT) Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [65.172.181.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l491iCfB019017 for ; Tue, 8 May 2007 18:44:12 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp1.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l491iADi017404 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 8 May 2007 18:44:11 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l491i9Fx007238; Tue, 8 May 2007 18:44:09 -0700 Date: Tue, 8 May 2007 18:44:09 -0700 From: Andrew Morton To: Timothy Shimmin Cc: torvalds@linux-foundation.org, xfs@oss.sgi.com Subject: Re: [GIT] XFS updates for 2.6.22 Message-Id: <20070508184409.e6ad4c8b.akpm@linux-foundation.org> In-Reply-To: References: <82BEB52DDD1E753B4E8C2F81@timothy-shimmins-power-mac-g5.local> <20070508032546.0728ae95.akpm@linux-foundation.org> X-Mailer: Sylpheed 2.4.1 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.177 $ X-Scanned-By: MIMEDefang 2.53 on 65.172.181.25 X-archive-position: 11351 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Wed, 09 May 2007 11:09:51 +1000 Timothy Shimmin wrote: > Should I be updating a git branch for you more often? Only if you want it tested ;) Yes please. From owner-xfs@oss.sgi.com Tue May 8 22:59:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 08 May 2007 22:59:29 -0700 (PDT) Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l495xOfB008438 for ; Tue, 8 May 2007 22:59:26 -0700 Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.195]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l495xLcV009966 for ; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l495xL321403 for xfs@oss.sgi.com; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv.nec.co.jp (8.11.7/3.7W-MAILSV-NEC) with ESMTP id l495xLO26246 for ; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070509.145924.21802300 for ; Wed, 9 May 2007 14:59:24 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Wed May 09 14:59:24 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 07047AE4B3; Wed, 9 May 2007 14:59:21 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l495xKD3006884; Wed, 9 May 2007 14:59:20 +0900 Message-Id: <200705090559.AA05331@TNESG9305.tnes.nec.co.jp> Date: Wed, 09 May 2007 14:59:11 +0900 To: xfs@oss.sgi.com Subject: [PATCH] Fix xfs_quota path command. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11352 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, In path command in xfs_quota, the range value in the message becomes from 0 to -1 incorrectly when the list number is specified though the path list is empty. I think that the message is unnecessary the same as not specifying the list number in this case. Example: # ./xfs_quota -x xfs_quota> path xfs_quota> path 0 value 0 is out of range (0--1) Signed-off-by: Utako Kusaka --- --- xfsprogs-2.8.20/quota/path.orig 2007-04-26 14:14:00.000000000 +0900 +++ xfsprogs-2.8.20/quota/path.c 2007-04-27 11:27:56.000000000 +0900 @@ -102,6 +102,9 @@ path_f( if (argc <= 1) return pathlist_f(); + if (!fs_count) + return 0; + i = atoi(argv[1]); if (i < 0 || i >= fs_count) { printf(_("value %d is out of range (0-%d)\n"), From owner-xfs@oss.sgi.com Wed May 9 03:52:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 03:52:33 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49AqPfB028412 for ; Wed, 9 May 2007 03:52:26 -0700 Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49AEJTa016628 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 9 May 2007 06:14:20 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l49AECOf004909 for ; Wed, 9 May 2007 06:14:13 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49AECD9171760 for ; Wed, 9 May 2007 04:14:12 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49AECFt015558 for ; Wed, 9 May 2007 04:14:12 -0600 Received: from qubit.in.ibm.com (wks184594wss.in.ibm.com [9.184.236.184]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49AEAHi014659; Wed, 9 May 2007 04:14:11 -0600 Received: from qubit.in.ibm.com (localhost.localdomain [127.0.0.1]) by qubit.in.ibm.com (Postfix) with ESMTP id A8B2A67FFD; Wed, 9 May 2007 15:45:19 +0530 (IST) Received: (from suparna@localhost) by qubit.in.ibm.com (8.13.1/8.13.1/Submit) id l49AFDbE001436; Wed, 9 May 2007 15:45:13 +0530 Date: Wed, 9 May 2007 15:45:07 +0530 From: Suparna Bhattacharya To: Paul Mackerras Cc: Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509101507.GA26056@in.ibm.com> Reply-To: suparna@in.ibm.com References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17978.47502.786970.196554@cargo.ozlabs.ibm.com> User-Agent: Mutt/1.5.11 X-archive-position: 11354 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: suparna@in.ibm.com Precedence: bulk X-list: xfs On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote: > Andrew Morton writes: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > > > Please add a comment over this function which specifies its behaviour. > > Really it should be enough material from which a full manpage can be > > written. > > This looks like it will have the same problem on s390 as > sys_sync_file_range. Maybe the prototype should be: > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) http://marc.info/?l=linux-fsdevel&m=117585330016809&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117690157917378&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117578821827323&w=2 (Randy) So we are kind of deadlocked, aren't we ? The debates on the proposed solution for s390 http://marc.info/?l=linux-fsdevel&m=117760995610639&w=2 http://marc.info/?l=linux-fsdevel&m=117708124913098&w=2 http://marc.info/?l=linux-fsdevel&m=117767607229807&w=2 Are there any better ideas ? Regards Suparna > > Paul. > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India From owner-xfs@oss.sgi.com Wed May 9 03:51:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 03:51:46 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49ApefB028095 for ; Wed, 9 May 2007 03:51:42 -0700 Received: by ozlabs.org (Postfix, from userid 1003) id C15B7DDE44; Wed, 9 May 2007 20:51:39 +1000 (EST) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17985.42884.971318.859402@cargo.ozlabs.ibm.com> Date: Wed, 9 May 2007 20:50:44 +1000 From: Paul Mackerras To: suparna@in.ibm.com Cc: Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc In-Reply-To: <20070509101507.GA26056@in.ibm.com> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-archive-position: 11353 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paulus@samba.org Precedence: bulk X-list: xfs Suparna Bhattacharya writes: > > This looks like it will have the same problem on s390 as > > sys_sync_file_range. Maybe the prototype should be: > > > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > Yes, but the trouble is that there was a contrary viewpoint preferring that fd > first be maintained as a convention like other syscalls (see the following > posts) Of course the interface used by an application program would have the fd first. Glibc can do the translation. Paul. From owner-xfs@oss.sgi.com Wed May 9 04:08:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 04:08:58 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49B8rfB001993 for ; Wed, 9 May 2007 04:08:55 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l49B8g6b005958 for ; Wed, 9 May 2007 07:08:42 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49B8g8B178256 for ; Wed, 9 May 2007 05:08:42 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49B8fSu020108 for ; Wed, 9 May 2007 05:08:42 -0600 Received: from qubit.in.ibm.com (wks184594wss.in.ibm.com [9.184.236.184]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49B8fFH020062; Wed, 9 May 2007 05:08:41 -0600 Received: from qubit.in.ibm.com (localhost.localdomain [127.0.0.1]) by qubit.in.ibm.com (Postfix) with ESMTP id F2E5767FFD; Wed, 9 May 2007 16:40:17 +0530 (IST) Received: (from suparna@localhost) by qubit.in.ibm.com (8.13.1/8.13.1/Submit) id l49BAHY1024578; Wed, 9 May 2007 16:40:17 +0530 Date: Wed, 9 May 2007 16:40:11 +0530 From: Suparna Bhattacharya To: Paul Mackerras Cc: Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509111011.GA21619@in.ibm.com> Reply-To: suparna@in.ibm.com References: <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> <17985.42884.971318.859402@cargo.ozlabs.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17985.42884.971318.859402@cargo.ozlabs.ibm.com> User-Agent: Mutt/1.5.11 X-archive-position: 11355 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: suparna@in.ibm.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > This looks like it will have the same problem on s390 as > > > sys_sync_file_range. Maybe the prototype should be: > > > > > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > > > Yes, but the trouble is that there was a contrary viewpoint preferring that fd > > first be maintained as a convention like other syscalls (see the following > > posts) > > Of course the interface used by an application program would have the > fd first. Glibc can do the translation. I think that was understood. Regards Suparna > > Paul. -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India From owner-xfs@oss.sgi.com Wed May 9 04:37:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 04:37:34 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49BbSfB013664 for ; Wed, 9 May 2007 04:37:29 -0700 Received: by ozlabs.org (Postfix, from userid 1003) id EA6BDDDE1A; Wed, 9 May 2007 21:37:27 +1000 (EST) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17985.45682.284634.969153@cargo.ozlabs.ibm.com> Date: Wed, 9 May 2007 21:37:22 +1000 From: Paul Mackerras To: suparna@in.ibm.com Cc: Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc In-Reply-To: <20070509111011.GA21619@in.ibm.com> References: <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> <17985.42884.971318.859402@cargo.ozlabs.ibm.com> <20070509111011.GA21619@in.ibm.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-archive-position: 11356 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paulus@samba.org Precedence: bulk X-list: xfs Suparna Bhattacharya writes: > > Of course the interface used by an application program would have the > > fd first. Glibc can do the translation. > > I think that was understood. OK, then what does it matter what the glibc/kernel interface is, as long as it works? It's only a minor point; the order of arguments can vary between architectures if necessary, but it's nicer if they don't have to. 32-bit powerpc will need to have the two int arguments adjacent in order to avoid using more than 6 argument registers at the user/kernel boundary, and s390 will need to avoid having a 64-bit argument last (if I understand it correctly). Paul. From owner-xfs@oss.sgi.com Wed May 9 05:05:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 05:06:03 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49C5wfB021602 for ; Wed, 9 May 2007 05:05:59 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l49C5vv6004481 for ; Wed, 9 May 2007 08:05:57 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49C5vn6531388 for ; Wed, 9 May 2007 08:05:57 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49C5vNd025124 for ; Wed, 9 May 2007 08:05:57 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49C5umt024820; Wed, 9 May 2007 08:05:56 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 5139793C0B; Wed, 9 May 2007 17:36:00 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l49C5xBc002968; Wed, 9 May 2007 17:35:59 +0530 Date: Wed, 9 May 2007 17:35:59 +0530 From: "Amit K. Arora" To: Paul Mackerras Cc: suparna@in.ibm.com, Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509120559.GA19430@amitarora.in.ibm.com> References: <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> <17985.42884.971318.859402@cargo.ozlabs.ibm.com> <20070509111011.GA21619@in.ibm.com> <17985.45682.284634.969153@cargo.ozlabs.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17985.45682.284634.969153@cargo.ozlabs.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11357 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > Of course the interface used by an application program would have the > > > fd first. Glibc can do the translation. > > > > I think that was understood. > > OK, then what does it matter what the glibc/kernel interface is, as > long as it works? > > It's only a minor point; the order of arguments can vary between > architectures if necessary, but it's nicer if they don't have to. > 32-bit powerpc will need to have the two int arguments adjacent in > order to avoid using more than 6 argument registers at the user/kernel > boundary, and s390 will need to avoid having a 64-bit argument last > (if I understand it correctly). You are right to say that. But, it may not be _that_ a minor point, especially for the arch which is getting affected. It has other implications like what Heiko noticed in his post below: http://lkml.org/lkml/2007/4/27/377 - implications like modifying glibc and *trace utilities for a particular arch. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Wed May 9 05:28:51 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 05:28:55 -0700 (PDT) Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.248]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49CSmfB028386 for ; Wed, 9 May 2007 05:28:50 -0700 Received: by an-out-0708.google.com with SMTP id c25so33820ana for ; Wed, 09 May 2007 05:28:48 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=googlemail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=k82pNmCQG6xCtzvBGnA0DcnHTp3uogf8iXrFeu2ESrspAtjyuI5di7SXd8oVHD6YcfovSY0Cy+TEdHl51eX07BRjau7AZKfFVGpw9W7y77qimQUhMP2b/y1TXK/aAK9hiIz5EQqGaVAtcfl0erlPsKGkeEkzwi/ajHMuLybLOzI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=a2VKPe4iK52E4mrndjj8ctLhkaJWneH5H3bKROdSv/dILYa6Il+YpwSYwjgNWXo8yMpsqv9m3mzVE7dcB55H+bm9QEqow1Aef2PRwGza9eo0XyC6lQL4WYeeflCjtZmrSVaXpmc66I5a2fCoiSavV49zRM9eb4UXNL4BTd/DSFs= Received: by 10.100.119.14 with SMTP id r14mr314081anc.1178712019528; Wed, 09 May 2007 05:00:19 -0700 (PDT) Received: by 10.100.44.6 with HTTP; Wed, 9 May 2007 05:00:19 -0700 (PDT) Message-ID: <6e0cfd1d0705090500u3423877u579ebace44100b77@mail.gmail.com> Date: Wed, 9 May 2007 14:00:19 +0200 From: "Martin Schwidefsky" To: "Paul Mackerras" Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Cc: suparna@in.ibm.com, "Andrew Morton" , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com In-Reply-To: <17985.45682.284634.969153@cargo.ozlabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20070418130600.GW5967@schatzie.adilger.int> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070503212955.b1b6443c.akpm@linux-foundation.org> <17978.47502.786970.196554@cargo.ozlabs.ibm.com> <20070509101507.GA26056@in.ibm.com> <17985.42884.971318.859402@cargo.ozlabs.ibm.com> <20070509111011.GA21619@in.ibm.com> <17985.45682.284634.969153@cargo.ozlabs.ibm.com> X-archive-position: 11358 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: schwidefsky@googlemail.com Precedence: bulk X-list: xfs On 5/9/07, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > Of course the interface used by an application program would have the > > > fd first. Glibc can do the translation. > > > > I think that was understood. > > OK, then what does it matter what the glibc/kernel interface is, as > long as it works? > > It's only a minor point; the order of arguments can vary between > architectures if necessary, but it's nicer if they don't have to. > 32-bit powerpc will need to have the two int arguments adjacent in > order to avoid using more than 6 argument registers at the user/kernel > boundary, and s390 will need to avoid having a 64-bit argument last > (if I understand it correctly). Ah, almost but not quite the point. But I admit it is hard to understand.. The trouble started with the futex call which has been the first system call with 6 arguments. s390 supported only 5 arguments up to that point (%r2 - %r6). For futex we added a wrapper to the glibc that loaded the 6th argument to %r7. In entry.S we set up things so that %r7 gets stored to the kernel stack where normal C code expects the first overflow argument. This enabled us to use the standard futex system call with 6 arguments. fallocate now has an additional problem: the last argument is a 64 bit integers AND registers %r2-%r5 are already used. In this case the 64 bit number would have to be split into the high part in %r6 and the low part on the stack so that the glibc wrapper can load the low part to %r7. But the C compiler will skip %r6 and store the 64 bit number on the stack. If the order of the arguments if modified so that %r6 is assigned to a 32-bit argument, then the entry.S magic with %r7 would work. -- blue skies, Martin From owner-xfs@oss.sgi.com Wed May 9 09:01:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 09:01:23 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49G1IfB007772 for ; Wed, 9 May 2007 09:01:19 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l49FvoGr002654 for ; Wed, 9 May 2007 11:57:50 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49G1Aa6133268 for ; Wed, 9 May 2007 10:01:11 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49G19kR006076 for ; Wed, 9 May 2007 10:01:10 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49G1835004970; Wed, 9 May 2007 10:01:09 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 3EA4C29EB6E; Wed, 9 May 2007 21:31:03 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l49G12nc004261; Wed, 9 May 2007 21:31:02 +0530 Date: Wed, 9 May 2007 21:31:02 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509160102.GA30745@amitarora.in.ibm.com> References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426180332.GA7209@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11359 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. 3) If above is true, the file size will need to be changed for "unallocation" when block holding the EOF gets unallocated. - If we do not "unallocate" normal (non-preallocated) blocks and we do not change the file size on preallocation, then this is a non-issue. 4) Should we update mtime & ctime on a successfull allocation/ unallocation ? - David Chinner raised this question in following post: http://lkml.org/lkml/2007/4/29/407 I think it makes sense to update the [mc]time for a successfull preallocation/unallocation. Does anyone feel otherwise ? It will be interesting to know how XFS behaves currently. Does XFS update [mc]time for preallocation ? -- Regards, Amit Arora From owner-xfs@oss.sgi.com Wed May 9 09:54:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 09:54:08 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49Gs4fB021251 for ; Wed, 9 May 2007 09:54:05 -0700 Received: from localhost.adilger.int (unknown [64.166.152.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 603794E4569; Wed, 9 May 2007 10:54:03 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 6891C3FB2; Wed, 9 May 2007 09:54:02 -0700 (PDT) Date: Wed, 9 May 2007 09:54:02 -0700 From: Andreas Dilger To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070509165402.GO6375@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070509160102.GA30745@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11360 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 09, 2007 21:31 +0530, Amit K. Arora wrote: > 2) For FA_UNALLOCATE mode, should the file system allow unallocation > of normal (non-preallocated) blocks (blocks allocated via > regular write/truncate operations) also (i.e. work as punch()) ? > - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > we need to finalize on the convention here as a general guideline > to all the filesystems that implement fallocate. I would only allow this on FA_ALLOCATE extents. That means it won't be possible to do this for filesystems that don't understand unwritten extents unless there are blocks allocated beyond EOF. > 3) If above is true, the file size will need to be changed > for "unallocation" when block holding the EOF gets unallocated. > - If we do not "unallocate" normal (non-preallocated) blocks and we > do not change the file size on preallocation, then this is a > non-issue. Not necessarily. That will just make the file sparse. If FA_ALLOCATE does not change the file size, why should FA_UNALLOCATE. > 4) Should we update mtime & ctime on a successfull allocation/ > unallocation ? I would say yes. If glibc does the fallback fallocate via write() the mtime/ctime will be updated, so it makes sense to be consistent for both methods. Also, it just makes sense from the "this file was modified" point of view. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Wed May 9 10:07:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 10:07:34 -0700 (PDT) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49H7TfB023872 for ; Wed, 9 May 2007 10:07:31 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l49H7TOQ015642 for ; Wed, 9 May 2007 13:07:29 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l49H7SWv169794 for ; Wed, 9 May 2007 11:07:28 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l49H7RwY000330 for ; Wed, 9 May 2007 11:07:28 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l49H7Qvq032702; Wed, 9 May 2007 11:07:26 -0600 Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc From: Mingming Cao Reply-To: cmm@us.ibm.com To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070509160102.GA30745@amitarora.in.ibm.com> References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> Content-Type: text/plain Organization: IBM LTC Date: Wed, 09 May 2007 10:07:25 -0700 Message-Id: <1178730446.3815.8.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11361 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote: > I have the updated patches ready which take care of Andrew's comments. > Will run some tests and post them soon. > > But, before submitting these patches, I think it will be better to finalize > on certain things which might be worth some discussion here: > > 1) Should the file size change when preallocation is done beyond EOF ? > - Andreas and Chris Wedgwood are in favor of not changing the > file size in this case. I also tend to agree with them. Does anyone > has an argument in favor of changing the filesize ? > If not, I will remove the code which changes the filesize, before I > resubmit the concerned ext4 patch. > If we chose not to update the file size beyong EOF, then for filesystem without fallocate() support (ext2,3 currently), posix_fallocate() will follow the hard way(zero-out) to do preallocation. Then we will get different behavior on filesystems w/o fallocate() support. It make sense to be consistent, IMO. My point of view, preallocation is just a efficient way to allocating blocks for files without zero-out, other than this, the new behavior should be consistent with the old way: file size update,mtime/ctime, ENOSPC etc. Mingming From owner-xfs@oss.sgi.com Wed May 9 16:16:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 16:16:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l49NGqfB008827 for ; Wed, 9 May 2007 16:16:54 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA26430; Thu, 10 May 2007 09:16:49 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l49NGlAf89734777; Thu, 10 May 2007 09:16:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l49NGibq89702380; Thu, 10 May 2007 09:16:44 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 09:16:43 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070509231643.GM85884050@sgi.com> References: <4642389E.4080804@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4642389E.4080804@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11362 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 02:09:50PM -0700, Jeremy Fitzhardinge wrote: > I've had a couple of instances of a linux-2.6 mercurial repo getting > corrupted in some odd way this morning. It looks like files are being > truncated; not to size 0, but losing something off the end. > > This is on an xfs filesystem. I haven't had any crashes/oops, and I > don't think its the normal files getting filled with 0 problem. I saw > this before the most recent set of xfs updates, but it happened again > afterwards too. It looks like the latest XFS changes haven't been pulled yet, so it's not new code that is triggering this.... > Mercurial uses a strictly append-only model for updating its repo files, > but it looks like maybe an append operation didn't stick. > > I'm repulling a fresh copy of the repo; I'll be able to compare > before/after. Update: yep, definitely truncated: > > $ ls -l .hg-new/store/data/_documentation/pi-futex.txt.i .hg-broken/store/data/_documentation/pi-futex.txt.i > 4 -rw-rw-r-- 1 jeremy jeremy 3309 May 9 09:43 .hg-broken/store/data/_documentation/pi-futex.txt.i > 4 -rw-rw-r-- 1 jeremy jeremy 3797 May 9 13:38 .hg-new/store/data/_documentation/pi-futex.txt.i > > also > 3476 -rw-rw-r-- 1 jeremy jeremy 3558208 May 9 13:55 00manifest.i > 3476 -rw-rw-r-- 1 jeremy jeremy 3555200 May 9 09:41 00manifest.i~ > > > where 00manifest.i~ is the broken one. The files are identical up to the > truncation point. Hmmm - that is bizarre. What is the output of xfs_bmap -vvp on each of those files? what happens to these files after then are downloaded? Does it only happen to append-only files or are other files affected as well? BTW, what's the 'xfs_info ' output for this filesystem? > The repo passed "hg verify" just after I pulled it, so this corruption > came about after a while. > > Hm, the other possibility is that nlinks is being misreported. When > cloning a repo, mercurial will generally hard-link files where possible, > and then break the link if it sees nlink > 1. If xfs is mis-reporting > the link count, then this will cause havok. Is that possible? Seems > unlikely, but it would also explain the symptoms. I just did a linking > clone with an older kernel, and the link count is as expected. I'd be surprised if it was a link count problem - that would cause all sorts of other problems as well.... > xfs_check passes without any output, which I presume is good. Yes, it means everythign is ok. You only have to worry when xfs_check says something - it only brings bad news ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 16:59:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 16:59:48 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l49NxifB019372 for ; Wed, 9 May 2007 16:59:45 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 223212C8046; Wed, 9 May 2007 16:29:29 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 9922E2C803C; Wed, 9 May 2007 16:29:28 -0700 (PDT) Received: from [192.168.1.25] (adsl-69-107-77-42.dsl.pltn13.pacbell.net [69.107.77.42]) by lurch.goop.org (Postfix) with ESMTP; Wed, 9 May 2007 16:29:28 -0700 (PDT) Message-ID: <4642598E.3000607@goop.org> Date: Wed, 09 May 2007 16:30:22 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> In-Reply-To: <20070509231643.GM85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11363 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > On Wed, May 09, 2007 at 02:09:50PM -0700, Jeremy Fitzhardinge wrote: > >> I've had a couple of instances of a linux-2.6 mercurial repo getting >> corrupted in some odd way this morning. It looks like files are being >> truncated; not to size 0, but losing something off the end. >> >> This is on an xfs filesystem. I haven't had any crashes/oops, and I >> don't think its the normal files getting filled with 0 problem. I saw >> this before the most recent set of xfs updates, but it happened again >> afterwards too. >> > > It looks like the latest XFS changes haven't been pulled yet, so > it's not new code that is triggering this.... > A bunch of xfs changes appeared in git this morning, I thought. But all this first happened from a kernel compiled yesterday. >> Mercurial uses a strictly append-only model for updating its repo files, >> but it looks like maybe an append operation didn't stick. >> >> I'm repulling a fresh copy of the repo; I'll be able to compare >> before/after. Update: yep, definitely truncated: >> >> $ ls -l .hg-new/store/data/_documentation/pi-futex.txt.i .hg-broken/store/data/_documentation/pi-futex.txt.i >> 4 -rw-rw-r-- 1 jeremy jeremy 3309 May 9 09:43 .hg-broken/store/data/_documentation/pi-futex.txt.i >> 4 -rw-rw-r-- 1 jeremy jeremy 3797 May 9 13:38 .hg-new/store/data/_documentation/pi-futex.txt.i >> >> also >> 3476 -rw-rw-r-- 1 jeremy jeremy 3558208 May 9 13:55 00manifest.i >> 3476 -rw-rw-r-- 1 jeremy jeremy 3555200 May 9 09:41 00manifest.i~ >> >> >> where 00manifest.i~ is the broken one. The files are identical up to the >> truncation point. >> > > Hmmm - that is bizarre. What is the output of xfs_bmap -vvp > on each of those files? > 00manifest.i~ is linux-2.6-broken/.hg/store/00manifest.i $ xfs_bmap -vvp linux-2.6/.hg/store/00manifest.i linux-2.6-broken/.hg/store/00manifest.i linux-2.6/.hg/store/00manifest.i: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..895]: 8135128..8136023 1 (270808..271703) 896 1: [896..1407]: 8207424..8207935 1 (343104..343615) 512 2: [1408..2047]: 8211520..8212159 1 (347200..347839) 640 3: [2048..3071]: 8212904..8213927 1 (348584..349607) 1024 4: [3072..4991]: 8215672..8217591 1 (351352..353271) 1920 5: [4992..6143]: 8344408..8345559 1 (480088..481239) 1152 6: [6144..6951]: 7930840..7931647 1 (66520..67327) 808 linux-2.6-broken/.hg/store/00manifest.i: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..383]: 27132064..27132447 3 (3539104..3539487) 384 1: [384..511]: 27132912..27133039 3 (3539952..3540079) 128 2: [512..895]: 27136216..27136599 3 (3543256..3543639) 384 3: [896..1151]: 27147816..27148071 3 (3554856..3555111) 256 4: [1152..1535]: 27148680..27149063 3 (3555720..3556103) 384 5: [1536..2175]: 27154152..27154791 3 (3561192..3561831) 640 6: [2176..3711]: 27158944..27160479 3 (3565984..3567519) 1536 7: [3712..4607]: 27161016..27161911 3 (3568056..3568951) 896 8: [4608..5247]: 27162880..27163519 3 (3569920..3570559) 640 9: [5248..5375]: 27164096..27164223 3 (3571136..3571263) 128 10: [5376..5759]: 27165080..27165463 3 (3572120..3572503) 384 11: [5760..5887]: 27166664..27166791 3 (3573704..3573831) 128 12: [5888..6015]: 27171400..27171527 3 (3578440..3578567) 128 13: [6016..6399]: 27172904..27173287 3 (3579944..3580327) 384 14: [6400..6527]: 27173336..27173463 3 (3580376..3580503) 128 15: [6528..6911]: 27173784..27174167 3 (3580824..3581207) 384 16: [6912..6943]: 27174568..27174599 3 (3581608..3581639) 32 > what happens to these files after then are downloaded? Does it only > happen to append-only files or are other files affected as well? > I saw similar damage in another repo, but I was using the "mq" extension on that, which means the files are no longer append-only. I explicitly checked that repo was OK after I downloaded it. It became broken again after a while. It was as if the dirty inode data was dropped without being written to disk, so once it had to read back it got a stale file length. Or something like that - I'm just guessing. > BTW, what's the 'xfs_info ' output for this filesystem? > meta-data=/dev/vg00/homexfs isize=256 agcount=19, agsize=983040 blks = sectsz=512 attr=1 data = bsize=4096 blocks=18350080, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=7680, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=65536 blocks=0, rtextents=0 J From owner-xfs@oss.sgi.com Wed May 9 17:01:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 17:01:31 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A01OfB019947 for ; Wed, 9 May 2007 17:01:26 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA27818; Thu, 10 May 2007 10:01:22 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A01LAf88776765; Thu, 10 May 2007 10:01:21 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A01JGB89629843; Thu, 10 May 2007 10:01:19 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 10:01:19 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510000119.GO85884050@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4642598E.3000607@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11364 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 04:30:22PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > On Wed, May 09, 2007 at 02:09:50PM -0700, Jeremy Fitzhardinge wrote: > > > >> I've had a couple of instances of a linux-2.6 mercurial repo getting > >> corrupted in some odd way this morning. It looks like files are being > >> truncated; not to size 0, but losing something off the end. > >> > >> This is on an xfs filesystem. I haven't had any crashes/oops, and I > >> don't think its the normal files getting filled with 0 problem. I saw > >> this before the most recent set of xfs updates, but it happened again > >> afterwards too. > >> > > > > It looks like the latest XFS changes haven't been pulled yet, so > > it's not new code that is triggering this.... > > > > A bunch of xfs changes appeared in git this morning, I thought. But all > this first happened from a kernel compiled yesterday. Ah, yes so it did - damn browser caching.... > >> Mercurial uses a strictly append-only model for updating its repo files, > >> but it looks like maybe an append operation didn't stick. > >> > >> I'm repulling a fresh copy of the repo; I'll be able to compare > >> before/after. Update: yep, definitely truncated: > >> > >> $ ls -l .hg-new/store/data/_documentation/pi-futex.txt.i .hg-broken/store/data/_documentation/pi-futex.txt.i > >> 4 -rw-rw-r-- 1 jeremy jeremy 3309 May 9 09:43 .hg-broken/store/data/_documentation/pi-futex.txt.i > >> 4 -rw-rw-r-- 1 jeremy jeremy 3797 May 9 13:38 .hg-new/store/data/_documentation/pi-futex.txt.i > >> > >> also > >> 3476 -rw-rw-r-- 1 jeremy jeremy 3558208 May 9 13:55 00manifest.i > >> 3476 -rw-rw-r-- 1 jeremy jeremy 3555200 May 9 09:41 00manifest.i~ > >> > >> > >> where 00manifest.i~ is the broken one. The files are identical up to the > >> truncation point. > >> > > > > Hmmm - that is bizarre. What is the output of xfs_bmap -vvp > > on each of those files? > > > 00manifest.i~ is linux-2.6-broken/.hg/store/00manifest.i > > $ xfs_bmap -vvp linux-2.6/.hg/store/00manifest.i linux-2.6-broken/.hg/store/00manifest.i > linux-2.6/.hg/store/00manifest.i: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL ...... > 6: [6144..6951]: 7930840..7931647 1 (66520..67327) 808 > linux-2.6-broken/.hg/store/00manifest.i: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL ..... > 16: [6912..6943]: 27174568..27174599 3 (3581608..3581639) 32 Yeah, there's one extra filesystem block in the good case compared to the broken case. If that was once good, then something has had to truncate the file to remove that block.... > > what happens to these files after then are downloaded? Does it only > > happen to append-only files or are other files affected as well? > > > > I saw similar damage in another repo, but I was using the "mq" extension > on that, which means the files are no longer append-only. > > I explicitly checked that repo was OK after I downloaded it. It became > broken again after a while. > > It was as if the dirty inode data was dropped without being written to > disk, so once it had to read back it got a stale file length. Or > something like that - I'm just guessing. Seems very unlikely. Have you unmounted and mounted the filesystem (or rebooted or suspended) between the files being seen good and the files being seen bad? > > BTW, what's the 'xfs_info ' output for this filesystem? > > > > meta-data=/dev/vg00/homexfs isize=256 agcount=19, agsize=983040 blks > = sectsz=512 attr=1 > data = bsize=4096 blocks=18350080, imaxpct=25 > = sunit=0 swidth=0 blks, unwritten=1 > naming =version 2 bsize=4096 > log =internal bsize=4096 blocks=7680, version=1 > = sectsz=512 sunit=0 blks > realtime =none extsz=65536 blocks=0, rtextents=0 Ok, nothing unusual there. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 17:04:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 17:04:33 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4A04UfB021087 for ; Wed, 9 May 2007 17:04:30 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id A2C6B2C8046; Wed, 9 May 2007 17:03:43 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 11C732C803C; Wed, 9 May 2007 17:03:43 -0700 (PDT) Received: from [192.168.1.25] (adsl-69-107-77-42.dsl.pltn13.pacbell.net [69.107.77.42]) by lurch.goop.org (Postfix) with ESMTP; Wed, 9 May 2007 17:03:42 -0700 (PDT) Message-ID: <46426194.3040403@goop.org> Date: Wed, 09 May 2007 17:04:36 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> In-Reply-To: <20070510000119.GO85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11365 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Seems very unlikely. Have you unmounted and mounted the filesystem > (or rebooted or suspended) between the files being seen good and > the files being seen bad? > There was definitely a suspend-resume, and maybe a reboot. I'll try again later on. J From owner-xfs@oss.sgi.com Wed May 9 17:49:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 17:49:33 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A0nQfB029955 for ; Wed, 9 May 2007 17:49:29 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA28968; Thu, 10 May 2007 10:49:21 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A0nKAf89693614; Thu, 10 May 2007 10:49:20 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A0nIsn89739459; Thu, 10 May 2007 10:49:18 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 10:49:18 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510004918.GS85884050@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46426194.3040403@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11366 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 05:04:36PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > Seems very unlikely. Have you unmounted and mounted the filesystem > > (or rebooted or suspended) between the files being seen good and > > the files being seen bad? > > > > There was definitely a suspend-resume, and maybe a reboot. I'll try > again later on. Suspend-resume, eh? There's an immediate suspect. Can you test this specifically for us? i.e. download a known good file set, do some stuff, suspend, resume, then check the files? If it doesn't show up the first time, can you do it a few times just to rule it out? If suspend/resume does cause the problem, can you try again but this time please run 'xfs_freeze -f ' on the filesystem before suspend, and then 'xfs_freeze -u ' after the resume and see if the problem still occurs? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 17:54:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 17:54:07 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4A0s4fB031191 for ; Wed, 9 May 2007 17:54:05 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 0E7CE2C8048; Wed, 9 May 2007 17:53:18 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 0CC2D2C8043; Wed, 9 May 2007 17:53:17 -0700 (PDT) Received: from [192.168.1.25] (adsl-69-107-77-42.dsl.pltn13.pacbell.net [69.107.77.42]) by lurch.goop.org (Postfix) with ESMTP; Wed, 9 May 2007 17:53:16 -0700 (PDT) Message-ID: <46426D31.8070000@goop.org> Date: Wed, 09 May 2007 17:54:09 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> In-Reply-To: <20070510004918.GS85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11367 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Suspend-resume, eh? > > There's an immediate suspect. Can you test this specifically for us? > i.e. download a known good file set, do some stuff, suspend, resume, > then check the files? If it doesn't show up the first time, can > you do it a few times just to rule it out? > Well, I've been doing suspend-resume with xfs for a while without problems; the problems seem to be recent and easily repeatable. Which just means that it could be a new suspend-resume problem, of course. > If suspend/resume does cause the problem, can you try again but this > time please run 'xfs_freeze -f ' on the filesystem before > suspend, and then 'xfs_freeze -u ' after the resume and see if > the problem still occurs? OK, but I tend to find that xfs_freeze ends up locking up large parts of the system... (For example, I tried to do the xfs_freeze + lvm snapshot thing, but the lvm snapshot just blocked on the frozen filesystem until I unfroze it). But I'll try it out. Hm, is there some script I can stick it into? J From owner-xfs@oss.sgi.com Wed May 9 18:00:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 18:00:05 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A0xvfB000479 for ; Wed, 9 May 2007 17:59:59 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA29194; Thu, 10 May 2007 10:59:40 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A0xYAf89540312; Thu, 10 May 2007 10:59:35 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A0xQUX89504537; Thu, 10 May 2007 10:59:26 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 10:59:26 +1000 From: David Chinner To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070510005926.GT85884050@sgi.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070509160102.GA30745@amitarora.in.ibm.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11368 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > I have the updated patches ready which take care of Andrew's comments. > Will run some tests and post them soon. > > But, before submitting these patches, I think it will be better to finalize > on certain things which might be worth some discussion here: > > 1) Should the file size change when preallocation is done beyond EOF ? > - Andreas and Chris Wedgwood are in favor of not changing the > file size in this case. I also tend to agree with them. Does anyone > has an argument in favor of changing the filesize ? > If not, I will remove the code which changes the filesize, before I > resubmit the concerned ext4 patch. I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate more space will end up racing. > 2) For FA_UNALLOCATE mode, should the file system allow unallocation > of normal (non-preallocated) blocks (blocks allocated via > regular write/truncate operations) also (i.e. work as punch()) ? Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what i did for FA_UNALLOCATE as well. > - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > we need to finalize on the convention here as a general guideline > to all the filesystems that implement fallocate. > > 3) If above is true, the file size will need to be changed > for "unallocation" when block holding the EOF gets unallocated. No - we punch a hole. If you want the filesize to change, then you use ftruncate() to remove the blocks at EOF and change the file size atomically. > 4) Should we update mtime & ctime on a successfull allocation/ > unallocation ? > - David Chinner raised this question in following post: > http://lkml.org/lkml/2007/4/29/407 > I think it makes sense to update the [mc]time for a successfull > preallocation/unallocation. Does anyone feel otherwise ? > It will be interesting to know how XFS behaves currently. Does XFS > update [mc]time for preallocation ? No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size changes. If the filesize changes, it behaves exactly the same way that ftruncate() behaves. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 18:26:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 18:26:24 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A1QGfB005302 for ; Wed, 9 May 2007 18:26:18 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA29908; Thu, 10 May 2007 11:26:14 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A1QCAf89698910; Thu, 10 May 2007 11:26:12 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A1Q9Ad89731643; Thu, 10 May 2007 11:26:09 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 11:26:09 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510012609.GU85884050@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46426D31.8070000@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11369 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > Suspend-resume, eh? > > > > There's an immediate suspect. Can you test this specifically for us? > > i.e. download a known good file set, do some stuff, suspend, resume, > > then check the files? If it doesn't show up the first time, can > > you do it a few times just to rule it out? > > Well, I've been doing suspend-resume with xfs for a while without > problems; the problems seem to be recent and easily repeatable. Which > just means that it could be a new suspend-resume problem, of course. Ok. I'm just trying to find a relatively simple test case for the problem - seeing as you seem to be able to reliably reproduce this we should be able to work out the trigger... > > If suspend/resume does cause the problem, can you try again but this > > time please run 'xfs_freeze -f ' on the filesystem before > > suspend, and then 'xfs_freeze -u ' after the resume and see if > > the problem still occurs? > > OK, but I tend to find that xfs_freeze ends up locking up large parts of > the system... (For example, I tried to do the xfs_freeze + lvm snapshot > thing, but the lvm snapshot just blocked on the frozen filesystem until > I unfroze it). Yes, because LVM snapshot freezes the filesystem for you - if you've already frozen the filesystem the snapshot will block until you unfreeze it and then it will freeze it itself to take the snapshot. > But I'll try it out. Hm, is there some script I can > stick it into? No idea..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 9 23:07:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 23:07:42 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A67afB002726 for ; Wed, 9 May 2007 23:07:37 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA05849; Thu, 10 May 2007 16:07:30 +1000 Date: Thu, 10 May 2007 16:11:01 +1000 From: Timothy Shimmin To: David Chinner , xfs-dev cc: xfs-oss Subject: Re: Review: unwritten extent conversion vs synchronous direct I/O Message-ID: In-Reply-To: <20070508065126.GK32602149@melbourne.sgi.com> References: <20070508065126.GK32602149@melbourne.sgi.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11370 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Dave, --On 8 May 2007 4:51:26 PM +1000 David Chinner wrote: > > Back in 2.6.13, unwritten extent conversion was changed to be done > via a workqueue because we can't do conversion in interrupt context > (AIO issue). The problem was that the changes extent conversion to > run asynchronously w.r.t I/o completion. Oh ok, and at the same time they used the workqueue also (apart from AIO) for synchronous direct writes even though they didn't have to. i.e the existing comment: * This is not necessary for synchronous direct I/O, but we do * it anyway to keep the code uniform and simpler. So you were tossing up whether to flush the queue as in the patch given or to effectively call the code of xfs_end_bio_unwritten to do the unwritten extent conversion straight away. Hmmm....I dunno :) Does it matter? What are the pros and cons? :) Does it matter if we flush the whole queue now or later? Is it nicer/simpler for this to always happen in the queue? Is it a bit silly to queue and immediately flush? * Possible typo in comment: s/passed to use to determine/passed to us to determine/ * Don't really need the "? 1 : 0" is_sync_kiocb(iocb) ? 1 : 0 => is_sync_kiocb(iocb) --Tim > > Under heavy load (e.g. 100 fsstress processes), a direct write into > an unwritten extent can complete and return to userspace before > the unwritten extent is converted. If that range of the file is > then read immediately, it will return zeros - unwritten - instead > of the data that was written and is present on disk. > > A simpl etest case to show this is to run 100 fsstress processes, > the loop doing: > > prealloc > direct write > bmap > > and at some point during this time, the bmap will return an > unwritten extent spanning a range that has already been written. > > The following patch fixes the synchronous direct I/O by triggering > a workqueue flush on detection of a sync direct I/O into an > unwritten extent after queuing the conversion work. The other > approach that could be taken is to simply do the conversion > without passing it off to a work queue. Anyone have a preference > on which would be the better method to choose? > > The patch below passes the QA test I wrote to exercise this > bug. > > Comments? > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > > > --- > fs/xfs/linux-2.6/xfs_aops.c | 28 ++++++++++++++++++++-------- > 1 file changed, 20 insertions(+), 8 deletions(-) > > Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c > =================================================================== > --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-04-26 09:25:26.000000000 +1000 > +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-08 14:28:20.854616591 +1000 > @@ -108,14 +108,19 @@ xfs_page_trace( > > /* > * Schedule IO completion handling on a xfsdatad if this was > - * the final hold on this ioend. > + * the final hold on this ioend. If we are asked to wait, > + * flush the workqueue. > */ > STATIC void > xfs_finish_ioend( > - xfs_ioend_t *ioend) > + xfs_ioend_t *ioend, > + int wait) > { > - if (atomic_dec_and_test(&ioend->io_remaining)) > + if (atomic_dec_and_test(&ioend->io_remaining)) { > queue_work(xfsdatad_workqueue, &ioend->io_work); > + if (wait) > + flush_workqueue(xfsdatad_workqueue); > + } > } > > /* > @@ -334,7 +339,7 @@ xfs_end_bio( > bio->bi_end_io = NULL; > bio_put(bio); > > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, 0); > return 0; > } > > @@ -470,7 +475,7 @@ xfs_submit_ioend( > } > if (bio) > xfs_submit_ioend_bio(ioend, bio); > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, 0); > } while ((ioend = next) != NULL); > } > > @@ -1408,6 +1413,13 @@ xfs_end_io_direct( > * This is not necessary for synchronous direct I/O, but we do > * it anyway to keep the code uniform and simpler. > * > + * Well, if only it were that simple. Because synchronous direct I/O > + * requires extent conversion to occur *before* we return to userspace, > + * we have to wait for extent conversion to complete. Look at the > + * iocb that has been passed to use to determine if this is AIO or > + * not. If it is synchronous, tell xfs_finish_ioend() to kick the > + * workqueue and wait for it to complete. > + * > * The core direct I/O code might be changed to always call the > * completion handler in the future, in which case all this can > * go away. > @@ -1415,9 +1427,9 @@ xfs_end_io_direct( > ioend->io_offset = offset; > ioend->io_size = size; > if (ioend->io_type == IOMAP_READ) { > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, 0); > } else if (private && size > 0) { > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, is_sync_kiocb(iocb) ? 1 : 0); > } else { > /* > * A direct I/O write ioend starts it's life in unwritten > @@ -1426,7 +1438,7 @@ xfs_end_io_direct( > * handler. > */ > INIT_WORK(&ioend->io_work, xfs_end_bio_written); > - xfs_finish_ioend(ioend); > + xfs_finish_ioend(ioend, 0); > } > > /* From owner-xfs@oss.sgi.com Wed May 9 23:51:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 09 May 2007 23:52:02 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A6pufB013695 for ; Wed, 9 May 2007 23:51:57 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA06774; Thu, 10 May 2007 16:51:55 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4A6psAf89700828; Thu, 10 May 2007 16:51:54 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4A6prj689784308; Thu, 10 May 2007 16:51:53 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 10 May 2007 16:51:53 +1000 From: David Chinner To: Timothy Shimmin Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: unwritten extent conversion vs synchronous direct I/O Message-ID: <20070510065153.GY85884050@sgi.com> References: <20070508065126.GK32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-archive-position: 11371 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 04:11:01PM +1000, Timothy Shimmin wrote: > Hi Dave, > > --On 8 May 2007 4:51:26 PM +1000 David Chinner wrote: > > > > >Back in 2.6.13, unwritten extent conversion was changed to be done > >via a workqueue because we can't do conversion in interrupt context > >(AIO issue). The problem was that the changes extent conversion to > >run asynchronously w.r.t I/o completion. > > Oh ok, and at the same time they used the workqueue also (apart > from AIO) for synchronous direct writes even though they didn't have to. > i.e the existing comment: > * This is not necessary for synchronous direct I/O, but we do > * it anyway to keep the code uniform and simpler. Yes, exactly. > So you were tossing up whether to flush the queue as in the patch given > or to effectively call the code of xfs_end_bio_unwritten to > do the unwritten extent conversion straight away. > Hmmm....I dunno :) > Does it matter? What are the pros and cons? :) I think with async buffered writes we are doing I/O completion in IRQ context as well so it seems to me that we have to push the unwritten extent conversion off to a workqueue in that case. I don't think there's any great overhead from flushing only when we are doing sync dio writes - all that calling xfs_end_bio_unwritten() directly saves us is a couple of context switches. However, that could promote I/o completion ahead of other I/Os waiting in the workqueue.... I think I'm convincing myself that the workqueue flush is the correct thing to do here ;) > Does it matter if we flush the whole queue now or later? We have to wait for it to complete, and that's what the flush does; it waits for the queued work up to the flush entrance sequence to complete. It's really the only way we can wait for a specific item in a workqueue to be run. So yes, it needs to be run now, not later. > Is it nicer/simpler for this to always happen in the queue? I think so. > Is it a bit silly to queue and immediately flush? I think that's the way you're supposed to do things ;) > * Possible typo in comment: > s/passed to use to determine/passed to us to determine/ > > * Don't really need the "? 1 : 0" > is_sync_kiocb(iocb) ? 1 : 0 > => > is_sync_kiocb(iocb) Right - I'll fix that. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 00:22:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 00:22:08 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4A7M2fB023732 for ; Thu, 10 May 2007 00:22:04 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA07415; Thu, 10 May 2007 17:21:56 +1000 Date: Thu, 10 May 2007 17:25:28 +1000 From: Timothy Shimmin To: David Chinner cc: xfs-dev , xfs-oss Subject: Re: Review: unwritten extent conversion vs synchronous direct I/O Message-ID: <0072C73B201FC2AD6A7E01D0@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070510065153.GY85884050@sgi.com> References: <20070508065126.GK32602149@melbourne.sgi.com> <20070510065153.GY85884050@sgi.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11372 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs --On 10 May 2007 4:51:53 PM +1000 David Chinner wrote: > On Thu, May 10, 2007 at 04:11:01PM +1000, Timothy Shimmin wrote: >> So you were tossing up whether to flush the queue as in the patch given >> or to effectively call the code of xfs_end_bio_unwritten to >> do the unwritten extent conversion straight away. >> Hmmm....I dunno :) >> Does it matter? What are the pros and cons? :) > > I think with async buffered writes we are doing I/O completion in > IRQ context as well so it seems to me that we have to push the > unwritten extent conversion off to a workqueue in that case. > > I don't think there's any great overhead from flushing only when > we are doing sync dio writes - all that calling > xfs_end_bio_unwritten() directly saves us is a couple of context > switches. However, that could promote I/o completion ahead of > other I/Os waiting in the workqueue.... That's true. > > I think I'm convincing myself that the workqueue flush is the > correct thing to do here ;) :) > > >> Does it matter if we flush the whole queue now or later? > > We have to wait for it to complete, and that's what the flush does; > it waits for the queued work up to the flush entrance sequence > to complete. It's really the only way we can wait for a specific > item in a workqueue to be run. So yes, it needs to be run now, > not later. > I was meaning for any i/o's previously existing in the queue which didn't need to do completion straight away - we are now handling those one's too. Not that it may matter but was just trying to see any differences in old behaviour. --Tim From owner-xfs@oss.sgi.com Thu May 10 04:56:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 04:56:27 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ABuKfB026834 for ; Thu, 10 May 2007 04:56:21 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4ABuGoC001834 for ; Thu, 10 May 2007 07:56:16 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4ABuGkY522062 for ; Thu, 10 May 2007 07:56:16 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4ABuFsC021127 for ; Thu, 10 May 2007 07:56:16 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4ABuE3U021087; Thu, 10 May 2007 07:56:15 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id EF6C729EB6E; Thu, 10 May 2007 17:26:21 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4ABuKMX027980; Thu, 10 May 2007 17:26:20 +0530 Date: Thu, 10 May 2007 17:26:20 +0530 From: "Amit K. Arora" To: David Chinner Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070510115620.GB21400@amitarora.in.ibm.com> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> <20070510005926.GT85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070510005926.GT85884050@sgi.com> User-Agent: Mutt/1.4.1i X-archive-position: 11373 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > I have the updated patches ready which take care of Andrew's comments. > > Will run some tests and post them soon. > > > > But, before submitting these patches, I think it will be better to finalize > > on certain things which might be worth some discussion here: > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > - Andreas and Chris Wedgwood are in favor of not changing the > > file size in this case. I also tend to agree with them. Does anyone > > has an argument in favor of changing the filesize ? > > If not, I will remove the code which changes the filesize, before I > > resubmit the concerned ext4 patch. > > I think there needs to be both. If we don't have a mechanism to > atomically change the file size with the preallocation, then > applications that use stat() to work out if they need to preallocate > more space will end up racing. By "both" above, do you mean we should give user the flexibility if it wants the filesize changed or not ? It can be done by having *two* modes for preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() will change the filesize if required (i.e. when allocation is beyond EOF) and also update [cm]time. This way, the application can decide what it wants. This will be helpfull for the partial allocation scenario also. Think of the case when we do not change the filesize in fallocate() and expect applications/posix_fallocate() to do ftruncate() after fallocate() for this. Now if fallocate() results in a partial allocation with -ENOSPC error returned, applications/posix_fallocate() will not know for what length ftruncate() has to be called. :( Hence it may be a good idea to give user the flexibility if it wants to atomically change the file size with preallocation or not. But, with more flexibility there comes inconsistency in behavior, which is worth considering. > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation > > of normal (non-preallocated) blocks (blocks allocated via > > regular write/truncate operations) also (i.e. work as punch()) ? > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and > what i did for FA_UNALLOCATE as well. Ok. But, some people may not expect/like this. I think, we can keep it on the backburner for a while, till other issues are sorted out. > > - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > > we need to finalize on the convention here as a general guideline > > to all the filesystems that implement fallocate. > > > > 3) If above is true, the file size will need to be changed > > for "unallocation" when block holding the EOF gets unallocated. > > No - we punch a hole. If you want the filesize to change, then > you use ftruncate() to remove the blocks at EOF and change the > file size atomically. Ok. > > > 4) Should we update mtime & ctime on a successfull allocation/ > > unallocation ? > > - David Chinner raised this question in following post: > > http://lkml.org/lkml/2007/4/29/407 > > I think it makes sense to update the [mc]time for a successfull > > preallocation/unallocation. Does anyone feel otherwise ? > > It will be interesting to know how XFS behaves currently. Does XFS > > update [mc]time for preallocation ? > > No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size > changes. If the filesize changes, it behaves exactly the same way that > ftruncate() behaves. Having additional mode (of FA_PREALLOCATE) might help here too. Please see above. -- Regards, Amit Arora From owner-xfs@oss.sgi.com Thu May 10 07:46:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 07:46:40 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4AEkTfB010149 for ; Thu, 10 May 2007 07:46:31 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 366E12C8048; Thu, 10 May 2007 07:45:42 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 453BB2C8043; Thu, 10 May 2007 07:45:41 -0700 (PDT) Received: from [192.168.28.126] (outer-dhcp-126.goop.org [192.168.28.126]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 07:45:41 -0700 (PDT) Message-ID: <46433049.4020003@goop.org> Date: Thu, 10 May 2007 07:46:33 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> In-Reply-To: <20070510012609.GU85884050@sgi.com> Content-Type: multipart/mixed; boundary="------------000301090406040003030205" X-archive-position: 11374 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs This is a multi-part message in MIME format. --------------000301090406040003030205 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit David Chinner wrote: > On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > >> David Chinner wrote: >> >>> Suspend-resume, eh? >>> >>> There's an immediate suspect. Can you test this specifically for us? >>> i.e. download a known good file set, do some stuff, suspend, resume, >>> then check the files? If it doesn't show up the first time, can >>> you do it a few times just to rule it out? >>> >> Well, I've been doing suspend-resume with xfs for a while without >> problems; the problems seem to be recent and easily repeatable. Which >> just means that it could be a new suspend-resume problem, of course. >> > > Ok. I'm just trying to find a relatively simple test case for the > problem - seeing as you seem to be able to reliably reproduce this > we should be able to work out the trigger... > OK, I was able to reproduce it reliably with a script with did basically: for i in `seq 20`; do hg clone -U --pull a b-$i hg verify b-$i # always OK umount /home sleep 5 mount /home hg verify b-$i # often found truncated files done No suspend/resumes involved. The trees are linux kernel ones, so fairly large, but small enough to fit entirely in core. My script also captured xfs_bmap before/after output for files which had tended to be corrupted in the past, but unfortunately none of them got corrupted in these tests. But I do have all the trees lying around to extract more detail for if you like. Interestingly, the corruption happened in each case around the same place in the tree, often in the sata drivers. I wonder if that was just related to the timing of this script. Attaching script and results. J --------------000301090406040003030205 Content-Type: text/plain; name="clonetest" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="clonetest" #!/bin/sh #set -x #set -e D=/home/jeremy/hg F=linux-clone-test function emit() { #echo "$@" > /dev/tty echo "$@" } function run() { emit " $@" if ! eval "$@"; then echo "Command failed" exit fi } function nofail() { emit " $@" eval "$@" } function validaterepo() { nofail hg -R "$1" verify run xfs_bmap -vvp $1/.hg/store/* run ls -ld $1/.hg/store/* } [ -d "$D" ] || nofail mount /home validaterepo $D/$F || exit for i in $(seq 20); do emit "Iteration $i" $(date) run hg clone -U --pull $D/$F $D/$F-$i validaterepo $D/$F-$i run umount /home #run xfs_check /dev/vg00/homexfs || exit run sleep 5 run mount /home nofail hg -R "$D/$F-$i" verify run xfs_bmap -vvp $D/$F-$i/.hg/store/* run ls -l $D/$F-$i/.hg/store/* emit done --------------000301090406040003030205 Content-Type: application/x-gzip; name="clonetest.log.gz" Content-Transfer-Encoding: base64 Content-Disposition: inline; filename="clonetest.log.gz" H4sICMcvQ0YCA2Nsb25ldGVzdC5sb2cA7F1bbxxHdn4XoP/QefMiplX3i96c xLtYxLsOdhUgQBAQfamWJiZnmJmhLf37/c6p6p5uyiSnm5aE2BRos9ic+qrq 3M/p082/t/vNzbE6HOv9MXXVblu9eXdbSVH9pf5QKSF8JeRraV/jyrd/qf7j 3968qKrqene7PVav3u2u00v6+d3b6uJv+edX/5v26frDq3dvX11ttrfvL9qr 3TZdHNPhWP2U9pv+w8sXf90dq+P+9nDcbN9W/eYqPTr11Te4+O7tvq36/e66 wuo0Hfu9PaR9led9Xb3d725vyk8vX7TvUvsjrdC+q7dv0yEdD5OL1/V20wOZ ru13h8P4C9rPodpsJ9OqetvNJsw++/KFstL6/MPXlbVOysnkryvjrNCmOu6O 9VW1Tz9tDpvd9sCEe98fLpvr+qa6+Omnm/OocDju9umVEHmFq93bb7q1EzfL Jg4UWLrgOG/Rel19rJd8/nbb7V6+WEvC12DHd//15nX1xz9//93FD3/849+/ e1Pxv3/5/od//feLv3371z99V5V/3/4JX7MP0b83P7z59nvmqnhd/bf45huj pIz/8zr/VkYnrDC4jJHx0saqUtVXOnhnBK7qKIOV8Q8VZimx+iCbT3EQF60c zlHJ4KIOIeAgwQsRdD6Iis4qg6sqeiUjDlJhmlp4kJOI3X+O8w/y8Umk1XY8 CY6itLLB0FGUNuBWPgrUNRpcxcAaT0epMNExkAQQ/UDsVbpgYbqKIboMBEg9 AEllHQMBUROQciEwkAIQILBMMMZPgKzzioEI0mYgE73QioGAaAmI5jKQBhAg iCEi+CJwNF0rFrcMKQcg40jcMqLkHRmtGMjQ0UQINEnLkA9H00XU+WiAHNht IvbnGMhYZneltMg7sgQEBCKsGXSAgYQoOwLkeDSpXd4REPloVSzEdgRkoA80 SYtxR9IGmXcESYvDjoKXPu9Iass7MtgDA3kCAgIxVU52JPWwI4IcdhRM2REh 8o4U1mOgACAg0I68xe8zUJSK/iPdxhaUL7qto1bESw0GO007skDO7I8AIgRm W3SaFQzTYQaykQDXYpEjbZxWbCTANc1yBGk1fDQpmGtAoOVBLE8CzjCBRI9G URcaaWOdIsoBUfhCI1uAWLIJwDFQ3hIDsYEqkHYAst7xjggxcw1DBlIZqOwo WOUzkIeGFCA1AYJgZyBTgDQ+x0CagQBARFIiqwkB+WJICXIwpNCAbEgJUWal jWRJ8Z0lmwAYSHuXgZjEJhM7QrwKENEYV7UVwjlWWq0LEEs2ATBQ0KwkEnOt 501GLeHqC5DFPom/GqIFZWNiuyxHkiWbAGh5wJgCJFXWfkiOAgcKEE5GxIRU g5OstBGkZCCWbAJgZfMgRwaCFcrE1iCrL8R2MENMbAcN0UzsOOwosK75zCKo SsxyBGApmEbax0H7Md8aRzRCaJO1X1uZlVbGomtsHRzokeVIR6HzjowLKg47 ctLzjryGA6EdeV+0X5FkMwAdHRfLjownBrAfFWRXM5A3gCDJDqRq2JHUyjmz 1v98Ej8qlZ86Hw0ahkxtGAJE2mx9nHIs3TSQzrMcq3DyPSRncAFjXEGTrVUF xhXbg9mWLRLj6TiFIf1kt6SDnsLoyGQlGFmUgWC0ZkYqgOspDGkny6+VcgYT WM0yoB1hlAgZRmfBG2BINy3tFGybwThnCoyPcoQR5DFpIK2cwpBmOqJbVu4T jNXlUAAcD2VCLIcSanYo0kvW7hBnJNZGDIeyejyU4Y9mvNmhSCtzPCDH0Ikm 22zLaOTDuBvH4RQGXpYgwZ68DRAUi87o2aFqZiAxEo9xNyoUEsOYzXYTs9yQ qBnhTzBwAUVuKHAYYWSRG7j7qdxkT5PpMIZRPNnJQmLhzXgoydLEeGJKYnmK oCQ4MsIYH1RmOFmxgeHsddiOwDZOGZ69TJYRyMcExuU4jAH1CCNsNkfAm+2G fQwsG7uTOIGxLhQYN0RhDq6U1Y/xpiTOHoZtK9zMicRwWaEcClHJeCgnTT4U sXwKQ1IMAFojO/IMA99UdoMQZtyNkD7vRsBHznbjOEph+g9RU4YpESED+hEm R4SMN7U37FtUDpqMPpFYwfkMuxGT3ehhN2a+G46ZTA5fbZzAhLIbAjztRg27 0fPdxGy2+JfBT2BkMVsnH47ZwmazNfrwAlO8CiscIr4TjA3F3sBwDvbGIpdh e4OBntkbRVKsOApV0ZxIDItdaCPdGL3bEIrZDkHNaKPYFnOyoYWXExhbDgXA 4VA2eA7ECUbMD8W2WHCsiYh4AqNyLoHREAMSjMvMCL7EgAMMSXGJNLXyExiW 1wIoRxjLosF4M9qQFGuWBooTJzBCDLSReqSNZ9mmgZkZUUVSTPElR1ATEttQ vCYB+hGGZTvjTeVGkRRrzvK0lxMSG1MYDnkdGR5syAwPNs4ZTlIMAI5F3IQ2 o5+SSOFHEhc/ZSn6jROTrkiKNeuPEXFCG11yLOxGuJHhNudYGJiZn9IkxYZD rFMamicPtDHqRBuE+Jk21s5oo+WYhMJSTeRGqmLSEcefGK5YKGkQSuqYD6U5 lzW0MuUVExgRCsMBODJcsffKeFOGa5JiABBt/MTBIF8rfgq2wY8k1lpmKUbw PJViTVJsOBg0YeJgKEcbYPSJU5qDJxo4P4MhKTZsFEyMU7kZjCictTuROBtR DOZGVJMUA0BxnHQisYixeAYppBxJLHUuPASi+xTG5zAJa1g1cTAi+qKaIoYT p6QsqklHncKQFOc6DcKqE4lFtnM80oPzRYjCakMDP3O+mqTYchRu7cTBCC+K 8yVAPcKYXAQhvKkUG5Jiy1bEuomDESon05TyDwE+Zhs2ZjTQccopQ1Js2aZB oCYkzp6UR0MKhNk5lKOBnYVJhqQ4Z7WwMBMSy8EWC3myxV4VW+z13BYbkmLL ER182EhiBQ+SUyuMTBjkBp6MgwYMgpnKjSEpJt8MGHlyMCoEnWmDkRlpY6Pz xb/6OW04LmbzTaZ7AiNL3YsA5Qhjs/gR3pThhqQ4OxNyJCcYPVS94MtHhkOj MsMtkoEZbXyO0qmmaE8OBgbBlFKVgagMtIGhzzolbZzplCEpdhxKO3dyMJg8 0AZ4I22kLbQBH+e0ISkGAJHYnxwMJg+0IUA5whTaEN6UNpak2HGc5+LJwWAy x5wFUI8whoMGxpvSxpIUOxb8U72VJusS0CrKcO0IkwNaGpSimVlWNab6+qdI cicpLjlfRCqcD+YRVX109ZVRMpeQeVD2X4VF+6d6/+dJ0odwA5ZqkqSrIUnX ZyXp0LdcacQI0lxgLMUwBGMRLvkzknT4oyxUGI1lZgSLORLFIJpzkvSQxZ5H p0QU+lZggpzD3JOkh2iKg4cZGBy8gwdnjmMQZw7+viQ9xFLOxSiOGaSTOYPB QItfTtKlmJTgdRzTmzjJktyQ3rghSxqyay6dcZx2ytJpshxcKnLFwaUiicgu FYO5S70nSUcUV/QWo1MiCi3IiSg4NjON9yTpRhqZA0OM1FjAQESXCxiIO2aB 4T1JOiYXl2rkWDPD7GzSaTB3qfck6UZFpTMM1RgKjEeyRfENDbyMjyfpmGyy TmHkBp2im0msUxjEmU5Nk/RwcmJG56JSHoURxnHqTgOXXWoVuFZaXR2qi6vu +Tbr02+zXux/5q+LCyJ6njh8Q24MoTRRcwtBFSk7FWo11R9ZDHE6sjdhfpXF Ng8vphDPIasXdlxMvYZHW8ndMw4WIN+/xlo4V7f/+T19XbyvkPvcXYvDh+hW E5GkaH4e+fEaEW7lCWtkyfvzMe3r42a3xQrUqUJoUlCTihKvjaQOFW5bGfpS GKq6+M/q4uLm9urq8SaVxz5wIT9XA8s+/d9tymvU2HlpKXn5ou66u20t5dKk R6Vc4c1NZ2LJux0q1c+b47uhS6Vcro67atLWUn31z7J6l+ru8IcFDT8Xcmz5 +e0141zI1X7i3qmbpVMX+Ir7Zi5c8yx/MZtxXmPO/eT85K05MOsi3xbHyARd kiykW54zf6Rv0VGScmZrzv3s/fTNOdzVke9peymGKquO3nD2qhFp5yrrec05 94rbr3CSj48y3uQtXKG7ZRTJCeRgQy+EMXRbl1JHOEfuheBZbv1ZPglXJukW h6pBu1yLMhSBD4EzZZQcqlprc+CMJGvMIjnhQvoXxxRHDh0jGIVTv4opKbax IverUPF9zCI1R9XTojVNzmH8pIGCZ3MYP2mgmGaRHxWtKaTPmRLF5mOHkRX5 vgAG8/sCnEXmovWswgHIHNpDvYzxQxOO4KovBjoICuNVsEu5zDWOarur0vtj 2pK3WGy5PkOJAcce25nUmP+SoOR2JqQ38/x3LDGcsiFiB/UNcFYlYc1GGC19 bmaCXWMYSm+XpEPPjm6RoysRcAmC9zuEivy/eVpUQlb7BOo/sNQsKXr6Upv7 l5qnRGUpvZ67jxxqTIievNKQEl0MKdF0pWk6tI58k4ToF88yTYbWrZClDUi3 d7vxD1cp3VT25fpO/efA/dmePQfuz4H7c+D+HLh/ucD9k9+ctF46bgLKIyKE xgmk0pkQNMh9D2fdnPzd5Q7PrvY5dfh9pg6fPrBfZzBzdGzoEbNJdqPm2c3d 1MZU9b59N5sB9X94RnO1a398iA5BqDvkvmwvd5c3lx8uN5fbywclo3JSB/fR 7P1luuww+3h5uJO+Pbrddv/h5ribzXFuvl+40Dsrdrv29jptj3y76aG5H+eK 3X6DBOqwbFI//7wxj3x+s22vbru0hBI0aXNcOOOmfYjR1EB2h3A/Nrebq+4u j+wj6/yY9tt0tWzO1aa5f2+IcZBS3tnb9WXNAngs3xPE6vCg8UA0pMxZKA+I NJyzj/4uSv1jonx3mTTb6vp6Lijykc9v0/GB80l6WuzOzkjVaijb9WV62NzR A1y/MPcGqr4HdVjVLy6by9vLtx9pLaY8vO8Dv5tgrhXVY1pxSO3tfnP8MF8q PDZrB7P7EBdApjvu9/awJzN9uiGt7tyQ1va1sZ/hhrR6viF9bl1L/ZbrWmp9 sK3WB9tqdbCtVgfbanGwrZbXtdQXrGtx3cRwBSXK4WHMXDgx/FiD4ocxz61r qS9W1yoHKEcJojz2nU+QjxKNXFDXUl+wrhXoUWiRCwNm6MW2UirOnS0CBe7F PruupT5jXesOV7hcmgunduydzPVSLpyq0ju5gisrbrqqz1E4kfOXmNAjVVwD c8GND9FGmxv5MfD5IVrq7D0VTnKb76QHlRLY3ObrqIt5gHEiP/ISqYmZm469 WtSD+mzMFxnzJZUTo55A/WWVk6cstbByosN67i6rnDxhpYU3XZeSb/lN16Ur fNKbrl8iOC2FileH9rB5dQD9Lm+QDmwO6ZsWQF16X7W77bHebGG2pFdk0/d1 1Xw40uTb7U19QqsewupUa+kZ/EbRc5Xf7fe7fXVhq5/f8cTU7q5v9ulwIKzM xe2uS1UvopK+sbJBtNC1sia579vUyJh6DIIPfRK+ozOPh4TrOdKFB7ZT4G3T Oau0NE3defzrhe6ST6KniBxOPWpIC73OZSV8HVVnbS2N9qIxAsekpxQbU9fO Gvy2MU7FXqyEj43Tvmm0Bo3gfZqgOziZFFOojTCyp8d5ag3VXQdf+2DrFIJy EYTx5Mdk0q1trU49TKoDZQLYsBK+T8Z3utOqlwASsW1MbDqprGlberdQl2RS opYr4V0t664nQnilXKpbepAUMtSkRoSI/9eiad3q3Rtla+tqU+tWBEuP7bTO us4mi2C27+sA8dExtmuJI1Lf2N4b703TRh98gyWUiq1qatnoAE5oWa8VTI3o BrJiuz5E5T0EMUlIkxKm17prELWYVvdyLXGgsl3TQyL7VIfGtr2Jfexk0yvh nJACSYAIdbtWMG0XIX0BAtN5aUUrRXKdaaKG6ioDO5Nk3yq9lvaazl5rXYPK NWAgKUA2TtShNcnB8rikwP61rMWOtUwpyVbGTjt6fUUvZeOT7zola98G2Dq7 lvbOgPC97Pok6Ym3hKC3Cb424HQno6lhUBNMp10L3wp6oK1Xbeu17mNNT5N5 oxJUDcdSXR89jPNara271oVQqyYm3dWN8NCmVhlIkIud8qnTsY5QvJXwHYTF ux4mwaRawmx2gh64BXmiCbATTdfbularBVPTa31MA3ujPLTWp6aBjRbJArmW yVjZd41QayWncw7+qm1kLyAkdYSUkkutbepgEHq41gBdWCv3yGlVY5H00GOS urNk9RvZdnRf2dVBhxp8BoPW0r7t4MitgI1pRd1biTWavoE6mBbrpGhUsK3x y+HPqK9J9XF9TRkgH9NbqhVXiQKSQ5W2LYVuCNG6f3quwD1X4J4rcM8VuP+v FbinnuCR3iljdXnDG4+oyU9XX9FbHqkdib7n59LP65z6vRUAn13Jc/3v91n/ +/TVuXXGck3nlI5LO6cwY2nnFMj9hM6pPHtt5xS2u7hzilZc2zll5IrOKUxa 1Dll1IrOKZ60qHOKZizqnCLCreicwjqLO6cwZ1nnFO3t6Z1T96Ms6ZxilDWd Uzj1os4pfH5Z5xTtbG3n1DD3V+icwr6Xd07RpOWdUzRrYecUpnzUOaXvdE5Z 9Vp/js4p/dw5de7NKf1b7pzS64NtvT7Y1quDbb062NaLg229vG6jv2TdJpY3 rAlnxPDWz1zjoLQ6Sslv/Ty3bqO/YN2GD1CO4mV5Z2A+QT5KMGFBhUB/ubpN rtbQUaRTwZS/a5HLNfQO2RDh4+WCuo3+cnUb5gU/3OisGV5HnJlBj9lFZfPr iFdwZUXnlP78hZPhPX00kuOr0ulN6fyexeh1flX6I4UTel97vkZ1sPIQIr1k kd/hioEtf59oaefUszFfYsyXVE5sfAL1l1VOnrLUwsrJGW/Vu5e7yyonT1hp YefUUvIt75xausIn7Zz6EsFpvqsJB02v+nddsAlfVkTlml63KvUNfuOVhnEL 59403f403i+lrhLfIwJwzvmWHtDttbe2sU0XkjC2T30USprlyJ0Mfd31SenY NqpttWyMcYDznRcmWSXaxtDfylmx5xhb65tEf2xHB9O3RrfYvuyN1kG1TQi+ 02ff450gA8k3sa1DZ2STgmxcL+FSmh6eqHPJ9Tr0MkS3HNlSP0QXdVO3yTfC CJBYtVp4K9pO16Ktdejc2V1UUzpjohYqiM7KrnE+RZ+oG6NXsY6NUqlxiHb8 Cg4akikRpDW2aa3VNWLeVihnY+/wQ4+LIdVmxZ7bLkmvIbu1M6b1rUumi7ED 2+pGCg/UptUhNcuRdS9Uj5xVqDpYi80hjgVtEiJCnOIf7J1Zri23DUWnJIlq h0OK0vyHkMX6SZD8vKs0uIBjGO4AV9WRyM211c56VsqxSOvnTy7WzVfaMmqW 1RxcsxyXBVya4sSKlVKz2kM77xG9FD3Gr8+pi9jp4TdkTfelpGMStfXzJ/fa h1rXW8+x5qXTrj0voi/fvXyXuHlhrfOQKbTElDvGlq2HPG7nDE2XjCT9NqC9 R7fTHp5sedlJ2Kc7lA70VaXWmA+/d8edDEM2nfoQGznrXOXMm2wYyZ008/my CfArER7nTtfzEM/q0/bhG5G5lfoeWUuXo3FRyr03xSU9KNaLbhzTxv+6V/G0 6/JDscfOaMtKkLQs1vZ6UdFbVU5eEW61jkMQ96ZT45R0Qt3rkdT4rw/xfBCy YeKZ7pOC9letsdbKaJhSdfZ7T0TLQw5mKdpW34YwWVwhdFcmRdagagmdqZVU kfSgonpur0700rr54r207COUGD6Y7NbebpLxonWLENiqjgKVPq+nOwthcVY+ c6oh3MTGfPhmHEztq+qiQKlXQkKyH42Vome0uloluc8qD7rRTO6Weo2nivWT xkGPa+pZgZC+UjXfp7/o86YAZiph3Bfay4E24hARz/fOmZPK8CYzP2TKPKLd Nw5fd9xhtQnqXuSYV2qVm7c07YVkup5YVbZqnOKS6xyi1ACbjQo4tg7vJVHd H9o52Kjq7ljUuLGjDxDJZEnIiPUAJKrVny9h/UdGMlyBjOWxsj3XWvjxnusu 126NSpX5D7099KDDQghzkarnAC+lIHOny4W4IrJXopL5iyLFHazj5rhP+KS4 TTJuZNF9EaImKVMH4zftB8rleXm3MC9DPYLXSMIKZlBjd1m5UcPjNpmHb+49 IXNLRvF2oQ2XpnmX0Y63nhFnuN3WQ3ajP76u6hCwpd0SADYchpzU73ES5Xvu qQ9Rt3PcuYg9zRnWBZ6BI2hXNq5iCMo/1536wkj89uh/1wlyZdOhY4wVjZK6 Qkhp9O33JZ6pTD2Lp0rMXoIvJ+d70VVKImyUqL0j8St+/uTlwAU5QgkpNyMe FR90zmr73nJiK0kCe+VBN9rQuD5nmOZZtzmSB0HmWL1PKbx9tu1uL7zRp8ym t2v7qGX0aNi79M5mCwRrq/JwfehBqipsiCw1v2IlblyCndPibcAzLOmQHrry 4lOQ4+ZtThzgbqcQGLH4l94cN46fS1OUbn7hZ8D7TPATKtfa8FQtD4Ka9NiZ Bqm8Aib7+ZMjfe0gdKveuAyVeLBxJo/EZe0M6Er13B5ycPCheEsCQi4xNihR lSSBm3LePR0SB184XuJ5jtj1M3iqa0k6UdWz66VE7cKTqZPd8gsVVMsCmOMl L0XV/VITvflOfkZf+0JgIGV6oVwvXuk2ArqR1iZ7xgLum8lB32uHu9h4z4d2 9nyiomxxVx6sOKBz0yiTCguIUHeVdz208ykUPR0XOponxgwOHx3D0tahkLha rpNJ5YWRejLFPCyeM8lFG0W/ZkjfvocqWtqetb/wRonhh1Pwaa0KykTWUBtL 7BjrO4bWOxz8EBurYB4CNw2tg+0xEoPPzHNro/XREZpd14Nr24SBB1SUQqyB WVocKBgxlhtXYVLFgKbz0IN6hFy7SZsfWpgyaJTUlS92M9P8K3ZXzHpfcrDn QlokvnLQ2DCT2xLlq3dFmGvsCJnzgTdgz1MdB4sP5gmJ/ENCyROkqRaBFw6o Wh7aGQK6VNVTqdaBu2hHXyscYEsD4Ci8bWKRHr5ZaQON6wXrqaMf4NHqtBm3 dWPW9NrIQewPutEpTpfMHqMOPzff1Sala/qF8BagOxK89KD8U8zpQRlxU2bk SDRDveJxyWEL+WgL5H35ZqOiELMF8Ig9TDeuIs47Oq2f5hR0r3OXh2+mESnU gis5Q04NEE0VN+/4nlZiu5xeJf0fandB4W+X1g6FKvWZcd0VY09TzHWwtBAS 2f/AGykuOgwe8IsjQeOiBHQ/FkMElkON9rjpxbXhjN3mmVV2RpcpgOpT/Bxg Rta5JykA8vBkAydw7GTKgjO0HKehyY7wiTFMgx9K6WXkJIB8JJIZjzZjuDFG B4iGGtZqxJhPyWjJeCFGW3iVmZChFnfeEYGHPtwHd7G32pRa+os+y7yktN8z 9W4kGh6Y202gyEp6Tp3CX9NDOwMpLdO4onOpix2RGLv7hlvn6pP+I2rSg/KX naQN2mJAXXxly2mfHRe8pWTXYYW9D0H5oBsLJIh8tt1uJzFq1RiJtpOE4p3F BLvyUmERJAyhb4SZPyx3cDzSJEHUe48zUKVOezzkYDRADGNsG5bUxO9F70l2 yXYB57pWw3G+jOWWBdAKBSrvvfvAeW/PORY1aPfJV6/Yq/tQrQSLemMZzozd lJkUpE08LcgrJvyHj5gaehkryHH38u7Fzj7EQ9xbbTcb3sWsrR4jjEfry7xV 2PY5kKC4uFvt8tMduTc5s8iJMV6rR+7T2JdZDGoMDGvauYFfi1oYC2Ob56o1 waP1IepI4gR7FoyJU7SLJpwPlhYgX2WjUafyyfaQKQcX4S0BXr3o6jG5lCBT ogJlRqB3hpbGy8iJ0BKWAFq+DMg6HebSnmJkI2s/u0Nn2cpDO4O2MXGF2gOi Bt03kXKstqJx4WUl6Ki44yHqBtKp6FnngasjEk13pVHOMuoUvwexq+XF0avg INL+jHEoKYbzYomICh/NlpGLmT59GIWonbwY/Hbd/Plt1d4EyxxC5Y2BJLgO +N0vYzKgRuULo0bFPnvgf9MohVf2ravQHhekfOjBz0dljAnPrYqsrS08bpea sJ4l5td7vi81ZTSgkVYWn5fWgAsU8kdVm9RMjhRyHoJ+qN2d+kFRigFiJGSD jjPFUdmGb0kHLPA7q/rLk8FxoAAHlCq03Le0W6BPbMqEnkARiC89kEy6ogQA IKAXe+z8a6QIMn01Lcfox35vf9FnbzFQmWY6DSccE4RDJTssgDp7QlJW7Ct/ 0LrTE5Dofd4wlL3vszSuMK+rgOfnuoKm+UGRJnw0ui/NduJBNwQPr7Zdgp5J xxMHt+yXdQUrDQIEM4/HWvx6cCCR7uBSzjhcOgFQeKlWnSxG3U0BmQnl94R4 Jgy4r3IIHIy99ZeRE9JCv3UJNHUsbkT1dgYfRzrJJ50KoU57odyZvwNBkMyY X2pt39l7uRbDpWvza6LYansZy00tDvyAO51Po0DFuS4tnU58tK+C5RYnsrzw BiG1bi1nyzi1rQrUEb9GDFJm+6pOJr2sskhSwGYljnGFql02eocej8Cxa7XU qOs/nE/5yUlOFO/Xg5xUYpVMJ29u//ODnAB6BThxqgcf1u+IwVBg5S5EA1XH P5AY/nqQE15sq2M5toObMUMXx5K0vfE8jWqnMedNhD0+fsTCCaeOdErIieEA kzHKooL7+BjfiJT6eiRM8Xo3ReOWdTGa3dRGJw/5G+ZqbZIl3+Sv54booj5h miB88JlydyhaTvQBpWfFKtuMfMp5PdgjTjoSNC6HJ1l4Yrk7rLE0SyXD5DmV P5/e/efHY4EL1Ttl/ETqOjQDMQB77bV1eoAfgKlYz22/cd+0gKtKwcTK3jZb cSsETIt1EZojVp8PcvqvHmB2baUckFcpCmvekS5uecVCgJiMzAVPk2d5/fqE lx9xzgXgH8saY6FINgw5zYMtkBo6pd1evz5XGqTHgUUmRa7bJOIbgUnX9nFj tKVJfn08SU+FBdZvDBrMhAkfoy1cfr7V+ZdCVtl6PsBst1vVjEA8FMhZC5gZ Z9tAQafHJCiEJTL+kmfTDQI+74rejEZDrzgZyUfMb7e2iFnSmZKaXzUnDs66 4CWaWUedLcfJRHXpyLxvmdRtqa3no5B6zW5jrY0nbaojTkQrBGscwwZ54tjj 7Ah9ffzZCTFrNkwwCZeIF0gQx7ZzvXlqGQPn+ty1ezdMDaG/mol7rLiVVjp8 gZbenFKsccryesrVvSkJZAj8uMd5OGoEfm7TCCZNKaqZ9Ofj1+D6C3Tt6EVP lPIjGQbXm+Y++DV4rE7w4bVxKqYdxLfcI3fp4nViZWDIDSJH8p7RfL2eMBZj JkOSyS3pFkSBv9Tisa6umCnVpka1tN955KNZKTSGKvWjozpHqLu8IjStrmBe QURr+Z0nShLWtnHcUmpe+BYf59Ja9HOSNjN+AwjK+VXSLHsDfrtCCtPO7ESo Tb4fGdso23USbtzXts9+7kAi0V4vMZ1qHU4YBJDdb0S89356fg1MUspP82zN ekPv+7U2PMbAO+8cFN+hgNr+/2me//nTPEcXw+cSiR9bHsqUCQUWaMCti+tR VH+/Pr7NGGTG9GsCkddE1M6IwF8WAxigiMgduzxDoM2R6U3vvcYRhHBD2fyT 4LqQfbp9dZLgtWvjLEML+ijU7JFjucOFa2OVwrytQwoLKs/PR50Wm3pGt9Lu iFW4NZbdDCIVELkrxxRbG6/uZPaIfMUlEI0+95x5lzNmajYX+sZPya398fzr v6TVkBqrFmVPgnDHkCjNP0EQSnyL2cd55c8XBP6Pj5ml64CnmIY40JScmSep m+CGSsfKkBgJ8P5MyCmsk2ACiXZaYRGjM4JT68LUzREydJ8bB9tHGBaEfSI2 q+j4+J48K8vj10wsb6uvXz/WhbIpgtieby9JiSmWbScWCFSn5fIUbelZkDM+ Byw4fKY0ImYVgWVpGLQAugXe4BJ9lrStSBnBvvRuOjEy1ja5u2ccqDqKlNG0 /BtjCncDOKVksql5q4aiBYLcHmOFuafW22taHfe71qBaHwWfqFKYIRm33dhR pxu57/f9qNN9YzMr6URNrbq+jT3YlLR02vIZsxrz9PLa9lljpUcMI6+0YpQF IC8D9BvklhgA2sTWemXMUZ18vTR/yfBS17aG81S5PZdzck0p79WfbXOLAwMp eqjZjIWPAGDk074zzhMueryO90NyKw2bY58INWuQqhfAUVChVN+XHthF/IKE r4EpdcSeixqrH2IEECjkTbBmVbubV0SHSH0eUzgtlmZ7hwluuTInLzkT+7lg 2G53tnnys7dS6GkdsLj5hqYQtnSa0Sop9Yk5BEdWba8Iy7NITlCqoS+Ey54I M26OVjEglmaHZEt9HewqE8tzx+kHDUuUXVrr3vAOVMk7Er1byLTXyDlQwY69 niBBgpX3wIfCrXa/6lgIpxTC8+xrw/S7p1iP42POFasMNOFCwQYL0SQFxvyd R7prCBmMPXuDx1bsrqErfF4geR1cv5N1bffnWruIPkmIO5wD19878T4VRFZP 6MWOUzbTezn5++UTGwjcvmJmFqdFo+3ldUnGtOvvPO8+8dxtHjuT+ex1Yz36 sgHkp7qh2nvW0nRfNcdil2s5CQkrLfYB4Pe9n62x04JcC0kOZPidp/XvmWJ7 iJgWQdzzKklkzRQTpLEkck3iauZX+7D7FjMP+cXzy3YtiMMQ9NLw+q3FcsM5 n0cC11qx3mFcwsTb3isfeLBAVjF2TCIgdzryX/IqAxiBYph7VqogPnn4oRpS 273GyrIaryn+DIFC3Az/VnYFIkvxXNutM1a2LInd40Rpn6+Px0cB8x1dudua EaKu8yTpvADjb/MmNDW/Ts0IcRm7i2KEEcKxvnsT6LWdQtVaejoMJak9T0lK zGSMhRcM1MQjZ8FTzVgN13hf+NF+8++8peLEYutijq5Nby2PZXyt9CTYUCAz 5h+QDHs2nrPHJjqaKDYkL+ge24O11aaDvgbZxl7ntVotuo26EdspRg/LbHM1 n7HrHhmN6zCWYc/L8wh4bKiLIxhauztG6aASaTdEjiJDMcjY3vJqm2fBohEw y1DMPdRzjrWg0NnVKQL9XwM2X6vVuUb69JhtHmJJ5j6HoNyLCg8/FTP8xCj5 fUJ1rH5WnMiAUUNe+mqzxSgaDjSWFk7lDa+uPEafcK+9r6y649yKODFgutWF F0fi+j688FUxY+hDZoUxWz+Nv+OxqlAKFWy+NE2noqR+novh8TIrsDDFBOcW ZYoU47eULWGgNdu7fSAv59/YO7cePY7jDN8L0H/4cmchXqkP0yfdOYlsBHHs wFGAAEGw6CO5Mclllkvb+vd53lmKFimbh7YUGEiIBbn8dqdnprqr6n2rq6qP iu4YuHcM53Ep2txXUVRV4vcB1DR2/HUeXTNKspmlbQZM0CvdrbhcbU/qtwEK D4C0ZLbX/Y98Mg5sxKkpQatq45ZTKUH9Mc6O9iowGlDearcD7HF1DHyFMDDB 1qq6I4JHRpvHQGdXVvZo2963Kg0WiNmMYx0DnpLw5RbYJpkr1Si2AKfuuzgn 9+ChVHCrgUNnITacR8IomBrHPNZQBkpdfVv2mEdWfFcS+tHnaMGp0BupTw9b HKKhte4KJ0+B+mgb+DiGDo9aqy2bG75x1A6sAnr2uh8mBd8k11V5qmJOrVJZ 5NbNHKYB2WCOeWwfVWan8hMAyCz2WGYBbR9zKcyIhjWH3FYtc3s7GFsGSlAu hC/A14bMj15b6dh+23EHYE27vfug9gDIwoMwR4km4DyOFpT+VvJa1UXlZEEm /jrPs3LYsX6YBjlUOBqmE5yi06kNA/7s3A5N3gbgcJ2uEpypUNdZfglaVm7O isDOyVT7vNL2tljC4sQGrTfhOOAkEB3MZBIatBVdOA4VtfXtMGkvVU3eGmwq oKHafjMqeF8A/mkxOavCjOb2YV/KFrCpR6CIMS6oXA4PWQA/FkPnQ/ZY1e1D 7orqlqEMi5nlNib5oLRG65GRyv27zWpgtBuyUFo8nrAY+GDEtWbwcgQIOpaR i0XdgcBpu/g+26H+9Gf7Z9/x6EhkGcZf/lgq3h+KbLq1PTz8oShmmXFWIGNj j2G1X+VjdTWv0iZkIm4D8KCaGhCsqmhUXhIwYID9bo6ZUNiGW4fl/p88hU7T dwTsCuveKDh0YB7cimEFtYgD/hjdIv11HnJXZ1KvpCDqULJnmZ7B/O4GtByN in6oGeJ2TKGCC7wp4G6vO0Ul6XUcblAGuekjqZHSdgz5xzui78EodBgJTH+p 24/yc0as5lw+xxiwRVCyjXk7qy4k7FbKav0Ewkw2qVtMDyNVWH91GE/WUN8G Ihnh9DNmZqpLxbuacOwrphynML9ttQac7nY+ploxpAMgf0TTQo5MAq7d1qil qt5Qths3tmNpKaP/a8aDb9STwKtLSlZnzpbdoRiA7NIuUnC48gNK2CcaipCq yolQXdgJMHwNQJSLadfXmpqw8rkt81AlFoCsrdgDVe4zrOVVndfXxtR+/4cv bp6440+VTOSPLJn4dqCQYGoDTFB7/fCCCd/biWzHnN067RYfuU9XJWDmKhZl iX54C4xXD/O6GgOLBGlah/Z2wbs5YvdUNsGdjsMMe0SWm98aHIRS1MYKM854 yyVnlVXpiwqLjxQVXsrF7T05QBrKu4ZXpxiV+xweQI1xqqtChRseSKS+74kF HY/qJIoT1oYWsIh34KPcTEdCuQOt59x78qP6NsCzwanHKrdYOqpp9gIRKwGo lDCG9oP3oN4cvMKMbFVVLWbUYaMOX48wzFQyRgRhdPTe1z2xYHYqnsUeHSbA Mj4cZlxlmcrCi8Pk6XDTMW4NHjOYuQEelN2nkG1xiAGQkoO1FoFltYZrm0+u dMqosi31MlGRAT7NWMVJMhjGd/nL44PjwG9NKDTobMLoJh4enK6tbQvsDHA7 xQfQK1s/uO7lLbHI42bbUlLzRTWSBMJN6JHSSpxIn528UNgafOWkLNDMqgNv gp0T4j+qUufLUsWKB6+MDwagbw2uVPYJa4RMDCu+haWusc0YZppTy6VW49LW 4EZtSJL3xqL7+N6ubInSpEg2qDVt9abPvDehqxjM9HDNQdbtqTXwa8YOXi0M DBZYW5Vla/A0Whv8ZUpWVF+pdUA5vsWvQXRbKWcz012xlGGaoIgt6lW87FAv BGvlLJL2MEPPGMm9wW1f3Y+zAPKoAOSkIM/M6j00hPtXHhMd3hocuDcAgK7A /pdaj9tsnF2hISq0R9XqB+t878lzV0BQm4dwuTZMmKXM5EuaCcjQY4KnYAvm 1uB2oZ+45AzNDxgSn5v63ube1YGtaasAvBX3VstyBxrPSjdAZF9bjglXmkE/ wwAQQ0L1BwZgz7Y4JYuApRgq9wwhcSxFVavPOY0qosZqdW1CC5es0iSDEr5H Z1IdlO4Y6JDtQVXAWJf5wSGptzUUq93A8ANdZNk1VgueqUWrHBGWvQ3KqtzE Lao68yDvnvClyhZbWEWsezx7NMG25kx97C3FcBzFtWpMba5W5JvNsjATnJ3p K02vHob2gzu5vPXkNTPOUhdZbXG6HgISH9N3M5EMXDGqtMJtDQ5csWGACpOS CBb0GbU0x3HWQQVWIYJPM++5OWRssuqA1Wc3ZgA9fiO5oRa+FSCjFBHFH/e8 v9JRpzCo6naBnUrP5nYq2SgHQrGhAlb3VkuXGfesPBxabFjx6MEX5365V655 OerA9m7all6Kes7gbWzvrteuniaH125q67wHYNK4XfCfspo9pGKKmr35wAyq RZjrLT4UuTd8/6b3V0VJNzCeUVR+6VXCHNWSTK3OsoodcNUxbypRaBUoB9ks wFHLuGmGs7YWJ5pbPeLiZcqeyT1cdOOIPYHMgUJlYmtan0DyqVYWKwaMTsl7 GupHHiUyjSYpqaojDNDGUp1AgCLbid/Gre7BOXuWtlfV0cOOvQU2huAz9kSh Lb9qLqn7sidzYTivxrLNZXXUBMOg8mdXv4ylUcmVV//cPbEsdfxxKOUqDeCi jmXyRFkhe+U4Hwgu97113lXmNBhENXcZnKiMCjA/rnQiIWtl3Hdxi7pyVvic sjpHA1Uki1FUNYyKh4pJx4KL9T04hxeuJgbtfQy1Ke4xpywCsLzKDipuKSiA vgfnsFPDHLUeg6lDXyyCPssyABwdKDBXOrDve54oDRWvWpXSNwV8UposS51G koqgObgG6lU2l+JABYESYFzf7WGVS9FVP6U9xs69iiKvcdNZVB0FYdVSpQWx iaZe58UNxeeri0OHwGwCUQdKUR1ohaakmgFCMUIV6+oqc2r6SI1690yuKuPC MsPKwwEnBrOrIqTp1X8TgtfVxWbsTehQ9+rmlLaZmFyjrZWaQsiYKwVxaz7b xOwZriLa2dNC1eGbxZqgPkfKQzhT/bCHYI+W66azSDN6/LyrduHY3IFzXgK2 BU5aFZg/OpO6xywsCxnNB5l3eKjxTAJUV63qW8C0Q3FFlnYNF9qdtcnKQ2MK UfVDbrpgEjIalHwyBRy5GW8xsa1qk6poGFaBWgb1rmOwYqwVv8Gd92Q+g1kY 9A6CcxFAClhU46ejNZ0U08+EjBI2ZX54f6hpdgVYoK04/66UGgQ0vYI8ajpv rN8M5oyMYcpALZAcXM6kGhP2V3EiX9Y0x3SlmU2rWMFT1WCijgNZsMYLw9YC Ph8AaBgwbHdsrhbjEhLAlrvgZoZ7dndGydXlKzsdvDWUDxc34dzZxQGU4gq0 3KvzlM53BJqqsX2CclUl/u4FLaEpcfUEuCrgTpusx+Q6rJU/Ezan2gKusRf6 g5pMFyNwaLHej+hxneqMcgwZA/VKUZLN2MMtani6FJ2YfRocB1KC2rVYOq9S i8NjN/PhXaTfGhwQ7pyOiYGY+6ojQkAuIP7sg1qHnYebpLU3oQ08rgKanuws UyXstSZlGTmDC/L8ZOaQk9mE0CwUTSVuAgdaje1ZZaoY3GjQn9GUNnvsiWWo YatqxdQcicWyRtZyPKBGII7IbJpD9RCbuAWQ4lYuSvxOOq/i0ElGceBGu1LL ivqN2L0nV0leVClzUa+GppReiApYrtQDiXQPLVLLvT2TC6718pdOpQTxMDrJ SXsJ9axVmFCl5Q636aCViAA4DICUUFo1qrdFXQ8Qi+nBGvUOn3nvyUsrOlOu ljrV9URHPGGvwoGkSmyKdbMeW9ib0KTONtpAU9/1Wm1RsWSyMUyl6Lhelkdw ZVOJJiZlACYg/rDoaIBHYF6PLi2Irve82WFr3oyIjhgCdBkNbVEZWGUpL8Sp oz5GRSlRzbZdqwiiKGDcVaHQaSWgo7DFGAF/lzG2yqUbezLvM/fDDgOBxtQm 0DP3KSHaqeMRhDhcV6vNvdXidJZmgqVk8Lhy+Ks1OqMF7tm9yV3tFTGRmyHu MZdHbax2zHqKadlDh9xZazHqGK+Ve9yMiBr5ndmUvjImMB/kEr0athSwEgC9 mT7ULHlv8DiAtkFtWSMkIzv1A7BNxSzKi8dRzyrntOeg+1RSbtcmyFzYdjXU qwMEpzO858A0DLXJ2LPnbsRyNBPd0ProXvUs1geLSQSjB4+2zr65lTN6BrUB TxKPrlMIoFpNCT7LrxVmagUvV+0mVoTw5KKMjHn2Sgg6ivQAWTC6unEXnZRb 5h5WrCs3FptV1jisH0ObUfalBvmA3QSfQc1C2ZtQcx65gyLqrDvs7zxrbuJU +USBSrdso8ow90huG44nraM5cXQ73XnmojFwdgc6h/9Dinb35ipuzSAVRKze Smc72FEwhKMyH7WqWWfbZBaLpVCxIKqLy+rwKO6PWQE/OtDcPBNVx+ZmK5pT VUPC1wEnz61nSJbJccETffQqGBDd2/P+KkCqqlJDPZsyot1Kh9rYWu0s2NS8 FeDdVP9o0BasB0x6GBXe4UrBLDkIsHelfSunYXP3fBzHAhpJFUfRyT+tKj10 qHl+ACDZqPLZPe+PAODKHUMb8wxiXE79gTGQPoLxoulLx9PsrXP17RH+B/kX BN2VKSpyno8DaQelQvQVNu25omdJ02iPMyS1UvFqIq2duenccvjBMPcc9Frd HgqfT/As3r+KozcU1JmjNHVtA9AEuzbDUEp9ZJ04+GFGYSCKCSt+rCGHpHNq MMF5b0JVtJqT9moRgQ6vMObIwFIdCr3SgiqhtGbsyRyWmI2yrQBAPSCcqlSO fq5LiLPHAZp07O77Q8+znLzOVlVBJRBamTqzLWFqPFyovEDdTFiq7Twcjwf3 nVU3ILqlDjmk1KG+IF00qW1OqJ8zVt90hLeKqcEqRhu6qka3AD0ciWJFm/kt h06CgHIGHVfsWdUlF52p4ora4K8ZHa50T0O73NBKbmIFbTLybkpxOaxqYqPO 9zbgvc1ci1qVj8DoY5izwlmyD5iWkKPB7aurzoLkbW4TN55U+0EptO5OByEr XLwHdiniqMO+N9U/4z2DCpLPptYRh6ztuV6PyoMrngEJAKvGzdUSKxw0ZCBc d43phGyBAMKBoQEZRJjGtJtiiSpSx27ZjhXHcThvYP3w3N6qd1ahLpa93+RE ylFkztQ1oEsgeeSsil5sYjk37ZqKWPesImv40KERyy418soDdB68KdoJtdpD 1+kgzuzRlhSPFVx7AMrF4qlxn12wt43gnJJFI7xlc7clLLx7UZXIVEqYB17J ruSpDVw17QUFx004B7MPBuO3dN5mQYUmXg79Tx2KlKFyONaS2x4QHSqpTdCT YQCebTKVSuSEZmUDADtA1bzW5oRGfGSFMPsOCWqw5gjTYFifYhzJ+AVm8TaY zdw5BT4zOLxWpfchFWz7sFoiTrEzHKmid3uGS515wEQzTXj5DJDeHoZRHH0o K1wHW5bk96xiskxYwYcaEJ1T93rsNy4Ca67MJXgLKz98fEhECZvpzLB+8dNL wHrZS39cnz2aL+Y9n+jg5xQu97f39cnlbv7u5sXN7bMXn37iz1OO7+eju5v7 by5TOdcvLvNZv33Jh3dz/M2nn1wulz+sF9ftaX1+ufrd755fvnh8+3R+8V/8 +Ok3Xzx+9MWTm2cv/3DVn9w+m1f3PO6V/+JzPn5xf3s3vzDm4TGe3D76fOxf evOxl34ru4+/6esrP/Keykn/uCtePhu3n36yL84vmZyv/v3rLy8//8dffnX1 65///F+/+vpy/vm7X/767//p6jc/+9Uvvrq8+vOzX/D1xi/pz9e//vpnvzzn 2Hx5+Q/z+eeHs7b855cPP3U6ncqE+PnnMAyxysBnl58cORYwAr9cLEY6fHbh Kmf+gle5+TFeBTJpv32Ty6sXePUqybr08CrnGzy8Sj5y+uxy4TL30a/yx+X2 A7zJ919FYfUSXr0MsvZH0kMDuhVgsuergHqAKnyKUyigis8u51Vx/13+d2aF ucjnrAgBlW9nhcnImhUXQCx7syKF/DHeIL1+fL1BsVDm4HiDh+94A6/JQPiF Tx++0RvwJ3/kG8hA/BhvwJr/4xzYnIzJhz7mO1vKw3JyRQ0OjN4LnhpYThcu c+cwlmH0n/OSVyPZnL2OrdNn8JSYHobRSWPe8inf6AQNCSLJVvDvkxeXqyf/ 705+WHdydff7q7srfTEnl7tbkMP5F2YjRJiAv/xz/eZizcXYL1ms+9J/x60u Xkcbm+MHutXNn7+VMzonpJjw3VuF/dl9z0tlZ/IPc6ebd93pInpY4l8gvoe1 sGcwH0Ai9ldVf7//w92VvjB0bzziYd58vnipd/3xG1eg/u++oj257b99lxyy cW+J+7pf314/v/7m+ub62fU7V8ZFtSPxe1ffXc/rwdX31y909cc8br/75vn9 7RvXxPjm8+pQ+jfvOG77y6fz2X29B3C/69q375e/Rfsfd9F68/eP452/L8bb n7wc82MkoYtu7j/yiuf9XRPtcBlvCe637eXNk/H2HIX33Oe38+7ZfPJx1zy5 aX/+2XRWtn3Lulw/va7nArx/9e9kWb14p/E4zzH9oFHesaQtrrikt0epv52i fR+3msvl6dM3F4p9z+8/m/fveD9Ig397BqVqFWV7ej3fbe58jPZPXPscVb9D OqeqX12365fXj76ntVzy7ud+0e9unt+/qRWX92nFi9lfihG/eav8vqtgzeNd s4CY3nK/L1/cyUz/I2T7tBA82dePX77+HfelKV/acvmXf/gaV2fSCZseP7qc Bvxy9W+Xq6vnL5+8H0O933kcn37yK57x/u7li/vXtdrvu+r0HY8f3fXLurt9 enn57Lx8Dl5r3l0ervvp5dHd7cvnr/736Sd3879fzod7VJ78VZDi00/qGPrs j0GL1x+9DoO8/uR8uO9eyS3fjnlcfn9z//hynK1Pvv34cn97+U6g5PKTv7WX x7OOF599K9ir33yAqC6Y5ZvFu/TH86Gy/btP/frD7zx3/x/2rm3Jltu2/ooe /eTiBQBJ/UyK4KXsqlh26SROPj9r9R5JMzpKtJtjaVzxnAdpbpvdTYLAAhpc 6/u/fvny4y8eF//zd6/vtn8333zgzd8+Ud65HvOr8s792o2cg205B9tyDLbl GGzLbbAt92s38oG1GybUNVypdTPkdS+pNTJqJnvSrAClP1+7kY+s3TR75KKh IIfU+PIoxn5JPkpBkI7PVwnkw2o3OaTEEjh+HIxsUvEqF5jlcFXUSkgS453a jXxc7eZai3StSgwx5JdV4WIkrkqNYvlsVa5U5Jvv/koqEaDnLzc//vsUTlrG Q4vgx/gqSntURymhVQJ+ii9SVf3VwknLNV3lPH6VRV+G4fZlOY9f1KhfFU7m pzP/xzrz5ysn6dto75j9O5WT913qVuUEl0rnq3uncvKuK/0M/78tS/wcl9+f vge10PO1mftXeFgbRvrPv/A12+PD1w++/Ptaf/tGr69//rt/YnD6wsSw8so5 pzbmRJ6qJXscg0oGc3edwaPkuJ/uyPj7f/30hp19pGZryMgpeSzUEuh91ujk VA5lik3dfjA0z1pMHm9ZY/FUKgYbzkM7de812ooj8JhJOBi6zLbX9DBb6Gp5 dwtBUkek36uOtgsb18rzLYevhm5ECzNsMxs8pdYiJoUHmDDhMkNaW8mFmw+G 9t77Zv9PDEOne9rWM6XyehfSnpttNnyWg6HxxMOqUEl375GttL5TXpYs9211 1L7LtnEydI9uW2XYtFTnCrsvtqkskgtUHr+0ukmRdGJ8a4fWcnEqQo66B3vp esiBR7BycKuZLeR6Yny05ZLybjOnpSWmPJq1uUNQTBAsT7LvdTTXdqk/etWB GxWr3i6DLnVYttlhhjP0p9nfXw89u/mkHE+fq22fWMueUw+xxJqiUQhVMNnz ZMuwDRcTITV4v+SD4TjYydhg0FJbZvt1y+1gaEwlnvmSdLfFLYLNudz33Eh/ nLRu7t2HHllIHT0OXdIz1WB2SjKWpl4w22RJ04JVPrOQMJLsmFTZxSlWYG0h hZYKD9KLDtWgdZ5YSAghu8ORwGP0sqhVFskWo72wua1MFQ12dNcdg7SojY2P WhabzzCWINHMkcK45P/xUU6GXnHhZpetlqZnp6ElHpWf3kdnW5R3+MSnSVXf 3DXCy0QQGErd5pRtY4tQ/6vKlkpqgYXU5nkukdcWkklOGeCIBmYglDUyNhDQ /u5uxc0RGlZ//kD+a7ueMw2Lc1X6KF+L/J17eqnktGqZx/5II39ifJQD6dZG ZwICG1wW+0jXUUgvYVIHHX77xKnC6VM+E0tYKAw0l/VGXeNNQrUgFqoUylef BDCpkxoFccdShbr28NZIsChgiv/sYpp2CTfn+hd+8d+/QH6p5S335Q8CN4hI iKVtDx0tYs15QBNBlnqQpGQuo7anJcGuC/+wRMYGY6nX2YDRqHzlPLMvbHhs g5p1VtfzO+310MgPsDaXCGkP++r1XiZAd0h1F2DBNC/PM1O/HnrFsuaKsKSR VnLYrkdq8FJeFm6f3ZTwHU9Lbr4eWq3hgWcbsrKweY9ayzzhFEibq/PS6r7R F/x6Qhy+suRBmnTqTrPTXReslZobExAJOAm/Phm6wySXkOjKOoKzAswChbG9 1uCHqQWWyziba+8FSV6ITV2Av3JzPEZIPAZf2og81BDXfFqS4fXQ1M+7JpUS sp3kUdXJzJt3pSodkr5FyvpyMDSArJNykZqOFBcIw7UpAjWP6wN+8syrxKcV Tl6G/voXf/8yfoHBVtpNBtvHML4QpjdtebVvv/lb/4/xp29G/4535Q8C27nm j4SbE8bOzphZkBiIU7yhrjWAUo1dvdjTGp72UtcN/Kj+GC7ikZoApsgfvxHa gyH0dCAFfMPzsSuNg6HJJosnBEgFAkwFuHXALnctKQrvHlAYv8npYGh2kfeK zQ879+amWF51nezyXwGLvxvw+9OCP6+HXoVctTCbVtOk3tfYBu+D+x+ZSjHC 3nWY28HQgDSN5FpIwgLTsOFGDbYNf0/R00IGqP38Wfs3Q8OBsU0xiwSZi3xD 2FsN+xceLIYUyTDpTyvevR4aqZYBM0SqEvEAXyCLB9vXkQUjoiPiA2PXMk+W EYGs4h6RNmtWeMMMbIKcBkE5r22N54Vrs5O5Dm0s5ovNqDUKN6/mgFCNSjC2 XHVLh4nIwdAOuNGi7AAkz5wRiS8SXUoU+S5DEKFM1i4nyzhTLsz2Z7AoyEin VhiJwXNOntSsI44GgHYydI+YaCeHVlXSlSNrlNI1Sk9I0r3QxWuyky2DpBng tC2ykRd2MK5ZW8P3UneYNDs4GNWjLZOx5RTRspO5aJccnTIRbQBeI0elhmxN z/OkvR66FUTOnpmBIWnEVcjFNgTxDZfawiO7FYmDn9y1booFUtN8ZWbViHaX 2tNeEVOfZiSh3i4n/jry4BITfeRzAosWErrGhZ3DZbVBcSBbJ3bdEIn7bkKp A98YtjpQIiK0GrIcTQ2bNeV9MtcC1OrA2FkmeTOzEmRTqm3AteSdEAg64Fc/ GJryD0GwraPy+J/m0HnyyHgWOFWAw5AyQOjJ0AFzjEyx2AjaCZknfODGBsLX CeG28LQkkpETz2fiyKCd9Yo8x+ZhHaSRJDG8JMI6/ACPZh8MDc+BGIL0wUhB H8qsXe06jJ76TL2RRtOfV9B5u2V6Xb725vHfJDwhRtILSzZHmEj3giGDjyfu qSZZwpV0ihYjEibkjHB+yE0nQrrPuHN1OZnraT3a3jnGALddNia52gYqJ09M pRZsM741Oxma/ieTwpbyvlT3IOLfZUaKqkSsw7BS9AjihIjVoshVi8uQ+yLp XeWS7DM4WA0G1C9WjuyaTFO6Gtwyq06UpQSwlASHVRAzO6wQkE+PPB/2I0au 2DhKqlUeWoIPGSUCpi5Si2cJ5QROWpvaaqeATRs82VpmAfBo62Is7ll2DgiX J0NTB7ogz+bpUHL6AVLubkDmS8g7BwxIQckZj5YxbawdFk5JH6hwrtHiQkTA /gdGqVWAXFc7GPrBMFEGz58jecVuHPgHtxpKaIIwRhqXsygjSAMpSbY3Kcgx 5205PMeaVIFoa8Wawpp2EmXEHGkepgAZWpgbgct5tn2mi1SlIduspC082TIJ SXysvUbWN4r0kpEhlwWQCcfV0phpkHLtZEKoMrRhCYCn14sp7JxIktxRERKA GvhGQrOHo4huhbKrwSU1sp0prTEgfGnyANQTItB8PzI+uLUesCeLASmQ+dHd gEdmoIRDE8A05EpyMtcFU8myTuzYPN5K4cEoTEfCbiT3bAfcnllPfAh5lAXB Za5Wlg8ytPIMeC6NIt7R+eZqt3G0jMBMowND1klRXvLtw88J5dMoimoAOh1/ k4/QUwHG6Ql4YSCVlkhu3FxK6o6HKfBMoZF24mQZkS+bGrZKbrZn9n4Jz+6I u28w7ZQMedlRlAmjCWZ59Qn/RkxTPc3dG2Bq0rgiI0L1cGLXgEme3WtKA64C CdhAaq3d4J0UW3xQ+q3OerKMnuGCAutJE1YBh7Qz9rc7blYoMjGANK3nk2XM itlGJj46BZ322ghohshC3S6AvT4z4IQcYT5k4IDnEkm2vXfFnYdtwMGKa2Fi xka81zJPtkySvZKTWGOnCtRbKI3pzKcRu+BbJgBmwFqf2HV3ql4ECnJPvm6k cHBXiiZijoEuL8mEo5SUpjeQ2nWyYStufO5COaZICY+k8HvIygCRT+Y6A36Q 8iJQihHL58IT3qod2ehI1p1kB+Mk20Ue0+CsyaanpNtJRt3LCkBJakO4U2FD 1ziJjTmOmTd8duyWkZlmsl3CJZEyZOICoyI87HgSCpRv4o0NaNNT2pPlwp5m QnAPEQaJeY7heYWnN3OtGbGx8UUFhUHHWgUepYkhL0B0qAUuwL3vI39NgsFu Ywv2uO6khLyCKAk/3rDVK0XF5WTogS0eAltNghjcaYVpUMkU+0QrvAn8akmt nxgfYoDm6MJGgjzKnsjmNoJCt+1KRknKVpieLGO+SEYQAhAfVQYAE6aeLwPD RtAFHO6XFNaJD0G6sZt21uCVxK5w270nahi52OhCFQxg1aOKme2GtGhh+ToW cfbGHhES1cVCtn7gCG95nNSvyT2pW0snCyrwOlxpXeyxCFRCha0DS1rQkyij q8xRTTKMGqF7WAJs7ySliQGLiYw9GZOdk6ELcUdUeJKkOwZZcFC57lXDEAQ1 pEzZo9YjH9KB3eHgdE4AaZU5AKJyg9FkbWEh44Dj8xN/vWBjexevI7PhWAtZ /zpsUZGSrcz6QvGSjpwq2Yp3GCTmqTFFivyypQg/Ipu7+2Qt204mhBROVUNd ojqNvJAKFA+kylIFefWQNIR+VGpBhE25seqRlBxusBhgSN9jIEfKbpSTr4j5 J9nuhNPE/kMWlpAedJJOdyAPbspBHve5CFpPhgbsH51d8mQOMbYWsxQaG9Aa 6eLcDAnpnidDsxYeAe2Qd3g0Ix18jMsGsSUegCXnHHJaR3ByFJIWUFepAgYD 3sUQ8Q1fJiKiBxabYztxqgWZF0z66okj/G8b+BczHCl8gBRpxFUa4MmJhTQA mh0mzHvx1b0HUsEDZWdsJmQE12ZPcRzl6HGHNMYm+XTNYfJN84oAqCNkr405 AqtPJwFsR4qqsKoHwKt8v6iZOpNGtkKgBKTUmKaT3VjhIPhylg0NmJsqyH4R 2an308nMPzIA5VlsbKSSv45JsPmkL+zs2pB9EMUDos1F13g21/D38Jp1ygJA T2GwmQ0IZLQ+8XPOPnxi20dFi4j8rc+EvJRSU4Jkq/iKVavxRQEeAkbe0gl0 7xQG9gSnQU756MWQboxmcOK1eVobecyWo7KWbL49kS67YgOOiHhuoUVEB0kJ M1ULfIAdvYBVZOfsXkOCC7+xANcRwspV9iOhkm5Hpo2wf2J8FEzfJWFj0EVR Iktw59SAK/CwyPiQkvqR8UVWgRrQqGw3XzAMzDVZ2zLQSCiNNcAJmHL0Dsxg HFjC3SZVMbenVr07LHpL36HT+uI+mZDZ1HotJuy4yz6GK3wpO7N7h/vjG2VN O9vRRk9w0ogvggUErBaAaqQbwPFOEkutgDjxeeqqN7GxIpSXHdeCHcQKpFRn WHuROTjWgHS3YXQ/QaowOXaPh2FI6bLxiH5afbXdEztf8vJKiz/a6KukBIwT 4aDh84GUSLuv1P6uZPBNiDm+11Gdr/a64DSZxcEDpklwrZH1oS5IpEkF3XI+ mWtA6pCaZpsLOIcCx5J3cDik2kgFD6BABsETC9GEe54rUWUAWR0yDkTZjEzA NjtcapQ+SWJ/MvTFGY8U3Zz9nQvfAgLi5inMVrE1SwcCXEcQB0g9kpHSnbVV DGPB9mJDoiDE50YC6HW00XFnQXsZQF9X+Z0NwmFRHhXzArdtytMkRzk6bhn7 3BO7anPXghsXpeolYkBjA2/JorOdgAVgUbgPvjfJZQVCVUbaLEWEggYRuwf7 Rk52I/BA2A2bvAmZxis1DWbe8BsDyEdN3NkZevZS0Ed0KgptinXOxs3iE5EY mImktxiecpInmA8Yr12aVMB2MW8EGGQvSPixlNJjwlZC0G0nPgRrN0dCNFRk t9vY2Lwx6xpW7YvCdDNRE/3krmG2MGXfsTQq6TFDNVJsxg1Ez6Z6w3zZ0Zu7 4vDW+NdLgOkV5AaYIMCyLAhsCohK8bQQTiAOW+d7ERvA0o23y17QPChzh+QR Wx7L3OpRSqpsowS0k9a91m1I5cjO3ykvlEjihcuE2I9ewOZCmBpIDs31BFCC 4S3nmQ6EBSVcILHuUdNd91FGi2xhJkcavF5P8NOZotcLkC1jJx29Jc0TUcom RlYsHwAPmX9HqrmqI9yOjsgGKz8JBQvItJSymODyDQ3teDk7Zfj+ZDak5wg0 R1V34DviGzw1wCMwA9wHkF6xMIGw5/RRL+W1I2BWPSDUbqTO5pOV1TphFbvM EcccQp0kXOjE8wE6zbSGduxsqsGPgccYWrD3KfW3G8tctR2F3QLPoYJAWxd7 H3IFuEQ0u4RnkYPlphMo+yQ2ZiRdie8zKtJnCWw3b5SGgJXDl461oiyqjBwN HSQ17OU8hbiyRFgh8wykTwBTHfiGhcCbE/LrPKgwla+JMqKFfzwP6ufx608u jU8ujU8ujX8GLo3fgQeVe+TiC72+Ki88qCVdW+TxRXmeB/Vfjc7jM5x8snn8 a7J5/PZcG2cO8z4PKu4v3+NBvT5xjwf1mu5jHtQfPn3Gg3rd7k0e1McVz3hQ 8Vm9zYN6fegGD+r193d5UF8+dIMH9fGJGzyoj4m7zYN6bZKbPKjXZ+7woD7u 7b08qP/XKM/zoL6Mcp8H9XrqGzyo19/f4UF93NkZD+pPn303D+p133d5UB8f usuD+vjULR7U6yNf8aDqz3hQk32L4Pbb86DqJw/qs1RT+v+ZB1XPwbaeg209 Btt6DLb1NtjW+7Ub/cDaDRLq9pJaaw7lh4IHT5lSoCMknsN/vnajH1i7KaVG 5qKJZ37DS2qpfPVnwkdRkyu1fLJKoB9WuxEjIU5haaOEGIsV4pM/lNR4mvaP fyzWQrRyo3ajH1i7QZL3UBYqeKwXZSE4onApCylPWN1RFtL38aDq71E4CSm/ KpxQPJrFw6u/hNriV8Wjsa3iWvjKTjwWTrDXfiqc4JvXhZOXype81MDaj5VJ lr7kqoFhx14CMlXqLR7UT2d+x5nfqZwA7J7P/r3KyXsudbNygmz7eHXvVU7e caWbPKh3p+8+D+rdK/ymPKj6UTyoI8Y4VukuGqL0zPeyyVfabcUSbOTurvCI v/5as38//vLl+5f//dufvP/0cj2VDUNKcbtP8VLNbdfYeQp0ycjUTU72jELl /36NsVrclltiy5Vsts70VkfEE1Up0dnxNkqY77nGTmGG/T/sXduOJbcN/JV9 zFNCiZIo5WcCURcgQGIDcR42f5+qnl3H8M7Gp3W8niAewEZmJm6pj0SRRR1W cZAlx++SO+lh1a01TJILyUBRRjN56nPUaXjl3cdoMlizuPsqM5qH6ps10H15 foQV9PU5Qt9NJNksq60226o1I7DZrLnBZQJ25sxmvM/MEbNhuRsrqa9C5S6U Z91UiPTg2xvAh7W5n7KrRrUi6d1aqsHynMMJLnvAsuUGgNx78EeUN/7Lnred ABFk4Vywdfgwo+oTCzTrwiHBn2utPT0zh66We0jkq5FZnHpKrfG7tLL7jpiq XbVW+tznSGmHRU28lPpK2VqMuXTY615dLAkLWvdT+wHcOCkmUj0mqRtbkOYs 8Lq9aW7dxdMMHuMzc5CvxJofmRZhYUBwc6xiJbH/essjliWy81NnsPiSYor3 ZQ3rXmwzKHGOsvqsK2pf2JuHaPxfn4PlbbDZujDZqDjULK2sOI4OTL6s2gwa 53Nz7GppapkWYGGNMoM7UrKypDZZKpRmbL23h3zJ3//Ux/f/+I6//WedcthK ZeQ00sorDHdnFbT0mGXxO26ZwMJ6On4m334uawIXxfbM+5LxkJ51BE0RJ6VR W+B4/OS7kMkDY2ph7lrhp7wLi9vhRaovRER4ktPxty/zuMQ9tTLdio+M3zUj osIH7h6tiO/j9a+CQHeJvATKPG6caU2XpoeOMqkeJyusnI7HL4MaMvBvJr1N pcRfQCK/xbH2TS5xbAnxdPylGxGCslcs2kJg1Z7TvCiJNbJrb982lvTT8eO1 KnFgK3NevZIqY9rSpub2EvZRXynZ8ftTKMiGUyacdwVrzaoUowjC0m6zSqZc S3I6vsyRsCqJZekV7ijShcKQLloDMvcRtbk+Uo79+vheqOFVytXiPHshEwpe Isc68lg6ekwaQz8ev20PADNkS8hsG8gsAvyxyBJhICQ2EN8a1vH6i3pPfTO0 eGbleplAgAB+wALrUkYZ7PccHht/rR++//iTvR1SEgBe67AdvDShS1fFQhlg slZqGj4kKffl2LB144JPSlFGBYAUOPt4MXFtG7Ycjh8n7mTsNPBiDoSKz43o EjICcICxCLy+jqQFITqFsY/GBiwlpxJ5ARWVyd+k6CnwA9sajFBCL+Q7n4w9 LbF3PbD6proXhXvbKBv+flD0Qi45BRM5GRvDUP8d/pBoxNZEQpAbSU1U6UQo Hz0/JhXz5dgsIocXJ2EfHovSclYQVAYLciQAWSUsvM+j9e4UY0SQomC4pGWl zRpK7HPWTozTe6oIjXZkg8lFdyP3eJNolSnip0Xy1QhjN7hOig+Ws/fOindO SB8bIBlJ5YBNSGoA1YDVTRCaKOh0ZIOWggBhCCU0F3mbFguhE/ArZdookmq1 HZ0dizTvIXNvrHeDB2cCI5SO6aP3CUwAnJmPbFDbzK6B0mRlwPgyTn70sTAl nOEucGMDn+1oTRaSQl0UpiTpbwhWvzfqbA5AcaoHbu0aVz8Zm1rvgxlWJVsz 9g1XgpxXE9JS5vQ9bfibVc7WhA0CcAwL/KDJdqpsso59G+D81VOhaI5Ha4II jAVQxDPd2D6G4l3TFJuUQEt7U7o45Adj2+4/xFB+ItdmI/SqbeGkzKWCUMae DC0n2sxasBg49Hw0uANR8XsFUheZO+/OYlfkBtsHRcpqG3o8eNrUccU/JSHh 6HCydVFd1yaSQ7hEDA9A8QhR75XBS9k7Iwcvu8nqvQpCsQ58HKXq1YIbbPRj 6WhwgeVV8qiH4ohWYJYiMI+odcLdUsw7bntIOeuVwUnLYLcYwBSTiWygh+0A o0zGkVc6sBASttyPBkfEwfMC3AnHQuIfgk0COESgnrXGEs2XBrOjwe1SQVJg wQJMTqH60pbqHgOp0xSS9hxpztmyFPJp6mzeNjtWpIzNQ1qEbUCCZwPRGZFZ RjkanN+TtN1JlHRKxlP6rFChSygFQVmLGYLY2ZoDyQK6Ob15swCDHMgoRukV f4zUqSl4921yZopDHacG4Wy2hoRLpsNILMLZjEoFCmTZPc0z3xItIOJ0BWij egASbATSTAHOBdO0lRGwY2hnG9qxLHCpa/ctAIIwmUmuE46WJZ0T+HMOq/Fs cKOSbslR4EToAnemfChwHdwA8NcIFmD/ZR8NHvquMahNOHEfSCMEWGiXud2H bniYnlazGs6WJcBRebhWF+dy5tDCQOoTm13gjl/IU47jzBSrKfJPZFmBknEG 95iwmwB2WqnoWABkZn40C/35m6+ABEpK7tQBW4vyuhU5c1kUhAAoygoUMP3M K+pOC0EHeWFd8LcLEBexHmEaqW7GJHP1jRB4ZuesU5+CHNYoClnIaAZEGqIN Pr2oIH4s13kWQ8MSiusuKmoBL8POqXAfBgWaR01+6cH42YauumUZ1Z0owBGc AA5OEFECiCJRnBlQ5iFhgVcGH5Q4RZrsSM0X0jci51SUrN3eNlmwOL4P6ei/ tubOSxUgIy/IPgG2pE0sccS615gjttgQp9UOfYuRvhdynAUhk9IeXmvxyb9O 3gaP4aHc3tBfpsghyH1ZQ3UdqF+ZIff+xfx7ldV7ldV7ldX/QpXVb8CQq1SJ vBhyvMHVTwy5Fl4+AX+4yqQeZMj93gq93sPJe53X77PO69tXYZ05zBOGnMpd hhyeuMuQwy9PMORenj5lyOF1bzPksGTHDDkNBww5jfcYcldPt7sMueuhWww5 PnGLIceFO2DIYZ7bDDk8c48hx3d7niH39VHuMOSuUU4YcvjUtxhy+O/vMeT4 ZqcMuc/P/goMObzTfYYcH7rPkONTNxlyeOQLhlz5GUMu8Z/fgCFX3hlyjxYh l/9nhlw5B9vlHGyXY7BdjsF2uQ22y/27m/J2dzehaDUKn+AnNtl7SeuIy68s jZKmMerjdzflLe9uKOv9klq2Zvb57ubS+uMtATPYG3c35Zvd3fzCzc3VpLFW 3tzU2K5Uu3z4A7IyK/hjSXrJAT18b1Pe7N4mlFiTxsu4NH++M8gVu9N4Mdik vNwZHOzIATuu/NaXJry4YhXAy82o6afbjizsaFmvm9Fq123HL12a8C6yfLqV bJ8Uol4uI6/7r2zxUoi6zY57d+R3HPmdW5PUnlj9e7cmz0x189Yk5fPdvXdr 8sRMN9lxd5fvPjvu7gzflB1X3oodJ7HLZoMVkzX2SH3kPkv0omotLDbZpqjt IzUqfx328eNH/i/S8p/ozkqoM3mxXoJt2b21GTW1NLpmdgGJV8uV4wk8winX ufvaXljwVuHUx/CVLC98rEnmlD7STf4rE+AIyNxhByniOXsNLGzukqioPDWW HdkDvR5PoKtOJEIjtrRHiFitli2z62XN02XvtqTs/MQERUNxdrTFYiFY1cZi z97ZlMZWGHNh8v1QGfzrE2TBore9jMsPVEF92lo1hVXYnIc1SbOV7scTVM+a BjvMsTFX3lmae+h51lk7DHfBdBG95/EEQ+uKO6ZLKd/EeyTFJAVlLSErHyUF TzUfT9Bhlxqopl6HhrxwHGqJsTRPhrM22Ue1PaS//JUJDEsEdLJGaOIWR6sA cjmw83ws1P9mB7MRzjeZ7dVSXTk0pfB1TF2C1zFLDfhEcQbPyds63wP2oFLJ sbNqa+NElYaBtTjOclrDsT0x6U7HE2BlUqfgP9sY9aSAvHW3qtqIivvGuVs7 PFS/9foEc5GIM/GeFrCdy2mu+BA4eMkcGcRSdgQ490XwFH2SBBhEdRil2cdq Jn3BVFngevE12xOfoDoLFIVt6SeWac4+SLcKLOkaA5a1W2/z3EwD+0gvJIIj z16jq9iw2jaPgJOOUmzG9YhS/dd8UesDkaaQ2l2k7h6RWwObBbHtrsFirOJP fIK0pcM8sc55h9pS7jFtHDyFE8+xLNLIsHZ2PMEuITVW00bzFhQBM6SV1RGN qVTKpn5jwwLOfZF26fClkTWAlG5nW+GN2C8+SJH3/EIVPJ6gtAa0NXadksYc PjoWBUa7c2MPphIBxrz4+SYjiilcsi9FrE+xsfJNYZsYOqSJoD+0ZZXzTWZJ cN8aFRG4exkjSFiBBYgyEoI+CcAl5fXEJ1jd2OnVisJrbGOzuVlJ3mNlqXbv CdHtfAIkCnW0gLNqVhfeF9NhPyrOcchwg/DmMmyee1P45QSrRMyKOvow/L7X hENFvIGP6KOlENMTyA4h3bCLKZH4FhxHwqLa8MqGvrVXksg8SjkPmUnZDoiL YnCgAgRDMQVA0ih5A5FhNwQn8GSC1/7vv3z/t/lHnqq5Pn4Y33/3z/7X7374 EHlb8o/+wf/1zx9hua40PQzrqwCfpzjoKgsr9TsC2ITrNDGNj8fal6l/lHlA 3BswYKCERDqXw4FtGSHHCFAowM3kX+nZ6MYgPbZ1KUggFxwjkRmbm86cUxEc XAmIXmejV4A9JdWFl3DkH0uNOiusrxriYHRWu7UuZ6MH2TMAr1bjS5dSkBqt HJdWnZ46q1/ZFeLw3SMWdsKhszgdWY+OAMxUG1v+IZjMILLy3vFwdKQIgIAa OhKG1nodQN9YkJp3g/vKkw1QkQCcjs4mjRJhhizrRtI223RsaQtuvjMb3rML UTwb3dlGq+PgKbuCIatiI+KVFlE+NlZxSCMS00OLFJLyyPVeBjyPbUXSGz3A Mgdy24bAgWhrp6dJYRTDETeRv7EfcZgI12QAwjXKFgTSuJfsw5XBwTHspO+Z 2GZbEYZKWBMrXh3uzyNApmTzs9GRuBGnNiT/DRCg4JVT1RUHbbMD2JfMQG2H 7458A3uYFXGfyis1sQ4bi2XcW4Q9iowMPTyrbHEfEo5MncgQKjtLsTdnLp7g JRuVWXoCRDjcVeR9DRknwBFbqjg2t+6AY4SQWunHOmYadrirWFJAsEjOR8+V MRNLAqDUkejmRMjBL3vK4bsjVASbQFpk6BtvP+KlgiSld1dj2b0MTHvoI2OD H/SEbA04eOH94ZG19JKnzjUjfP8iveds9D4HO6bVUeAWJ7yXyoTTbWFHiYv8 ZpVVvB+++5BOpZmeLSPPob+cU5bD6YeqvMl3yaWkQz/T2XJ2w+CrxDQRQjts JIW1ZsLKr5RT9tgP132l0W0kQIMSMkwltUqYtfjVjYzRI6xxxXW47jASDAtH ArC4t+BwjRnGbAngrq2cM6BwA4g8RByAmrbGsKvQOWg19k0NFDMqsMSm2Gjg hMPThAHhW1JCAg/0UhCimTB1NorgvUECmmsD/55aJMxCvMKVb0DqnJFGItUA ioFfYCNSV2Sv+XDd1wQakoDYR4+esUg5U2JmkNGHaIJfKaBz6AmGZSw1PCN7 nwG/4CeVsiyZh+zCTSb78xSLARsDY6y6MkUkSkL6BYs3r6nDyZirszPf4cqw UeoEhJTZ40KOh6VAXFq1VSxYDiWXvbbNU5uheFaBZQNwxLnCxZdGSte1XuJw oQJ/13x4VpkEsf98VLLJ2gT48DjZFR2n1kmR32HBWxyue8ZYk2S1KeTxtmwA GRP+BSkrjbKyq28sh5EPZzP1aYA1SJCWwx/GCUev87oIcmAaJfY4tMgJmCiN TSuVF2PFYeQwc2lR1uw1+yppxENMwO/ms1/0sj3hEwG9Qpmpqe5iA6gbCd0A xDncVXgWoPaG0wRXy5ZuJS+2VQQyXhMbkvGB4PNPz+ooAa4mZkpUDKp/kcNn CdEuXkkD3Nvoh2eV8gkk9S6AC4Uf146UN0ef1UtE0MO0LWL8s9ERJnYFQjJK iRFzlDWuDG2tZiPCOwLShNN1r8AtlNlySxIaoumOszKYhD0JimOghso6tEjg Op+IPqqpJoSJSSAfNxtzYlfha+A7Q06HK8P+kAo/tkuPIogcVVJEWhz5nRhc fFXy5PXQv/OaajGcJmTbpGsGNaR/8IoB4Cw2qogBex9GPqxvRiIfssaJ1K9v /JBmQoCC2wQkjm3B4ZTDdd8JeLTljdO/yxZHwPaGQzXhGWr1sQf29TGhu9ds BicpbTiCFlW755HwusbsfntFPrWQ8TWkaYdR2xMCKjV2dgNuN8PuRhbDUnpn YdhWKc14ek+QAE9XDXFiefMKTTGgM4YsKv0UH73nHk7xDIU35VJQ42UMzK9G mDs1bJoO+N65zfs8RXpwih3eUR3ZgDMz9ezefCDtnki590S8je0wX61INzJy p8xe8kJhoxH5ZW9GLp+RUNnCub3zBcvn0R/g4ALhfVGlWdOvz8F9L/95r+N8 r+N8r+N86zrOb8+/jYKUnzZ1/aQv7ULhe6VcvG7+YOFx/u3vrZT0PZS8V5L+ PitJv32d55nDPOHfpnKXf4sn7vJvsdxP8G9fnj7l3+J1b/NvOeMp/zbVA/4t HrrFv03tgH97PXSLf8snbvFvuXAH/FvMc5t/i2fu8W/5bs/zb78+yh3+7TXK Cf8Wn/oW/xb//T3+Ld/slH/7+dlfgX+L977Pv+VD9/m3fOom/xaPfMG/tZ/x b3P7c7TfgH9r7/zbR2kO9v/Mv7VzsG3nYNuOwbYdg227Dbbt/r2NveW9DeAI JaBisJh+FBxrWSPlonJr+RIce/Text7u3ublA7x8FA3wg58+Cj8B/loE+Wy5 odJlb6adZpoMYAdptQGdlswLj/bhD7G0JIK/qrAA8M7djb2ddlq9ki8m/FVq Du1Twm+xxevewP7N3rXu3HEjx1fRCyzAazfb//McAclmIwtkEyNGktdP1fmk WIqEzTdUbC+y8gJeAfJw5pB9qSLZ1TbU7lblogZXf4+Nk/w/JiC9DPDjVHwS j9Os+trx4EzQLPFY+XXjJFPQ8PONk1FKsrdhShn5066kUZ2fw1jLyl3JD0pH fVCD+yOYPwnm7985qT8l+Y7Zf7Jz8n2verRzgleV+9V9snPyXW96VIP7fPqe 1uA+f8NvWoP7R4DTL44kf8H8/ePPoAN//uV8XSbwJ6TyLysF/v1ffp6/jvbh r43lZXcYrKxS+08f/oEHlR/+1D/85z+9Hjz7X//y87+dX37hWG+r+PHWngE2 rJ5XYfVXnrT72GdlO4E/DB1xkr7j3vSXn/PpAu9y6aXmtqYr/olU/ehh45fa AFuswlrqe7qrfXv4acU7+w5WTasl/Ezu1rMHh/SGv11NikW6HN6WVF3rVTw4 ZKxRHUnm2BmzIbcFEEqa9T0XDL45/GQ3hsPuCYaJ0Yw8xvvqffd6AiFVMDPD 3nOD7JvDx2nq1SvvR1VM9V7NludCxe/eVfxQdX3my+Fl5unBidBS5MxdUzWK wp/FqwxnzbS2XH99K312mY3NNwCfykhbung/xOURc7BQ9F2Xmb49OenE6qG8 ALe26dCFV5Riu6xJFXSsRM3z1jCrdYWtsMgOIEdhiCfDmkpqUasvigHvGvl2 cuCyvgIWybZVq+9oFsYWPCXxTnUCn0lj7lvD7LzDKAMG45p74tU38baswnVL Q5w5OXgp43Zy+NtnrROzzMIyWApbLEiaY7cjiDwsz3pPo5JvLy2+uOZzTt7Z vCIAZI2clx5ek81T2UZD+u3cS8PER/ZgraDkA9C7hs6GlfZsbSKgHoTOfjv8 Tnm0FGVvrTUMAZkyy+XA1V412GGK4HzrtdO3jDHLslN9rqTwpl0aLEjMix6v vHDe5XJ4h7GoBEJCOzMjbHoSywXTw5uDpSyPPme5NswKsASHRbwpCq/VsxZi dDq85TjzaT2Hr1RuLcdFkK/2ypFgJGxbXJlSZz+OgBBIrQO+cGv3uSA1sW1W Hr1X74z6K29njw6ZvCWIdZ5+O7xvRyLvYF1sFRc94x0rFtyhsQXmsVYG+3M9 H/7rv/zlz//8NXzp3ypz5IVFOTszmrKAwQ98r7LApiJDl5nGEmbqd37V68Wf bgIn9np7BSlEu8r5dETbnfpcAYtWR4541+3Cr4aGJ+TsZsG7xEXWyHkbPV5r gonJWXvzsy+GXqGhSI9DJDpvXkZg4DrV6krSeXkxiIxuJqRUGlMZlVXeEhXB BMFiInizP2aCzw+4Zb4YuuEjZQeA1uhjRj21wKhq7jOSHMC4DcfWvC+Gnsgm +DYYbYJ7n4acyR5ziaVkPc4QhA+b1S+GRvI9xTzgE7xCmLRG7V4Hoh1mx7i2 bI/TLoY+W/LmfWr2OHN9wz0H9i2ZE5SXaDOt62JoCcyoARamM+rh/dvDa9bK +5YARMkny3jXzVd3947wCQeZPlj6Q0sRxLaGyCEbIH3pPDouho4w2EQn/ah7 sPxsBiBiZU9fGM9prQ0KitzMNZcR+dsZmhObsM92wgmCJoDs0iGNzWUuhs5j gqOANSG3BtDN2NNy8wMkEisjQu21ucd0MbQ7qINuxDvXDishehos+3cs5PIS 7DvoO26GLiNHZw+sLqILbAdfWUah0onaiIklbj3fxJA0JBfWXEdriN2jpODB PbsMw0DyRJLqW/vNhAD+YgZmDsAYNnwe1GYpVjrryA0mWNmf/N1k8Au7NuTV JC3VXnr3fepMfchcDUZT60JIDXbvuRjaSplrw1nSOSxdXpQJAiGhFkJQUStH gGb1G+NTeEwEy590K0zQ2RncdhwsQJ6WSrC5w83QcyK2eQKSqx1GlmElgAFn gMcCfgzAvrpyjnw1IUkB5ZoLTcQdEHsu3bKsVFr0UQESmTe5ERkkKTijy5RU WFrW0xodKQYBoAdlOsDb9k1QXQWYY9clNk9h67wEL0EcTO5IBxvJizXM5Sby wR8aYilCpzmwtYqCeACN9E5m1qxg3D7lJjw1fOA5nrqehnSrsw9nqf90pESe 95xVY+tNbkTYKKC902pLMDsfoBlsTdcxJsvNluVVdtabCUHM6OCPTeAXZBrq BpiXbbKrLvh13oUVpzcZna3tzHcPBItXyW0dCEgnKtvmsuF3qliKm6F5ICOH 6lAzNcFYyA0NPHQjYTprVoBLEGNvLIS7F+xUBtzYcqGUUwU8ADtaQt0ZGA5l kuod5otSJM4+7EzaKQAGzDdbk6YrJij1gs3Um2UscciluxoiKv4Hfi5JwFuA fGXYTFNGbfUG8yVlBdNcDgwNCv1qsQiY56vTqIW9+QAD042jl1SQAQ+bQQOH APAtLCAAZgUmG0F8kxBd5Gau8+zcOzAsl9fNHcZylAiwIq8Xlh8icK+48ca5 ESJqU8AE8FogMAXzcPxfFkN+OctaTpiSm3i92ma9NMs7DbEE/oiItdsgEFTw Ahr6LuUGmAHiICoBdHgxJNsOPJbzwOrBCIWKifD9dJUKQFaRq2ALAGAHrHWp cJN32pkbLwRMK3u3eYOeWoIJI/2xoxVFvkI7zG9b3utU0AGYI2DJHcQZxbKs lPKEFU7wvLIWoDpy1iKIKk2BpvRqQl7gv4M/T0rwrcMM4LUh0vazaO0nOzDU FQOrHiLdTpFdDXwDb+G+Sj5BCZ4Dy1O74uj44QeYkWMsTw3zmoayCnhaC4Rb ijNUWTcxBPA8HzwMc+MFDkPUZxPRIVWcndemdbBdu7EQJVSHb++GeWBncCA/ YXIBnudHY+KBpewq7UYHODqjJ0md9KCnOgBHnJxX8QIHsBxXdt1hxHARryml omnYyVtZvUh+VF797A87rF5hPswrWyZjqZBrbDby/U0hvIP0MNbxAqQdV8uI oN8S6EoqaSOyGo1uMYWBibGmHolRr7ZaNJfYmZvAbLC7GwJGa4UqY5EAHZw4 G25zE/lyBSeV6ArMrsdAB3RMGwkgoTHSghYAeF8ZX9YBWlETshcw8JY98k5w IeB58Gh4j2fY5Tw3y+iZW0BItSm7gI3vCRxPHV0EPVE1/BrnHYvnQwPplXMs KJsLQ8yUs8VQMsnAYCtNu5SerrZa2D7ejiSdu2Xp1Tc3EpC7XiXRgcSbQO5u 7Hp15Ku1wLqQs46zTFlLGmSRRtmSNsD83tVj92vjQ158qTV4qmbstN25wx5w dVabg/pZZC03yygrA/8n6tltjT28xCSx3oPddWF2ubacr4AZPhRJ2zife76a aQcDnwCUwKBh2R0osJabCQF3OYmyuFSH7uCgh6Ahl9pg1eAwHPn0KxyCSYjQ sZSnyUV1v5ivDzUAP26wDypnXBFpPJhnHKkAfli0MRmZkOWR3ROVo6wHgvi8 SQX9gCCCv+Uy10k8iczgNchqbYEhgJn2QIBpN9CdGrK5AT4tfCq4HeDSWtK4 ZYm/2EAjufsdMAOmho+zN/cCcoeDTNj5Cp5i94MhNzUWV9zs860Jn/GxEu/5 wOkBl9KpFAtJWUAHEmIAaNKNhewtxQg7QEkTD3SAokBC12jIaAi33gwI/GqD CDwmAfzWfcRAk1RixqIqD8DUyXD7uoreWcjgdnvuddYJU+CSYgErGMFLv71L Ro7v8yqodmCyngrlSOKAUANWOzLjSbUBN2Tu58NQ1o2FhDZKeHnmZjXmfIOZ CwlkX6w6UuQfcIUhV5S0UQ8DSMkXEgqcHgBn4UUVGb0huIKPIaTY1YQAHYBL gy5nRD4AsUBKkBm2geoxdAU+3uJXW+OLQWnIapTwAiLuTTP3ggMJmHdaR7N6 dQyx8MGBmOQdwOYUZBuYR84Frzjw8o0f4gAkV97YT7RGwbPUuPM2KOwwCuhG EWp4IJQkhLCbLAOqsuF1jblqUX4EQH4qSKR2fKyODUQyLN8YXy2lNDBmsFzm k3QQ/g7v7rTF44eqeRdDgrhhYK2SMzaE5oTfLwFamgBZjVrd4OodbAzLqFc0 yWC4ncd0PEejaAd+vrpa6guYZOZTYCo3y6hCj6ADMoUD4lE5ORqg6gE9Gk0F Flm8XBEOEC3uUA4KGW6LbgtThPBx9mjAEXMSxd5sWvCscbMNQ+q8ZoS5hXlw t1xNzhlIDvB/+NUNxKF+CVBe4zULN8udKsQr6YDJjAQAr2DV8+qUdHpL5pM7 +lRj9Uy5GJ5vHgA1oNjGZgbzxkJapYxOr4nidMaLRQk2l4G2c6oUX8l4a4ub GFJtZCDgolQzA+StKUDZGfdefyqxeCtwXaEnflVykLpU01IkKyAPA7Ish/dc wKWzVDs3IJhCi+yb0ZEheWLQDxijU/YOXgnQo9TGrueGgQ144mosbuSNM4QM UC+qJcLA4eq9HUrkH5Wr7VogEBiFIdEakmBBBJ15yeS5Dxy8IhxyY//q2LhL A7YGC5suVADiFjPFhLiYll9iEbKuHB15BJAd5GLjUw+iReY1S7NaxKeAKWXi qxuXEd0TwQKYMlOWD8BAkciQdljVgjeWmo0Xrm7AAjf4AMAUsbNhylfige4h HNN20ipSQaf7lTfOGWnX0nmxDD7v1OcLdaJJNtgBLlm92dUy9g5qASsAu4Pf ba8zA1AFuBKYzNACoBLr6lZLtQx87aMsL9UBAC1RebkA/eoSapiNCgjfrk7u 4qhOOAq+ffDYalerm4qrxVfpa+6DKHWzO8ldyARnR/4aDeTgWFuygdfhNgmm UjNi9h1Nqra7IUchMxrikgOFIC9o2z31emYJqnz5usrooOI8SDN8tpOHTQy7 krxkr8HbhVc6wNNvgupC3LTTdl0eiERS9fCQcVkMOqojQM1qN+jpRF48FOyI Fci9yLVUdeUiRhkNWQCBFvN9tYUIa0YOnx0ZRnngYMi5ZwGcyRHt6/TK47Ub R98ZJN3A0a1RFQ/wI0lv3GM2kxCk9j0l243xBfi5sonDQDj1qK/UAFq33HnD APCMeTPfJLAY5+TBG/JAYcGs0KKuigU8E3ZyzuGt8321oZ9ON5sKSgp4Wl+n R3nKykolW+RNULK80w10B5IEpAYjhRkM4KQyXte3KOgMslrGMEK1qyzD+Gat IX+BJHFQMSBIBNLFO7shm52J5tXWuIvDvhhBk6a+MQuttlzq4H3v8ESpXAQR vdtZWLMgKCXww64Z5BaUyXtBrp2J5+l7co/n2dD/u1jdMPm6pDm3/H+vVvej UO5H1fOPqucfVc9/C1XPv4NiXSnKGuSPf7KPinXlo+be6w/2fsW6v7fC6x/p 5Efd9d9n3fVvXxV9FzCfK9bh++ozxbrXE88U617Tfa1Y9+npO8W61+c+VKx7 e+OdYh2e7Y8V614PPVCse1nVU8W6jw89UKx7e+KBYt3bxD1WrHu956Fi3euZ J4p1b9/2vYp1f22U9yvWfRzluWLd61c/UKx7/fdPFOvevuxOse7XZ79bse71 3U8V694eeqpY9/bUI8W61yNfKdaNLxTr6k9ZfkIe/e0V68YPxbr3ioKM/8+K deMebI97sD2uwfa4BtvjMdgez/duxh+4d6O1vVTFC3lcbm8K4pJSr5Tnl9S0 pwedBsYfuHczciv60rUfrFf+9FMGxVv4U3jd+IGu/fgD926MVejcu2lVWpK3 vZuaLTfu6FSeTo8nezfjj+s2oKy4eDOwMT4Rft5SL+VlYJLeCP/Fqlwo1o3f Y+Pkv7c7PnpYxhq+bSnWnvMnD+OF+teWYhuvZgV4rDySmvsRhZ9E4SdbHogd 97P/bMvje171cMsD0PR6dZ9teXzHmx5KzT2dvudSc0/f8JtKzY2/Cam5v/zH N1TmBGH+oczcaxwtbWTKM8W29yvM5VkpntaLhNtuyUvzsaqz5IxFI74PyPR6 7yVBfsmnk9fIvA1+fO2ImQdvceR0is+Cn1i8UnFov/umzGcjUzHFD/sdvu7w aRqLXQl3Xy31sFZ2KYDo8nxktbME3Nq7+IzcQ0opGQOfgr/S1GZokaPPR66y c+rDc/Zlc8xlkdxYMchu0OyEvKv5WhcjbywuCy57iuytpJ3YZB4DlkDkyu59 l+X+fOTiSSV2V8pLlJGp7hTD02DZlhY3nVXPLBfzPC0l66v7woxUcRmwN2q1 bPwDHM2bqu+X0frc6qrrkumnN0slaXQW2oXt2BW4nBpShb3wLuY5nDeMe8W/ ZDR8dTSNsqbHklc9/pkaMp+P7GVm1kz1ES1ONWmHUl/ipY4aWa1ollRvvvmA ikweLJupqLD8SLUF9XxOP1jK1U68u8rp83nGqGmLwO5y4009zDvmtfmszr6v mKxwfXfN/OdWt0LxfcJ7IIs6d80mjXAWYHgWKGGe04p6ETeU0ih7bNYGydx2 6mpWju01qZXUx6B2xsUKTrhYWT3rDF50Chba0KJZlF9g0T335ho3K9jX6rWr ySzb2CpyHhPPO+CPmCPkWC9l3nxz1gSnkDIiR/dqY2FoHQgf5m5WBtxeUlzY s0bLYyHgtS4NnvJf7J3djly5kYRfxZe+5D+Zfhv+Yq8GC9gw9vH3i6ORIXu0 2D7UeMbe7QEGaEldrFNkMjKCxYwMKtclsvMOuv8fmgB8zvcj532A5EIeGbum rg5yvfSd6ixbHWlLTodwvEDRrmtNZWxCgK3ha4msYFVBBUCEtmgqianjApE6 GcXbWCooTCRXK0hNC2Be6SqI83MTKOEiD1bm1HhOEu2epvjrprohGKtyYGH6 13TlYg9unlOVlWF2eQQQG8e5lZfFNdUknchTYUG+mA1yUo+qEUCxxmpyELMe cpY7l4r5OiCy61VshATqI4g7hKUuGZ/pjrytcLw/aUa2YToXz+x0IuJUetBL jKmsOCuJJMUN1smsh6Sb4B4Xu3s4fVwrG1DKM7ERn9K9akk+PUqTBQJ2kWEj STqdIsML3zTFrjuzzrKFo6INtyEi01+gqLd0jkp3HIu1t0sxN7dhkCSCsVUf mNoiSC6i7kCN9mytpTjyhtltv6toBvnEIBy5gs/uYmSgQvZ3hy2SpoOBHWIa Bmby/9H9HW/5zH6xU2Qwwif31Q+ANAvsR+3l2Mm6ftnTOqTxeMPrRhzDp1pc OAc6fqyXmddyS+ncNxD0MCM3yH82n9WSzw68HGjOMlfOYGhyW9QJBIScXozs oJ2W3OmQgRiaM/aFSiM3oRFIYIEtREp8N8+//PufvqOb6nfcLRc4np2ss1sm Qnd24C7UYZKNRtFVmxgjDOODj/PTX7+pCEtQPtBxJcC3+VHAmOeWKhKq7HLY f153QF+PrDp9pFjUlXy5gjlYcpjR1exUbOBmj22VD7tafzMyimtFx6qsLGuj SrLfcsdV7YKNEJA+sVhN70dOmkXXVL8xZgaAraFC0GDQOP5w+Mu2e7p45vnU 4rNaHSIxleYT1ATQXX141YSRLmLb4/3IkLRw1gJce8vqZq26TIAMssynaICy 87J5eD9yGGUsc+gNqI+RNJlkLwOlw1RsOQgH8ty4mOdZtUpaMT69J3/GsYu+ hyBBQ9nUmdtFpOX7kUsqFaLST9obVRYK81q8bk37M22xcXNLZvtipzATTbf0 yb8wtVIkd7sDHsuAY6BLZiVN5YuRB7Rqk+/RNJ0FlF9mSo7/z5k1RatxsqgX sdHJYlO11hNRDjhW31WNhEgA285xAmAUycXuPgny401Ll1KV/w1MCkKkdvUD Vryjy/ztRWxsQKGOuEjKKQaYGzlPPtKDxw9S1nL53B9OF9/GMwS+ZytzsMlH DDAhnSqwckeedDP2RNhFd4FIHeqb5DCkby9O9coRG1olQ052Si/5uFhvcMNY qNl1PJRCaWfB3gKLt81D8fsABFnBdvHMJcvTL5I3EexxQNfqY9riiu8kmqKU t+YuN4g0gXwP9pM50GUQ75zIYMvDt5p3PdaVY/MXOQUS38uaqlrmU1ebhF4J cY+VQOc1VpZfxEXUIb7ka20JAIEHo5g6qCdTyAFa97pKcOSz8H5kVEYLCISl DhgIp8AjLvmSnoG6BEFhL63ki3le5GgAI8Qk+joln6OXao+pKkpUdqdCqvcj ixPPrJP12pcCYUg0PKbOcA7zmQyQc7jAOl+KY2NbFAs+OtWLWYdGNe+Vix95 wnOGXcSzTL/t9F4fP88TlL4rpNUa6F838kkeO/1iBadH7INszqumeIJKXgI7 TlhYjWBds9P6TYbls2uVVketbz967bVW06S40smvTh6nZ17gBlhcfFwuEVmH EPEOsedAkmRL5rLkhOr4FO9HtkVqIpIBTfQoGzHV2fa2PM8JW41hHKQpXuxB NIA6ldfREQVzLOAD/uHViwPwR2+rPpiJv9jdLbaMYu8idvsgRZhYdMhp0Hwd Alhi8H6zU4bsilTguU4cumjQYF7Oshy6IHUy+gf/7IblAm0ZAdqsrpl1cK9M hSaBgNXE7myxs8zvR9YmGxvQMB0yIe52HnU3ZgImPVHZrN7y+WKnqBw+yeMW 4tgDAskjVpDvQKeOcTb4XYZuu76PDdWbIh1YwhxlCT+bGhActPBZ06bY2ISr vx+5Lr8hBmlG1VqOqIObfZBULZr8kXP2nfe62CmxuNEhW0bGasTIqKE/D+ue 7hop9pBnSxfIH2KQqNpB7twpsmNa7WB2UF+iMkOQc8gpNysYoHEw5D3Yg7As KF3lMb2OdZgj4pvJQWVdoCiLtZSSQpBzNtQ/rD5P1Rfg0bzQNaWxL+a57+hn OU5eGMww8CwjbfOylfFMv6mHR0vnJp6LDw5txVPWIKe5tIbFzlPPBGAk9R1p 7SIPkj8PmLwTWC/iwW4pyBIYM2reySEeHQqlvIjnzpN23dxKO9WyIQgjtdFK yAFy28+oXtzpYg8WoA2pSZJKde3jj+UG8LV14AcGmamObHvD6wZ4xPoHkou6 zqBMAgykDrTEzktV+anNcDGyl61+jrC4jfxOIhsuoVH0jZFO+tDfp7OVLvA5 7Fpk15Q33NOV5tESCbmScm0ywJVzQ0sXyO+cTr6F+evADcELgV5Ze0j4DK+d Peu54fwjVKkFos48aBT2YjqINLFf5kQNjZy7UW2iRtXNE/SNiY4NpExYM7ll qdAf4enZPfUmdw/TF5VOflyleP5Lm5meG543JxIoooxuECm2w/ZYZ7d+JqC0 pmtzjUg+T4R6Uy+F0NwFbpDismdyY2+mPgw78l4kAx2bNCttu87augusC9PF XJmLepKeUh4PG1hO0bmhBgEL7UXoXOgU6/Vo142Z9YUf6Nl1ojS2i0eeTiNC HG9yitNjSkLNUYfrQ2fAIBwbJ/pxJqLILMPQb05Ogg00CTTAzzlLTfIi814G vV3GjLuaOpVdPLOPwNEsYWyEvHMphDTUbiWvMWRNPhz8LF0wxiYx0mqLULvM bBwecAFwI2504q76AizteK7U8RiSahUa7qYaxEQDo1VEkpGcHUGH4LxYwZ0O 8syRXEvoVnQs6mAf8vA8ahwzPRmx3qi2yPPKrqTwkUmku5BXe3FSVb6XPQsZ 2I9wMRvQl/3UbIMUZ3gEoS7xDDWD0hcGaaECktWLM/MqE3V2cImqZ2dbyNOP Sdk25D/NzLiewo2a6BGW6OR7nKuwY8WnaRNrtypKBWklT6ebcySGyjkxjrBT ffWgYVPXmlJBW3ULPPWBNlzM88NofZb5KTHGRrYZGW6qcD7qwHWzhucG62pu cTAXcbXD9iZfdTgYOJJj8jJYYv+4dvHMpYBr6qaSisjWhHg0dYVCiu/gdld7 rJb6BftyaryTAKGBANyGKOwKN/Ls6c4W8kVd2NZF1AESOnBAXO+MctAxce3R LzIJeLTkEGPq9naBors4GROXdkTAZUFuHQIKLgco2D6rQz/8xe5uZNdalnU/ tgY6Ao/T+1xRDEmu2GqnejEbasVkG8HQSVZIZD6CzGKRFUuO2ZWJhjveqDZC rD/fITEhJcuYskwPRahuPzYrHRbSxg2TcZ5JtJPCnrHulC3B6FixwawPoM/S InZuvl1yMUAzOisH15X5LGqtngxJJDGekVBuwdX5bp4/Owp/dhT+7Cj82VH4 s6PwZ0fhz47Cnx2F/w07Cn/E3i5/z94u/fr2dp/FeZ8l0p8l0p8l0v8KJdK/ gb1dggy0x94uKWv8bG/HUmgxnh8sf9ze7t+mSvszD3wWaf//LNL+55dQ3yHd jS9dcG996XjFW186pvsHfOm+vPrWl47Hfe1L5+3ely74C186XvTKl+7p6fbW l+550StfOr3ilS+dJu7Cl473ee1Lx2ve+dLp2X7cl+5/HuWNL90zyo0vHZ/6 lS8dv//Ol05PdutL9/W1v4IvHc/93pdOL3rvSwdCv/Wl4yW/8KWzf/Cli+FP 6bfwpbNPX7qPOojY/2VfOrsn23ZPtu2abNs12bbXZNveH7rY73jo0kxn+9Jj lnz5qol91VkFekxnkfVFTwH7PQ9ddNj5SMtWPP/080cJJRb9so8xev9xeW+/ 26ELEt++zH/LoaXMRwHn/wg5qOVZlZLUzvfFoYv9focuTUUe6ctRGOn/66GL xZp06OKdK5bvVuXCl85+c0P/5wQw/nwWKMeWrxMQsn05C2Qm4v9q6O9brO3p C+BVdfF1Hr0zr74A+uHno5+/M/Rfn2D+64L5m5OTaD8w++9OTn7krV6enBB6 16v77uTkB97ppb3d2+l7b2/39h3+qfZ29i9hb/fn/0q/vPOUv2fU0FzY7qwx 7Mw8zfdRdBHElzHPknNaqbPZ/HBjO73x39pMDq/K1GEtr6mK9iGjnJRKOtmm 21m9k2u+GZrNo8JLpEfqzw3GGnZJpqvUaYdTVhl15puht6/qTCursrDDWLsN X2y4kWvq1kNU68J++sXQuegmvcz+dkzl7Kw7Q9FOdbEVl9eqdUfb4WZChkuz xjlcTnIOrClYlkVMZoqWNdZURfk3Q/cczk7ukAx7UfvefHo/bbVW6sq6sxLr vJvr0SsIKAuYkaqf0YZsN8Ji6qvaWDrX/V5nXgyty0DPpPrSag9HVyBbgA6c Jg8dFcEwPx/vOPzN0Gr/nojbGmzL2cbNkSFmYUyLXUU/cdTk/Xo39C//4a9/ nt+5uJjs5b3FL8OMnXM6iuVtf/rDf/a/zP/4w+w/6anGl0uLiPi/mV0sgj3p CkztxaWxqxtt77maLuLZYU+jBz6KCc8DfI3T5nolG7Uw1J70KZyarszJzK3K H1RXt8O8GHqp1f1p055SznpsTeLytCp3JJ6+RUBhxXAx9JQTXmPzT7nsjJJV Az/yKjXbdiz+sXbah+85fTv01m1jXcdSiVffgt0C+vD8M+r2EAqyK9wuhkZY ml91uwJ6MdVqUdxaPeB9JHbr2rmeNG8mRABWfahy/UhrH3YCe8vYvyCYd8H3 1uuo/mLoNluZRST96aDuXXO1T/l07RlPKcYus/bha7J/t4wkssYzTidjP9Aw Fr+sqPl43EdXQ726R97MtTMViEYZBG4v78RchslbJM1d9sj5pE6IpIuhR0vV vEqS29yPb+lIhSV1a5w6kxrFpn3qzTKugPKZdpYrPhmpKTeCpICcS22m2/TT hu83Q3fPRA/12laNTKqOxKSqutRDUS9dQbysb2+eupXYjpEYiMCMUN+rmfHn 1I5bCjsABny/ievIlstky75bjaeq3MvGlmuH2Tly8Gwh+nYxtFUyZ49dpp8h 8C66Bqv22zAMuTaWp/wijJunzucs6yTqsONoeZLtQhOebs/UhyVHAmjGDV77 WYCLAQPIgGBsaXRPUn78QEsqc7lBgN/EtZGJ+7E0fG3jyEJhwBLJ0LnMfHIw NmuI52auk5Nh5PExreRlatNdCx0cnUBLPIFE0KFf/WJo3yIZkW2tdvds7Ogg F6HKhIs5hxy6ECGhN0M75nj0Wct0uYsyLzDwsIH4WfXxlfclQ9wMPUsa0y8V C/F481hgq8SQp+ov3DkdHFDf0IuhQQ5yCPKhrDS7q6v1XMqWLWZfqr1Tk/Pt 99WW6W2rygvWnAMr2eGeRTaka7rlO8Qh1+1v4KmFtJNWcnjijEwYqluA3xp1 kdLH8ie2kW7mepXuyznR6wApqailtHJg5T2hd3gDkgQM4iaul/AHrbCRDtXp 6qkY/5EhhLkhi55Zas1XFMd5Vou9Ip+mskZUeSAPj3woUaWUBdafSr2K6xjG yduA5ap8oxisKjhk1smZnSiE8uUr5GM/MnJj42TA84QZHBgyq28qXa4qvHP1 hk4WW9lalz/sUx5Y1a3eVCTCGvoe04mOdHkz9CbOKjq7ynp867r/kP8By5pG HgcOuIqz5a+WMRzWjoWDo052nq+++E1GYP/DUZqa1ZPQLoY+z6au85xJfG92 43xKHNnzztJWsX+Ld1kmIQPldXAOwnUw57YHyLGX7uObusEHt1e5yTKpDGSe ilCzd+tsOZ7JA0kXxGVe8bgy7nWzZQIi3rfevM43auo1opDrhmQCXBbmCgA5 IX6DfIT1IRLkeoAOYzWb3yiO2UgJTfayjcge7iqjl9q9VxFXADSKZUWjk6NX GA7W4zxsvl8FH7DWHXuyFpgCk5zHKPCR5VBiyeTAjFZKN3NdmUod6/jO5hlW 4Xjy4G6B3Vh8YJ+Op2D4JhXs5WWVveTKOaaqHNoXh+R92Okjks8PDPBmGeFM s8Mh25IJFWmsgXMAn8+sqSsQnc7vxCv2VOU7FeALEymdfKsKwBr6SOo/3KZD k9abjT7RyyUXtkq0clYcvZ9KFvY8vRHaQcZI/irLuGmJWd59yRwaTtNGWKeb vAKz314ZoSFtbiKkNdn2qkpcdbCwYPV26AV0ymzxuTwcZ7WbZRwRCHI6T1pE BYB0Ivt7DB42Qbz9hGmWHm+WMWZmGyU+VaEI/B0SWjnxKSqB7PUVoRPpivOh wKHn8jtF8J/Gk7tT4MGZ92Ji5iHf57putkxIZ4eRl0F9G6y3xt3OkJ4md4Et C4LpWOubuO5wMBddhZctcrDU0uk5WXfMMeyStyU53ASfQm8i7bpfHmrdiDzo 1NzeDvkgO3lgZijyzVxH6IeRvxyCpbN8I6GK0KFd3kXydA3RnXmjdtExBlhn iDQKeqVQIB+9QShhk6D3f7d3NTua3DbwnqfYRxAlUqL2FuScUx7A0G9gIIYP hgHn7VPVs2t7sJ74a429a9jjw3r2p9XdEkVWscUid3vQcRIbk4yZNny2UF29 rRRANeCSqBI/cYPhCA9bTkKBgeTDiybV2WPck+nCFmdEcA9C6Vgg7ACcdjLX lhLlygGgck9hrFXgUapm8AJqmBe4gA5+d+SvqRDb8tiKPU5JRkJeRZSEH69U ZswLjvxk6IEtHij4FYJmuFOHaYC80Amaw5tUqmDVdmJ8iAGWpIOSZr/08cHm NoJCy7vbgsuGjdRsJ8uYrup/hACeitABwISpXw3EaQvl5pzCYmfQHXRjV2vM wZtSmGRQ+KvY2l3zaCCVYGcg2SdbJu8KWrSwfA2LOFutrvQm2Cq7ONwgNdHG Sf5asi3bVtrsgMMLoA9BHQBhBUnDYOuTasB2EmVslTk8a4JRI3SPHAHbWzIw XYr8gbHHTLJzMnQh7hCjlD9V3KjzqMn38jDYVAGUKXUxP/IhjZK5MGbq4y/T ORYbKsFoktWwntp69BN/vWBje5fuI/E0jpVEQftEQZxQV2J+ofQSj5yqI0RR 7iyn4RJlX3q0NvFHe1TAiMlcdj6ZkAHDcgu+1Gzm1TEsUDyQKlMVuzF5K6Ed pVoQYWOqzHpEw2ZpsBhgyL7HAEdKPbPE2BHzT9juhNOkMu/SCHrQ1CnJ17jl gTDn8rkIWk+GBuwfl1hgs+aZ526YCpUKtLZDwSxnKpzOk6GZCxdAO/COLjlf rSZk5UFs6ey2oaAzKa4jODmKAaIXAcQBDAa8kyD4DT8mXh1e/PpadTI0mBdM uiiYBeF/3cC/mGGEHmqnAhGuUgFPTiykAtDsQIHDxU/3PbTm1NNI2ExgBNdm jzKOOLrsEMfYvQOFpTD5pXkJAOoIqXslR2D26SSAbQEcDczqAfAavy9aohwa pcnBrPndBtN0shsdDoIfZ3mgAXPjCvaLyC4zODgI2zsAUJ7FRsTsHnicE3ay gDyws72CfRDFA6LNRdd4Ntfw9/CaPnUBoMcwGrgpEMiobeLPOfvwiXUfJS0E /K3NqGwURv3SgD24xM0zPxTgJWDkeLWT3agT2C7CaeRRQJxLBt0YNcOJe6WY krLLwVFaSze/niilH7ABh1Q2MaqC6KAxYqa8wAfkow+w7BwCfpFBcOE3FuA6 Qli50n6TKbrdwbQR9k+MbztcfonYGHRRbtj5ePIQKTCpAKoGStqPjE+YBaqZ gsg99wXDwFyzTyA7I4VSmQO80aLm+TewDOPAEu4KPAMi0CMblHVY9Na2Lxme IftkQma13LxkVQvAOWN0o7rJAvlocH/8omxxp3y00SOcNOKLYgEBqxWgelKX ELwU+M8cEEcQgE5iI88Jly1rwQ7EgZR8hrWXEvV5AN2tTnXkk5wqtvQCTxoZ lC5l1q/F1VbdLfLkS1rdafFHG32VSEEmgYOGzwdSAkCtRmEKTA3iO2JOR5Q4 +o7OHhEmZHHwgHESXJswP9TYLWCyh1FKJ3MNSB1ivbRj2T9tYOy0Q4dD8ioK PLUBVJacWIhFPPNcoM0WwerAOBBlE5hApohccdE22zoid5Qw2wqKTmlCym9S hzTg4RF4jT2DSgMCXEcQB0gdkbyu3plbxTA55L1q1KoI8Qk4td5oLvLsqWHG 1soA+rrS76D9Iyx1+v8Et52NRy2PODoeuVHnSaNgb1vBg6sNEASK36whpSS1 WU/AArAo3Ae/m6SyAqEqI23SoqqluWD3YN/oyW4EHgi7YpNX+LlBTLN0pg2/ MQKFyrVfylVnHwX7EKwUT84NRCs2Q4l9UoMmF3CPieF39yNWAIxXsSHhhxJ2 +9UHq1YQfiylNonYSgi69cSHYO3miIiGBna7M8INvHNZFpY3Ng3AC7Dbw8lT w2xhyp09Vgc/eIOhZiqTygaijxHLgPnKR1/uSoe3xn+tBJheATfABAGWJUVg M0DUkDZgzwnEAR/CjtE8gKUvIVWeBU1DFVyD0tcDy1z9iJIaj1EC2mlt3X1n ylHAyCnKhvlIFnGbIO3oA2wqhKmh605cTwAlGN7qiLcJYcEIF2AtcnTorrHd TMUyXv1XSqDoKfw0Bi57AbIl7KSjr6RpIkrliZENywfAozyUHFmi0RFuR0Nk q4+3oXkWZYBMSymLBJdfaGjHq/OkDL+fzAp6jkBzlHUHviO+wVsDPAIzsLeN l0LFZLzM7MOdByCPgJn3wK5BlLfrk5lVn7CKXeaQMYfmXBMlgk8gzgYlWMMa dnZA8B0DrzGsYO8rVSor01yPN9V7nnuC5zBFoPXFsw/JAS4RzWqrwGMzsE8b UPZJbEwgXZHfMxz0WQOPm9ed9tWzsbOHruhqtuvR0EFjxV5OU4kri1wNUpz9 yACmGvANE4E3J+T5X+CXr7ALv//3L5zyrdHSjXO+Px+KBCO14lQVfFydtA5Y KjhghV9vYUf2UG6Uje0O0LEBfRH80kNnbX72MB9nMwa4SIcvMEb9BGSU2VC4 dKr3xrqc6Mj9aHBwk8JTtAG+Eo4STlQKbGwW4QHJpjxJ1tpD2d5PB+85tuAF AEDgekoNl+Rp0EEVccwSuJfiLc6mZQuiyAISmCxYQ2DKJe8AHgOHPAOVypls f6jt2KeDS2LnkjX4MWpi+aSwOx+Fz+GEehjKxqDloWbev/DkSZNGOEljkwPA XNgcUGRI1E2EJ9KYA5BjPRqcxyNNKtDWpjLwYr/Q2VvqNp0Syr6qS+y3reUR bT39JW09++219d4qA9/KvN/KvN/KvP8IZd6fQVsPT8tS6qefWKd9aetlDyy7 vn6I6XFtvb9apflbOHkrNP9rFpr//mXgZw7zRKIv5bsSfbjirkQfpvsVEn1P V59K9OFxb0v08Y6nEn3JDyT6cNEtib5UDyT6rotuSfTxilsSfZy4A4k+3Oe2 RB+uuSfRx2d7vUTfy6Pckei7RjmR6MNb35Low7+/J9HHJzuV6Pt47W8g0Yfn vi/Rx4vuS/TxqpsSfbjkE4k+/M1zjT6t71P8DBp9Et5E+h7VQcHK/IlV+iSc A+4Xr/369rV3IPdLl96962Og+9klDyZxXp7Uz5DFqcaOtVcWB1TVPjCzGNn9 lFmcXAGjH8/ivLzInyFhcOWhPmSk3HL58C6MjU8ZqVpjeTxh8KLR3XqVw0yO CI9qFOY/RHhASEsF4sD75JRSwB+XgMWp9UYu58Wt8DmSOTl+aJTguf7YKEGS PjVKEJVbjRI+2Zo3Rfs+2ae/Sy4lxPQsywgv/ZQEsWi1fswyxpquJEhM1Spz KSHqT7mUq3/Js1yKh/DU7oCdx/2DlYuJXu0OxDCjtHImA2+p9r059/vO/U5K xfJr1uBeTuVV97qZVLH4ijW+l1V5za1u6vfdnsH7An63b/G7Kvh9EeT64VhK EJ/ac2lZyg671Tpj0qqjJWORQLwqMn79w2f7epQffviB/wd1/+lrNnxl8bnb 2j1vYTl5SWPwG7atUMJUZ6uodXwDNvudW7aEHLpZR4DruF1QHriaKeYdKZHk xzfwbkmph5VZlGbbQu1dmk2f3tbYaw/N7vP4BiP5ijvqdUq0hN6ip8GvRSXy TEUrQaXrIz0HX7gBi4tSsNhy3HNjLnIdSVLuWAVdo+PeUdPW4xtIDdp4kpP1 KY2dO4vv6ldr1Rzb1mpryyPKSy/dgOoXqxQbNhuF3EIZxetOZbK0pcVcZlyP nK974QZW28AGyCGPlIPvFoHV4dcllN17khKjhz7P90FLLTTYUJSRIo9rUUpg axuhD9HrpMH2h7Q9XrgBACl83Ng+g445+mgw1xGFupdtzxzhAnvu51M0a69t p5hs9tbzGBKEXTJZxqbYaD1qyGrnOxnxzUeVOlspvjBcK3wb1zzFYF8dUzjK PDdTGLxa6jOzLylVj3rfa8JSO+B8gwWAxbC15flGWzq7jELJpMBqZtoqfizS sD0mbLeEkuLjruKrb/8zfzoBE7GrBiY54kFTybvrhssehJ2tlgB/6lQmPRu9 0AWMXVoAFfVFvS04NdbETjPNIZUQJKVxNrrDT6YG6wTi9k11R49pOpbASwkt 9mIp1BbORpewp0yPXvjQGTytjmVxJU+za9ui12HCw2ePmNgpqe22A6JhGiLd wNGD7IQ1lRCW7R0PR8/bEGCStL1BKZsPYZv26LYrtphN1s2Gfjw6a/tChBma Msy0WWfHklbq4fFoXBg8vB7PRu+svmjACInFJIi2rF9fuhggsbBpyYwGP3Q2 elitVM8IKwWhEMsaW4hd2EW77FVTpwhQOd1NCUYxOsUB61XGLlMUGyhO+Iew we5GpGzl4cxg4xSsZN/zOoSY4CozdWWxtXrk6b3VWbPdz0Zn/+dSVw1rVESR jEdWTysO2mZzGdno68vhs29EpFUtIXQ4CzJ0AlZhsgrXNq3G4qGRDvcqlVHA l5f5BP5wFiSwpNNyV3jJWi1bU+3rcFUBmdgLHfGVJ3E7Fte3YBtVSibBjzVq /ZXDVcWUIopH7FVv5gwcmBLEWrbpNmVYBBeUfPjsCBVSJoI1wl8rRMURy0mt 2tZ6wp2WhGHuhz4yVvjBrsCCbe6F54dHTrllSkKvGTdP3LYkZ6O3OVho44Md yie8VwKyL7nKjiGuNrOksHJvZ6MDDSewBlksKMMUpYqbrAbHhjkr0YQg4Xiv YoYbS02aFSv58sZzhtURUsQpw+Y9WM566MUa66A3tpOHqBMBusECVdaainVd amo9tsNVXeBpZSiARwa36k2rE8lgd9UaxmgRtr7iOlzVQrkxg5sCHtubokZj yphVgZ/qMjOAwRr1cFUBXUpZY5RKxVBJXljMKz1ozbDzmmBGQCGHq4oB4blU QT6AjTIAAByCweylkPNoDBSEt3pq7zCL0B2BYgO1moEl9LKBkeB1WB3bE8iJ Hc77msBaQRBZGS8Mk2TG6k+Y0XChEBvc5RqHfmYUw1TD77IgB+iIIowhr6Kl gzsHLjIQaz9FelUSEMzyBU6F51RQD1h86a4NLqz01FkudjgzrN6dAKhhtrg0 VkwFot7y6pgwk2wZfLzMU5uJApIMywaciXPJxs08sDuBd1D0Kg5073a4V8kz KIoSU8e8V9Dw0eOkVAd2bbdV85YFb3E474ax5ooKytRk92oFEGbCv4C00Sid peYxH8ZV7E1lgYOWAm/T4Q/jpLbhHMsT9uwOicjm0CInQGiorKRMIxnLajKL SUKlCGNz6yvriIeIo4B0WAdBBeKd8IkAdpKn1pR2LgOYvioAjZ6uKjwLOEFt V0MI1hllW6z1A+5eEwtieCH4/NO9OrLA1UQ4WwS/Esug0EhRxNJ4URK4N7zS 4aouwXbEVmWtA/x4agX2Evv0niOCnvOsKcY/Gx1hYjvwVwHbEyKavMbF/9aq hVrcAsAkp/PumeKwysYHQSrVw+J0BhPZk5A7Ajk1+Juz0YEa+0T0SYnCywHM PfJ1WC2KVYWvge8U08OZYdFigh/bucUQEDk8aATpjszEwsV7coCbdOjfmahZ DKcKLj/wkJIKyCW8ogD6RRhmUiD7w8iH+bWxKYYZJ4hl2/hBpyJAwW0CcFPJ c6d8OO9bgXarbez+nXfoCNi9YlNNeAb3PvbAum479ASOnaQbjqDGlFq3oXjc wtzB7g62tsAnK0jgYdTuioAqVhOV4oFtsLrUJwTvUxZcI0gh5sppFkIBT5dL nJheisYlDNgZQ1be4Ah9tGZNTvHMGDUE5p0nUz0wP48w9+W4SRrwvXOX3uYp 0oNTbIVK8eAanby3W++1D5D6CUK/J+JtrIds2EFmDMzMKHASYIljRH5iMFlw vS2WhX0b5v3d9OvVWzXnT8/2XBpev3Hx1tsH4rfTP2+nf95O//wBT/98hlou UHd5quXCT/FjLZe72lXLhR/yjVquv9wBpLfw8nb+6K9+/ugzHA469J4nhV2W 7hZ24Yq7hV2Y8VcUdj1dfVrYhce9XdjFO54WdgF/3S/swkW3Crsu4aC7hV3X RbcKu3jFrcIuTtxBYRfuc7uwC9fcK+zis72+sOvlUe4Udl2jnBR24a1vFXbh 398r7OKTnRZ2fbz2NyjswnPfL+ziRfcLu3jVzcIuXPJpYZc8K+zS9yG/D+Vz FHbJW2HXw8dj5U9d2CWvwN7yCuwt59hbzrG33MfecpDakS+X2okSaixU7gAT kFQ/aNokD5l6HhnMu1AI5uHUjny51E4MdIPXu4RYQvygQhLZI4zvEqsnTzfS B/JFUztXPscvrZVikb9cqZ3CdyODLhqflG0eT+3Il0vtXInDfK1N0Bw+pt2Y ObzSbsVqLodrc1LYJV8gr1JTrU8iOdXCj9mtbIk6OPghpyu79f/zKlerU3uy 8lB+TM/E4iKXlRd/Ss/cL+x6c+63nfvjiRV9D4D3ijW4k1h55b1uJVYAQ/0V a3wnsfK6W90q7DqYwbuFXQe3eKGw6+l3759++/7dBMLjeavv3vXvv/vvr/71 P7795hviy92AH7Eef/vXRYveTdII0A5yDjwvH5se53rw8j7Wd3//J7nH3/4H i1Y/w239AgA= --------------000301090406040003030205-- From owner-xfs@oss.sgi.com Thu May 10 08:55:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 08:55:10 -0700 (PDT) Received: from waste.org (waste.org [66.93.16.53]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4AFt4fB031181 for ; Thu, 10 May 2007 08:55:06 -0700 Received: from waste.org (localhost [127.0.0.1]) by waste.org (8.13.8/8.13.8/Debian-3) with ESMTP id l4AFcYjQ028027 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Thu, 10 May 2007 10:38:34 -0500 Received: (from oxymoron@localhost) by waste.org (8.13.8/8.13.8/Submit) id l4AFcXV8028020; Thu, 10 May 2007 10:38:33 -0500 Date: Thu, 10 May 2007 10:38:33 -0500 From: Matt Mackall To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510153832.GQ11115@waste.org> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46433049.4020003@goop.org> User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11375 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mpm@selenic.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 07:46:33AM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > > > >> David Chinner wrote: > >> > >>> Suspend-resume, eh? > >>> > >>> There's an immediate suspect. Can you test this specifically for us? > >>> i.e. download a known good file set, do some stuff, suspend, resume, > >>> then check the files? If it doesn't show up the first time, can > >>> you do it a few times just to rule it out? > >>> > >> Well, I've been doing suspend-resume with xfs for a while without > >> problems; the problems seem to be recent and easily repeatable. Which > >> just means that it could be a new suspend-resume problem, of course. > >> > > > > Ok. I'm just trying to find a relatively simple test case for the > > problem - seeing as you seem to be able to reliably reproduce this > > we should be able to work out the trigger... > > > > OK, I was able to reproduce it reliably with a script with did basically: > > for i in `seq 20`; do > hg clone -U --pull a b-$i > hg verify b-$i # always OK > umount /home > sleep 5 > mount /home > hg verify b-$i # often found truncated files > done > > > No suspend/resumes involved. The trees are linux kernel ones, so fairly > large, but small enough to fit entirely in core. My script also > captured xfs_bmap before/after output for files which had tended to be > corrupted in the past, but unfortunately none of them got corrupted in > these tests. But I do have all the trees lying around to extract more > detail for if you like. > > Interestingly, the corruption happened in each case around the same > place in the tree, often in the sata drivers. I wonder if that was just > related to the timing of this script. I guess this pins it as an XFS problem pretty solidly. This test looks like it should consist solely of open-for-append and write on about 20k files in the target directory. Because of the --pull, no hardlinks are involved. It shouldn't be all that different from doing tar cf - a | tar xf - b. The files get visited in alphabetical order, so the start of the corruption may be telling. -- Mathematics is the supreme nostalgia of our time. From owner-xfs@oss.sgi.com Thu May 10 14:14:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:14:15 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4ALE7fB007064 for ; Thu, 10 May 2007 14:14:10 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id HAA29144; Fri, 11 May 2007 07:13:58 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4ALDsAf89781116; Fri, 11 May 2007 07:13:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4ALDm4I90595453; Fri, 11 May 2007 07:13:48 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 07:13:48 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510211348.GC86004887@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46433049.4020003@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11376 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 07:46:33AM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > > > >> David Chinner wrote: > >> > >>> Suspend-resume, eh? > >>> > >>> There's an immediate suspect. Can you test this specifically for us? > >>> i.e. download a known good file set, do some stuff, suspend, resume, > >>> then check the files? If it doesn't show up the first time, can > >>> you do it a few times just to rule it out? > >>> > >> Well, I've been doing suspend-resume with xfs for a while without > >> problems; the problems seem to be recent and easily repeatable. Which > >> just means that it could be a new suspend-resume problem, of course. > >> > > > > Ok. I'm just trying to find a relatively simple test case for the > > problem - seeing as you seem to be able to reliably reproduce this > > we should be able to work out the trigger... > > > > OK, I was able to reproduce it reliably with a script with did basically: > > for i in `seq 20`; do > hg clone -U --pull a b-$i > hg verify b-$i # always OK > umount /home > sleep 5 > mount /home > hg verify b-$i # often found truncated files > done > > > No suspend/resumes involved. The trees are linux kernel ones, so fairly > large, but small enough to fit entirely in core. My script also > captured xfs_bmap before/after output for files which had tended to be > corrupted in the past, but unfortunately none of them got corrupted in > these tests. But I do have all the trees lying around to extract more > detail for if you like. Ok, so most of the of the integrity errors are processed by an error like this: drivers/scsi/sata_sil24.c index contains -98 extra bytes unpacking file drivers/scsi/sata_sil24.c 5715cdfceaca: Error -5 while decompressing data That's an -EIO and not a normal error to report. Are there any errors in dmesg or syslog corresponding to this? The errors tend to imply problems decompressing and patching files, not that truncates are occurring once the files have been patched. Can you check that what is being pulled from the repository is correct before it gets uncompressed? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 14:23:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:23:42 -0700 (PDT) Received: from waste.org (waste.org [66.93.16.53]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALNcfB008822 for ; Thu, 10 May 2007 14:23:39 -0700 Received: from waste.org (localhost [127.0.0.1]) by waste.org (8.13.8/8.13.8/Debian-3) with ESMTP id l4ALNOYW012939 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Thu, 10 May 2007 16:23:24 -0500 Received: (from oxymoron@localhost) by waste.org (8.13.8/8.13.8/Submit) id l4ALNNob012937; Thu, 10 May 2007 16:23:23 -0500 Date: Thu, 10 May 2007 16:23:23 -0500 From: Matt Mackall To: David Chinner Cc: Jeremy Fitzhardinge , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510212323.GS11115@waste.org> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510211348.GC86004887@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070510211348.GC86004887@sgi.com> User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11377 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mpm@selenic.com Precedence: bulk X-list: xfs On Fri, May 11, 2007 at 07:13:48AM +1000, David Chinner wrote: > On Thu, May 10, 2007 at 07:46:33AM -0700, Jeremy Fitzhardinge wrote: > > David Chinner wrote: > > > On Wed, May 09, 2007 at 05:54:09PM -0700, Jeremy Fitzhardinge wrote: > > > > > >> David Chinner wrote: > > >> > > >>> Suspend-resume, eh? > > >>> > > >>> There's an immediate suspect. Can you test this specifically for us? > > >>> i.e. download a known good file set, do some stuff, suspend, resume, > > >>> then check the files? If it doesn't show up the first time, can > > >>> you do it a few times just to rule it out? > > >>> > > >> Well, I've been doing suspend-resume with xfs for a while without > > >> problems; the problems seem to be recent and easily repeatable. Which > > >> just means that it could be a new suspend-resume problem, of course. > > >> > > > > > > Ok. I'm just trying to find a relatively simple test case for the > > > problem - seeing as you seem to be able to reliably reproduce this > > > we should be able to work out the trigger... > > > > > > > OK, I was able to reproduce it reliably with a script with did basically: > > > > for i in `seq 20`; do > > hg clone -U --pull a b-$i > > hg verify b-$i # always OK > > umount /home > > sleep 5 > > mount /home > > hg verify b-$i # often found truncated files > > done > > > > > > No suspend/resumes involved. The trees are linux kernel ones, so fairly > > large, but small enough to fit entirely in core. My script also > > captured xfs_bmap before/after output for files which had tended to be > > corrupted in the past, but unfortunately none of them got corrupted in > > these tests. But I do have all the trees lying around to extract more > > detail for if you like. > > Ok, so most of the of the integrity errors are processed by an > error like this: > > drivers/scsi/sata_sil24.c index contains -98 extra bytes > unpacking file drivers/scsi/sata_sil24.c 5715cdfceaca: Error -5 while decompressing data > > That's an -EIO and not a normal error to report. Are there any > errors in dmesg or syslog corresponding to this? > > The errors tend to imply problems decompressing and patching files, > not that truncates are occurring once the files have been patched. > Can you check that what is being pulled from the repository is correct > before it gets uncompressed? Notice that verify gets run twice. Before unmount, it's fine, after remount, it's not. That message saying that the file contains -98 extra bytes is Mercurial detecting the truncation before if tries to read and decompress the truncated bit. -- Mathematics is the supreme nostalgia of our time. From owner-xfs@oss.sgi.com Thu May 10 14:32:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:32:28 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALWPfB010384 for ; Thu, 10 May 2007 14:32:26 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id AF1BB2C8048; Thu, 10 May 2007 14:31:38 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 51A9E2C8043; Thu, 10 May 2007 14:31:38 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 14:31:38 -0700 (PDT) Message-ID: <46438F67.9060503@goop.org> Date: Thu, 10 May 2007 14:32:23 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510211348.GC86004887@sgi.com> In-Reply-To: <20070510211348.GC86004887@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11378 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Ok, so most of the of the integrity errors are processed by an > error like this: > > drivers/scsi/sata_sil24.c index contains -98 extra bytes > unpacking file drivers/scsi/sata_sil24.c 5715cdfceaca: Error -5 while decompressing data > > That's an -EIO and not a normal error to report. Are there any > errors in dmesg or syslog corresponding to this? > No, that's an error code from zlib: #define Z_BUF_ERROR (-5) I think it means it got a truncated buffer while decompressing. > The errors tend to imply problems decompressing and patching files, > not that truncates are occurring once the files have been patched. > Can you check that what is being pulled from the repository is correct > before it gets uncompressed? > The hg verify checks the integrity of all the files by decompressing them and making sure their sha1 hashes are correct. The fact that the first hg verify passed is a very strong check that the whole repo's integrity is sound, both in structure and content. The second failing hg verify's messages are all related to truncation. I haven't checked this comprehensively, but in every instance I've checked the files are identical up to the truncation point. All the error messages are consistent with pure truncation, not content differences or IO errors. J From owner-xfs@oss.sgi.com Thu May 10 14:46:32 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:46:34 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALkVfB012587 for ; Thu, 10 May 2007 14:46:32 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 934AE2C8048; Thu, 10 May 2007 14:45:44 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 0EE002C8043; Thu, 10 May 2007 14:45:43 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 14:45:42 -0700 (PDT) Message-ID: <464392B4.3070009@goop.org> Date: Thu, 10 May 2007 14:46:28 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Chuck Ebbert CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> In-Reply-To: <46439185.5060207@redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11379 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs Chuck Ebbert wrote: > What CPU architecture is this happening on? Not i686 with PAE by > any chance? Yes. Why? J From owner-xfs@oss.sgi.com Thu May 10 14:49:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:49:38 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALnZfB013092 for ; Thu, 10 May 2007 14:49:36 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id E76F82C804A; Thu, 10 May 2007 14:48:48 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id BFAA32C8043; Thu, 10 May 2007 14:48:46 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 14:48:46 -0700 (PDT) Message-ID: <4643936B.8060708@goop.org> Date: Thu, 10 May 2007 14:49:31 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510211348.GC86004887@sgi.com> <46438F67.9060503@goop.org> In-Reply-To: <46438F67.9060503@goop.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11380 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs Jeremy Fitzhardinge wrote: > I haven't checked > this comprehensively I just did. They're all pure truncations. J From owner-xfs@oss.sgi.com Thu May 10 14:51:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:51:37 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALpXfB013510 for ; Thu, 10 May 2007 14:51:34 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l4ALpVx1013624; Thu, 10 May 2007 17:51:31 -0400 Received: from mail.boston.redhat.com (mail.boston.redhat.com [172.16.76.12]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l4ALpUUE007437; Thu, 10 May 2007 17:51:30 -0400 Received: from [172.16.83.145] (dhcp83-145.boston.redhat.com [172.16.83.145]) by mail.boston.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id l4ALpTJA025798; Thu, 10 May 2007 17:51:29 -0400 Message-ID: <464393E1.3050705@redhat.com> Date: Thu, 10 May 2007 17:51:29 -0400 From: Chuck Ebbert Organization: Red Hat User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> In-Reply-To: <464392B4.3070009@goop.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11381 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cebbert@redhat.com Precedence: bulk X-list: xfs Jeremy Fitzhardinge wrote: > Chuck Ebbert wrote: >> What CPU architecture is this happening on? Not i686 with PAE by >> any chance? > > Yes. Why? I have a bug report where NFS files are corrupted only with PAE clients. Corruption is at the end of the (newly untarred) files. Doesn't happen without PAE. From owner-xfs@oss.sgi.com Thu May 10 14:54:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 14:54:32 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ALsRfB014138 for ; Thu, 10 May 2007 14:54:29 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 8762C2C8048; Thu, 10 May 2007 14:53:40 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 182372C8043; Thu, 10 May 2007 14:53:40 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 14:53:39 -0700 (PDT) Message-ID: <46439491.9010604@goop.org> Date: Thu, 10 May 2007 14:54:25 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Chuck Ebbert CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> In-Reply-To: <464393E1.3050705@redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11382 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs Chuck Ebbert wrote: > Jeremy Fitzhardinge wrote: > >> Chuck Ebbert wrote: >> >>> What CPU architecture is this happening on? Not i686 with PAE by >>> any chance? >>> >> Yes. Why? >> > > I have a bug report where NFS files are corrupted only with PAE clients. > Corruption is at the end of the (newly untarred) files. Doesn't happen > without PAE. > Hm, suggestive, but I'm not convinced. Two differences to this situation: 1. Immediately after the clone ("untar"), the contents are completely OK; it's only after a umount/mount cycle to problems appear 2. There's no corruption as such; the files are just too short. And it seems they're at a previously OK length, not some random size. J From owner-xfs@oss.sgi.com Thu May 10 15:05:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 15:05:19 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4AM5FfB015355 for ; Thu, 10 May 2007 15:05:16 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l4ALfQWk007310; Thu, 10 May 2007 17:41:26 -0400 Received: from mail.boston.redhat.com (mail.boston.redhat.com [172.16.76.12]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l4ALfPM4004474; Thu, 10 May 2007 17:41:26 -0400 Received: from [172.16.83.145] (dhcp83-145.boston.redhat.com [172.16.83.145]) by mail.boston.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id l4ALfPUP024073; Thu, 10 May 2007 17:41:25 -0400 Message-ID: <46439185.5060207@redhat.com> Date: Thu, 10 May 2007 17:41:25 -0400 From: Chuck Ebbert Organization: Red Hat User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> In-Reply-To: <46426194.3040403@goop.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11383 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cebbert@redhat.com Precedence: bulk X-list: xfs Jeremy Fitzhardinge wrote: > David Chinner wrote: >> Seems very unlikely. Have you unmounted and mounted the filesystem >> (or rebooted or suspended) between the files being seen good and >> the files being seen bad? >> > > There was definitely a suspend-resume, and maybe a reboot. I'll try > again later on. > What CPU architecture is this happening on? Not i686 with PAE by any chance? From owner-xfs@oss.sgi.com Thu May 10 15:40:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 15:40:27 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4AMeKfB018533 for ; Thu, 10 May 2007 15:40:22 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA02051; Fri, 11 May 2007 08:40:03 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4AMdvAf90673838; Fri, 11 May 2007 08:39:58 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4AMdpXS90607975; Fri, 11 May 2007 08:39:51 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 08:39:50 +1000 From: David Chinner To: "Amit K. Arora" Cc: David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070510223950.GD86004887@sgi.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> <20070510005926.GT85884050@sgi.com> <20070510115620.GB21400@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070510115620.GB21400@amitarora.in.ibm.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11384 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > > I have the updated patches ready which take care of Andrew's comments. > > > Will run some tests and post them soon. > > > > > > But, before submitting these patches, I think it will be better to > > > finalize on certain things which might be worth some discussion here: > > > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > > - Andreas and Chris Wedgwood are in favor of not changing the file size > > > in this case. I also tend to agree with them. Does anyone has an > > > argument in favor of changing the filesize ? If not, I will remove the > > > code which changes the filesize, before I resubmit the concerned ext4 > > > patch. > > > > I think there needs to be both. If we don't have a mechanism to atomically > > change the file size with the preallocation, then applications that use > > stat() to work out if they need to preallocate more space will end up > > racing. > > By "both" above, do you mean we should give user the flexibility if it wants > the filesize changed or not ? It can be done by having *two* modes for > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() > will change the filesize if required (i.e. when allocation is beyond EOF) > and also update [cm]time. This way, the application can decide what it > wants. Yes, that's right. > This will be helpfull for the partial allocation scenario also. Think of the > case when we do not change the filesize in fallocate() and expect > applications/posix_fallocate() to do ftruncate() after fallocate() for this. > Now if fallocate() results in a partial allocation with -ENOSPC error > returned, applications/posix_fallocate() will not know for what length > ftruncate() has to be called. :( Well, posix_fallocate() either gets all the space or it fails. If you truncate to extend the file size after an ENOSPC, then that is a buggy implementation. The same could be said for any application, or even the fallocate() call itself if it changes the filesize without having completely preallocated the space asked.... > Hence it may be a good idea to give user the flexibility if it wants to > atomically change the file size with preallocation or not. But, with more > flexibility there comes inconsistency in behavior, which is worth > considering. We've got different modes to specify different behaviour. That's what the mode field was put there for in the first place - the interface is *designed* to support different preallocation behaviours.... > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of > > > normal (non-preallocated) blocks (blocks allocated via regular > > > write/truncate operations) also (i.e. work as punch()) ? > > > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what > > i did for FA_UNALLOCATE as well. > > Ok. But, some people may not expect/like this. I think, we can keep it on > the backburner for a while, till other issues are sorted out. How can it be a "backburner" issue when it defines the implementation? I've already implemented some thing in XFS that sort of does what I think that the interface is supposed to do, but I need that interface to be nailed down before proceeding any further. All I'm really interested in right now is that the fallocate _interface_ can be used as a *complete replacement* for the pre-existing XFS-specific ioctls that are already used by applications. What ext4 can or can't do right now is irrelevant to this discussion - the interface definition needs to take priority over implementation.... Cheers, Dave, -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 15:58:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 15:58:52 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4AMwhfB019892 for ; Thu, 10 May 2007 15:58:45 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA02503; Fri, 11 May 2007 08:58:40 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4AMwbAf90654675; Fri, 11 May 2007 08:58:38 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4AMwYWr90638489; Fri, 11 May 2007 08:58:34 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 08:58:34 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: Chuck Ebbert , David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510225834.GF86004887@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46439491.9010604@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11385 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 02:54:25PM -0700, Jeremy Fitzhardinge wrote: > Chuck Ebbert wrote: > > Jeremy Fitzhardinge wrote: > > > >> Chuck Ebbert wrote: > >> > >>> What CPU architecture is this happening on? Not i686 with PAE by > >>> any chance? > >>> > >> Yes. Why? > >> > > > > I have a bug report where NFS files are corrupted only with PAE clients. > > Corruption is at the end of the (newly untarred) files. Doesn't happen > > without PAE. > > > > Hm, suggestive, but I'm not convinced. Two differences to this situation: > > 1. Immediately after the clone ("untar"), the contents are completely > OK; it's only after a umount/mount cycle to problems appear > 2. There's no corruption as such; the files are just too short. And > it seems they're at a previously OK length, not some random size. Just to confirm this isn't a result of a recent change, can you reproduce this on a 2.6.20 or 2.6.21 kernel? (sorry if you've already done this - I've juggling some many things at once it's easy to forget little things). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 16:07:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 16:07:40 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4AN7WfB020556 for ; Thu, 10 May 2007 16:07:33 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 750E12C8048; Thu, 10 May 2007 16:06:45 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 1DF2F2C8043; Thu, 10 May 2007 16:06:45 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 16:06:45 -0700 (PDT) Message-ID: <4643A5B2.3060906@goop.org> Date: Thu, 10 May 2007 16:07:30 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> In-Reply-To: <20070510225834.GF86004887@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11386 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Just to confirm this isn't a result of a recent change, can you reproduce > this on a 2.6.20 or 2.6.21 kernel? (sorry if you've already done this - I've juggling > some many things at once it's easy to forget little things). It is the result of a recent change. I had seen no problem until around 2.6.21-git8-11. I will try again with a plain 2.6.21 kernel, just to confirm. J From owner-xfs@oss.sgi.com Thu May 10 16:08:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 16:08:12 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4AN82fB020669 for ; Thu, 10 May 2007 16:08:04 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA02814; Fri, 11 May 2007 09:08:01 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4AN7vAf90414844; Fri, 11 May 2007 09:07:57 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4AN7t2o89605496; Fri, 11 May 2007 09:07:55 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 09:07:55 +1000 From: David Chinner To: Chuck Ebbert Cc: Jeremy Fitzhardinge , David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510230755.GG86004887@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <464393E1.3050705@redhat.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11387 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 05:51:29PM -0400, Chuck Ebbert wrote: > Jeremy Fitzhardinge wrote: > > Chuck Ebbert wrote: > >> What CPU architecture is this happening on? Not i686 with PAE by > >> any chance? > > > > Yes. Why? > > I have a bug report where NFS files are corrupted only with PAE clients. > Corruption is at the end of the (newly untarred) files. Doesn't happen > without PAE. Chuck, can you post a pointer to this thread? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 16:27:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 16:27:47 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4ANRafB025565 for ; Thu, 10 May 2007 16:27:39 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA03331; Fri, 11 May 2007 09:27:33 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4ANRWAf90008022; Fri, 11 May 2007 09:27:32 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4ANRT8U90616905; Fri, 11 May 2007 09:27:29 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 09:27:29 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070510232729.GH86004887@sgi.com> References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4643A5B2.3060906@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11388 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 04:07:30PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > Just to confirm this isn't a result of a recent change, can you reproduce > > this on a 2.6.20 or 2.6.21 kernel? (sorry if you've already done this - I've juggling > > some many things at once it's easy to forget little things). > > It is the result of a recent change. I had seen no problem until around > 2.6.21-git8-11. I will try again with a plain 2.6.21 kernel, just to > confirm. Ok, this is important to kow becase we merged a mod around that time that changes the way we handle the updates to the file size i.e. the fix for the NULL-files-on-crash problem: http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ba87ea699ebd9dd577bf055ebc4a98200e337542 and that means the size of the file is not updated to the incore cached inode until after the data write is complete. The symptoms being seen would match with a inode-not-being-written-after-last- data-write-bug in this mod.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 16:49:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 16:49:43 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4ANnefB028196 for ; Thu, 10 May 2007 16:49:41 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id D3E782C804B; Thu, 10 May 2007 16:48:51 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 965EA2C8047; Thu, 10 May 2007 16:48:50 -0700 (PDT) Received: from [10.100.2.2] (207.47.60.4.static.nextweb.net [207.47.60.4]) by lurch.goop.org (Postfix) with ESMTP; Thu, 10 May 2007 16:48:50 -0700 (PDT) Message-ID: <4643AF8F.5040705@goop.org> Date: Thu, 10 May 2007 16:49:35 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> <20070510232729.GH86004887@sgi.com> In-Reply-To: <20070510232729.GH86004887@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11389 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > Ok, this is important to kow becase we merged a mod around that time > that changes the way we handle the updates to the file size i.e. the > fix for the NULL-files-on-crash problem: > > http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ba87ea699ebd9dd577bf055ebc4a98200e337542 > > and that means the size of the file is not updated to the incore > cached inode until after the data write is complete. The symptoms > being seen would match with a inode-not-being-written-after-last- > data-write-bug in this mod.... > Yes, that does look like a good candidate. Should I try to before-and-after this change? J From owner-xfs@oss.sgi.com Thu May 10 17:33:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 17:33:12 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B0X7fB001179 for ; Thu, 10 May 2007 17:33:08 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA04793; Fri, 11 May 2007 10:33:02 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4B0X0Af90668740; Fri, 11 May 2007 10:33:00 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4B0WvOp90586679; Fri, 11 May 2007 10:32:57 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 10:32:57 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070511003257.GL86004887@sgi.com> References: <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> <20070510232729.GH86004887@sgi.com> <4643AF8F.5040705@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4643AF8F.5040705@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11390 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 10, 2007 at 04:49:35PM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > Ok, this is important to kow becase we merged a mod around that time > > that changes the way we handle the updates to the file size i.e. the > > fix for the NULL-files-on-crash problem: > > > > http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ba87ea699ebd9dd577bf055ebc4a98200e337542 > > > > and that means the size of the file is not updated to the incore > > cached inode until after the data write is complete. The symptoms > > being seen would match with a inode-not-being-written-after-last- > > data-write-bug in this mod.... > > > > Yes, that does look like a good candidate. Should I try to > before-and-after this change? Yes please! Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 10 17:36:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 17:36:24 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B0aEfB001875 for ; Thu, 10 May 2007 17:36:16 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA04854; Fri, 11 May 2007 10:36:08 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4B0a7Af90635377; Fri, 11 May 2007 10:36:08 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4B0a7LC90645469; Fri, 11 May 2007 10:36:07 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 10:36:07 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: Concurrent Multi-File Data Streams Message-ID: <20070511003606.GB85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11391 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Concurrent Multi-File Data Streams In media spaces, video is often stored in a frame-per-file format. When dealing with uncompressed realtime HD video streams in this format, it is crucial that files do not get fragmented and that multiple files a placed contiguously on disk. When multiple streams are being ingested and played out at the same time, it is critical that the filesystem does not cross the streams and interleave them together as this creates seek and readahead cache miss latency and prevents both ingest and playout from meeting frame rate targets. This patches creates a "stream of files" concept into the allocator to place all the data from a single stream contiguously on disk so that RAID array readahead can be used effectively. Each additional stream gets placed in different allocation groups within the filesystem, thereby ensuring that we don't cross any streams. When an AG fills up, we select a new AG for the stream that is not in use. The core of the functionality is the stream tracking - each inode that we create in a directory needs to be associated with the directories' stream. Hence every time we create a file, we look up the directories' stream object and associate the new file with that object. Once we have a stream object for a file, we use the AG that the stream object point to for allocations. If we can't allocate in that AG (e.g. it is full) we move the entire stream to another AG. Other inodes in the same stream are moved to the new AG on their next allocation (i.e. lazy update). Stream objects are kept in a cache and hold a reference on the inode. Hence the inode cannot be reclaimed while there is an outstanding stream reference. This means that on unlink we need to remove the stream association and we also need to flush all the associations on certain events that want to reclaim all unreferenced inodes (e.g. filesystem freeze). The following patch survives XFSQA with timeouts set to minimum, default, 500s and maximum. The patch has not had a great deal of low memory testing, and the object cache may need a shrinker interface to work in low memory conditions. Comments? Credits: The original filestream allocator on Irix was written by Glen Overby, the Linux port and rewrite by Nathan Scott and Sam Vaughan (none of whom work at SGI any more). I just picked the pieces and beat it repeatedly with a big stick until it passed XFSQA. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/Makefile-linux-2.6 | 2 fs/xfs/linux-2.6/xfs_globals.c | 1 fs/xfs/linux-2.6/xfs_linux.h | 1 fs/xfs/linux-2.6/xfs_sysctl.c | 11 fs/xfs/linux-2.6/xfs_sysctl.h | 2 fs/xfs/quota/xfs_qm.c | 3 fs/xfs/xfs_ag.h | 1 fs/xfs/xfs_bmap.c | 337 +++++++++++++++++ fs/xfs/xfs_clnt.h | 2 fs/xfs/xfs_dinode.h | 4 fs/xfs/xfs_filestream.c | 777 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_filestream.h | 59 +++ fs/xfs/xfs_fs.h | 1 fs/xfs/xfs_fsops.c | 2 fs/xfs/xfs_inode.c | 17 fs/xfs/xfs_mount.c | 11 fs/xfs/xfs_mount.h | 4 fs/xfs/xfs_mru_cache.c | 607 ++++++++++++++++++++++++++++++++ fs/xfs/xfs_mru_cache.h | 225 +++++++++++ fs/xfs/xfs_vfsops.c | 25 + fs/xfs/xfs_vnodeops.c | 28 + 21 files changed, 2114 insertions(+), 6 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/Makefile-linux-2.6 =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/Makefile-linux-2.6 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/Makefile-linux-2.6 2007-05-10 17:24:12.975025602 +1000 @@ -54,6 +54,7 @@ xfs-y += xfs_alloc.o \ xfs_dir2_sf.o \ xfs_error.o \ xfs_extfree_item.o \ + xfs_filestream.o \ xfs_fsops.o \ xfs_ialloc.o \ xfs_ialloc_btree.o \ @@ -67,6 +68,7 @@ xfs-y += xfs_alloc.o \ xfs_log.o \ xfs_log_recover.o \ xfs_mount.o \ + xfs_mru_cache.o \ xfs_rename.o \ xfs_trans.o \ xfs_trans_ail.o \ Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_globals.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_globals.c 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_globals.c 2007-05-10 17:24:12.987024029 +1000 @@ -49,6 +49,7 @@ xfs_param_t xfs_params = { .inherit_nosym = { 0, 0, 1 }, .rotorstep = { 1, 1, 255 }, .inherit_nodfrg = { 0, 1, 1 }, + .fstrm_timer = { 1, 50, 3600*100}, }; /* Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_linux.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_linux.h 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_linux.h 2007-05-10 17:24:12.991023505 +1000 @@ -132,6 +132,7 @@ #define xfs_inherit_nosymlinks xfs_params.inherit_nosym.val #define xfs_rotorstep xfs_params.rotorstep.val #define xfs_inherit_nodefrag xfs_params.inherit_nodfrg.val +#define xfs_fstrm_centisecs xfs_params.fstrm_timer.val #define current_cpu() (raw_smp_processor_id()) #define current_pid() (current->pid) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_sysctl.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_sysctl.c 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_sysctl.c 2007-05-10 17:24:12.991023505 +1000 @@ -243,6 +243,17 @@ static ctl_table xfs_table[] = { .extra1 = &xfs_params.inherit_nodfrg.min, .extra2 = &xfs_params.inherit_nodfrg.max }, + { + .ctl_name = XFS_FILESTREAM_TIMER, + .procname = "filestream_centisecs", + .data = &xfs_params.fstrm_timer.val, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &xfs_params.fstrm_timer.min, + .extra2 = &xfs_params.fstrm_timer.max, + }, /* please keep this the last entry */ #ifdef CONFIG_PROC_FS { Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_sysctl.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_sysctl.h 2007-05-10 17:22:43.486754830 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_sysctl.h 2007-05-10 17:24:12.991023505 +1000 @@ -50,6 +50,7 @@ typedef struct xfs_param { xfs_sysctl_val_t inherit_nosym; /* Inherit the "nosymlinks" flag. */ xfs_sysctl_val_t rotorstep; /* inode32 AG rotoring control knob */ xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */ + xfs_sysctl_val_t fstrm_timer; /* Filestream dir-AG assoc'n timeout. */ } xfs_param_t; /* @@ -89,6 +90,7 @@ enum { XFS_INHERIT_NOSYM = 19, XFS_ROTORSTEP = 20, XFS_INHERIT_NODFRG = 21, + XFS_FILESTREAM_TIMER = 22, }; extern xfs_param_t xfs_params; Index: 2.6.x-xfs-new/fs/xfs/xfs_ag.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ag.h 2007-05-10 17:22:43.494753782 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ag.h 2007-05-10 17:24:12.995022981 +1000 @@ -196,6 +196,7 @@ typedef struct xfs_perag lock_t pagb_lock; /* lock for pagb_list */ #endif xfs_perag_busy_t *pagb_list; /* unstable blocks */ + atomic_t pagf_fstrms; /* # of filestreams active in this AG */ /* * inode allocation search lookup optimisation. Index: 2.6.x-xfs-new/fs/xfs/xfs_bmap.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_bmap.c 2007-05-10 17:22:43.494753782 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_bmap.c 2007-05-10 17:24:13.011020884 +1000 @@ -52,6 +52,7 @@ #include "xfs_quota.h" #include "xfs_trans_space.h" #include "xfs_buf_item.h" +#include "xfs_filestream.h" #ifdef DEBUG @@ -171,6 +172,14 @@ xfs_bmap_alloc( xfs_bmalloca_t *ap); /* bmap alloc argument struct */ /* + * xfs_bmap_filestreams is the underlying allocator when filestreams are + * enabled. + */ +STATIC int /* error */ +xfs_bmap_filestreams( + xfs_bmalloca_t *ap); /* bmap alloc argument struct */ + +/* * Transform a btree format file with only one leaf node, where the * extents list will fit in the inode, into an extents format file. * Since the file extents are already in-core, all we have to do is @@ -2968,10 +2977,338 @@ xfs_bmap_alloc( { if ((ap->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) && ap->userdata) return xfs_bmap_rtalloc(ap); + if ((ap->ip->i_mount->m_flags & XFS_MOUNT_FILESTREAMS) || + (ap->ip->i_d.di_flags & XFS_DIFLAG_FILESTREAM)) + return xfs_bmap_filestreams(ap); return xfs_bmap_btalloc(ap); } /* + * xfs_filestreams called by xfs_bmapi for multi-file data stream filesystems. + * + * Allocate files in a directory all in the same AG. When an AG fills, pick + * a new AG. + */ +int /* error */ +xfs_bmap_filestreams( + xfs_bmalloca_t *ap) /* bmap alloc argument struct */ +{ + xfs_alloctype_t atype; /* type for allocation routines */ + int error; /* error return value */ + xfs_agnumber_t fb_agno; /* ag number of ap->firstblock */ + xfs_mount_t *mp; /* mount point structure */ + int nullfb; /* true if ap->firstblock isn't set */ + int rt; /* true if inode is realtime */ + xfs_extlen_t align; /* minimum allocation alignment */ + xfs_agnumber_t ag; + xfs_alloc_arg_t args; + xfs_extlen_t blen; + xfs_extlen_t delta; + int isaligned; + xfs_extlen_t longest; + xfs_extlen_t need; + xfs_extlen_t nextminlen = 0; + int notinit; + xfs_perag_t *pag; + xfs_agnumber_t startag; + int tryagain; + + /* + * Set up variables. + */ + mp = ap->ip->i_mount; + rt = (ap->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) && ap->userdata; + align = (ap->userdata && ap->ip->i_d.di_extsize && + (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) ? + ap->ip->i_d.di_extsize : 0; + if (align) { + error = xfs_bmap_extsize_align(mp, ap->gotp, ap->prevp, + align, rt, + ap->eof, 0, ap->conv, + &ap->off, &ap->alen); + ASSERT(!error); + ASSERT(ap->alen); + } + nullfb = ap->firstblock == NULLFSBLOCK; + fb_agno = nullfb ? NULLAGNUMBER : XFS_FSB_TO_AGNO(mp, ap->firstblock); + if (nullfb) { + ag = xfs_filestream_get_ag(ap->ip); + ag = (ag != NULLAGNUMBER) ? ag : 0; + ap->rval = (ap->userdata) ? XFS_AGB_TO_FSB(mp, ag, 0) : + XFS_INO_TO_FSB(mp, ap->ip->i_ino); + } else { + ap->rval = ap->firstblock; + } + + xfs_bmap_adjacent(ap); + + /* + * If allowed, use ap->rval; otherwise must use firstblock since + * it's in the right allocation group. + */ + if (nullfb || XFS_FSB_TO_AGNO(mp, ap->rval) == fb_agno) + ; + else + ap->rval = ap->firstblock; + /* + * Normal allocation, done through xfs_alloc_vextent. + */ + tryagain = isaligned = 0; + args.tp = ap->tp; + args.mp = mp; + args.fsbno = ap->rval; + args.maxlen = MIN(ap->alen, mp->m_sb.sb_agblocks); + blen = 0; + if (nullfb) { + /* _vextent doesn't pick an AG */ + args.type = XFS_ALLOCTYPE_NEAR_BNO; + args.total = ap->total; + /* + * Find the longest available space. + * We're going to try for the whole allocation at once. + */ + startag = ag = XFS_FSB_TO_AGNO(mp, args.fsbno); + if (startag == NULLAGNUMBER) { + startag = ag = 0; + } + notinit = 0; + /* + * Search for an allocation group with a single extent + * large enough for the request. + * + * If one isn't found, then adjust the minimum allocation + * size to the largest space found. + */ + down_read(&mp->m_peraglock); + while (blen < ap->alen) { + pag = &mp->m_perag[ag]; + if (!pag->pagf_init && + (error = xfs_alloc_pagf_init(mp, args.tp, + ag, XFS_ALLOC_FLAG_TRYLOCK))) { + up_read(&mp->m_peraglock); + return error; + } + /* + * See xfs_alloc_fix_freelist... + */ + if (pag->pagf_init) { + need = XFS_MIN_FREELIST_PAG(pag, mp); + delta = need > pag->pagf_flcount ? + need - pag->pagf_flcount : 0; + longest = (pag->pagf_longest > delta) ? + (pag->pagf_longest - delta) : + (pag->pagf_flcount > 0 || + pag->pagf_longest > 0); + if (blen < longest) + blen = longest; + } else { + notinit = 1; + } + + if (blen >= ap->alen) + break; + + if (ap->userdata) { + if (startag == NULLAGNUMBER) { + /* + * If startag is an invalid AG, + * we've come here once before and + * xfs_filestream_new_ag picked the best + * currently available. + * + * Don't continue looping, since we + * could loop forever. + */ + break; + } + + if ((error = xfs_filestream_new_ag(ap, &ag))) { + up_read(&mp->m_peraglock); + return error; + } + + startag = NULLAGNUMBER; + + /* Go around the loop once more to set 'blen'*/ + } else { + if (++ag == mp->m_sb.sb_agcount) + ag = 0; + + if (ag == startag) + break; + } + } + up_read(&mp->m_peraglock); + /* + * Since the above loop did a BUF_TRYLOCK, it is + * possible that there is space for this request. + */ + if (notinit || blen < ap->minlen) + args.minlen = ap->minlen; + /* + * If the best seen length is less than the request + * length, use the best as the minimum. + */ + else if (blen < ap->alen) + args.minlen = blen; + /* + * Otherwise we've seen an extent as big as alen, + * use that as the minimum. + */ + else + args.minlen = ap->alen; + ap->rval = args.fsbno = XFS_AGB_TO_FSB(mp, ag, 0); + } else if (ap->low) { + args.type = XFS_ALLOCTYPE_FIRST_AG; + args.total = args.minlen = ap->minlen; + } else { + args.type = XFS_ALLOCTYPE_NEAR_BNO; + args.total = ap->total; + args.minlen = ap->minlen; + } + if (ap->userdata && ap->ip->i_d.di_extsize && + (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) { + args.prod = ap->ip->i_d.di_extsize; + if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) + args.mod = (xfs_extlen_t)(args.prod - args.mod); + } else if (mp->m_sb.sb_blocksize >= NBPP) { + args.prod = 1; + args.mod = 0; + } else { + args.prod = NBPP >> mp->m_sb.sb_blocklog; + if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) + args.mod = (xfs_extlen_t)(args.prod - args.mod); + } + /* + * If we are not low on available data blocks, and the + * underlying logical volume manager is a stripe, and + * the file offset is zero then try to allocate data + * blocks on stripe unit boundary. + * NOTE: ap->aeof is only set if the allocation length + * is >= the stripe unit and the allocation offset is + * at the end of file. + */ + atype = args.type; + if (!ap->low && ap->aeof) { + if (!ap->off) { + args.alignment = mp->m_dalign; + atype = args.type; + isaligned = 1; + /* + * Adjust for alignment + */ + if (blen > args.alignment && blen <= ap->alen) + args.minlen = blen - args.alignment; + args.minalignslop = 0; + } else { + /* + * First try an exact bno allocation. + * If it fails then do a near or start bno + * allocation with alignment turned on. + */ + atype = args.type; + tryagain = 1; + args.type = XFS_ALLOCTYPE_THIS_BNO; + args.alignment = 1; + /* + * Compute the minlen+alignment for the + * next case. Set slop so that the value + * of minlen+alignment+slop doesn't go up + * between the calls. + */ + if (blen > mp->m_dalign && blen <= ap->alen) + nextminlen = blen - mp->m_dalign; + else + nextminlen = args.minlen; + if (nextminlen + mp->m_dalign > args.minlen + 1) + args.minalignslop = + nextminlen + mp->m_dalign - + args.minlen - 1; + else + args.minalignslop = 0; + } + } else { + args.alignment = 1; + args.minalignslop = 0; + } + args.minleft = ap->minleft; + args.wasdel = ap->wasdel; + args.isfl = 0; + args.userdata = ap->userdata; + if ((error = xfs_alloc_vextent(&args))) + return error; + if (tryagain && args.fsbno == NULLFSBLOCK) { + /* + * Exact allocation failed. Now try with alignment + * turned on. + */ + args.type = atype; + args.fsbno = ap->rval; + args.alignment = mp->m_dalign; + args.minlen = nextminlen; + args.minalignslop = 0; + isaligned = 1; + if ((error = xfs_alloc_vextent(&args))) + return error; + } + if (isaligned && args.fsbno == NULLFSBLOCK) { + /* + * allocation failed, so turn off alignment and + * try again. + */ + args.type = atype; + args.fsbno = ap->rval; + args.alignment = 0; + if ((error = xfs_alloc_vextent(&args))) + return error; + } + if (args.fsbno == NULLFSBLOCK && nullfb && + args.minlen > ap->minlen) { + args.minlen = ap->minlen; + args.type = XFS_ALLOCTYPE_START_BNO; + args.fsbno = ap->rval; + if ((error = xfs_alloc_vextent(&args))) + return error; + } + if (args.fsbno == NULLFSBLOCK && nullfb) { + args.fsbno = 0; + args.type = XFS_ALLOCTYPE_FIRST_AG; + args.total = ap->minlen; + args.minleft = 0; + if ((error = xfs_alloc_vextent(&args))) + return error; + ap->low = 1; + } + if (args.fsbno != NULLFSBLOCK) { + ap->firstblock = ap->rval = args.fsbno; + ASSERT(nullfb || fb_agno == args.agno || + (ap->low && fb_agno < args.agno)); + ap->alen = args.len; + ap->ip->i_d.di_nblocks += args.len; + xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE); + if (ap->wasdel) + ap->ip->i_delayed_blks -= args.len; + /* + * Adjust the disk quota also. This was reserved + * earlier. + */ + if (XFS_IS_QUOTA_ON(mp) && + ap->ip->i_ino != mp->m_sb.sb_uquotino && + ap->ip->i_ino != mp->m_sb.sb_gquotino) { + XFS_TRANS_MOD_DQUOT_BYINO(mp, ap->tp, ap->ip, + ap->wasdel ? + XFS_TRANS_DQ_DELBCOUNT : + XFS_TRANS_DQ_BCOUNT, + (long)args.len); + } + } else { + ap->rval = NULLFSBLOCK; + ap->alen = 0; + } + return 0; +} + +/* * Transform a btree format file with only one leaf node, where the * extents list will fit in the inode, into an extents format file. * Since the file extents are already in-core, all we have to do is Index: 2.6.x-xfs-new/fs/xfs/xfs_clnt.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_clnt.h 2007-05-10 17:22:43.494753782 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_clnt.h 2007-05-10 17:24:13.011020884 +1000 @@ -99,5 +99,7 @@ struct xfs_mount_args { */ #define XFSMNT2_COMPAT_IOSIZE 0x00000001 /* don't report large preferred * I/O size in stat(2) */ +#define XFSMNT2_FILESTREAMS 0x00000002 /* enable the filestreams + * allocator */ #endif /* __XFS_CLNT_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfs_dinode.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_dinode.h 2007-05-10 17:22:43.494753782 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_dinode.h 2007-05-10 17:24:13.015020360 +1000 @@ -257,6 +257,7 @@ typedef enum xfs_dinode_fmt #define XFS_DIFLAG_EXTSIZE_BIT 11 /* inode extent size allocator hint */ #define XFS_DIFLAG_EXTSZINHERIT_BIT 12 /* inherit inode extent size */ #define XFS_DIFLAG_NODEFRAG_BIT 13 /* do not reorganize/defragment */ +#define XFS_DIFLAG_FILESTREAM_BIT 14 /* use filestream allocator */ #define XFS_DIFLAG_REALTIME (1 << XFS_DIFLAG_REALTIME_BIT) #define XFS_DIFLAG_PREALLOC (1 << XFS_DIFLAG_PREALLOC_BIT) #define XFS_DIFLAG_NEWRTBM (1 << XFS_DIFLAG_NEWRTBM_BIT) @@ -271,12 +272,13 @@ typedef enum xfs_dinode_fmt #define XFS_DIFLAG_EXTSIZE (1 << XFS_DIFLAG_EXTSIZE_BIT) #define XFS_DIFLAG_EXTSZINHERIT (1 << XFS_DIFLAG_EXTSZINHERIT_BIT) #define XFS_DIFLAG_NODEFRAG (1 << XFS_DIFLAG_NODEFRAG_BIT) +#define XFS_DIFLAG_FILESTREAM (1 << XFS_DIFLAG_FILESTREAM_BIT) #define XFS_DIFLAG_ANY \ (XFS_DIFLAG_REALTIME | XFS_DIFLAG_PREALLOC | XFS_DIFLAG_NEWRTBM | \ XFS_DIFLAG_IMMUTABLE | XFS_DIFLAG_APPEND | XFS_DIFLAG_SYNC | \ XFS_DIFLAG_NOATIME | XFS_DIFLAG_NODUMP | XFS_DIFLAG_RTINHERIT | \ XFS_DIFLAG_PROJINHERIT | XFS_DIFLAG_NOSYMLINKS | XFS_DIFLAG_EXTSIZE | \ - XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG) + XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM) #endif /* __XFS_DINODE_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfs_filestream.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ 2.6.x-xfs-new/fs/xfs/xfs_filestream.c 2007-05-10 17:24:13.019019836 +1000 @@ -0,0 +1,777 @@ +/* + * Copyright (c) 2000-2005 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +#include "xfs.h" +#include "xfs_bmap_btree.h" +#include "xfs_inum.h" +#include "xfs_dir2.h" +#include "xfs_dir2_sf.h" +#include "xfs_attr_sf.h" +#include "xfs_dinode.h" +#include "xfs_inode.h" +#include "xfs_ag.h" +#include "xfs_dmapi.h" +#include "xfs_log.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_mount.h" +#include "xfs_bmap.h" +#include "xfs_alloc.h" +#include "xfs_utils.h" +#include "xfs_mru_cache.h" +#include "xfs_filestream.h" + +#ifdef DEBUG_FILESTREAMS +#define dprint(fmt, args...) do { \ + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ + current_pid(), __FUNCTION__, ##args); \ +} while(0) +#else +#define dprint(args...) do {} while (0) +#endif + +static kmem_zone_t *item_zone; + +/* + * Per-mount point data structure to maintain its active filestreams. Currently + * only contains a single pointer, but set up and allocated as a structure to + * ease future expansion, if any. + */ +typedef struct fstrm_mnt_data +{ + struct xfs_mru_cache *fstrm_items; +} fstrm_mnt_data_t; + +/* + * Structure for associating a file or a directory with an allocation group. + * The parent directory pointer is only needed for files, but since there will + * generally be vastly more files than directories in the cache, using the same + * data structure simplifies the code with very little memory overhead. + */ +typedef struct fstrm_item +{ + xfs_agnumber_t ag; /* AG currently in use for the file/directory. */ + xfs_inode_t *ip; /* inode self-pointer. */ + xfs_inode_t *pip; /* Parent directory inode pointer. */ +} fstrm_item_t; + +/* + * Allocation group filestream associations are tracked with per-ag atomic + * counters. These counters allow _xfs_filestream_pick_ag() to tell whether a + * particular AG already has active filestreams associated with it. The mount + * point's m_peraglock is used to protect these counters from per-ag array + * re-allocation during a growfs operation. When xfs_growfs_data_private() is + * about to reallocate the array, it calls xfs_filestream_flush() with the + * m_peraglock held in write mode. + * + * Since xfs_mru_cache_flush() guarantees that all the free functions for all + * the cache elements have finished executing before it returns, it's safe for + * the free functions to use the atomic counters without m_peraglock protection. + * This allows the implementation of xfs_fstrm_free_func() to be agnostic about + * whether it was called with the m_peraglock held in read mode, write mode or + * not held at all. The race condition this addresses is the following: + * + * - The work queue scheduler fires and pulls a filestream directory cache + * element off the LRU end of the cache for deletion, then gets pre-empted. + * - A growfs operation grabs the m_peraglock in write mode, flushes all the + * remaining items from the cache and reallocates the mount point's per-ag + * array, resetting all the counters to zero. + * - The work queue thread resumes and calls the free function for the element + * it started cleaning up earlier. In the process it decrements the + * filestreams counter for an AG that now has no references. + * + * With a shrinkfs feature, the above scenario could panic the system. + * + * All other uses of the following macros should be protected by either the + * m_peraglock held in read mode, or the cache's internal locking exposed by the + * interval between a call to xfs_mru_cache_lookup() and a call to + * xfs_mru_cache_done(). In addition, the m_peraglock must be held in read mode + * when new elements are added to the cache. + * + * Combined, these locking rules ensure that no associations will ever exist in + * the cache that reference per-ag array elements that have since been + * reallocated. + */ +#define GET_AG_REF(mp, ag) atomic_read(&(mp)->m_perag[ag].pagf_fstrms) +#define INC_AG_REF(mp, ag) atomic_inc_return(&(mp)->m_perag[ag].pagf_fstrms) +#define DEC_AG_REF(mp, ag) atomic_dec_return(&(mp)->m_perag[ag].pagf_fstrms) + +#define XFS_PICK_USERDATA 1 +#define XFS_PICK_LOWSPACE 2 + +/* + * Scan the AGs starting at startag looking for an AG that isn't in use and has + * at least minlen blocks free. + */ +static int +_xfs_filestream_pick_ag( + xfs_mount_t *mp, + xfs_agnumber_t startag, + xfs_agnumber_t *agp, + int flags, + xfs_extlen_t minlen) +{ + int err, trylock, nscan; + xfs_extlen_t delta, longest, need, free, minfree, maxfree = 0; + xfs_agnumber_t ag, max_ag = NULLAGNUMBER; + struct xfs_perag *pag; + + /* 2% of an AG's blocks must be free for it to be chosen. */ + minfree = mp->m_sb.sb_agblocks / 50; + + ag = startag; + *agp = NULLAGNUMBER; + + /* For the first pass, don't sleep trying to init the per-AG. */ + trylock = XFS_ALLOC_FLAG_TRYLOCK; + + for (nscan = 0; 1; nscan++) { + + //dprint("scanning AG %d[%d]", ag, GET_AG_REF(mp, ag)); + + pag = mp->m_perag + ag; + + if (!pag->pagf_init && + (err = xfs_alloc_pagf_init(mp, NULL, ag, trylock)) && + !trylock) { + dprint("xfs_alloc_pagf_init returned %d", err); + return err; + } + + /* Might fail sometimes during the 1st pass with trylock set. */ + if (!pag->pagf_init) { + dprint("!pagf_init"); + goto next_ag; + } + + /* Keep track of the AG with the most free blocks. */ + if (pag->pagf_freeblks > maxfree) { + maxfree = pag->pagf_freeblks; + max_ag = ag; + } + + /* + * The AG reference count does two things: it enforces mutual + * exclusion when examining the suitability of an AG in this + * loop, and it guards against two filestreams being established + * in the same AG as each other. + */ + if (INC_AG_REF(mp, ag) > 1) { + DEC_AG_REF(mp, ag); + goto next_ag; + } + + need = XFS_MIN_FREELIST_PAG(pag, mp); + delta = need > pag->pagf_flcount ? need - pag->pagf_flcount : 0; + longest = (pag->pagf_longest > delta) ? + (pag->pagf_longest - delta) : + (pag->pagf_flcount > 0 || pag->pagf_longest > 0); + + if (((minlen && longest >= minlen) || + (!minlen && pag->pagf_freeblks >= minfree)) && + (!pag->pagf_metadata || !(flags & XFS_PICK_USERDATA) || + (flags & XFS_PICK_LOWSPACE))) { + + /* Break out, retaining the reference on the AG. */ + free = pag->pagf_freeblks; + *agp = ag; + break; + } + + /* Drop the reference on this AG, it's not usable. */ + DEC_AG_REF(mp, ag); +next_ag: + /* Move to the next AG, wrapping to AG 0 if necessary. */ + if (++ag >= mp->m_sb.sb_agcount) + ag = 0; + + /* If a full pass of the AGs hasn't been done yet, continue. */ + if (ag != startag) + continue; + + /* Allow sleeping in xfs_alloc_pagf_init() on the 2nd pass. */ + if (trylock != 0) { + trylock = 0; + continue; + } + + /* Finally, if lowspace wasn't set, set it for the 3rd pass. */ + if (!(flags & XFS_PICK_LOWSPACE)) { + flags |= XFS_PICK_LOWSPACE; + continue; + } + + /* + * Take the AG with the most free space, regardless of whether + * it's already in use by another filestream. + */ + if (max_ag != NULLAGNUMBER) { + INC_AG_REF(mp, max_ag); + dprint("using max_ag %d[1] with maxfree %d", max_ag, + maxfree); + + free = maxfree; + *agp = max_ag; + break; + } + + dprint("giving up, returning AG 0"); + *agp = 0; + return 0; + } + + /* + dprint("mp %p startag %d newag %d[%d] free %d minlen %d minfree %d " + "scanned %d trylock %d flags 0x%x", mp, startag, *agp, + GET_AG_REF(mp, *agp), free, minlen, minfree, nscan, trylock, + flags); + */ + + return 0; +} + +/* + * Set the allocation group number for a file or a directory, updating inode + * references and per-AG references as appropriate. Must be called with the + * m_peraglock held in read mode. + */ +static int +_xfs_filestream_set_ag( + xfs_inode_t *ip, + xfs_inode_t *pip, + xfs_agnumber_t ag) +{ + int err = 0; + xfs_mount_t *mp; + xfs_mru_cache_t *cache; + fstrm_item_t *item; + xfs_agnumber_t old_ag; + xfs_inode_t *old_pip; + + /* + * Either ip is a regular file and pip is a directory, or ip is a + * directory and pip is NULL. + */ + ASSERT(ip && (((ip->i_d.di_mode & S_IFREG) && pip && + (pip->i_d.di_mode & S_IFDIR)) || + ((ip->i_d.di_mode & S_IFDIR) && !pip))); + + mp = ip->i_mount; + cache = mp->m_filestream->fstrm_items; + + if ((item = (fstrm_item_t*)xfs_mru_cache_lookup(cache, ip->i_ino))) { + ASSERT(item->ip == ip); + old_ag = item->ag; + item->ag = ag; + old_pip = item->pip; + item->pip = pip; + xfs_mru_cache_done(cache); + + /* + * If the AG has changed, drop the old ref and take a new one, + * effectively transferring the reference from old to new AG. + */ + if (ag != old_ag) { + DEC_AG_REF(mp, old_ag); + INC_AG_REF(mp, ag); + } + + /* + * If ip is a file and its pip has changed, drop the old ref and + * take a new one. + */ + if (pip && pip != old_pip) { + IRELE(old_pip); + IHOLD(pip); + } + + if (ag != old_ag) + dprint("found ip %p ino %lld, AG %d[%d] -> %d[%d]", ip, + ip->i_ino, old_ag, GET_AG_REF(mp, old_ag), ag, + GET_AG_REF(mp, ag)); + else + dprint("found ip %p ino %lld, AG %d[%d]", ip, ip->i_ino, + ag, GET_AG_REF(mp, ag)); + + return 0; + } + + if (!(item = (fstrm_item_t*)kmem_zone_zalloc(item_zone, KM_SLEEP))) + return ENOMEM; + + item->ag = ag; + item->ip = ip; + item->pip = pip; + + if ((err = xfs_mru_cache_insert(cache, ip->i_ino, item))) { + kmem_zone_free(item_zone, item); + return err; + } + + /* Take a reference on the AG. */ + INC_AG_REF(mp, ag); + + /* + * Take a reference on the inode itself regardless of whether it's a + * regular file or a directory. + */ + IHOLD(ip); + + /* + * In the case of a regular file, take a reference on the parent inode + * as well to ensure it remains in-core. + */ + if (pip) + IHOLD(pip); + + dprint("put ip %p ino %lld into AG %d[%d]", ip, ip->i_ino, ag, + GET_AG_REF(mp, ag)); + + return 0; +} + +/* xfs_fstrm_free_func(): callback for freeing cached stream items. */ +void +xfs_fstrm_free_func( + xfs_ino_t ino, + fstrm_item_t *item) +{ + xfs_inode_t *ip = item->ip; + int ref; + + ASSERT(ip->i_ino == ino); + + /* Drop the reference taken on the AG when the item was added. */ + ref = DEC_AG_REF(ip->i_mount, item->ag); + + ASSERT(ref >= 0); + + /* + * _xfs_filestream_set_ag() always takes a reference on the inode + * itself, whether it's a file or a directory. Release it here. + */ + IRELE(ip); + + /* + * In the case of a regular file, _xfs_filestream_set_ag() also takes a + * ref on the parent inode to keep it in-core. Release that too. + */ + if (item->pip) + IRELE(item->pip); + + if (ip->i_d.di_mode & S_IFDIR) + dprint("deleting dip %p ino %lld, AG %d[%d]", ip, ip->i_ino, + item->ag, GET_AG_REF(ip->i_mount, item->ag)); + else + dprint("deleting file %p ino %lld, pip %p ino %lld, AG %d[%d]", + ip, ip->i_ino, item->pip, + item->pip ? item->pip->i_ino : 0, item->ag, + GET_AG_REF(ip->i_mount, item->ag)); + + /* Finally, free the memory allocated for the item. */ + kmem_zone_free(item_zone, item); +} + +/* + * xfs_filestream_init() is called at xfs initialisation time to set up the + * memory zone that will be used for filestream data structure allocation. + */ +void +xfs_filestream_init(void) +{ + item_zone = kmem_zone_init(sizeof(fstrm_item_t), "fstrm_item"); + ASSERT(item_zone); +} + +/* + * xfs_filestream_uninit() is called at xfs termination time to destroy the + * memory zone that was used for filestream data structure allocation. + */ +void +xfs_filestream_uninit(void) +{ + if (item_zone) { + kmem_zone_destroy(item_zone); + item_zone = NULL; + } +} + +/* + * xfs_filestream_mount() is called when a file system is mounted with the + * filestream option. It is responsible for allocating the data structures + * needed to track the new file system's file streams. + */ +int +xfs_filestream_mount( + xfs_mount_t *mp) +{ + int err = 0; + unsigned int lifetime, grp_count; + fstrm_mnt_data_t *md; + + if (!(md = (fstrm_mnt_data_t*)kmem_zalloc(sizeof(*md), KM_SLEEP))) + return ENOMEM; + + /* + * The filestream timer tunable is currently fixed within the range of + * one second to four minutes, with five seconds being the default. The + * group count is somewhat arbitrary, but it'd be nice to adhere to the + * timer tunable to within about 10 percent. This requires at least 10 + * groups. + */ + lifetime = xfs_fstrm_centisecs * 10; + grp_count = 10; + + if ((err = xfs_mru_cache_create(&md->fstrm_items, lifetime, grp_count, + (xfs_mru_cache_free_func_t)xfs_fstrm_free_func))) { + kmem_free(md, sizeof(*md)); + return err; + } + + mp->m_filestream = md; + + dprint("created fstrm_items %p for mount %p", md->fstrm_items, mp); + + return 0; +} + +/* + * xfs_filestream_unmount() is called when a file system that was mounted with + * the filestream option is unmounted. It drains the data structures created + * to track the file system's file streams and frees all the memory that was + * allocated. + */ +void +xfs_filestream_unmount( + xfs_mount_t *mp) +{ + xfs_mru_cache_destroy(mp->m_filestream->fstrm_items); + kmem_free(mp->m_filestream, sizeof(*mp->m_filestream)); +} + +/* + * If the mount point's m_perag array is going to be reallocated, all + * outstanding cache entries must be flushed to avoid accessing reference count + * addresses that have been freed. The call to xfs_filestream_flush() must be + * made inside the block that holds the m_peraglock in write mode to do the + * reallocation. + */ +void +xfs_filestream_flush( + xfs_mount_t *mp) +{ + /* point in time flush, so keep the reaper running */ + xfs_mru_cache_flush(mp->m_filestream->fstrm_items, 1); +} + +/* + * Return the AG of the filestream the file or directory belongs to, or + * NULLAGNUMBER otherwise. + */ +xfs_agnumber_t +xfs_filestream_get_ag( + xfs_inode_t *ip) +{ + xfs_mru_cache_t *cache; + fstrm_item_t *item; + xfs_agnumber_t ag; + int ref; + + ASSERT(ip->i_d.di_mode & (S_IFREG | S_IFDIR)); + if (!(ip->i_d.di_mode & (S_IFREG | S_IFDIR))) + return NULLAGNUMBER; + + cache = ip->i_mount->m_filestream->fstrm_items; + if (!(item = (fstrm_item_t*)xfs_mru_cache_lookup(cache, ip->i_ino))) { + dprint("lookup on %s ip %p ino %lld failed, returning %d", + ip->i_d.di_mode & S_IFREG ? "file" : "dir", ip, + ip->i_ino, NULLAGNUMBER); + return NULLAGNUMBER; + } + + ASSERT(ip == item->ip); + ag = item->ag; + ref = GET_AG_REF(ip->i_mount, ag); + xfs_mru_cache_done(cache); + + if (ip->i_d.di_mode & S_IFREG) + dprint("lookup on file ip %p ino %lld dir %p dino %lld got AG " + "%d[%d]", ip, ip->i_ino, item->pip, item->pip->i_ino, ag, + ref); + else + dprint("lookup on dir ip %p ino %lld got AG %d[%d]", ip, + ip->i_ino, ag, ref); + + return ag; +} + +/* + * xfs_filestream_associate() should only be called to associate a regular file + * with its parent directory. Calling it with a child directory isn't + * appropriate because filestreams don't apply to entire directory hierarchies. + * Creating a file in a child directory of an existing filestream directory + * starts a new filestream with its own allocation group association. + */ +int +xfs_filestream_associate( + xfs_inode_t *pip, + xfs_inode_t *ip) +{ + xfs_mount_t *mp; + xfs_mru_cache_t *cache; + fstrm_item_t *item; + xfs_agnumber_t ag, rotorstep, startag; + int err = 0; + + ASSERT(pip->i_d.di_mode & S_IFDIR); + ASSERT(ip->i_d.di_mode & S_IFREG); + if (!(pip->i_d.di_mode & S_IFDIR) || !(ip->i_d.di_mode & S_IFREG)) + return EINVAL; + + mp = pip->i_mount; + cache = mp->m_filestream->fstrm_items; + down_read(&mp->m_peraglock); + xfs_ilock(pip, XFS_IOLOCK_EXCL); + + /* If the parent directory is already in the cache, use its AG. */ + if ((item = (fstrm_item_t*)xfs_mru_cache_lookup(cache, pip->i_ino))) { + ASSERT(item->ip == pip); + ag = item->ag; + xfs_mru_cache_done(cache); + + dprint("got cached dir %p ino %lld with AG %d[%d]", pip, + pip->i_ino, ag, GET_AG_REF(mp, ag)); + + if ((err = _xfs_filestream_set_ag(ip, pip, ag))) + dprint("_xfs_filestream_set_ag(%p, %p, %d) -> err %d", + ip, pip, ag, err); + + goto exit; + } + + /* + * Set the starting AG using the rotor for inode32, otherwise + * use the directory inode's AG. + */ + if (mp->m_flags & XFS_MOUNT_32BITINODES) { + rotorstep = xfs_rotorstep; + startag = (mp->m_agfrotor / rotorstep) % mp->m_sb.sb_agcount; + mp->m_agfrotor = (mp->m_agfrotor + 1) % + (mp->m_sb.sb_agcount * rotorstep); + } else + startag = XFS_INO_TO_AGNO(mp, pip->i_ino); + + /* Pick a new AG for the parent inode starting at startag. */ + if ((err = _xfs_filestream_pick_ag(mp, startag, &ag, 0, 0)) || + ag == NULLAGNUMBER) + goto exit_did_pick; + + /* Associate the parent inode with the AG. */ + if ((err = _xfs_filestream_set_ag(pip, NULL, ag))) { + dprint("_xfs_filestream_set_ag(%p (%lld), NULL, %d) -> err %d", + pip, pip->i_ino, ag, err); + goto exit_did_pick; + } + + /* Associate the file inode with the AG. */ + if ((err = _xfs_filestream_set_ag(ip, pip, ag))) { + dprint("_xfs_filestream_set_ag(%p (%lld), %p (%lld), %d) -> " + "err %d", ip, ip->i_ino, pip, pip->i_ino, ag, err); + goto exit_did_pick; + } + + dprint("pip %p ino %lld and ip %p ino %lld given ag %d[%d]", + pip, pip->i_ino, ip, ip->i_ino, ag, GET_AG_REF(mp, ag)); + +exit_did_pick: + /* + * If _xfs_filestream_pick_ag() returned a valid AG, remove the + * reference it took on it, since the file and directory will have taken + * their own now if they were successfully cached. + */ + if (ag != NULLAGNUMBER) + DEC_AG_REF(mp, ag); + else + dprint("_pick_ag() returned invalid AG %d, no stream set", ag); + +exit: + xfs_iunlock(pip, XFS_IOLOCK_EXCL); + up_read(&mp->m_peraglock); + return err; +} + +/* + * Pick a new allocation group for the current file and its file stream. This + * function is called by xfs_bmap_filestreams() with the mount point's per-ag + * lock held. + */ +int +xfs_filestream_new_ag( + xfs_bmalloca_t *ap, + xfs_agnumber_t *agp) +{ + int flags, err; + xfs_inode_t *ip, *pip = NULL; + xfs_mount_t *mp; + xfs_mru_cache_t *cache; + xfs_extlen_t minlen; + fstrm_item_t *dir, *file; + xfs_agnumber_t ag = NULLAGNUMBER; + + ip = ap->ip; + mp = ip->i_mount; + cache = mp->m_filestream->fstrm_items; + minlen = ap->alen; + *agp = NULLAGNUMBER; + + /* + * Look for the file in the cache, removing it if it's found. Doing + * this allows it to be held across the dir lookup that follows. + */ + if ((file = (fstrm_item_t*)xfs_mru_cache_remove(cache, ip->i_ino))) { + ASSERT(ip == file->ip); + + /* Save the file's parent inode and old AG number for later. */ + pip = file->pip; + ag = file->ag; + + /* Look for the file's directory in the cache. */ + dir = (fstrm_item_t*)xfs_mru_cache_lookup(cache, pip->i_ino); + if (dir) { + ASSERT(pip == dir->ip); + + /* + * If the directory has already moved on to a new AG, + * use that AG as the new AG for the file. Don't + * forget to twiddle the AG refcounts to match the + * movement. + */ + if (dir->ag != file->ag) { + DEC_AG_REF(mp, file->ag); + INC_AG_REF(mp, dir->ag); + *agp = file->ag = dir->ag; + } + + xfs_mru_cache_done(cache); + } + + /* + * Put the file back in the cache. If this fails, the free + * function needs to be called to tidy up in the same way as if + * the item had simply expired from the cache. + */ + if ((err = xfs_mru_cache_insert(cache, ip->i_ino, file))) { + xfs_fstrm_free_func(ip->i_ino, file); + return err; + } + + /* + * If the file's AG was moved to the directory's new AG, there's + * nothing more to be done. + */ + if (*agp != NULLAGNUMBER) { + dprint("dir %p ino %lld for file %p ino %lld has " + "already moved %d[%d] -> %d[%d]", pip, + pip->i_ino, ip, ip->i_ino, ag, + GET_AG_REF(mp, ag), *agp, GET_AG_REF(mp, *agp)); + return 0; + } + } + + /* + * If the file's parent directory is known, take its iolock in exclusive + * mode to prevent two sibling files from racing each other to migrate + * themselves and their parent to different AGs. + */ + if (pip) + xfs_ilock(pip, XFS_IOLOCK_EXCL); + + /* + * A new AG needs to be found for the file. If the file's parent + * directory is also known, it will be moved to the new AG as well to + * ensure that files created inside it in future use the new AG. + */ + ag = (ag == NULLAGNUMBER) ? 0 : (ag + 1) % mp->m_sb.sb_agcount; + flags = (ap->userdata ? XFS_PICK_USERDATA : 0) | + (ap->low ? XFS_PICK_LOWSPACE : 0); + + if ((err = _xfs_filestream_pick_ag(mp, ag, agp, flags, minlen)) || + *agp == NULLAGNUMBER) + goto exit; + + /* + * If the file wasn't found in the file cache, then its parent directory + * inode isn't known. For this to have happened, the file must either + * be pre-existing, or it was created long enough ago that its cache + * entry has expired. This isn't the sort of usage that the filestreams + * allocator is trying to optimise, so there's no point trying to track + * its new AG somehow in the filestream data structures. + */ + if (!pip) { + dprint("gave ag %d to orphan ip %p ino %lld", *agp, ip, + ip->i_ino); + goto exit; + } + + /* Associate the parent inode with the AG. */ + if ((err = _xfs_filestream_set_ag(pip, NULL, *agp))) { + dprint("_xfs_filestream_set_ag(%p (%lld), NULL, %d) -> err %d", + pip, pip->i_ino, *agp, err); + goto exit; + } + + /* Associate the file inode with the AG. */ + if ((err = _xfs_filestream_set_ag(ip, pip, *agp))) { + dprint("_xfs_filestream_set_ag(%p (%lld), %p (%lld), %d) -> " + "err %d", ip, ip->i_ino, pip, pip->i_ino, *agp, err); + goto exit; + } + + dprint("pip %p ino %lld and ip %p ino %lld moved to new ag %d[%d]", + pip, pip->i_ino, ip, ip->i_ino, *agp, GET_AG_REF(mp, *agp)); + +exit: + /* + * If _xfs_filestream_pick_ag() returned a valid AG, remove the + * reference it took on it, since the file and directory will have taken + * their own now if they were successfully cached. + */ + if (*agp != NULLAGNUMBER) + DEC_AG_REF(mp, *agp); + else { + dprint("_pick_ag() returned invalid AG %d, using AG 0", *agp); + *agp = 0; + } + + if (pip) + xfs_iunlock(pip, XFS_IOLOCK_EXCL); + + return err; +} + +/* + * Remove an association between an inode and a filestream object. + * Typically this is done on last close of an unlinked file. + */ +void +xfs_filestream_deassociate( + xfs_inode_t *ip) +{ + xfs_mru_cache_t *cache = ip->i_mount->m_filestream->fstrm_items; + + xfs_mru_cache_delete(cache, ip->i_ino); +} Index: 2.6.x-xfs-new/fs/xfs/xfs_filestream.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ 2.6.x-xfs-new/fs/xfs/xfs_filestream.h 2007-05-10 17:24:13.107008304 +1000 @@ -0,0 +1,59 @@ +/* + * Copyright (c) 2000-2002,2005 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +#ifndef __XFS_FILESTREAM_H__ +#define __XFS_FILESTREAM_H__ + +#ifdef __KERNEL__ + +struct xfs_mount; +struct xfs_inode; +struct xfs_perag; +struct xfs_bmalloca; + +void +xfs_filestream_init(void); + +void +xfs_filestream_uninit(void); + +int +xfs_filestream_mount(struct xfs_mount *mp); + +void +xfs_filestream_unmount(struct xfs_mount *mp); + +void +xfs_filestream_flush(struct xfs_mount *mp); + +xfs_agnumber_t +xfs_filestream_get_ag(struct xfs_inode *ip); + +int +xfs_filestream_associate(struct xfs_inode *dip, + struct xfs_inode *ip); + +void +xfs_filestream_deassociate(struct xfs_inode *ip); + +int +xfs_filestream_new_ag(struct xfs_bmalloca *ap, + xfs_agnumber_t *agp); + +#endif /* __KERNEL__ */ + +#endif /* __XFS_FILESTREAM_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfs_fs.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_fs.h 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_fs.h 2007-05-10 17:24:13.123006207 +1000 @@ -66,6 +66,7 @@ struct fsxattr { #define XFS_XFLAG_EXTSIZE 0x00000800 /* extent size allocator hint */ #define XFS_XFLAG_EXTSZINHERIT 0x00001000 /* inherit inode extent size */ #define XFS_XFLAG_NODEFRAG 0x00002000 /* do not defragment */ +#define XFS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */ #define XFS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */ /* Index: 2.6.x-xfs-new/fs/xfs/xfs_fsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_fsops.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_fsops.c 2007-05-10 17:24:13.131005159 +1000 @@ -44,6 +44,7 @@ #include "xfs_trans_space.h" #include "xfs_rtalloc.h" #include "xfs_rw.h" +#include "xfs_filestream.h" /* * File system operations @@ -163,6 +164,7 @@ xfs_growfs_data_private( new = nb - mp->m_sb.sb_dblocks; oagcount = mp->m_sb.sb_agcount; if (nagcount > oagcount) { + xfs_filestream_flush(mp); down_write(&mp->m_peraglock); mp->m_perag = kmem_realloc(mp->m_perag, sizeof(xfs_perag_t) * nagcount, Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c 2007-05-10 17:24:13.143003586 +1000 @@ -48,6 +48,7 @@ #include "xfs_dir2_trace.h" #include "xfs_quota.h" #include "xfs_acl.h" +#include "xfs_filestream.h" kmem_zone_t *xfs_ifork_zone; @@ -817,6 +818,8 @@ _xfs_dic2xflags( flags |= XFS_XFLAG_EXTSZINHERIT; if (di_flags & XFS_DIFLAG_NODEFRAG) flags |= XFS_XFLAG_NODEFRAG; + if (di_flags & XFS_DIFLAG_FILESTREAM) + flags |= XFS_XFLAG_FILESTREAM; } return flags; @@ -1099,7 +1102,7 @@ xfs_ialloc( * Call the space management code to pick * the on-disk inode to be allocated. */ - error = xfs_dialloc(tp, pip->i_ino, mode, okalloc, + error = xfs_dialloc(tp, pip ? pip->i_ino : 0, mode, okalloc, ialloc_context, call_again, &ino); if (error != 0) { return error; @@ -1153,7 +1156,7 @@ xfs_ialloc( if ( (prid != 0) && (ip->i_d.di_version == XFS_DINODE_VERSION_1)) xfs_bump_ino_vers2(tp, ip); - if (XFS_INHERIT_GID(pip, vp->v_vfsp)) { + if (pip && XFS_INHERIT_GID(pip, vp->v_vfsp)) { ip->i_d.di_gid = pip->i_d.di_gid; if ((pip->i_d.di_mode & S_ISGID) && (mode & S_IFMT) == S_IFDIR) { ip->i_d.di_mode |= S_ISGID; @@ -1195,8 +1198,14 @@ xfs_ialloc( flags |= XFS_ILOG_DEV; break; case S_IFREG: + if (unlikely(pip && + ((ip->i_mount->m_flags & XFS_MOUNT_FILESTREAMS) || + (pip->i_d.di_flags & XFS_DIFLAG_FILESTREAM)) && + (error = xfs_filestream_associate(pip, ip)))) + return error; + /* fall through */ case S_IFDIR: - if (unlikely(pip->i_d.di_flags & XFS_DIFLAG_ANY)) { + if (unlikely(pip && (pip->i_d.di_flags & XFS_DIFLAG_ANY))) { uint di_flags = 0; if ((mode & S_IFMT) == S_IFDIR) { @@ -1233,6 +1242,8 @@ xfs_ialloc( if ((pip->i_d.di_flags & XFS_DIFLAG_NODEFRAG) && xfs_inherit_nodefrag) di_flags |= XFS_DIFLAG_NODEFRAG; + if (pip->i_d.di_flags & XFS_DIFLAG_FILESTREAM) + di_flags |= XFS_DIFLAG_FILESTREAM; ip->i_d.di_flags |= di_flags; } /* FALLTHROUGH */ Index: 2.6.x-xfs-new/fs/xfs/xfs_mount.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_mount.h 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mount.h 2007-05-10 17:24:13.147003062 +1000 @@ -66,6 +66,7 @@ struct xfs_bmbt_irec; struct xfs_bmap_free; struct xfs_extdelta; struct xfs_swapext; +struct xfs_filestream; extern struct bhv_vfsops xfs_vfsops; extern struct bhv_vnodeops xfs_vnodeops; @@ -436,6 +437,7 @@ typedef struct xfs_mount { struct notifier_block m_icsb_notifier; /* hotplug cpu notifier */ struct mutex m_icsb_mutex; /* balancer sync lock */ #endif + struct fstrm_mnt_data *m_filestream; /* per-mount filestream data */ } xfs_mount_t; /* @@ -475,6 +477,8 @@ typedef struct xfs_mount { * I/O size in stat() */ #define XFS_MOUNT_NO_PERCPU_SB (1ULL << 23) /* don't use per-cpu superblock counters */ +#define XFS_MOUNT_FILESTREAMS (1ULL << 24) /* enable the filestreams + allocator */ /* Index: 2.6.x-xfs-new/fs/xfs/xfs_mru_cache.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mru_cache.c 2007-05-10 17:24:13.151002538 +1000 @@ -0,0 +1,607 @@ +/* + * Copyright (c) 2000-2002,2006 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +//#define DEBUG_MRU_CACHE 1 +#include "xfs.h" +#include "xfs_mru_cache.h" + +/* + * An MRU Cache is a dynamic data structure that stores its elements in a way + * that allows efficient lookups, but also groups them into discrete time + * intervals based on insertion time. This allows elements to be efficiently + * and automatically reaped after a fixed period of inactivity. + */ + +#ifdef DEBUG_MRU_CACHE +#define dprint(fmt, args...) do { \ + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ + current_pid(), __FUNCTION__, ##args); \ +} while(0) + +#define DEBUG_DECL_CACHE_FIELDS \ + unsigned int *list_elems; \ + unsigned int reap_elems; \ + unsigned long allocs; \ + unsigned long frees; + +#define DEBUG_INIT_CACHE(mru) \ + ((mru)->list_elems = (unsigned int*) \ + kmem_zalloc((mru)->grp_count * sizeof(*(mru)->list_elems), \ + KM_SLEEP)) + +#define DEBUG_UNINIT_CACHE(mru) \ + kmem_free((mru)->list_elems, \ + (mru)->grp_count * sizeof(*(mru)->list_elems)) + +#define DEBUG_INC_ALLOCS(mru) (mru)->allocs++ +#define DEBUG_INC_FREES(mru) (mru)->frees++ + +STATIC int +_xfs_mru_cache_print(struct xfs_mru_cache *mru, char *buf); + +#define DEBUG_PRINT_STACK_VARS \ + char buf[256]; \ + char *bufp = buf; + +#define DEBUG_PRINT_BEFORE_REAP \ + bufp += _xfs_mru_cache_print(mru, bufp) + +#define DEBUG_PRINT_AFTER_REAP \ + bufp += sprintf(bufp, " -> "); \ + bufp += _xfs_mru_cache_print(mru, bufp); \ + dprint("[%p]: %s", mru, buf) +#else /* !defined DEBUG_MRU_CACHE */ +#define dprint(args...) do {} while (0) +#define DEBUG_DECL_CACHE_FIELDS +#define DEBUG_INIT_CACHE(mru) 1 +#define DEBUG_UNINIT_CACHE(mru) do {} while (0) +#define DEBUG_INC_ALLOCS(mru) do {} while (0) +#define DEBUG_INC_FREES(mru) do {} while (0) +#define DEBUG_PRINT_STACK_VARS +#define DEBUG_PRINT_BEFORE_REAP do {} while (0) +#define DEBUG_PRINT_AFTER_REAP do {} while (0) +#endif /* DEBUG_MRU_CACHE */ + + +/* + * When a client data pointer is stored in the MRU Cache it needs to be added to + * both the data store and to one of the lists. It must also be possible to + * access each of these entries via the other, i.e. to: + * + * a) Walk a list, removing the corresponding data store entry for each item. + * b) Look up a data store entry, then access its list entry directly. + * + * To achieve both of these goals, each entry must contain both a list entry and + * a key, in addition to the user's data pointer. Note that it's not a good + * idea to have the client embed one of these structures at the top of their own + * data structure, because inserting the same item more than once would most + * likely result in a loop in one of the lists. That's a sure-fire recipe for + * an infinite loop in the code. + */ +typedef struct xfs_mru_cache_elem +{ + struct list_head list_node; + unsigned long key; + void *value; +} xfs_mru_cache_elem_t; + +static kmem_zone_t *elem_zone; +static struct workqueue_struct *reap_wq; + +/* + * When inserting, destroying or reaping, it's first necessary to update the + * lists relative to a particular time. In the case of destroying, that time + * will be well in the future to ensure that all items are moved to the reap + * list. In all other cases though, the time will be the current time. + * + * This function enters a loop, moving the contents of the LRU list to the reap + * list again and again until either a) the lists are all empty, or b) time zero + * has been advanced sufficiently to be within the immediate element lifetime. + * + * Case a) above is detected by counting how many groups are migrated and + * stopping when they've all been moved. Case b) is detected by monitoring the + * time_zero field, which is updated as each group is migrated. + * + * The return value is the earliest time that more migration could be needed, or + * zero if there's no need to schedule more work because the lists are empty. + */ +STATIC unsigned long +_xfs_mru_cache_migrate( + xfs_mru_cache_t *mru, + unsigned long now) +{ + unsigned int grp; + unsigned int migrated = 0; + struct list_head *lru_list; + + /* Nothing to do if the data store is empty. */ + if (!mru->time_zero) + return 0; + + /* While time zero is older than the time spanned by all the lists. */ + while (mru->time_zero <= now - mru->grp_count * mru->grp_time) { + + /* + * If the LRU list isn't empty, migrate its elements to the tail + * of the reap list. + */ + lru_list = mru->lists + mru->lru_grp; + if (!list_empty(lru_list)) + list_splice_init(lru_list, mru->reap_list.prev); + + /* + * Advance the LRU group number, freeing the old LRU list to + * become the new MRU list; advance time zero accordingly. + */ + mru->lru_grp = (mru->lru_grp + 1) % mru->grp_count; + mru->time_zero += mru->grp_time; + + /* + * If reaping is so far behind that all the elements on all the + * lists have been migrated to the reap list, it's now empty. + */ + if (++migrated == mru->grp_count) { + mru->lru_grp = 0; + mru->time_zero = 0; + return 0; + } + } + + /* Find the first non-empty list from the LRU end. */ + for (grp = 0; grp < mru->grp_count; grp++) { + + /* Check the grp'th list from the LRU end. */ + lru_list = mru->lists + ((mru->lru_grp + grp) % mru->grp_count); + if (!list_empty(lru_list)) + return mru->time_zero + + (mru->grp_count + grp) * mru->grp_time; + } + + /* All the lists must be empty. */ + mru->lru_grp = 0; + mru->time_zero = 0; + return 0; +} + +/* + * When inserting or doing a lookup, an element needs to be inserted into the + * MRU list. The lists must be migrated first to ensure that they're + * up-to-date, otherwise the new element could be given a shorter lifetime in + * the cache than it should. + */ +STATIC void +_xfs_mru_cache_list_insert( + xfs_mru_cache_t *mru, + xfs_mru_cache_elem_t *elem) +{ + unsigned int grp = 0; + unsigned long now = jiffies; + + /* + * If the data store is empty, initialise time zero, leave grp set to + * zero and start the work queue timer if necessary. Otherwise, set grp + * to the number of group times that have elapsed since time zero. + */ + if (!_xfs_mru_cache_migrate(mru, now)) { + mru->time_zero = now; + if (!mru->next_reap) + mru->next_reap = mru->grp_count * mru->grp_time; + } else { + grp = (now - mru->time_zero) / mru->grp_time; + grp = (mru->lru_grp + grp) % mru->grp_count; + } + + /* Insert the element at the tail of the corresponding list. */ + list_add_tail(&elem->list_node, mru->lists + grp); +} + +/* + * When destroying or reaping, all the elements that were migrated to the reap + * list need to be deleted. For each element this involves removing it from the + * data store, removing it from the reap list, calling the client's free + * function and deleting the element from the element zone. + */ +STATIC void +_xfs_mru_cache_clear_reap_list( + xfs_mru_cache_t *mru) +{ + xfs_mru_cache_elem_t *elem, *next; + struct list_head tmp; + + INIT_LIST_HEAD(&tmp); + list_for_each_entry_safe(elem, next, &mru->reap_list, list_node) { + + /* Remove the element from the data store. */ + radix_tree_delete(&mru->store, elem->key); + + /* + * remove to temp list so it can be freed without + * needing to hold the lock + */ + list_move(&elem->list_node, &tmp); + } + mutex_spinunlock(&mru->lock, 0); + + list_for_each_entry_safe(elem, next, &tmp, list_node) { + + /* Remove the element from the reap list. */ + list_del_init(&elem->list_node); + + /* Call the client's free function with the key and value pointer. */ + mru->free_func(elem->key, elem->value); + + /* Free the element structure. */ + kmem_zone_free(elem_zone, elem); + DEBUG_INC_FREES(mru); + } + + mutex_spinlock(&mru->lock); +} + +/* + * We fire the reap timer every group expiry interval so + * we always have a reaper ready to run. This makes shutdown + * and flushing of the reaper easy to do. Hence we need to + * keep when the next reap must occur so we can determine + * at each interval whether there is anything we need to do. + */ +STATIC void +_xfs_mru_cache_reap( + struct work_struct *work) +{ + xfs_mru_cache_t *mru = container_of(work, xfs_mru_cache_t, work.work); + unsigned long now, next; + DEBUG_PRINT_STACK_VARS; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return; + + mutex_spinlock(&mru->lock); + now = jiffies; + if (mru->reap_all || + (mru->next_reap && time_after(now, mru->next_reap))) { + DEBUG_PRINT_BEFORE_REAP; + if (mru->reap_all) + now += mru->grp_count * mru->grp_time * 2; + mru->next_reap = _xfs_mru_cache_migrate(mru, now); + _xfs_mru_cache_clear_reap_list(mru); + DEBUG_PRINT_AFTER_REAP; + } + + /* + * the process that triggered the reap_all is responsible + * for restating the periodic reap if it is required. + */ + if (!mru->reap_all) + queue_delayed_work(reap_wq, &mru->work, mru->grp_time); + mru->reap_all = 0; + mutex_spinunlock(&mru->lock, 0); +} + +int +xfs_mru_cache_init(void) +{ + if (!(elem_zone = kmem_zone_init(sizeof(xfs_mru_cache_elem_t), + "xfs_mru_cache_elem"))) + return ENOMEM; + + if (!(reap_wq = create_singlethread_workqueue("xfs_mru_cache"))) { + kmem_zone_destroy(elem_zone); + elem_zone = NULL; + return ENOMEM; + } + + return 0; +} + +void +xfs_mru_cache_uninit(void) +{ + if (reap_wq) { + destroy_workqueue(reap_wq); + reap_wq = NULL; + } + + if (elem_zone) { + kmem_zone_destroy(elem_zone); + elem_zone = NULL; + } +} + +int +xfs_mru_cache_create( + xfs_mru_cache_t **mrup, + unsigned int lifetime_ms, + unsigned int grp_count, + xfs_mru_cache_free_func_t free_func) +{ + xfs_mru_cache_t *mru = NULL; + int err = 0, grp; + unsigned int grp_time; + + if (mrup) + *mrup = NULL; + + if (!mrup || !grp_count || !lifetime_ms || !free_func) + return EINVAL; + + if (!(grp_time = msecs_to_jiffies(lifetime_ms) / grp_count)) + return EINVAL; + + if (!(mru = kmem_zalloc(sizeof(*mru), KM_SLEEP))) + return ENOMEM; + + /* An extra list is needed to avoid reaping up to a grp_time early. */ + mru->grp_count = grp_count + 1; + mru->lists = (struct list_head*) + kmem_alloc(mru->grp_count * sizeof(*mru->lists), KM_SLEEP); + + if (!mru->lists || !DEBUG_INIT_CACHE(mru)) { + err = ENOMEM; + goto exit; + } + + for (grp = 0; grp < mru->grp_count; grp++) + INIT_LIST_HEAD(mru->lists + grp); + + /* + * We use GFP_KERNEL radix tree preload and do inserts under a + * spinlock so GFP_ATOMIC is appropriate for the radix tree itself. + */ + INIT_RADIX_TREE(&mru->store, GFP_ATOMIC); + INIT_LIST_HEAD(&mru->reap_list); + spinlock_init(&mru->lock, "xfs_mru_cache"); + INIT_DELAYED_WORK(&mru->work, _xfs_mru_cache_reap); + + mru->grp_time = grp_time; + mru->free_func = free_func; + + /* start up the reaper event */ + mru->next_reap = 0; + mru->reap_all = 0; + queue_delayed_work(reap_wq, &mru->work, mru->grp_time); + + *mrup = mru; + +exit: + if (err && mru && mru->lists) + kmem_free(mru->lists, mru->grp_count * sizeof(*mru->lists)); + if (err && mru) + kmem_free(mru, sizeof(*mru)); + + return err; +} + +/* + * When flushing, we stop the periodic reaper from running first + * so we don't race with it. If we are flushing on unmount, we + * don't want to restart the reaper again, so the restart is conditional. + * + * Because reaping can drop the last refcount on inodes which can free + * extents, we have to push the reaping off to the workqueue thread + * because we could be called holding locks that extent freeing requires. + */ +void +xfs_mru_cache_flush( + xfs_mru_cache_t *mru, + int restart) +{ + DEBUG_PRINT_STACK_VARS; + + if (!mru || !mru->lists) + return; + + cancel_rearming_delayed_workqueue(reap_wq, &mru->work); + + mutex_spinlock(&mru->lock); + mru->reap_all = 1; + mutex_spinunlock(&mru->lock, 0); + + queue_work(reap_wq, &mru->work.work); + flush_workqueue(reap_wq); + + mutex_spinlock(&mru->lock); + WARN_ON_ONCE(mru->reap_all != 0); + mru->reap_all = 0; + if (restart) + queue_delayed_work(reap_wq, &mru->work, mru->grp_time); + mutex_spinunlock(&mru->lock, 0); +} + +void +xfs_mru_cache_destroy( + xfs_mru_cache_t *mru) +{ + if (!mru || !mru->lists) + return; + + /* we don't want the reaper to restart here */ + xfs_mru_cache_flush(mru, 0); + + DEBUG_UNINIT_CACHE(mru); + kmem_free(mru->lists, mru->grp_count * sizeof(*mru->lists)); + kmem_free(mru, sizeof(*mru)); +} + +int +xfs_mru_cache_insert( + xfs_mru_cache_t *mru, + unsigned long key, + void *value) +{ + xfs_mru_cache_elem_t *elem; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return EINVAL; + + elem = (xfs_mru_cache_elem_t*)kmem_zone_zalloc(elem_zone, KM_SLEEP); + if (!elem) + return ENOMEM; + + if (radix_tree_preload(GFP_KERNEL)) { + kmem_zone_free(elem_zone, elem); + return ENOMEM; + } + + INIT_LIST_HEAD(&elem->list_node); + elem->key = key; + elem->value = value; + + mutex_spinlock(&mru->lock); + + radix_tree_insert(&mru->store, key, elem); + radix_tree_preload_end(); + + _xfs_mru_cache_list_insert(mru, elem); + + DEBUG_INC_ALLOCS(mru); + + mutex_spinunlock(&mru->lock, 0); + + return 0; +} + +void* +xfs_mru_cache_remove( + xfs_mru_cache_t *mru, + unsigned long key) +{ + xfs_mru_cache_elem_t *elem; + void *value = NULL; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return NULL; + + mutex_spinlock(&mru->lock); + elem = (xfs_mru_cache_elem_t*)radix_tree_delete(&mru->store, key); + if (elem) { + value = elem->value; + list_del(&elem->list_node); + DEBUG_INC_FREES(mru); + } + + mutex_spinunlock(&mru->lock, 0); + + if (elem) + kmem_zone_free(elem_zone, elem); + + return value; +} + +void +xfs_mru_cache_delete( + xfs_mru_cache_t *mru, + unsigned long key) +{ + void *value; + + if ((value = xfs_mru_cache_remove(mru, key))) + mru->free_func(key, value); +} + +void* +xfs_mru_cache_lookup( + xfs_mru_cache_t *mru, + unsigned long key) +{ + xfs_mru_cache_elem_t *elem; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return NULL; + + mutex_spinlock(&mru->lock); + elem = (xfs_mru_cache_elem_t*)radix_tree_lookup(&mru->store, key); + if (elem) { + list_del(&elem->list_node); + _xfs_mru_cache_list_insert(mru, elem); + } + else + mutex_spinunlock(&mru->lock, 0); + + return elem ? elem->value : NULL; +} + +void* +xfs_mru_cache_peek( + xfs_mru_cache_t *mru, + unsigned long key) +{ + xfs_mru_cache_elem_t *elem; + + ASSERT(mru && mru->lists); + if (!mru || !mru->lists) + return NULL; + + mutex_spinlock(&mru->lock); + elem = (xfs_mru_cache_elem_t*)radix_tree_lookup(&mru->store, key); + if (!elem) + mutex_spinunlock(&mru->lock, 0); + + return elem ? elem->value : NULL; +} + +void +xfs_mru_cache_done( + xfs_mru_cache_t *mru) +{ + mutex_spinunlock(&mru->lock, 0); +} + +#ifdef DEBUG_MRU_CACHE +STATIC int +_xfs_mru_cache_print( + xfs_mru_cache_t *mru, + char *buf) +{ + unsigned int grp; + struct list_head *node; + char *bufp = buf; + + for (grp = 0; grp < mru->grp_count; grp++) { + mru->list_elems[grp] = 0; + list_for_each(node, mru->lists + grp) + mru->list_elems[grp]++; + } + mru->reap_elems = 0; + list_for_each(node, &mru->reap_list) + mru->reap_elems++; + + bufp += sprintf(bufp, "(%d) ", mru->reap_elems); + + for (grp = 0; grp < mru->grp_count; grp++) + { + if (grp == mru->lru_grp) + *bufp++ = '*'; + + bufp += sprintf(bufp, "%u", mru->list_elems[grp]); + + if (grp == mru->lru_grp) + *bufp++ = '*'; + + if (grp < mru->grp_count - 1) + *bufp++ = ' '; + } + + bufp += sprintf(bufp, " [%lu/%lu]", mru->allocs, mru->frees); + + return bufp - buf; +} +#endif /* DEBUG_MRU_CACHE */ Index: 2.6.x-xfs-new/fs/xfs/xfs_mru_cache.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mru_cache.h 2007-05-10 17:24:13.155002014 +1000 @@ -0,0 +1,225 @@ +/* + * Copyright (c) 2000-2002,2006 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +#ifndef __XFS_MRU_CACHE_H__ +#define __XFS_MRU_CACHE_H__ + +/* + * The MRU Cache data structure consists of a data store, an array of lists and + * a lock to protect its internal state. At initialisation time, the client + * supplies an element lifetime in milliseconds and a group count, as well as a + * function pointer to call when deleting elements. A data structure for + * queueing up work in the form of timed callbacks is also included. + * + * The group count controls how many lists are created, and thereby how finely + * the elements are grouped in time. When reaping occurs, all the elements in + * all the lists whose time has expired are deleted. + * + * To give an example of how this works in practice, consider a client that + * initialises an MRU Cache with a lifetime of ten seconds and a group count of + * five. Five internal lists will be created, each representing a two second + * period in time. When the first element is added, time zero for the data + * structure is initialised to the current time. + * + * All the elements added in the first two seconds are appended to the first + * list. Elements added in the third second go into the second list, and so on. + * If an element is accessed at any point, it is removed from its list and + * inserted at the head of the current most-recently-used list. + * + * The reaper function will have nothing to do until at least twelve seconds + * have elapsed since the first element was added. The reason for this is that + * if it were called at t=11s, there could be elements in the first list that + * have only been inactive for nine seconds, so it still does nothing. If it is + * called anywhere between t=12 and t=14 seconds, it will delete all the + * elements that remain in the first list. It's therefore possible for elements + * to remain in the data store even after they've been inactive for up to + * (t + t/g) seconds, where t is the inactive element lifetime and g is the + * number of groups. + * + * The above example assumes that the reaper function gets called at least once + * every (t/g) seconds. If it is called less frequently, unused elements will + * accumulate in the reap list until the reaper function is eventually called. + * The current implementation uses work queue callbacks to carefully time the + * reaper function calls, so this should happen rarely, if at all. + * + * From a design perspective, the primary reason for the choice of a list array + * representing discrete time intervals is that it's only practical to reap + * expired elements in groups of some appreciable size. This automatically + * introduces a granularity to element lifetimes, so there's no point storing an + * individual timeout with each element that specifies a more precise reap time. + * The bonus is a saving of sizeof(long) bytes of memory per element stored. + * + * The elements could have been stored in just one list, but an array of + * counters or pointers would need to be maintained to allow them to be divided + * up into discrete time groups. More critically, the process of touching or + * removing an element would involve walking large portions of the entire list, + * which would have a detrimental effect on performance. The additional memory + * requirement for the array of list heads is minimal. + * + * When an element is touched or deleted, it needs to be removed from its + * current list. Doubly linked lists are used to make the list maintenance + * portion of these operations O(1). Since reaper timing can be imprecise, + * inserts and lookups can occur when there are no free lists available. When + * this happens, all the elements on the LRU list need to be migrated to the end + * of the reap list. To keep the list maintenance portion of these operations + * O(1) also, list tails need to be accessible without walking the entire list. + * This is the reason why doubly linked list heads are used. + */ + +/* Function pointer type for callback to free a client's data pointer. */ +typedef void (*xfs_mru_cache_free_func_t)(void*, void*); + +typedef struct xfs_mru_cache +{ + struct radix_tree_root store; /* Core storage data structure. */ + struct list_head *lists; /* Array of lists, one per grp. */ + struct list_head reap_list; /* Elements overdue for reaping. */ + spinlock_t lock; /* Lock to protect this struct. */ + unsigned int grp_count; /* Number of discrete groups. */ + unsigned int grp_time; /* Time period spanned by grps. */ + unsigned int lru_grp; /* Group containing time zero. */ + unsigned long time_zero; /* Time first element was added. */ + unsigned long next_reap; /* Time that the reaper should + next do something. */ + unsigned int reap_all; /* if set, reap all lists */ + xfs_mru_cache_free_func_t free_func; /* Function pointer for freeing. */ + struct delayed_work work; /* Workqueue data for reaping. */ +#ifdef DEBUG_MRU_CACHE + unsigned int *list_elems; + unsigned int reap_elems; + unsigned long allocs; + unsigned long frees; +#endif +} xfs_mru_cache_t; + +/* + * xfs_mru_cache_init() prepares memory zones and any other globally scoped + * resources. + */ +int +xfs_mru_cache_init(void); + +/* + * xfs_mru_cache_uninit() tears down all the globally scoped resources prepared + * in xfs_mru_cache_init(). + */ +void +xfs_mru_cache_uninit(void); + +/* + * To initialise a struct xfs_mru_cache pointer, call xfs_mru_cache_create() + * with the address of the pointer, a lifetime value in milliseconds, a group + * count and a free function to use when deleting elements. This function + * returns 0 if the initialisation was successful. + */ +int +xfs_mru_cache_create(struct xfs_mru_cache **mrup, + unsigned int lifetime_ms, + unsigned int grp_count, + xfs_mru_cache_free_func_t free_func); + +/* + * Call xfs_mru_cache_flush() to flush out all cached entries, calling their + * free functions as they're deleted. When this function returns, the caller is + * guaranteed that all the free functions for all the elements have finished + * executing. + * + * While we are flushing, we stop the periodic reaper event from triggering. + * Normally, we want to restart this periodic event, but if we are shutting + * down the cache we do not want it restarted. hence the restart parameter + * where 0 = do not restart reaper and 1 = restart reaper. + */ +void +xfs_mru_cache_flush( + xfs_mru_cache_t *mru, + int restart); + +/* + * Call xfs_mru_cache_destroy() with the MRU Cache pointer when the cache is no + * longer needed. + */ +void +xfs_mru_cache_destroy(struct xfs_mru_cache *mru); + +/* + * To insert an element, call xfs_mru_cache_insert() with the data store, the + * element's key and the client data pointer. This function returns 0 on + * success or ENOMEM if memory for the data element couldn't be allocated. + */ +int +xfs_mru_cache_insert(struct xfs_mru_cache *mru, + unsigned long key, + void *value); + +/* + * To remove an element without calling the free function, call + * xfs_mru_cache_remove() with the data store and the element's key. On success + * the client data pointer for the removed element is returned, otherwise this + * function will return a NULL pointer. + */ +void* +xfs_mru_cache_remove(struct xfs_mru_cache *mru, + unsigned long key); + +/* + * To remove and element and call the free function, call xfs_mru_cache_delete() + * with the data store and the element's key. + */ +void +xfs_mru_cache_delete(struct xfs_mru_cache *mru, + unsigned long key); + +/* + * To look up an element using its key, call xfs_mru_cache_lookup() with the + * data store and the element's key. If found, the element will be moved to the + * head of the MRU list to indicate that it's been touched. + * + * The internal data structures are protected by a spinlock that is STILL HELD + * when this function returns. Call xfs_mru_cache_done() to release it. Note + * that it is not safe to call any function that might sleep in the interim. + * + * The implementation could have used reference counting to avoid this + * restriction, but since most clients simply want to get, set or test a member + * of the returned data structure, the extra per-element memory isn't warranted. + * + * If the element isn't found, this function returns NULL and the spinlock is + * released. xfs_mru_cache_done() should NOT be called when this occurs. + */ +void* +xfs_mru_cache_lookup(struct xfs_mru_cache *mru, + unsigned long key); + +/* + * To look up an element using its key, but leave its location in the internal + * lists alone, call xfs_mru_cache_peek(). If the element isn't found, this + * function returns NULL. + * + * See the comments above the declaration of the xfs_mru_cache_lookup() function + * for important locking information pertaining to this call. + */ +void* +xfs_mru_cache_peek(struct xfs_mru_cache *mru, + unsigned long key); +/* + * To release the internal data structure spinlock after having performed an + * xfs_mru_cache_lookup() or an xfs_mru_cache_peek(), call xfs_mru_cache_done() + * with the data store pointer. + */ +void +xfs_mru_cache_done(struct xfs_mru_cache *mru); + +#endif /* __XFS_MRU_CACHE_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vfsops.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c 2007-05-10 17:24:13.163000966 +1000 @@ -51,6 +51,8 @@ #include "xfs_acl.h" #include "xfs_attr.h" #include "xfs_clnt.h" +#include "xfs_mru_cache.h" +#include "xfs_filestream.h" #include "xfs_fsops.h" STATIC int xfs_sync(bhv_desc_t *, int, cred_t *); @@ -81,6 +83,8 @@ xfs_init(void) xfs_dabuf_zone = kmem_zone_init(sizeof(xfs_dabuf_t), "xfs_dabuf"); xfs_ifork_zone = kmem_zone_init(sizeof(xfs_ifork_t), "xfs_ifork"); xfs_acl_zone_init(xfs_acl_zone, "xfs_acl"); + xfs_mru_cache_init(); + xfs_filestream_init(); /* * The size of the zone allocated buf log item is the maximum @@ -164,6 +168,8 @@ xfs_cleanup(void) xfs_cleanup_procfs(); xfs_sysctl_unregister(); xfs_refcache_destroy(); + xfs_filestream_uninit(); + xfs_mru_cache_uninit(); xfs_acl_zone_destroy(xfs_acl_zone); #ifdef XFS_DIR2_TRACE @@ -320,6 +326,9 @@ xfs_start_flags( else mp->m_flags &= ~XFS_MOUNT_BARRIER; + if (ap->flags2 & XFSMNT2_FILESTREAMS) + mp->m_flags |= XFS_MOUNT_FILESTREAMS; + return 0; } @@ -518,6 +527,9 @@ xfs_mount( if (mp->m_flags & XFS_MOUNT_BARRIER) xfs_mountfs_check_barriers(mp); + if ((error = xfs_filestream_mount(mp))) + goto error2; + error = XFS_IOINIT(vfsp, args, flags); if (error) goto error2; @@ -575,6 +587,13 @@ xfs_unmount( */ xfs_refcache_purge_mp(mp); + /* + * Blow away any referenced inode in the filestreams cache. + * This can and will cause log traffic as inodes go inactive + * here. + */ + xfs_filestream_unmount(mp); + XFS_bflush(mp->m_ddev_targp); error = xfs_unmount_flush(mp, 0); if (error) @@ -682,6 +701,7 @@ xfs_mntupdate( mp->m_flags &= ~XFS_MOUNT_BARRIER; } } else if (!(vfsp->vfs_flag & VFS_RDONLY)) { /* rw -> ro */ + xfs_filestream_flush(mp); bhv_vfs_sync(vfsp, SYNC_FSDATA|SYNC_BDFLUSH|SYNC_ATTR, NULL); xfs_quiesce_fs(mp); xfs_log_sbcount(mp, 1); @@ -909,6 +929,9 @@ xfs_sync( { xfs_mount_t *mp = XFS_BHVTOM(bdp); + if (flags & SYNC_IOWAIT) + xfs_filestream_flush(mp); + return xfs_syncsub(mp, flags, NULL); } @@ -1869,6 +1892,8 @@ xfs_parseargs( } else if (!strcmp(this_char, "irixsgid")) { cmn_err(CE_WARN, "XFS: irixsgid is now a sysctl(2) variable, option is deprecated."); + } else if (!strcmp(this_char, "filestreams")) { + args->flags2 |= XFSMNT2_FILESTREAMS; } else { cmn_err(CE_WARN, "XFS: unknown mount option [%s].", this_char); Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c 2007-05-10 17:24:13.170999917 +1000 @@ -51,6 +51,7 @@ #include "xfs_refcache.h" #include "xfs_trans_space.h" #include "xfs_log_priv.h" +#include "xfs_filestream.h" STATIC int xfs_open( @@ -94,6 +95,19 @@ xfs_close( return 0; /* + * If we are using filestreams, and we have an unlinked + * file that we are processing the last close on, then nothing + * will be able to reopen and write to this file. Purge this + * inode from the filestreams cache so that it doesn't delay + * teardown of the inode. + */ + if ((ip->i_d.di_nlink == 0) && + ((ip->i_mount->m_flags & XFS_MOUNT_FILESTREAMS) || + (ip->i_d.di_flags & XFS_DIFLAG_FILESTREAM))) { + xfs_filestream_deassociate(ip); + } + + /* * If we previously truncated this file and removed old data in * the process, we want to initiate "early" writeout on the last * close. This is an attempt to combat the notorious NULL files @@ -820,6 +834,8 @@ xfs_setattr( di_flags |= XFS_DIFLAG_PROJINHERIT; if (vap->va_xflags & XFS_XFLAG_NODEFRAG) di_flags |= XFS_DIFLAG_NODEFRAG; + if (vap->va_xflags & XFS_XFLAG_FILESTREAM) + di_flags |= XFS_DIFLAG_FILESTREAM; if ((ip->i_d.di_mode & S_IFMT) == S_IFDIR) { if (vap->va_xflags & XFS_XFLAG_RTINHERIT) di_flags |= XFS_DIFLAG_RTINHERIT; @@ -2564,6 +2580,18 @@ xfs_remove( */ xfs_refcache_purge_ip(ip); + /* + * If we are using filestreams, kill the stream association. + * If the file is still open it may get a new one but that + * will get killed on last close in xfs_close() so we don't + * have to worry about that. + */ + if (link_zero && + ((ip->i_mount->m_flags & XFS_MOUNT_FILESTREAMS) || + (ip->i_d.di_flags & XFS_DIFLAG_FILESTREAM))) { + xfs_filestream_deassociate(ip); + } + vn_trace_exit(XFS_ITOV(ip), __FUNCTION__, (inst_t *)__return_address); /* Index: 2.6.x-xfs-new/fs/xfs/quota/xfs_qm.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/quota/xfs_qm.c 2007-05-10 17:22:43.506752209 +1000 +++ 2.6.x-xfs-new/fs/xfs/quota/xfs_qm.c 2007-05-10 17:24:13.186997821 +1000 @@ -65,7 +65,6 @@ kmem_zone_t *qm_dqtrxzone; static struct shrinker *xfs_qm_shaker; static cred_t xfs_zerocr; -static xfs_inode_t xfs_zeroino; STATIC void xfs_qm_list_init(xfs_dqlist_t *, char *, int); STATIC void xfs_qm_list_destroy(xfs_dqlist_t *); @@ -1415,7 +1414,7 @@ xfs_qm_qino_alloc( return error; } - if ((error = xfs_dir_ialloc(&tp, &xfs_zeroino, S_IFREG, 1, 0, + if ((error = xfs_dir_ialloc(&tp, NULL, S_IFREG, 1, 0, &xfs_zerocr, 0, 1, ip, &committed))) { xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES | XFS_TRANS_ABORT); From owner-xfs@oss.sgi.com Thu May 10 18:11:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 18:11:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B1BpfB008635 for ; Thu, 10 May 2007 18:11:53 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA05863; Fri, 11 May 2007 11:11:47 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4B1BkAf88912650; Fri, 11 May 2007 11:11:47 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4B1BjMD90624383; Fri, 11 May 2007 11:11:45 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 11 May 2007 11:11:45 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: fix b0rked test 030 behaviour. Message-ID: <20070511011145.GN86004887@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.4.2.1i X-archive-position: 11392 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Test 030 is not testing things as it should. Specifically, corrupting the AGFL with "-1" is a no-op on a freshly repaired filesystem, because xfs_repair rebuilds the AGF btrees and AGFL from scratch and does not populate the AGFL. The current test does: repair mount create file remove file umount And it does the filesystem twiddling to check that the filesystem is ußable after repair. The problem is that this doesn't dirty the filesystem - the create is followed by a remove, so nothing is actually allocated and so the AGFL lists do not get modified. hence after a repair/check/corruption cycle, writing "-1" to the AGFL is a no-op because it is already full of "-1" fields (NULL blocks). With filestreams, the create/remove pair *does* modify the filesystem and so when we write "-1" to the AGFL, we get different output because the filesystem detects new corruptions and the test "fails". So, to make behaviour consistent, dirty the filesystem before corrupting it on each cycle. Hence it doesn't matter if we are using filestreams or not, we'll really test out corrupting the AGFL with NULL blocks (-1) now. Comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- xfstests/030.out.irix | 4 ++++ xfstests/030.out.linux | 4 ++++ xfstests/common.repair | 10 ++++++++++ 3 files changed, 18 insertions(+) Index: xfs-cmds/xfstests/030.out.irix =================================================================== --- xfs-cmds.orig/xfstests/030.out.irix 2007-05-03 17:10:29.554803451 +1000 +++ xfs-cmds/xfstests/030.out.irix 2007-05-03 17:10:54.227585189 +1000 @@ -262,6 +262,10 @@ Wrote X.XXKb (value 0xffffffff) Phase 1 - find and verify superblock... Phase 2 - zero log... - scan filesystem freespace and inode maps... +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... Index: xfs-cmds/xfstests/030.out.linux =================================================================== --- xfs-cmds.orig/xfstests/030.out.linux 2007-05-03 17:10:29.554803451 +1000 +++ xfs-cmds/xfstests/030.out.linux 2007-05-03 17:10:54.231584667 +1000 @@ -270,6 +270,10 @@ Phase 1 - find and verify superblock... Phase 2 - using log - zero log... - scan filesystem freespace and inode maps... +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... Index: xfs-cmds/xfstests/common.repair =================================================================== --- xfs-cmds.orig/xfstests/common.repair 2007-05-03 17:10:29.554803451 +1000 +++ xfs-cmds/xfstests/common.repair 2007-05-03 17:10:54.231584667 +1000 @@ -72,8 +72,18 @@ _check_repair() { value=$1 structure="$2" + + #ensure the filesystem has been dirtied since last repair + _scratch_mount + POSIXLY_CORRECT=yes \ + dd if=/bin/sh of=$SCRATCH_MNT/sh 2>&1 |_filter_dd + sync + rm -f $SCRATCH_MNT/sh + umount $SCRATCH_MNT + _zero_position $value "$structure" _scratch_xfs_repair 2>&1 | _filter_repair + # some basic sanity checks... _check_scratch_fs _scratch_mount #mount From owner-xfs@oss.sgi.com Thu May 10 20:39:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 20:39:18 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B3dDfB032370 for ; Thu, 10 May 2007 20:39:15 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA08907; Fri, 11 May 2007 13:39:07 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 423A158CA530; Fri, 11 May 2007 13:39:06 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 964546 - Only use refcounted pages for I/O Message-Id: <20070511033907.423A158CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 13:39:06 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11393 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Only use refcounted pages for I/O Many block drivers (aoe, iscsi) really want refcountable pages in bios, which is what almost everyone send down. XFS unfortunately has a few places where it sends down buffers that may come from kmalloc, which breaks them. Fix the places that use kmalloc()d buffers. Signed-Off-By: Christoph Hellwig Date: Fri May 11 13:37:22 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: dgc,tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28562a fs/xfs/xfs_log.c - 1.328 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log.c.diff?r1=text&tr1=1.328&r2=text&tr2=1.327&f=h - Convert log buffers to use xfs_buf_get_noaddr rather than using kmem_alloc()d buffers. fs/xfs/linux-2.6/xfs_buf.h - 1.120 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.h.diff?r1=text&tr1=1.120&r2=text&tr2=1.119&f=h - Use alloc_page() rather than kmem_alloc() for buffers that do not use page cache backed pages. fs/xfs/linux-2.6/xfs_buf.c - 1.236 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.c.diff?r1=text&tr1=1.236&r2=text&tr2=1.235&f=h - Use alloc_page() rather than kmem_alloc() for buffers that do not use page cache backed pages. From owner-xfs@oss.sgi.com Thu May 10 21:01:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 21:01:41 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B41ZfB003462 for ; Thu, 10 May 2007 21:01:37 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA09369; Fri, 11 May 2007 14:01:31 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id C7E7A58CA530; Fri, 11 May 2007 14:01:31 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: PARTIAL TAKE 957886 - xfs_growfs should refuse to grow fs past 16Tb on a 32 bit system Message-Id: <20070511040131.C7E7A58CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 14:01:31 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11394 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Don't grow filesystems past the size they can index. When growing a filesystem we don't check to see if the new size overflows the page cache index range, so we can do silly things like grow a filesystem page 16TB on a 32bit. Check new filesystem sizes against the limits the kernel can support. Signed-Off-By: Nathan Scott Date: Fri May 11 14:00:21 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: dgc The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28563a fs/xfs/xfs_rtalloc.c - 1.107 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_rtalloc.c.diff?r1=text&tr1=1.107&r2=text&tr2=1.106&f=h - Check new rt volume size against the maximum the system can support. fs/xfs/xfs_mount.h - 1.235 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.h.diff?r1=text&tr1=1.235&r2=text&tr2=1.234&f=h - Factor maximum supported filesystem size checks to allow other callers to use it. fs/xfs/xfs_mount.c - 1.394 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.c.diff?r1=text&tr1=1.394&r2=text&tr2=1.393&f=h - Factor maximum supported filesystem size checks to allow other callers to use it. fs/xfs/xfs_fsops.c - 1.123 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_fsops.c.diff?r1=text&tr1=1.123&r2=text&tr2=1.122&f=h - Check new volume size against the maximum the system can support. From owner-xfs@oss.sgi.com Thu May 10 22:03:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 22:03:50 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B53ifB014004 for ; Thu, 10 May 2007 22:03:46 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA10567; Fri, 11 May 2007 15:03:40 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id A25B858CA530; Fri, 11 May 2007 15:03:40 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 963674 - Don't hold ilock when calling vn_iowait. Message-Id: <20070511050340.A25B858CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 15:03:40 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11395 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Sleeping with the ilock waiting for I/O completion is Bad. Recent fixes to the filesystem freezing code introduced a vn_iowait call in the middle of the sync code. Unfortunately, at the point where this call was added we are holding the ilock. The ilock is needed by I/O completion for unwritten extent conversion and now updating the file size. Hence I/o cannot complete if we hol dthe ilock while waiting for I/O completion. Fix up the bug and clean the code up around it. Date: Fri May 11 15:02:29 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@infradead.org,tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28566a fs/xfs/xfs_vfsops.c - 1.519 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.519&r2=text&tr2=1.518&f=h - Drop the ilock before calling vn_iowait() when doing a SYNC_IOWAIT sync operation. Make the code easier to understand as well. From owner-xfs@oss.sgi.com Thu May 10 22:25:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 22:25:28 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B5POfB018617 for ; Thu, 10 May 2007 22:25:26 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA10961; Fri, 11 May 2007 15:25:20 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 2957D58CA530; Fri, 11 May 2007 15:25:20 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 964545 - use-after-free of xfs_buf_t during log unmount Message-Id: <20070511052520.2957D58CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 15:25:20 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11396 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Fix use-after-free during log unmount. Don't reference the log buffer after running the callbacks as the callback can trigger the log buffers to be freed during unmount. Date: Fri May 11 15:24:46 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@infradead.org The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28567a fs/xfs/xfs_log.c - 1.329 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log.c.diff?r1=text&tr1=1.329&r2=text&tr2=1.328&f=h - Don't reference the log buffer after running the callbacks as it may have been freed during the unmount. From owner-xfs@oss.sgi.com Thu May 10 22:35:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 10 May 2007 22:35:54 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4B5ZnfB020332 for ; Thu, 10 May 2007 22:35:51 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA11162; Fri, 11 May 2007 15:35:45 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 54D9758CA530; Fri, 11 May 2007 15:35:45 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 964544 - Barriers need to be dynamically checked and switched off Message-Id: <20070511053545.54D9758CA530@chook.melbourne.sgi.com> Date: Fri, 11 May 2007 15:35:45 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11397 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Barriers need to be dynamically checked and switched off If the underlying block device sudden stops supporting barriers, we need to handle the -EOPNOTSUPP error in a sane manner rather than shutting downteh filesystem. If we get this error, clear the barrier flag, reissue the I/O, and tell the world bad things are occurring. Date: Fri May 11 15:35:19 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@infradead.org The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28568a fs/xfs/xfs_log.c - 1.330 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log.c.diff?r1=text&tr1=1.330&r2=text&tr2=1.329&f=h - If we have barriers enabled and we see a barrier log write come back without the barrier flag on it, then we need to stop issuing barriers on the log writes. Make noise about it, too. fs/xfs/linux-2.6/xfs_super.c - 1.380 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_super.c.diff?r1=text&tr1=1.380&r2=text&tr2=1.379&f=h - We shouldn't peer down into the backing device to see if barriers are supported or not - the test I/O is sufficient to tell us this. fs/xfs/linux-2.6/xfs_buf.c - 1.237 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.c.diff?r1=text&tr1=1.237&r2=text&tr2=1.236&f=h - If the buffer gets a EOPNOTSUPP I/O error and it is a barrier write, clear the barrier and reissue the I/O. From owner-xfs@oss.sgi.com Fri May 11 04:01:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 11 May 2007 04:01:44 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4BB1bfB020236 for ; Fri, 11 May 2007 04:01:39 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4BB1ak9030146 for ; Fri, 11 May 2007 07:01:36 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4BB1aIw544592 for ; Fri, 11 May 2007 07:01:36 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4BB1ZDE031697 for ; Fri, 11 May 2007 07:01:36 -0400 Received: from qubit.in.ibm.com ([9.124.219.214]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4BB1XPB031541; Fri, 11 May 2007 07:01:34 -0400 Received: from qubit.in.ibm.com (localhost.localdomain [127.0.0.1]) by qubit.in.ibm.com (Postfix) with ESMTP id C81A667FFD; Fri, 11 May 2007 16:33:11 +0530 (IST) Received: (from suparna@localhost) by qubit.in.ibm.com (8.13.1/8.13.1/Submit) id l4BB37tF003708; Fri, 11 May 2007 16:33:07 +0530 Date: Fri, 11 May 2007 16:33:01 +0530 From: Suparna Bhattacharya To: David Chinner Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070511110301.GB28425@in.ibm.com> Reply-To: suparna@in.ibm.com References: <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> <20070510005926.GT85884050@sgi.com> <20070510115620.GB21400@amitarora.in.ibm.com> <20070510223950.GD86004887@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070510223950.GD86004887@sgi.com> User-Agent: Mutt/1.5.11 X-archive-position: 11398 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: suparna@in.ibm.com Precedence: bulk X-list: xfs On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: > > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > > > I have the updated patches ready which take care of Andrew's comments. > > > > Will run some tests and post them soon. > > > > > > > > But, before submitting these patches, I think it will be better to > > > > finalize on certain things which might be worth some discussion here: > > > > > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > > > - Andreas and Chris Wedgwood are in favor of not changing the file size > > > > in this case. I also tend to agree with them. Does anyone has an > > > > argument in favor of changing the filesize ? If not, I will remove the > > > > code which changes the filesize, before I resubmit the concerned ext4 > > > > patch. > > > > > > I think there needs to be both. If we don't have a mechanism to atomically > > > change the file size with the preallocation, then applications that use > > > stat() to work out if they need to preallocate more space will end up > > > racing. > > > > By "both" above, do you mean we should give user the flexibility if it wants > > the filesize changed or not ? It can be done by having *two* modes for > > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we > > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not > > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() > > will change the filesize if required (i.e. when allocation is beyond EOF) > > and also update [cm]time. This way, the application can decide what it > > wants. > > Yes, that's right. > > > This will be helpfull for the partial allocation scenario also. Think of the > > case when we do not change the filesize in fallocate() and expect > > applications/posix_fallocate() to do ftruncate() after fallocate() for this. > > Now if fallocate() results in a partial allocation with -ENOSPC error > > returned, applications/posix_fallocate() will not know for what length > > ftruncate() has to be called. :( > > Well, posix_fallocate() either gets all the space or it fails. If > you truncate to extend the file size after an ENOSPC, then that is > a buggy implementation. > > The same could be said for any application, or even the fallocate() > call itself if it changes the filesize without having completely > preallocated the space asked.... > > > Hence it may be a good idea to give user the flexibility if it wants to > > atomically change the file size with preallocation or not. But, with more > > flexibility there comes inconsistency in behavior, which is worth > > considering. > > We've got different modes to specify different behaviour. That's > what the mode field was put there for in the first place - the > interface is *designed* to support different preallocation > behaviours.... > > > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of > > > > normal (non-preallocated) blocks (blocks allocated via regular > > > > write/truncate operations) also (i.e. work as punch()) ? > > > > > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what > > > i did for FA_UNALLOCATE as well. > > > > Ok. But, some people may not expect/like this. I think, we can keep it on > > the backburner for a while, till other issues are sorted out. > > How can it be a "backburner" issue when it defines the > implementation? I've already implemented some thing in XFS that > sort of does what I think that the interface is supposed to do, but > I need that interface to be nailed down before proceeding any > further. > > All I'm really interested in right now is that the fallocate > _interface_ can be used as a *complete replacement* for the > pre-existing XFS-specific ioctls that are already used by > applications. What ext4 can or can't do right now is irrelevant to > this discussion - the interface definition needs to take priority > over implementation.... Would you like to write up an interface definition description (likely man page) and post it for review, possibly with a mention of apps using it today ? One reason for introducing the mode parameter was to allow the interface to evolve incrementally as more options / semantic questions are proposed, so that we don't have to make all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Regards Suparna > > Cheers, > > Dave, > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India From owner-xfs@oss.sgi.com Fri May 11 07:48:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 11 May 2007 07:48:33 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4BEmSfB026547 for ; Fri, 11 May 2007 07:48:29 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id F047B2C804B; Fri, 11 May 2007 07:47:38 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id B19862C8043; Fri, 11 May 2007 07:47:38 -0700 (PDT) Received: from [192.168.28.126] (outer-dhcp-126.goop.org [192.168.28.126]) by lurch.goop.org (Postfix) with ESMTP; Fri, 11 May 2007 07:47:38 -0700 (PDT) Message-ID: <4644823A.8090104@goop.org> Date: Fri, 11 May 2007 07:48:26 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> <20070510232729.GH86004887@sgi.com> <4643AF8F.5040705@goop.org> <20070511003257.GL86004887@sgi.com> In-Reply-To: <20070511003257.GL86004887@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11399 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: >> Yes, that does look like a good candidate. Should I try to >> before-and-after this change? >> > > Yes please! > OK, definite result. Before ba87ea699ebd9dd577bf055ebc4a98200e337542: all OK. After: truncated files. I also got a bmap of a particular truncated file, linux-clone-test-1/.hg/store/00manifest.i, diffing before with after: --rw-r--r-- 1 root root 3558208 May 11 01:16 /home/jeremy/hg/linux-clone-test-1/.hg/store/00manifest.i +-rw-r--r-- 1 root root 3541760 May 11 01:16 /home/jeremy/hg/linux-clone-test-1/.hg/store/00manifest.i 16: [6144..6271]: 18141808..18141935 2 (2413168..2413295) 128 17: [6272..6399]: 18140608..18140735 2 (2411968..2412095) 128 18: [6400..6911]: 18136464..18136975 2 (2407824..2408335) 512 - 19: [6912..6951]: 18136336..18136375 2 (2407696..2407735) 40 + 19: [6912..6919]: 18136336..18136343 2 (2407696..2407703) 8 J From owner-xfs@oss.sgi.com Fri May 11 09:52:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 11 May 2007 09:52:30 -0700 (PDT) Received: from gab.dneg.com (mail.dneg.com [193.203.82.196]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4BGqNfB019117 for ; Fri, 11 May 2007 09:52:25 -0700 Received: from localhost (localhost.localdomain [127.0.0.1]) by gab.dneg.com (Postfix) with ESMTP id DF7594D5F86 for ; Fri, 11 May 2007 17:34:53 +0100 (BST) Received: from gab.dneg.com ([127.0.0.1]) by localhost (gab.dneg.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TpQDzJTEYiUv for ; Fri, 11 May 2007 17:34:52 +0100 (BST) Received: from [172.16.10.24] (spinach.dneg.com [172.16.10.24]) by gab.dneg.com (Postfix) with ESMTP id E13BE4D5EFD for ; Fri, 11 May 2007 17:34:52 +0100 (BST) Message-ID: <46449B2C.30208@dneg.com> Date: Fri, 11 May 2007 17:34:52 +0100 From: Evan Fraser User-Agent: Thunderbird 2.0.0.0 (X11/20070326) MIME-Version: 1.0 To: xfs@oss.sgi.com Subject: updatedb triggers XFS internal error Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11400 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: evan@dneg.com Precedence: bulk X-list: xfs Hello, I'm having a problem with one of my linux servers. whenever updatedb is run, the following errors occur in the system log. 0x0: c9 00 5a f1 3a 7f 66 be a3 c1 d4 7f e8 1d 6b c9 Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2271 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffff8817957f Call Trace:{:xfs:xfs_da_do_buf+1513} {:xfs:xfs_da_read_buf+36} {do_lookup+83} {:xfs:xfs_dir2_put_dirent64_direct+0} {:xfs:xfs_da_read_buf+36} {:xfs:xfs_dir2_block_getdents+183} {:xfs:xfs_dir2_block_getdents+183} {:xfs:xfs_dir2_put_dirent64_direct+0} {link_path_walk+196} {:xfs:xfs_bmap_last_offset+226} {:xfs:xfs_dir2_put_dirent64_direct+0} {:xfs:xfs_dir2_getdents+222} {:xfs:xfs_readdir+84} {:xfs:linvfs_readdir+213} {filldir64+0} {filldir64+0} {vfs_readdir+154} {sys_getdents64+116} {tracesys+209} 0x0: 28 ab a5 ec 3e 42 55 1f 76 9e 01 72 72 ee bd f1 Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2271 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffff8817957f Call Trace:{:xfs:xfs_da_do_buf+1513} {:xfs:xfs_da_read_buf+36} {find_or_create_page+30} {:xfs:xfs_da_read_buf+36} {:xfs:xfs_dir2_leaf_getdents+1107} {:xfs:xfs_dir2_leaf_getdents+1107} {:xfs:xfs_dir2_put_dirent64_direct+0} {:xfs:xfs_bmap_last_offset+226} {:xfs:xfs_dir2_put_dirent64_direct+0} {:xfs:xfs_dir2_getdents+250} {:xfs:xfs_readdir+84} {:xfs:linvfs_readdir+213} {filldir64+0} {filldir64+0} {vfs_readdir+154} {sys_getdents64+116} {tracesys+209} Its a dual opteron system running Fedora Core 4 and running the fedora packaged 2.6.12-1.1456_FC4smp kernel. The filesystem in question is on a md stripe raid running across an Infortrend 1.4TB hardware SCSI raid. The output from xfs_info is: meta-data=/user_data isize=256 agcount=32, agsize=11180624 blks = sectsz=512 data = bsize=4096 blocks=357779520, imaxpct=25 = sunit=16 swidth=32 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=131072 blocks=0, rtextents=0 Any help will be gratefully received! Cheers, Evan. -- evan@dneg.com Linux Systems Administrator Double Negative tel: +44 (0)20 7534 4400 fax: +44 (0)20 7534 4452 77 shaftesbury avenue, w1d 5du, London From owner-xfs@oss.sgi.com Sat May 12 00:56:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 00:56:43 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4C7uVfB013039 for ; Sat, 12 May 2007 00:56:32 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA11477; Sat, 12 May 2007 17:56:25 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4C7uMAf91189703; Sat, 12 May 2007 17:56:23 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4C7uIhK87536032; Sat, 12 May 2007 17:56:18 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Sat, 12 May 2007 17:56:18 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070512075618.GE85884050@sgi.com> References: <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070510225834.GF86004887@sgi.com> <4643A5B2.3060906@goop.org> <20070510232729.GH86004887@sgi.com> <4643AF8F.5040705@goop.org> <20070511003257.GL86004887@sgi.com> <4644823A.8090104@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4644823A.8090104@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11401 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 11, 2007 at 07:48:26AM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > >> Yes, that does look like a good candidate. Should I try to > >> before-and-after this change? > >> > > > > Yes please! > > > > OK, definite result. Before ba87ea699ebd9dd577bf055ebc4a98200e337542: > all OK. After: truncated files. > > I also got a bmap of a particular truncated file, > linux-clone-test-1/.hg/store/00manifest.i, diffing before with after: > > --rw-r--r-- 1 root root 3558208 May 11 01:16 /home/jeremy/hg/linux-clone-test-1/.hg/store/00manifest.i > +-rw-r--r-- 1 root root 3541760 May 11 01:16 /home/jeremy/hg/linux-clone-test-1/.hg/store/00manifest.i > > 16: [6144..6271]: 18141808..18141935 2 (2413168..2413295) 128 > 17: [6272..6399]: 18140608..18140735 2 (2411968..2412095) 128 > 18: [6400..6911]: 18136464..18136975 2 (2407824..2408335) 512 > - 19: [6912..6951]: 18136336..18136375 2 (2407696..2407735) 40 > + 19: [6912..6919]: 18136336..18136343 2 (2407696..2407703) 8 Ok, thanks for confirming the cause of the regression. I'll post a patch when I've got something for you to try. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sat May 12 01:02:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 01:02:32 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4C82NfB014463 for ; Sat, 12 May 2007 01:02:24 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA11674; Sat, 12 May 2007 18:02:08 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4C824Af90572564; Sat, 12 May 2007 18:02:04 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4C81vTq91481671; Sat, 12 May 2007 18:01:57 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Sat, 12 May 2007 18:01:57 +1000 From: David Chinner To: Suparna Bhattacharya Cc: David Chinner , "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070512080157.GF85884050@sgi.com> References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426180332.GA7209@amitarora.in.ibm.com> <20070509160102.GA30745@amitarora.in.ibm.com> <20070510005926.GT85884050@sgi.com> <20070510115620.GB21400@amitarora.in.ibm.com> <20070510223950.GD86004887@sgi.com> <20070511110301.GB28425@in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070511110301.GB28425@in.ibm.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11402 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote: > On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > > All I'm really interested in right now is that the fallocate > > _interface_ can be used as a *complete replacement* for the > > pre-existing XFS-specific ioctls that are already used by > > applications. What ext4 can or can't do right now is irrelevant to > > this discussion - the interface definition needs to take priority > > over implementation.... > > Would you like to write up an interface definition description (likely > man page) and post it for review, possibly with a mention of apps using > it today ? Yeah, I started doing that yesterday as i figured it was the only way to cut the discussion short.... > One reason for introducing the mode parameter was to allow the interface to > evolve incrementally as more options / semantic questions are proposed, so > that we don't have to make all the decisions right now. > So it would be good to start with a *minimal* definition, even just one mode. > The rest could follow as subsequent patches, each being reviewed and debated > separately. Otherwise this discussion can drag on for a long time. Minimal definition to replace what applicaitons use on XFS and to support poasix_fallocate are the thre that have been mentioned so far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them all in a man page... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sat May 12 05:46:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 05:47:04 -0700 (PDT) Received: from waste.org (waste.org [66.93.16.53]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CCksfB023990 for ; Sat, 12 May 2007 05:46:56 -0700 Received: from waste.org (localhost [127.0.0.1]) by waste.org (8.13.8/8.13.8/Debian-3) with ESMTP id l4CCkgpd012114 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Sat, 12 May 2007 07:46:43 -0500 Received: (from oxymoron@localhost) by waste.org (8.13.8/8.13.8/Submit) id l4CCkf7o012113; Sat, 12 May 2007 07:46:41 -0500 Date: Sat, 12 May 2007 07:46:41 -0500 From: Matt Mackall To: Jan Engelhardt Cc: Jeremy Fitzhardinge , David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070512124641.GZ11115@waste.org> References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510153832.GQ11115@waste.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11403 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mpm@selenic.com Precedence: bulk X-list: xfs On Sat, May 12, 2007 at 01:21:41PM +0200, Jan Engelhardt wrote: > > On May 10 2007 10:38, Matt Mackall wrote: > >> > >> for i in `seq 20`; do > >> hg clone -U --pull a b-$i > >> hg verify b-$i # always OK > >> umount /home > >> sleep 5 > >> mount /home > >> hg verify b-$i # often found truncated files > >> done > >> > [...] > > > >This test looks like it should consist solely of open-for-append and > >write on about 20k files in the target directory. Because of the > >--pull, no hardlinks are involved. It shouldn't be all that different > >from doing tar cf - a | tar xf - b. > > > >The files get visited in alphabetical order, so the start of the > >corruption may be telling. > > You should not assume alphabetical order. Filesystems may be free to > reorder things and return them (1) randomly like in a hash (2) by > creation time during readdir(). There is no assumption. Mercurial explicitly visits files in alphabetical order for the above commands. -- Mathematics is the supreme nostalgia of our time. From owner-xfs@oss.sgi.com Sat May 12 06:02:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 06:02:22 -0700 (PDT) Received: from mailer.gwdg.de (mailer.gwdg.de [134.76.10.26]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CD2AfB028338 for ; Sat, 12 May 2007 06:02:12 -0700 Received: from linux01.gwdg.de ([134.76.13.21]) by mailer.gwdg.de with esmtps (TLSv1:AES256-SHA:256) (Exim 4.66) (envelope-from ) id 1Hmpj7-0000iM-5t; Sat, 12 May 2007 13:25:29 +0200 Received: from linux01.gwdg.de (localhost [127.0.0.1]) by linux01.gwdg.de (8.13.3/8.13.3/SuSE Linux 0.7) with ESMTP id l4CBLgW1023744; Sat, 12 May 2007 13:21:44 +0200 Received: from localhost (jengelh@localhost) by linux01.gwdg.de (8.13.3/8.13.3/Submit) with ESMTP id l4CBLfOW023691; Sat, 12 May 2007 13:21:41 +0200 Date: Sat, 12 May 2007 13:21:41 +0200 (MEST) From: Jan Engelhardt To: Matt Mackall cc: Jeremy Fitzhardinge , David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? In-Reply-To: <20070510153832.GQ11115@waste.org> Message-ID: References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510153832.GQ11115@waste.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 11405 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jengelh@linux01.gwdg.de Precedence: bulk X-list: xfs On May 10 2007 10:38, Matt Mackall wrote: >> >> for i in `seq 20`; do >> hg clone -U --pull a b-$i >> hg verify b-$i # always OK >> umount /home >> sleep 5 >> mount /home >> hg verify b-$i # often found truncated files >> done >> [...] > >This test looks like it should consist solely of open-for-append and >write on about 20k files in the target directory. Because of the >--pull, no hardlinks are involved. It shouldn't be all that different >from doing tar cf - a | tar xf - b. > >The files get visited in alphabetical order, so the start of the >corruption may be telling. You should not assume alphabetical order. Filesystems may be free to reorder things and return them (1) randomly like in a hash (2) by creation time during readdir(). Jan -- From owner-xfs@oss.sgi.com Sat May 12 06:02:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 06:02:22 -0700 (PDT) Received: from mailer.gwdg.de (mailer.gwdg.de [134.76.10.26]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CD2AfC028338 for ; Sat, 12 May 2007 06:02:15 -0700 Received: from linux01.gwdg.de ([134.76.13.21]) by mailer.gwdg.de with esmtps (TLSv1:AES256-SHA:256) (Exim 4.66) (envelope-from ) id 1Hmpkn-00010V-61; Sat, 12 May 2007 13:27:13 +0200 Received: from linux01.gwdg.de (localhost [127.0.0.1]) by linux01.gwdg.de (8.13.3/8.13.3/SuSE Linux 0.7) with ESMTP id l4CBNRWc003792; Sat, 12 May 2007 13:23:30 +0200 Received: from localhost (jengelh@localhost) by linux01.gwdg.de (8.13.3/8.13.3/Submit) with ESMTP id l4CBNRpV003734; Sat, 12 May 2007 13:23:27 +0200 Date: Sat, 12 May 2007 13:23:27 +0200 (MEST) From: Jan Engelhardt To: Jeremy Fitzhardinge cc: Chuck Ebbert , David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? In-Reply-To: <46439491.9010604@goop.org> Message-ID: References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 11404 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jengelh@linux01.gwdg.de Precedence: bulk X-list: xfs On May 10 2007 14:54, Jeremy Fitzhardinge wrote: >>>> What CPU architecture is this happening on? Not i686 with PAE by >>>> any chance? >>>> >>> Yes. Why? >> >> I have a bug report where NFS files are corrupted only with PAE clients. >> Corruption is at the end of the (newly untarred) files. Doesn't happen >> without PAE. > >Hm, suggestive, but I'm not convinced. Two differences to this situation: > > 1. Immediately after the clone ("untar"), the contents are completely > OK; it's only after a umount/mount cycle to problems appear And if you do a "sync" rather than umount/mount? > 2. There's no corruption as such; the files are just too short. And > it seems they're at a previously OK length, not some random size. Jan -- From owner-xfs@oss.sgi.com Sat May 12 06:52:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 06:52:24 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4CDpvfB012422 for ; Sat, 12 May 2007 06:51:59 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA16432; Sat, 12 May 2007 23:51:51 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4CDpmAf91517894; Sat, 12 May 2007 23:51:49 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4CDpiT591658269; Sat, 12 May 2007 23:51:44 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Sat, 12 May 2007 23:51:43 +1000 From: David Chinner To: Jan Engelhardt Cc: Jeremy Fitzhardinge , Chuck Ebbert , David Chinner , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070512135143.GG85884050@sgi.com> References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-archive-position: 11406 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sat, May 12, 2007 at 01:23:27PM +0200, Jan Engelhardt wrote: > > On May 10 2007 14:54, Jeremy Fitzhardinge wrote: > >>>> What CPU architecture is this happening on? Not i686 with PAE by > >>>> any chance? > >>>> > >>> Yes. Why? > >> > >> I have a bug report where NFS files are corrupted only with PAE clients. > >> Corruption is at the end of the (newly untarred) files. Doesn't happen > >> without PAE. > > > >Hm, suggestive, but I'm not convinced. Two differences to this situation: > > > > 1. Immediately after the clone ("untar"), the contents are completely > > OK; it's only after a umount/mount cycle to problems appear > > And if you do a "sync" rather than umount/mount? I doubt it will matter - I don't think we are marking the inode dirty at the right point. The change that was at fault modifies the way we update the file size on the inode. We added an in-memory copy of the file size to the in-memory copy of the disk inode's file size that we already keep. We now only update the disk inode's (in memory copy) file size on I/O completion. Because the generic code writes the inode out before waiting for I/O to complete, the old file size gets written out instead of the new one. If the write was to extending the file into an existing block there would be no delalloc transaction to redirty the inode (happens on log I/O completion). Hence when the I/O completes and the file size gets updated to the in-core disk inode (which is marked dirty), the linux inode remains clean. As a result, a sync will never flush the inode to get the updated file size to disk. What I don't understand is that on unmount dirty xfs inodes get written out. Clearly this is not happening - either there's a hole in the writeback logic (unlikely - it was unchanged) or we've missed some case where we need to update the filesize and mark the inode dirty. Hmmmm - if the write was just a short append to the file, then the block that was written to should already be mapped. Then we'll just look up the extent by doing a BMAPI_READ lookup, set the type to IOMAP_READ and add the block to ioend we are building. The type IOMAP_READ determines the I/O completion behaviour - in this case it is xfs_end_bio_read(), which fails to update the file size.... Bingo. A patch for you to try, Jeremy. I've just started a test run on it... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_aops.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-05-11 16:03:59.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-12 23:35:42.691464799 +1000 @@ -973,8 +973,9 @@ xfs_page_state_convert( bh = head = page_buffers(page); offset = page_offset(page); - flags = -1; - type = IOMAP_READ; + iomap_valid = 0; + flags = BMAPI_READ; + type = IOMAP_NEW; /* TODO: cleanup count and page_dirty */ @@ -1004,14 +1005,14 @@ xfs_page_state_convert( * * Third case, an unmapped buffer was found, and we are * in a path where we need to write the whole page out. - */ + */ if (buffer_unwritten(bh) || buffer_delay(bh) || ((buffer_uptodate(bh) || PageUptodate(page)) && !buffer_mapped(bh) && (unmapped || startio))) { - /* + /* * Make sure we don't use a read-only iomap */ - if (flags == BMAPI_READ) + if (flags == BMAPI_READ) iomap_valid = 0; if (buffer_unwritten(bh)) { @@ -1060,7 +1061,7 @@ xfs_page_state_convert( * That means it must already have extents allocated * underneath it. Map the extent by reading it. */ - if (!iomap_valid || type != IOMAP_READ) { + if (!iomap_valid || flags != BMAPI_READ) { flags = BMAPI_READ; size = xfs_probe_cluster(inode, page, bh, head, 1); @@ -1071,7 +1072,15 @@ xfs_page_state_convert( iomap_valid = xfs_iomap_valid(&iomap, offset); } - type = IOMAP_READ; + /* + * We set the type to IOMAP_NEW in case we are doing a + * small write at EOF that is extending the file but + * without needing an allocation. We need to update the + * file size on I/O completion in this case so it is + * the same case as having just allocated a new extent + * that we are writing into for the first time. + */ + type = IOMAP_NEW; if (!test_and_set_bit(BH_Lock, &bh->b_state)) { ASSERT(buffer_mapped(bh)); if (iomap_valid) From owner-xfs@oss.sgi.com Sat May 12 07:56:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 07:56:27 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CEuJfB029390 for ; Sat, 12 May 2007 07:56:20 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 4A9C62C8042; Sat, 12 May 2007 07:55:30 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 252522C803B; Sat, 12 May 2007 07:55:30 -0700 (PDT) Received: from [192.168.28.126] (outer-dhcp-126.goop.org [192.168.28.126]) by lurch.goop.org (Postfix) with ESMTP; Sat, 12 May 2007 07:55:30 -0700 (PDT) Message-ID: <4645D594.4070801@goop.org> Date: Sat, 12 May 2007 07:56:20 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Jan Engelhardt , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070512135143.GG85884050@sgi.com> In-Reply-To: <20070512135143.GG85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11407 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > What I don't understand is that on unmount dirty xfs inodes get > written out. Clearly this is not happening - either there's a hole > in the writeback logic (unlikely - it was unchanged) or we've missed > some case where we need to update the filesize and mark the inode > dirty. > > Hmmmm - if the write was just a short append to the file, then the > block that was written to should already be mapped. Then we'll just > look up the extent by doing a BMAPI_READ lookup, set the type to > IOMAP_READ and add the block to ioend we are building. > Well, that result I mailed you showed that the difference was just over 16k, and that there was a 32 block difference in the final extent length. Does that fit with this theory? > The type IOMAP_READ determines the I/O completion behaviour - in this case > it is xfs_end_bio_read(), which fails to update the file size.... > > Bingo. > > A patch for you to try, Jeremy. I've just started a test run on it... > Thanks, I'll give it a spin. Have you reproduced the bug yourself? J From owner-xfs@oss.sgi.com Sat May 12 10:49:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 10:49:24 -0700 (PDT) Received: from mx1.suse.de (cantor.suse.de [195.135.220.2]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4CHn2fB010718 for ; Sat, 12 May 2007 10:49:04 -0700 Received: from Relay1.suse.de (mail2.suse.de [195.135.221.8]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.suse.de (Postfix) with ESMTP id 9DA3D122E4; Sat, 12 May 2007 19:49:01 +0200 (CEST) To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams References: <20070511003606.GB85884050@sgi.com> From: Andi Kleen Date: 12 May 2007 20:46:19 +0200 In-Reply-To: <20070511003606.GB85884050@sgi.com> Message-ID: Lines: 21 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 11408 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: andi@firstfloor.org Precedence: bulk X-list: xfs David Chinner writes: > > The following patch survives XFSQA with timeouts set to minimum, > default, 500s and maximum. The patch has not had a great > deal of low memory testing, and the object cache may need a shrinker > interface to work in low memory conditions. > > Comments? It seems to be an optimization for a relatively small number of streams. When you do a large number on average you should get similar readahead benefits from round robing the streams over some AGs vs keeping it in a single AG, right? The fallback to AG 0 if nstreams>AGs seems pretty lousy. Wouldn't it be better to do the normal XFS allocation algorithm then? I think right now it will go into low space mode in this case, which might give worse results. Also centisecs is a really ugly unit whose use should be probably not propagated. -Andi From owner-xfs@oss.sgi.com Sat May 12 20:02:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 20:02:21 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4D32GfB025910 for ; Sat, 12 May 2007 20:02:17 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 7BEE81806FDAD; Sat, 12 May 2007 22:02:14 -0500 (CDT) Message-ID: <46467FB5.8080301@sandeen.net> Date: Sat, 12 May 2007 22:02:13 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Evan Fraser CC: xfs@oss.sgi.com Subject: Re: updatedb triggers XFS internal error References: <46449B2C.30208@dneg.com> In-Reply-To: <46449B2C.30208@dneg.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11409 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Evan Fraser wrote: > Hello, > I'm having a problem with one of my linux servers. whenever updatedb is > run, the following errors occur in the system log. > > 0x0: c9 00 5a f1 3a 7f 66 be a3 c1 d4 7f e8 1d 6b c9 > Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2271 of > file fs/xfs/xfs_da_btree.c. Caller 0xffffffff8817957f This indicates bad metadata magic read from disk. You'll probably want to run xfs_repair; you can run it with -n to see what it *would* do first, to get an idea of how drastic the repair might be. Repairing 1.4T won't probably be lots of fun, but you've got corruption in there somewhere... -Eric From owner-xfs@oss.sgi.com Sat May 12 20:08:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 12 May 2007 20:09:07 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4D38vfB027310 for ; Sat, 12 May 2007 20:08:58 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 5DE311806FDAD; Sat, 12 May 2007 22:08:57 -0500 (CDT) Message-ID: <46468148.7000708@sandeen.net> Date: Sat, 12 May 2007 22:08:56 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Andi Kleen CC: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams References: <20070511003606.GB85884050@sgi.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11410 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Andi Kleen wrote: > Also centisecs is a really ugly unit whose use should be probably not propagated. > > -Andi Hmm at one point I thought the preferred unit for this sort of tuneable *was* centisecs. What's the unit du jour? [root@neon ~]# sysctl -a |grep cent vm.dirty_expire_centisecs = 2999 vm.dirty_writeback_centisecs = 499 fs.xfs.age_buffer_centisecs = 1500 fs.xfs.xfsbufd_centisecs = 100 fs.xfs.xfssyncd_centisecs = 3000 I think xfs was following the vm lead at one point. -Eric From owner-xfs@oss.sgi.com Sun May 13 14:19:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 13 May 2007 14:19:40 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4DLJafB015043 for ; Sun, 13 May 2007 14:19:37 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HnLAX-0003eh-NS; Sun, 13 May 2007 21:59:53 +0100 Date: Sun, 13 May 2007 21:59:53 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070513205953.GA14030@infradead.org> References: <20070511003606.GB85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=unknown-8bit Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070511003606.GB85884050@sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11411 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs I already had some comments on this when discussing it with Sam in person, but it seems like they didn't make it to you. First the mru cache while beeing quite nice code is heavily overengineered for this case. Unless there are a many hundred filestreams per filesystem it will be a lot faster to just have a simple wrap-around array of linked lists. We don't want to feed the argument that xfs has lots of useless bloated code, do we? :) All the pip != NULL checks are superflous in Linux. A regular file can never have a non-null parent inode, and a directory can only have a non-NULL parent inode in very odd corner cases involving NFS exports, but it has to be connect again once you start doing namespace modifying operations on it. There some naming confusion: xfs_mount.h forward-declares struct xfs_filestream but everything else uses struct fstrm_mnt_data. The former is very non-descriptive and the latter but ugly, I'd suggestjust putting the mru-cache replacement directly in there as xfs_filestream_cache instead of the wrapping. The xfs_zeroino changes looks good but should be a separate commit. Some comments on the actual code in xfs_filestream.c > +#ifdef DEBUG_FILESTREAMS > +#define dprint(fmt, args...) do { \ > + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ > + current_pid(), __FUNCTION__, ##args); \ > +} while(0) > +#else > +#define dprint(args...) do {} while (0) > +#endif This should probably be killed entirely. > +#define GET_AG_REF(mp, ag) atomic_read(&(mp)->m_perag[ag].pagf_fstrms) > +#define INC_AG_REF(mp, ag) atomic_inc_return(&(mp)->m_perag[ag].pagf_fstrms) > +#define DEC_AG_REF(mp, ag) atomic_dec_return(&(mp)->m_perag[ag].pagf_fstrms) These should be inlines with more descriptive lower case names. > +#define XFS_PICK_USERDATA 1 > +#define XFS_PICK_LOWSPACE 2 enum. > + > +/* > + * Scan the AGs starting at startag looking for an AG that isn't in use and has > + * at least minlen blocks free. > + */ > +static int > +_xfs_filestream_pick_ag( > + xfs_mount_t *mp, > + xfs_agnumber_t startag, > + xfs_agnumber_t *agp, > + int flags, > + xfs_extlen_t minlen) > +{ > + int err, trylock, nscan; > + xfs_extlen_t delta, longest, need, free, minfree, maxfree = 0; > + xfs_agnumber_t ag, max_ag = NULLAGNUMBER; > + struct xfs_perag *pag; > + > + /* 2% of an AG's blocks must be free for it to be chosen. */ > + minfree = mp->m_sb.sb_agblocks / 50; > + > + ag = startag; > + *agp = NULLAGNUMBER; > + > + /* For the first pass, don't sleep trying to init the per-AG. */ > + trylock = XFS_ALLOC_FLAG_TRYLOCK; > + > + for (nscan = 0; 1; nscan++) { > + > + //dprint("scanning AG %d[%d]", ag, GET_AG_REF(mp, ag)); please don't leave commented out debug code in. > + pag = mp->m_perag + ag; > + > + if (!pag->pagf_init && > + (err = xfs_alloc_pagf_init(mp, NULL, ag, trylock)) && > + !trylock) { > + dprint("xfs_alloc_pagf_init returned %d", err); > + return err; > + } if (!pag->pagf_init) { err = xfs_alloc_pagf_init(mp, NULL, ag, trylock); if (err && !trylock) return err; } > +static int > +_xfs_filestream_set_ag( > + xfs_inode_t *ip, > + xfs_inode_t *pip, > + xfs_agnumber_t ag) > +{ > + int err = 0; > + xfs_mount_t *mp; > + xfs_mru_cache_t *cache; > + fstrm_item_t *item; > + xfs_agnumber_t old_ag; > + xfs_inode_t *old_pip; > + > + /* > + * Either ip is a regular file and pip is a directory, or ip is a > + * directory and pip is NULL. > + */ We have parent information for parents aswell so this should probably be made more regular. > + ASSERT(ip && (((ip->i_d.di_mode & S_IFREG) && pip && > + (pip->i_d.di_mode & S_IFDIR)) || > + ((ip->i_d.di_mode & S_IFDIR) && !pip))); > + mp = ip->i_mount; > + cache = mp->m_filestream->fstrm_items; > + > + if ((item = (fstrm_item_t*)xfs_mru_cache_lookup(cache, ip->i_ino))) { assignment and conditional on separate lines please (also alsewhere in the code), and no needless casts from void * either (also various places > +void > +xfs_filestream_init(void) > +{ > + item_zone = kmem_zone_init(sizeof(fstrm_item_t), "fstrm_item"); > + ASSERT(item_zone); Please check for errors instead and propagate them. > +/* > + * xfs_filestream_uninit() is called at xfs termination time to destroy the > + * memory zone that was used for filestream data structure allocation. > + */ > +void > +xfs_filestream_uninit(void) > +{ > + if (item_zone) { > + kmem_zone_destroy(item_zone); > + item_zone = NULL; > + } > +} no need for the NULL check or setting it to NULL. > + if (!(md = (fstrm_mnt_data_t*)kmem_zalloc(sizeof(*md), KM_SLEEP))) Please use KM_MAYFAIL for all new code otside of transactions. > + ASSERT(ip->i_d.di_mode & (S_IFREG | S_IFDIR)); > + if (!(ip->i_d.di_mode & (S_IFREG | S_IFDIR))) > + return NULLAGNUMBER; either the assert or the if clause checking gor it, please. Now comes the worst part the new allocator function i IF we look at a diff between xfs_bmap_filestreams and xfs_bmap_btalloc we see that it's a pretty bad cut & paste job: --- btalloc 2007-05-12 12:43:03.000000000 +0200 +++ fsalloc 2007-05-12 12:42:28.000000000 +0200 @@ -1,44 +1,54 @@ > + rt = (ap->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) && ap->userdata; xfs_bmap_alloc() never calls xfs_bmap_filestreams if this is true so all code guarded by if (rt) is dead. > - if (unlikely(align)) { > + if (align) { Âlign should have the same likelyhood for oth > - if (nullfb) > - ap->rval = XFS_INO_TO_FSB(mp, ap->ip->i_ino); > - else > + if (nullfb) { > + ag = xfs_filestream_get_ag(ap->ip); > + ag = (ag != NULLAGNUMBER) ? ag : 0; > + ap->rval = (ap->userdata) ? XFS_AGB_TO_FSB(mp, ag, 0) : > + XFS_INO_TO_FSB(mp, ap->ip->i_ino); > + } else { > ap->rval = ap->firstblock; > + } Some rreal changes :) But this could be just a third if case for the filesystream case. > - args.firstblock = ap->firstblock; Backout of parts of rev1.349 blen = 0; if (nullfb) { - args.type = XFS_ALLOCTYPE_START_BNO; + /* _vextent doesn't pick an AG */ + args.type = XFS_ALLOCTYPE_NEAR_BNO; /* > @@ -117,18 +167,19 @@ > */ > else > args.minlen = ap->alen; > + ap->rval = args.fsbno = XFS_AGB_TO_FSB(mp, ag, 0); > } else if (ap->low) { > - args.type = XFS_ALLOCTYPE_START_BNO; > + args.type = XFS_ALLOCTYPE_FIRST_AG; > args.total = args.minlen = ap->minlen; Why is this different? } > - if (unlikely(ap->userdata && ap->ip->i_d.di_extsize && > - (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE))) { > + if (ap->userdata && ap->ip->i_d.di_extsize && > + (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) { args.prod = ap->ip->i_d.di_extsize; > - if ((args.mod = (xfs_extlen_t)do_mod(ap->off, args.prod))) > + if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) Gratious difference. * is >= the stripe unit and the allocation offset is * at the end of file. */ > + atype = args.type; I don't quite undersatnd why we'd nee this in one, but not the other. if (!ap->low && ap->aeof) { if (!ap->off) { args.alignment = mp->m_dalign; > - * First try an exact bno allocation. > + * First try an exact bno allocation. > * If it fails then do a near or start bno > * allocation with alignment turned on. > - */ > + */ Backout of whitespace adjustments. > - XFS_TRANS_MOD_DQUOT_BYINO(mp, ap->tp, ap->ip, > - ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT : > + if (XFS_IS_QUOTA_ON(mp) && > + ap->ip->i_ino != mp->m_sb.sb_uquotino && > + ap->ip->i_ino != mp->m_sb.sb_gquotino) { > + XFS_TRANS_MOD_DQUOT_BYINO(mp, ap->tp, ap->ip, > + ap->wasdel ? > + XFS_TRANS_DQ_DELBCOUNT : > XFS_TRANS_DQ_BCOUNT, > - (long) args.len); > + (long)args.len); > + } Gratious differenes but okay because there won't be file streams for quota inodes. Based onthat my conclusion is that xfs_bmap_filestreams and xfs_bmap_btalloc should be merged to avoid further maintaince overhead. From owner-xfs@oss.sgi.com Sun May 13 22:32:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 13 May 2007 22:32:08 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4E5W1fB012122 for ; Sun, 13 May 2007 22:32:03 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA23946; Mon, 14 May 2007 15:31:44 +1000 Date: Mon, 14 May 2007 15:35:31 +1000 From: Timothy Shimmin To: Eric Sandeen , Andi Kleen cc: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams - centisecs Message-ID: In-Reply-To: <46468148.7000708@sandeen.net> References: <20070511003606.GB85884050@sgi.com> <46468148.7000708@sandeen.net> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11412 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Yeah, I thought we were told off in the past for not using centisecs and so Nathan changed stuff so it was in centisecs. Looking in logs and bug db.... ---------------- xfs_sysctl.c revision 1.28 date: 2004/05/14 03:13:52; author: nathans; state: Exp; lines: +7 -7 modid: xfs-linux:xfs-kern:171825a Export/import tunable time intervals as centisecs not jiffies. Description: Not sure what we were smoking when we made these interfaces converse with userspace in terms of jiffies, I guess it was just more expedient at the time. Time to clean this up so regular humans know what time intervals they're asking for, and so that the interface works consistently for different HZ values. The kernel pdflush daemon in 2.6 uses centisecs, so we may as well make our units consistent with that (since that guy plays a big role in flushing our data & it is likely to be tuned along with any XFS-specific parameter changes). cheers. On Tue, May 11, 2004 at 03:40:57PM -0700, Andrew Morton wrote: > bart@samwel.tk wrote: > > > > The laptop mode control script incorrectly guesses XFS_HZ=1000. > > aargh. XFS is broken. It shouldn't be exposing jiffy-based tunables into > /proc, or `mount -o remount' or whatever. > > It would be much better to rework XFS so that these user-visible tunables > are in units of milliseconds, centiseconds or whatever. > > Is this possible, please? > > If so, please make the /proc filename reflect the tunable's units: > > /proc/sys/fs/xfs/lm_sync_centisecs > /proc/sys/fs/xfs/age_buffer_centisecs > etc. > > thanks. ---------------------------- --Tim --On 12 May 2007 10:08:56 PM -0500 Eric Sandeen wrote: > Andi Kleen wrote: > >> Also centisecs is a really ugly unit whose use should be probably not propagated. >> >> -Andi > > Hmm at one point I thought the preferred unit for this sort of tuneable *was* centisecs. What's > the unit du jour? > > [root@neon ~]# sysctl -a |grep cent > vm.dirty_expire_centisecs = 2999 > vm.dirty_writeback_centisecs = 499 > fs.xfs.age_buffer_centisecs = 1500 > fs.xfs.xfsbufd_centisecs = 100 > fs.xfs.xfssyncd_centisecs = 3000 > > I think xfs was following the vm lead at one point. > > -Eric From owner-xfs@oss.sgi.com Mon May 14 01:17:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 01:17:13 -0700 (PDT) Received: from gab.dneg.com (mail.dneg.com [193.203.82.196]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4E8H4fB026858 for ; Mon, 14 May 2007 01:17:06 -0700 Received: from localhost (localhost.localdomain [127.0.0.1]) by gab.dneg.com (Postfix) with ESMTP id 94AD64D6074; Mon, 14 May 2007 09:17:03 +0100 (BST) Received: from gab.dneg.com ([127.0.0.1]) by localhost (gab.dneg.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YlGqSKEZ+p2z; Mon, 14 May 2007 09:17:02 +0100 (BST) Received: from [172.16.10.24] (spinach.dneg.com [172.16.10.24]) by gab.dneg.com (Postfix) with ESMTP id 0F4A44D5F20; Mon, 14 May 2007 09:17:02 +0100 (BST) Message-ID: <46481AFD.8090109@dneg.com> Date: Mon, 14 May 2007 09:17:01 +0100 From: Evan Fraser User-Agent: Thunderbird 2.0.0.0 (X11/20070326) MIME-Version: 1.0 To: Eric Sandeen CC: xfs@oss.sgi.com Subject: Re: updatedb triggers XFS internal error References: <46449B2C.30208@dneg.com> <46467FB5.8080301@sandeen.net> In-Reply-To: <46467FB5.8080301@sandeen.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11413 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: evan@dneg.com Precedence: bulk X-list: xfs Thanks for your help. Cheers, Evan. Eric Sandeen wrote: > Evan Fraser wrote: >> Hello, >> I'm having a problem with one of my linux servers. whenever updatedb >> is run, the following errors occur in the system log. >> >> 0x0: c9 00 5a f1 3a 7f 66 be a3 c1 d4 7f e8 1d 6b c9 >> Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2271 of >> file fs/xfs/xfs_da_btree.c. Caller 0xffffffff8817957f > > This indicates bad metadata magic read from disk. You'll probably > want to run xfs_repair; you can run it with -n to see what it *would* > do first, to get an idea of how drastic the repair might be. > Repairing 1.4T won't probably be lots of fun, but you've got > corruption in there somewhere... > > -Eric > -- evan@dneg.com Linux Systems Administrator Double Negative tel: +44 (0)20 7534 4400 fax: +44 (0)20 7534 4452 77 shaftesbury avenue, w1d 5du, London From owner-xfs@oss.sgi.com Mon May 14 06:29:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 06:29:46 -0700 (PDT) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EDTdfB021279 for ; Mon, 14 May 2007 06:29:41 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EDTZQv003673 for ; Mon, 14 May 2007 09:29:35 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EDTZNT268006 for ; Mon, 14 May 2007 07:29:35 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EDTYRa015478 for ; Mon, 14 May 2007 07:29:35 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EDTX0M014657; Mon, 14 May 2007 07:29:34 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 0A54F29EBD3; Mon, 14 May 2007 18:59:28 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EDTRVE003530; Mon, 14 May 2007 18:59:27 +0530 Date: Mon, 14 May 2007 18:59:26 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 0/5][TAKE2] fallocate system call Message-ID: <20070514132926.GA30768@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11414 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This is the new set of patches which take care of the review comments received from the community (mainly from Andrew). Description: ----------- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: --------- The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime/mtime in the inode of the corresponding file, marking a successfull allocation. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. * New modes might get added in future. One such new mode which is already under discussion is FA_PREALLOCATE, which when used will preallocate space but will not change the filesize and [cm]time. Since the semantics of this new mode is not clear and agreed upon yet, this patchset does not implement it currently. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). sys_fallocate() on s390: ----------------------- There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problem: ------------- mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: ToDos: ----- 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Changelog: --------- Each post will have an individual changelog for the particular patch. Following posts with patches follow: Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc Patch 2/5 : fallocate() on s390 Patch 3/5 : ext4: Extent overlap bugfix Patch 4/5 : ext4: fallocate support in ext4 Patch 5/5 : ext4: write support for preallocated blocks -- Regards, Amit Arora From owner-xfs@oss.sgi.com Mon May 14 07:01:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:01:07 -0700 (PDT) Received: from atrey.karlin.mff.cuni.cz (atrey.karlin.mff.cuni.cz [195.113.31.123]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EE0wfB028645 for ; Mon, 14 May 2007 07:01:00 -0700 Received: by atrey.karlin.mff.cuni.cz (Postfix, from userid 4043) id 1130BC7D2C; Mon, 14 May 2007 15:34:46 +0200 (CEST) Date: Mon, 14 May 2007 15:34:46 +0200 From: Jan Kara To: Andrew Morton Cc: Andreas Dilger , "Amit K. Arora" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070514133445.GA28875@atrey.karlin.mff.cuni.cz> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070426181332.GD7209@amitarora.in.ibm.com> <20070503213133.d1559f52.akpm@linux-foundation.org> <20070507113753.GA5439@schatzie.adilger.int> <20070507135825.f8545a65.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070507135825.f8545a65.akpm@linux-foundation.org> User-Agent: Mutt/1.5.9i X-archive-position: 11415 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jack@suse.cz Precedence: bulk X-list: xfs > On Mon, 7 May 2007 05:37:54 -0600 > > Does the proposed implementation handle quotas correctly, btw? Has that > been tested? It seems to handle quotas fine - the block allocation itself does not differ from the usual case, just the extents in the tree are marked as uninitialized... The only question is whether DQUOT_PREALLOC_BLOCK() shouldn't be called instead of DQUOT_ALLOC_BLOCK(). Then fallocate() won't be able to allocate anything after the softlimit has been reached which makes some sence but probably current behavior is kind-of less surprising. Honza -- Jan Kara SuSE CR Labs From owner-xfs@oss.sgi.com Mon May 14 07:45:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:45:26 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEjJfB010035 for ; Mon, 14 May 2007 07:45:21 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EEjIg8009223 for ; Mon, 14 May 2007 10:45:18 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEjIje554160 for ; Mon, 14 May 2007 10:45:18 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEjHM5029882 for ; Mon, 14 May 2007 10:45:18 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EEjF8e029685; Mon, 14 May 2007 10:45:16 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id DC29929EBD3; Mon, 14 May 2007 20:15:24 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEjOuJ007049; Mon, 14 May 2007 20:15:24 +0530 Date: Mon, 14 May 2007 20:15:24 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070514144524.GA31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11416 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Changelog: --------- Following changes were made to the previous version: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) Here is the new patch: Signed-off-by: Amit Arora --- arch/i386/kernel/syscall_table.S | 1 arch/powerpc/kernel/sys_ppc32.c | 7 +++ arch/x86_64/kernel/functionlist | 1 fs/open.c | 89 +++++++++++++++++++++++++++++++++++++++ include/asm-i386/unistd.h | 3 - include/asm-powerpc/systbl.h | 1 include/asm-powerpc/unistd.h | 3 - include/asm-x86_64/unistd.h | 4 + include/linux/fs.h | 13 +++++ include/linux/syscalls.h | 1 10 files changed, 120 insertions(+), 3 deletions(-) Index: linux-2.6.21/arch/i386/kernel/syscall_table.S =================================================================== --- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.21/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_fallocate /* 320 */ Index: linux-2.6.21/arch/x86_64/kernel/functionlist =================================================================== --- linux-2.6.21.orig/arch/x86_64/kernel/functionlist +++ linux-2.6.21/arch/x86_64/kernel/functionlist @@ -931,6 +931,7 @@ *(.text.sys_getitimer) *(.text.sys_getgroups) *(.text.sys_ftruncate) +*(.text.sys_fallocate) *(.text.sysfs_lookup) *(.text.sys_exit_group) *(.text.stub_fork) Index: linux-2.6.21/fs/open.c =================================================================== --- linux-2.6.21.orig/fs/open.c +++ linux-2.6.21/fs/open.c @@ -351,6 +351,95 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + /* + * Let individual file system decide if it supports preallocation + * for directories or not. + */ + if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + /* Check for wrap through zero too */ + if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; + + /* + * Update [cm]time. + * Partial allocation will not result in the time stamp changes, + * since ->fallocate will return error (say, -ENOSPC) in this case. + */ + if (!ret) + file_update_time(file); +out_fput: + fput(file); +out: + return ret; +} + +/* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and * switching the fsuid/fsgid around to the real ones. Index: linux-2.6.21/include/asm-i386/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-i386/unistd.h +++ linux-2.6.21/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages 317 #define __NR_getcpu 318 #define __NR_epoll_pwait 319 +#define __NR_fallocate 320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.21/include/asm-powerpc/systbl.h =================================================================== --- linux-2.6.21.orig/include/asm-powerpc/systbl.h +++ linux-2.6.21/include/asm-powerpc/systbl.h @@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list) COMPAT_SYS_SPU(move_pages) SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) +COMPAT_SYS(fallocate) Index: linux-2.6.21/include/asm-powerpc/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-powerpc/unistd.h +++ linux-2.6.21/include/asm-powerpc/unistd.h @@ -326,10 +326,11 @@ #define __NR_move_pages 301 #define __NR_getcpu 302 #define __NR_epoll_pwait 303 +#define __NR_fallocate 304 #ifdef __KERNEL__ -#define __NR_syscalls 304 +#define __NR_syscalls 305 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls Index: linux-2.6.21/include/asm-x86_64/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-x86_64/unistd.h +++ linux-2.6.21/include/asm-x86_64/unistd.h @@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_fallocate 280 +__SYSCALL(__NR_fallocate, sys_fallocate) -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_fallocate #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.21/include/linux/fs.h =================================================================== --- linux-2.6.21.orig/include/linux/fs.h +++ linux-2.6.21/include/linux/fs.h @@ -264,6 +264,17 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * sys_fallocate modes + * Currently sys_fallocate supports two modes: + * FA_ALLOCATE : This is the preallocate mode, using which an application/user + * may request (pre)allocation of blocks. + * FA_DEALLOCATE: This is the deallocate mode, which can be used to free + * the preallocated blocks. + */ +#define FA_ALLOCATE 0x1 +#define FA_DEALLOCATE 0x2 + #ifdef __KERNEL__ #include @@ -1125,6 +1136,8 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *inode, int mode, loff_t offset, + loff_t len); }; struct seq_file; Index: linux-2.6.21/include/linux/syscalls.h =================================================================== --- linux-2.6.21.orig/include/linux/syscalls.h +++ linux-2.6.21/include/linux/syscalls.h @@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); Index: linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c =================================================================== --- linux-2.6.21.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c @@ -777,6 +777,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, + u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, + ((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { From owner-xfs@oss.sgi.com Mon May 14 07:48:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:48:37 -0700 (PDT) Received: from e3.ny.us.ibm.com (e3.ny.us.ibm.com [32.97.182.143]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEmQfB010838 for ; Mon, 14 May 2007 07:48:28 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e3.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EDkcKw025338 for ; Mon, 14 May 2007 09:46:38 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEmQ4f517452 for ; Mon, 14 May 2007 10:48:26 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEmPPX005575 for ; Mon, 14 May 2007 10:48:25 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EEmOEJ005493; Mon, 14 May 2007 10:48:24 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 0DF5529EBD3; Mon, 14 May 2007 20:18:34 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEmY3x008439; Mon, 14 May 2007 20:18:34 +0530 Date: Mon, 14 May 2007 20:18:34 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 2/5][TAKE2] fallocate() on s390 Message-ID: <20070514144833.GB31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11417 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This is the patch suggested by Martin Schwidefsky. Here are the comments and patch from him. ------------- From: Martin Schwidefsky This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with the arguments of this system call. Signed-off-by: Martin Schwidefsky --- arch/s390/kernel/compat_wrapper.S | 10 ++++++++++ arch/s390/kernel/sys_s390.c | 29 +++++++++++++++++++++++++++++ arch/s390/kernel/syscalls.S | 1 + include/asm-s390/unistd.h | 3 ++- 4 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S +++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: llgtr %r2,%r2 # char * llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr %r2,%r2 # int + lgfr %r3,%r3 # int + sllg %r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg %r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate Index: linux-2.6.21/arch/s390/kernel/syscalls.S =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.21/arch/s390/kernel/syscalls.S @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) Index: linux-2.6.21/arch/s390/kernel/sys_s390.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c +++ linux-2.6.21/arch/s390/kernel/sys_s390.c @@ -286,3 +286,32 @@ int kernel_execve(const char *filename, "d" (__arg3) : "memory"); return __svcres; } + +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + union { + u64 len; + struct { + u32 high; + u32 low; + }; + } cv; + cv.high = len_high; + cv.low = len_low; + return sys_fallocate(fd, mode, offset, cv.len); +} +#endif Index: linux-2.6.21/include/asm-s390/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-s390/unistd.h +++ linux-2.6.21/include/asm-s390/unistd.h @@ -251,8 +251,9 @@ #define __NR_getcpu 311 #define __NR_epoll_pwait 312 #define __NR_utimes 313 +#define __NR_fallocate 314 -#define NR_syscalls 314 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some From owner-xfs@oss.sgi.com Mon May 14 07:50:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:50:07 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEo1fB011383 for ; Mon, 14 May 2007 07:50:02 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EEo0N1016384 for ; Mon, 14 May 2007 10:50:00 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEo0mQ256682 for ; Mon, 14 May 2007 08:50:00 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEnxeB025855 for ; Mon, 14 May 2007 08:50:00 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EEnwDG025737; Mon, 14 May 2007 08:49:59 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 6CC0529EBD3; Mon, 14 May 2007 20:20:08 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEo8Ad009129; Mon, 14 May 2007 20:20:08 +0530 Date: Mon, 14 May 2007 20:20:08 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 3/5][TAKE2] ext4: Extent overlap bugfix Message-ID: <20070514145008.GC31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11418 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch adds a check for overlap of extents and cuts short the new extent to be inserted, if there is a chance of overlap. Changelog: --------- As suggested by Andrew, a check for wrap though zero has been added. Here is the new patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 60 ++++++++++++++++++++++++++++++++++++++-- include/linux/ext4_fs_extents.h | 1 2 files changed, 59 insertions(+), 2 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c =================================================================== --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -1129,6 +1129,55 @@ ext4_can_extents_be_merged(struct inode } /* + * check if a portion of the "newext" extent overlaps with an + * existing extent. + * + * If there is an overlap discovered, it updates the length of the newext + * such that there will be no overlap, and then returns 1. + * If there is no overlap found, it returns 0. + */ +unsigned int ext4_ext_check_overlap(struct inode *inode, + struct ext4_extent *newext, + struct ext4_ext_path *path) +{ + unsigned long b1, b2; + unsigned int depth, len1; + unsigned int ret = 0; + + b1 = le32_to_cpu(newext->ee_block); + len1 = le16_to_cpu(newext->ee_len); + depth = ext_depth(inode); + if (!path[depth].p_ext) + goto out; + b2 = le32_to_cpu(path[depth].p_ext->ee_block); + + /* + * get the next allocated block if the extent in the path + * is before the requested block(s) + */ + if (b2 < b1) { + b2 = ext4_ext_next_allocated_block(path); + if (b2 == EXT_MAX_BLOCK) + goto out; + } + + /* check for wrap through zero */ + if (b1 + len1 < b1) { + len1 = EXT_MAX_BLOCK - b1; + newext->ee_len = cpu_to_le16(len1); + ret = 1; + } + + /* check for overlap */ + if (b1 + len1 > b2) { + newext->ee_len = cpu_to_le16(b2 - b1); + ret = 1; + } +out: + return ret; +} + +/* * ext4_ext_insert_extent: * tries to merge requsted extent into the existing extent or * inserts requested extent as new one into the tree, @@ -2032,7 +2081,15 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); - allocated = max_blocks; + + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ + newex.ee_block = cpu_to_le32(iblock); + newex.ee_len = cpu_to_le16(max_blocks); + err = ext4_ext_check_overlap(inode, &newex, path); + if (err) + allocated = le16_to_cpu(newex.ee_len); + else + allocated = max_blocks; newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err); if (!newblock) goto out2; @@ -2040,7 +2097,6 @@ int ext4_ext_get_blocks(handle_t *handle goal, newblock, allocated); /* try to insert new extent into found leaf and return */ - newex.ee_block = cpu_to_le32(iblock); ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); err = ext4_ext_insert_extent(handle, inode, path, &newex); Index: linux-2.6.21/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode * extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *); From owner-xfs@oss.sgi.com Mon May 14 07:52:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:52:29 -0700 (PDT) Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.144]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEqNfB012317 for ; Mon, 14 May 2007 07:52:25 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EEqNxj004164 for ; Mon, 14 May 2007 10:52:23 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEqNe6552736 for ; Mon, 14 May 2007 10:52:23 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEqMmZ024628 for ; Mon, 14 May 2007 10:52:23 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EEqKDR024534; Mon, 14 May 2007 10:52:20 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 2ADCE29EBD3; Mon, 14 May 2007 20:22:30 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEqUK8010124; Mon, 14 May 2007 20:22:30 +0530 Date: Mon, 14 May 2007 20:22:30 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 4/5][TAKE2] ext4: fallocate support in ext4 Message-ID: <20070514145230.GD31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11419 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FA_ALLOCATE mode is being supported as of now. Supporting FA_DEALLOCATE mode is a "To Do" item. Changelog: --------- Here are the changes from the previous post: 1) Added more description for ext4_fallocate(). 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent). 3) Moved journal_start & journal_stop inside the while loop. 4) Replaced BUG_ON with WARN_ON & ext4_error. 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally. 6) Added variable names in the function declaration of ext4_fallocate() 7) Converted macros that handle uninitialized extents into inline functions. Here is the updated patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 241 +++++++++++++++++++++++++++++++++------- fs/ext4/file.c | 1 include/linux/ext4_fs.h | 8 + include/linux/ext4_fs_extents.h | 12 + 4 files changed, 221 insertions(+), 41 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c =================================================================== --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1107,7 +1107,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* + * Make sure that either both extents are uninitialized, or + * both are _not_. + */ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1192,8 +1204,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; int depth, len, err, next; + unsigned uninitialized = 0; - BUG_ON(newext->ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode); ex = path[depth].p_ext; BUG_ON(path[depth].p_hdr == NULL); @@ -1201,14 +1214,24 @@ int ext4_ext_insert_extent(handle_t *han /* try to insert block into found extent and return */ if (ex && ext4_can_extents_be_merged(inode, ex, newext)) { ext_debug("append %d block to %d:%d (from %llu)\n", - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); err = ext4_ext_get_access(handle, inode, path + depth); if (err) return err; - ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len) - + le16_to_cpu(newext->ee_len)); + + /* + * ext4_can_extents_be_merged should have checked that either + * both extents are uninitialized, or both aren't. Thus we + * need to check only one of them here. + */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(newext)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); eh = path[depth].p_hdr; nearex = ex; goto merge; @@ -1264,7 +1287,7 @@ has_space: ext_debug("first extent in the leaf: %d:%llu:%d\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len)); + ext4_ext_get_actual_len(newext)); path[depth].p_ext = EXT_FIRST_EXTENT(eh); } else if (le32_to_cpu(newext->ee_block) > le32_to_cpu(nearex->ee_block)) { @@ -1277,7 +1300,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 2, nearex + 1, len); } @@ -1290,7 +1313,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 1, nearex, len); path[depth].p_ext = nearex; @@ -1309,8 +1332,13 @@ merge: if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) break; /* merge with next extent! */ - nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len) - + le16_to_cpu(nearex[1].ee_len)); + if (ext4_ext_is_uninitialized(nearex)) + uninitialized = 1; + nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) + + ext4_ext_get_actual_len(nearex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(nearex); + if (nearex + 1 < EXT_LAST_EXTENT(eh)) { len = (EXT_LAST_EXTENT(eh) - nearex - 1) * sizeof(struct ext4_extent); @@ -1380,8 +1408,8 @@ int ext4_ext_walk_space(struct inode *in end = le32_to_cpu(ex->ee_block); if (block + num < end) end = block + num; - } else if (block >= - le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) { + } else if (block >= le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex)) { /* need to allocate space after found extent */ start = block; end = block + num; @@ -1393,7 +1421,8 @@ int ext4_ext_walk_space(struct inode *in * by found extent */ start = block; - end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len); + end = le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex); if (block + num < end) end = block + num; exists = 1; @@ -1409,7 +1438,7 @@ int ext4_ext_walk_space(struct inode *in cbex.ec_type = EXT4_EXT_CACHE_GAP; } else { cbex.ec_block = le32_to_cpu(ex->ee_block); - cbex.ec_len = le16_to_cpu(ex->ee_len); + cbex.ec_len = ext4_ext_get_actual_len(ex); cbex.ec_start = ext_pblock(ex); cbex.ec_type = EXT4_EXT_CACHE_EXTENT; } @@ -1482,15 +1511,15 @@ ext4_ext_put_gap_in_cache(struct inode * ext_debug("cache gap(before): %lu [%lu:%lu]", (unsigned long) block, (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len)); + (unsigned long) ext4_ext_get_actual_len(ex)); } else if (block >= le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len)) { + + ext4_ext_get_actual_len(ex)) { lblock = le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len); + + ext4_ext_get_actual_len(ex); len = ext4_ext_next_allocated_block(path); ext_debug("cache gap(after): [%lu:%lu] %lu", (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len), + (unsigned long) ext4_ext_get_actual_len(ex), (unsigned long) block); BUG_ON(len == lblock); len = len - lblock; @@ -1620,12 +1649,12 @@ static int ext4_remove_blocks(handle_t * unsigned long from, unsigned long to) { struct buffer_head *bh; + unsigned short ee_len = ext4_ext_get_actual_len(ex); int i; #ifdef EXTENTS_STATS { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); - unsigned short ee_len = le16_to_cpu(ex->ee_len); spin_lock(&sbi->s_ext_stats_lock); sbi->s_ext_blocks += ee_len; sbi->s_ext_extents++; @@ -1639,12 +1668,12 @@ static int ext4_remove_blocks(handle_t * } #endif if (from >= le32_to_cpu(ex->ee_block) - && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to == le32_to_cpu(ex->ee_block) + ee_len - 1) { /* tail removal */ unsigned long num; ext4_fsblk_t start; - num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from; - start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num; + num = le32_to_cpu(ex->ee_block) + ee_len - from; + start = ext_pblock(ex) + ee_len - num; ext_debug("free last %lu blocks starting %llu\n", num, start); for (i = 0; i < num; i++) { bh = sb_find_get_block(inode->i_sb, start + i); @@ -1652,12 +1681,12 @@ static int ext4_remove_blocks(handle_t * } ext4_free_blocks(handle, inode, start, num); } else if (from == le32_to_cpu(ex->ee_block) - && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) { printk("strange request: removal %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } else { printk("strange request: removal(2) %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } return 0; } @@ -1672,6 +1701,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc unsigned a, b, block, num; unsigned long ex_ee_block; unsigned short ex_ee_len; + unsigned uninitialized = 0; struct ext4_extent *ex; ext_debug("truncate since %lu in leaf\n", start); @@ -1686,7 +1716,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex = EXT_LAST_EXTENT(eh); ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex_ee_len = ext4_ext_get_actual_len(ex); while (ex >= EXT_FIRST_EXTENT(eh) && ex_ee_block + ex_ee_len > start) { @@ -1754,6 +1786,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex->ee_block = cpu_to_le32(block); ex->ee_len = cpu_to_le16(num); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); if (err) @@ -1763,7 +1797,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc ext_pblock(ex)); ex--; ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + ex_ee_len = ext4_ext_get_actual_len(ex); } if (correct_index && eh->eh_entries) @@ -2039,7 +2073,7 @@ int ext4_ext_get_blocks(handle_t *handle if (ex) { unsigned long ee_block = le32_to_cpu(ex->ee_block); ext4_fsblk_t ee_start = ext_pblock(ex); - unsigned short ee_len = le16_to_cpu(ex->ee_len); + unsigned short ee_len; /* * Allow future support for preallocated extents to be added @@ -2047,8 +2081,9 @@ int ext4_ext_get_blocks(handle_t *handle * Uninitialized extents are treated as holes, except that * we avoid (fail) allocating new blocks during a write. */ - if (ee_len > EXT_MAX_LEN) + if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) goto out2; + ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { newblock = iblock - ee_block + ee_start; @@ -2056,8 +2091,11 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); - ext4_ext_put_in_cache(inode, ee_block, ee_len, - ee_start, EXT4_EXT_CACHE_EXTENT); + /* Do not put uninitialized extent in the cache */ + if (!ext4_ext_is_uninitialized(ex)) + ext4_ext_put_in_cache(inode, ee_block, + ee_len, ee_start, + EXT4_EXT_CACHE_EXTENT); goto out; } } @@ -2099,6 +2137,8 @@ int ext4_ext_get_blocks(handle_t *handle /* try to insert new extent into found leaf and return */ ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); + if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */ + ext4_ext_mark_uninitialized(&newex); err = ext4_ext_insert_extent(handle, inode, path, &newex); if (err) goto out2; @@ -2110,8 +2150,10 @@ int ext4_ext_get_blocks(handle_t *handle newblock = ext_pblock(&newex); __set_bit(BH_New, &bh_result->b_state); - ext4_ext_put_in_cache(inode, iblock, allocated, newblock, - EXT4_EXT_CACHE_EXTENT); + /* Cache only when it is _not_ an uninitialized extent */ + if (create!=EXT4_CREATE_UNINITIALIZED_EXT) + ext4_ext_put_in_cache(inode, iblock, allocated, newblock, + EXT4_EXT_CACHE_EXTENT); out: if (allocated > max_blocks) allocated = max_blocks; @@ -2215,10 +2257,127 @@ int ext4_ext_writepage_trans_blocks(stru return needed; } +/* + * preallocate space for a file. This implements ext4's fallocate inode + * operation, which gets called from sys_fallocate system call. + * Currently only FA_ALLOCATE mode is supported on extent based files. + * We may have more modes supported in future - like FA_DEALLOCATE, which + * tells fallocate to unallocate previously (pre)allocated blocks. + * For block-mapped files, posix_fallocate should fall back to the method + * of writing zeroes to the required new blocks (the same behavior which is + * expected for file systems which do not support fallocate() system call). + */ +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + handle_t *handle; + ext4_fsblk_t block, max_blocks; + ext4_fsblk_t nblocks = 0; + int ret = 0; + int ret2 = 0; + int retries = 0; + struct buffer_head map_bh; + unsigned int credits, blkbits = inode->i_blkbits; + + /* + * currently supporting (pre)allocate mode for extent-based + * files _only_ + */ + if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return -EOPNOTSUPP; + + /* preallocation to directories is currently not supported */ + if (S_ISDIR(inode->i_mode)) + return -ENODEV; + + block = offset >> blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) + - block; + + /* + * credits to insert 1 extent into extent tree + buffers to be able to + * modify 1 super block, 1 block bitmap and 1 group descriptor. + */ + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3; +retry: + while (ret >= 0 && ret < max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + handle = ext4_journal_start(inode, credits); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + break; + } + + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, &map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + WARN_ON(!ret); + if (!ret) { + ext4_error(inode->i_sb, "ext4_fallocate", + "ext4_ext_get_blocks returned 0! inode#%lu" + ", block=%llu, max_blocks=%llu", + inode->i_ino, block, max_blocks); + ret = -EIO; + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + break; + } + if (ret > 0) { + /* check wrap through sign-bit/zero here */ + if ((block + ret) < 0 || (block + ret) < block) { + ret = -EIO; + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + break; + } + if (buffer_new(&map_bh) && ((block + ret) > + (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits) + >> blkbits))) + nblocks = nblocks + ret; + } + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + if (ret2) + break; + } + + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) + goto retry; + + /* + * Time to update the file size. + * Update only when preallocation was requested beyond the file size. + */ + if ((offset + len) > i_size_read(inode)) { + if (ret > 0) { + /* + * if no error, we assume preallocation succeeded + * completely + */ + mutex_lock(&inode->i_mutex); + i_size_write(inode, offset + len); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } else if (ret < 0 && nblocks) { + /* Handle partial allocation scenario */ + loff_t newsize; + + mutex_lock(&inode->i_mutex); + newsize = (nblocks << blkbits) + i_size_read(inode); + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } + } + + return ret > 0 ? ret2 : ret; +} + EXPORT_SYMBOL(ext4_mark_inode_dirty); EXPORT_SYMBOL(ext4_ext_invalidate_cache); EXPORT_SYMBOL(ext4_ext_insert_extent); EXPORT_SYMBOL(ext4_ext_walk_space); EXPORT_SYMBOL(ext4_ext_find_goal); EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); +EXPORT_SYMBOL(ext4_fallocate); Index: linux-2.6.21/fs/ext4/file.c =================================================================== --- linux-2.6.21.orig/fs/ext4/file.c +++ linux-2.6.21/fs/ext4/file.c @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ .removexattr = generic_removexattr, #endif .permission = ext4_permission, + .fallocate = ext4_fallocate, }; Index: linux-2.6.21/include/linux/ext4_fs.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs.h +++ linux-2.6.21/include/linux/ext4_fs.h @@ -102,6 +102,7 @@ EXT4_GOOD_OLD_FIRST_INO : \ (s)->s_first_ino) #endif +#define EXT4_BLOCK_ALIGN(size, blkbits) ALIGN((size),(1 << (blkbits))) /* * Macro-instructions used to manage fragments @@ -225,6 +226,11 @@ struct ext4_new_group_data { __u32 free_blocks_count; }; +/* + * Following is used by preallocation code to tell get_blocks() that we + * want uninitialzed extents. + */ +#define EXT4_CREATE_UNINITIALIZED_EXT 2 /* * ioctl commands @@ -976,6 +982,8 @@ extern int ext4_ext_get_blocks(handle_t extern void ext4_ext_truncate(struct inode *, struct page *); extern void ext4_ext_init(struct super_block *); extern void ext4_ext_release(struct super_block *); +extern int ext4_fallocate(struct inode *inode, int mode, loff_t offset, + loff_t len); static inline int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, unsigned long max_blocks, struct buffer_head *bh, Index: linux-2.6.21/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -188,6 +188,18 @@ ext4_ext_invalidate_cache(struct inode * EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO; } +static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext) { + ext->ee_len |= cpu_to_le16(0x8000); +} + +static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext) { + return (int)(le16_to_cpu((ext)->ee_len) & 0x8000); +} + +static inline int ext4_ext_get_actual_len(struct ext4_extent *ext) { + return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF); +} + extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); From owner-xfs@oss.sgi.com Mon May 14 07:54:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 07:54:08 -0700 (PDT) Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EEs2fB013001 for ; Mon, 14 May 2007 07:54:04 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EEs1i4020276 for ; Mon, 14 May 2007 10:54:01 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EEs1Fk200374 for ; Mon, 14 May 2007 08:54:01 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EEs08G028241 for ; Mon, 14 May 2007 08:54:01 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EErx5F028145; Mon, 14 May 2007 08:53:59 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 1264029EBD3; Mon, 14 May 2007 20:24:09 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EEs97l010839; Mon, 14 May 2007 20:24:09 +0530 Date: Mon, 14 May 2007 20:24:09 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 5/5][TAKE2] ext4: write support for preallocated blocks Message-ID: <20070514145408.GE31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514142820.GA31468@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11420 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Changelog: --------- 1) Replaced BUG_ON with WARN_ON & ext4_error. 2) Added variable names to the function declaration of ext4_ext_try_to_merge(). 3) Updated variable declarations to use multiple-definitions-per-line. 4) "if((a=foo())).." was broken into "a=foo(); if(a).." 5) Removed extra spaces. Here is the updated patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 234 +++++++++++++++++++++++++++++++++++----- include/linux/ext4_fs_extents.h | 3 2 files changed, 210 insertions(+), 27 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c =================================================================== --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -1141,6 +1141,54 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) + { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1328,25 +1376,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2012,15 +2042,152 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, newex; + struct ext4_extent *ex1 = NULL; + struct ext4_extent *ex2 = NULL; + struct ext4_extent *ex3 = NULL; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err = 0; + int ret = 0; + + depth = ext_depth(inode); + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + ee_block = le32_to_cpu(ex->ee_block); + ee_len = ext4_ext_get_actual_len(ex); + allocated = ee_len - (iblock - ee_block); + newblock = iblock - ee_block + ext_pblock(ex); + ex2 = ex; + + /* ex1: ee_block to iblock - 1 : uninitialized */ + if (iblock > ee_block) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* for sanity, update the length of the ex2 extent before + * we insert ex3, if ex1 is NULL. This is to avoid temporary + * overlap of blocks. + */ + if (!ex1 && allocated > max_blocks) + ex2->ee_len = cpu_to_le16(max_blocks); + /* ex3: to ee_block + ee_len : uninitialised */ + if (allocated > max_blocks) { + unsigned int newdepth; + ex3 = &newex; + ex3->ee_block = cpu_to_le32(iblock + max_blocks); + ext4_ext_store_pblock(ex3, newblock + max_blocks); + ex3->ee_len = cpu_to_le16(allocated - max_blocks); + ext4_ext_mark_uninitialized(ex3); + err = ext4_ext_insert_extent(handle, inode, path, ex3); + if (err) + goto out; + /* The depth, and hence eh & ex might change + * as part of the insert above. + */ + newdepth = ext_depth(inode); + if (newdepth != depth) { + depth = newdepth; + path = ext4_ext_find_extent(inode, iblock, NULL); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + goto out; + } + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + if (ex2 != &newex) + ex2 = ex; + } + allocated = max_blocks; + } + /* If there was a change of depth as part of the + * insertion of ex3 above, we need to update the length + * of the ex1 extent again here + */ + if (ex1 && ex1 != ex) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* ex2: iblock to iblock + maxblocks-1 : initialised */ + ex2->ee_block = cpu_to_le32(iblock); + ex2->ee_start = cpu_to_le32(newblock); + ext4_ext_store_pblock(ex2, newblock); + ex2->ee_len = cpu_to_le16(allocated); + if (ex2 != ex) + goto insert; + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; + /* New (initialized) extent starts from the first block + * in the current extent. i.e., ex2 == ex + * We have to see if it can be merged with the extent + * on the left. + */ + if (ex2 > EXT_FIRST_EXTENT(eh)) { + /* To merge left, pass "ex2 - 1" to try_to_merge(), + * since it merges towards right _only_. + */ + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + depth = ext_depth(inode); + ex2--; + } + } + /* Try to Merge towards right. This might be required + * only when the whole extent is being written to. + * i.e. ex2 == ex and ex3 == NULL. + */ + if (!ex3) { + ret = ext4_ext_try_to_merge(inode, path, ex2); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + } + } + /* Mark modified extent as dirty */ + err = ext4_ext_dirty(handle, inode, path + depth); + goto out; +insert: + err = ext4_ext_insert_extent(handle, inode, path, &newex); +out: + return err ? err : allocated; +} + int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, int create, int extend_disksize) { struct ext4_ext_path *path = NULL; + struct ext4_extent_header *eh; struct ext4_extent newex, *ex; ext4_fsblk_t goal, newblock; - int err = 0, depth; + int err = 0, depth, ret; unsigned long allocated = 0; __clear_bit(BH_New, &bh_result->b_state); @@ -2068,6 +2235,7 @@ int ext4_ext_get_blocks(handle_t *handle * this is why assert can't be put in ext4_ext_find_extent() */ BUG_ON(path[depth].p_ext == NULL && depth != 0); + eh = path[depth].p_hdr; ex = path[depth].p_ext; if (ex) { @@ -2076,13 +2244,9 @@ int ext4_ext_get_blocks(handle_t *handle unsigned short ee_len; /* - * Allow future support for preallocated extents to be added - * as an RO_COMPAT feature: * Uninitialized extents are treated as holes, except that - * we avoid (fail) allocating new blocks during a write. + * we split out initialized portions during a write. */ - if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) - goto out2; ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { @@ -2091,12 +2255,27 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); + /* Do not put uninitialized extent in the cache */ - if (!ext4_ext_is_uninitialized(ex)) + if (!ext4_ext_is_uninitialized(ex)) { ext4_ext_put_in_cache(inode, ee_block, ee_len, ee_start, EXT4_EXT_CACHE_EXTENT); - goto out; + goto out; + } + if (create == EXT4_CREATE_UNINITIALIZED_EXT) + goto out; + if (!create) + goto out2; + + ret = ext4_ext_convert_to_initialized(handle, inode, + path, iblock, + max_blocks); + if (ret <= 0) + goto out2; + else + allocated = ret; + goto outnew; } } @@ -2148,6 +2327,7 @@ int ext4_ext_get_blocks(handle_t *handle /* previous routine could use block we allocated */ newblock = ext_pblock(&newex); +outnew: __set_bit(BH_New, &bh_result->b_state); /* Cache only when it is _not_ an uninitialized extent */ Index: linux-2.6.21/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -202,6 +202,9 @@ static inline int ext4_ext_get_actual_le extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); From owner-xfs@oss.sgi.com Mon May 14 08:33:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 08:33:32 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EFXSfB023882 for ; Mon, 14 May 2007 08:33:29 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4EFXSGR025359 for ; Mon, 14 May 2007 11:33:28 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4EFXR0K219552 for ; Mon, 14 May 2007 09:33:27 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4EFXQqb012695 for ; Mon, 14 May 2007 09:33:27 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4EFXPoT012613; Mon, 14 May 2007 09:33:26 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 9342629EBD3; Mon, 14 May 2007 21:03:30 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4EFXURX028658; Mon, 14 May 2007 21:03:30 +0530 Date: Mon, 14 May 2007 21:03:30 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 2/5][TAKE2] fallocate() on s390 - glibc wrapper Message-ID: <20070514153330.GA25249@amitarora.in.ibm.com> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> <20070514144833.GB31748@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514144833.GB31748@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11421 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Mon, May 14, 2007 at 08:18:34PM +0530, Amit K. Arora wrote: > This is the patch suggested by Martin Schwidefsky. Here are the comments > and patch from him. Martin also suggested a wrapper in glibc to handle this system call on s390. Posting it here so that we get feedback for this too. Here it is: .globl __fallocate ENTRY(__fallocate) stm %r6,%r7,28(%r15) /* save %r6/%r7 on stack */ cfi_offset (%r7, -68) cfi_offset (%r6, -72) lm %r6,%r7,96(%r15) /* load loff_t len from stack */ svc SYS_ify(fallocate) lm %r6,%r7,28(%r15) /* restore %r6/%r7 from stack */ br %r14 PSEUDO_END(__fallocate) -- Regards, Amit Arora > ------------- > From: Martin Schwidefsky > > This patch implements support of fallocate system call on s390(x) > platform. A wrapper is added to address the issue which s390 ABI has > with the arguments of this system call. > > Signed-off-by: Martin Schwidefsky > --- > > arch/s390/kernel/compat_wrapper.S | 10 ++++++++++ > arch/s390/kernel/sys_s390.c | 29 +++++++++++++++++++++++++++++ > arch/s390/kernel/syscalls.S | 1 + > include/asm-s390/unistd.h | 3 ++- > 4 files changed, 42 insertions(+), 1 deletion(-) > > Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S > =================================================================== > --- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S > +++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S > @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: > llgtr %r2,%r2 # char * > llgtr %r3,%r3 # struct compat_timeval * > jg compat_sys_utimes > + > + .globl sys_fallocate_wrapper > +sys_fallocate_wrapper: > + lgfr %r2,%r2 # int > + lgfr %r3,%r3 # int > + sllg %r4,%r4,32 # get high word of 64bit loff_t > + lr %r4,%r5 # get low word of 64bit loff_t > + sllg %r5,%r6,32 # get high word of 64bit loff_t > + l %r5,164(%r15) # get low word of 64bit loff_t > + jg sys_fallocate > Index: linux-2.6.21/arch/s390/kernel/syscalls.S > =================================================================== > --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S > +++ linux-2.6.21/arch/s390/kernel/syscalls.S > @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * > SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) > SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) > SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) > +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) > Index: linux-2.6.21/arch/s390/kernel/sys_s390.c > =================================================================== > --- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c > +++ linux-2.6.21/arch/s390/kernel/sys_s390.c > @@ -286,3 +286,32 @@ int kernel_execve(const char *filename, > "d" (__arg3) : "memory"); > return __svcres; > } > + > +#ifndef CONFIG_64BIT > +/* > + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last > + * 64 bit argument "len" is split into the upper and lower 32 bits. The > + * system call wrapper in the user space loads the value to %r6/%r7. > + * The code in entry.S keeps the values in %r2 - %r6 where they are and > + * stores %r7 to 96(%r15). But the standard C linkage requires that > + * the whole 64 bit value for len is stored on the stack and doesn't > + * use %r6 at all. So s390_fallocate has to convert the arguments from > + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len > + * to > + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len > + */ > +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, > + u32 len_high, u32 len_low) > +{ > + union { > + u64 len; > + struct { > + u32 high; > + u32 low; > + }; > + } cv; > + cv.high = len_high; > + cv.low = len_low; > + return sys_fallocate(fd, mode, offset, cv.len); > +} > +#endif > Index: linux-2.6.21/include/asm-s390/unistd.h > =================================================================== > --- linux-2.6.21.orig/include/asm-s390/unistd.h > +++ linux-2.6.21/include/asm-s390/unistd.h > @@ -251,8 +251,9 @@ > #define __NR_getcpu 311 > #define __NR_epoll_pwait 312 > #define __NR_utimes 313 > +#define __NR_fallocate 314 > > -#define NR_syscalls 314 > +#define NR_syscalls 315 > > /* > * There are some system calls that are not present on 64 bit, some > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From owner-xfs@oss.sgi.com Mon May 14 13:20:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 13:20:08 -0700 (PDT) Received: from mailer.gwdg.de (mailer.gwdg.de [134.76.10.26]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EKK3fB012622 for ; Mon, 14 May 2007 13:20:04 -0700 Received: from linux01.gwdg.de ([134.76.13.21]) by mailer.gwdg.de with esmtps (TLSv1:AES256-SHA:256) (Exim 4.66) (envelope-from ) id 1Hnh1I-0001dN-FS; Mon, 14 May 2007 22:19:48 +0200 Received: from linux01.gwdg.de (localhost [127.0.0.1]) by linux01.gwdg.de (8.13.3/8.13.3/SuSE Linux 0.7) with ESMTP id l4EKG1gG018684; Mon, 14 May 2007 22:16:03 +0200 Received: from localhost (jengelh@localhost) by linux01.gwdg.de (8.13.3/8.13.3/Submit) with ESMTP id l4EKG0JO018678; Mon, 14 May 2007 22:16:00 +0200 Date: Mon, 14 May 2007 22:16:00 +0200 (MEST) From: Jan Engelhardt To: Matt Mackall cc: Jeremy Fitzhardinge , David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? In-Reply-To: <20070512124641.GZ11115@waste.org> Message-ID: References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510153832.GQ11115@waste.org> <20070512124641.GZ11115@waste.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 11422 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jengelh@linux01.gwdg.de Precedence: bulk X-list: xfs On May 12 2007 07:46, Matt Mackall wrote: >> >> You should not assume alphabetical order. Filesystems may be free to >> reorder things and return them (1) randomly like in a hash (2) by >> creation time during readdir(). > >There is no assumption. Mercurial explicitly visits files in >alphabetical order for the above commands. But who says that for i in {a..z}; do ## {..} is a bash3 extension touch $i; done; actually makes readdir() return them in the same order? Jan -- From owner-xfs@oss.sgi.com Mon May 14 13:27:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 13:27:49 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EKRkfB013776 for ; Mon, 14 May 2007 13:27:47 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 902FD2C8046; Mon, 14 May 2007 13:26:57 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id EB7702C803C; Mon, 14 May 2007 13:26:54 -0700 (PDT) Received: from [75.210.82.29] (29.sub-75-210-82.myvzw.com [75.210.82.29]) by lurch.goop.org (Postfix) with ESMTP; Mon, 14 May 2007 13:26:54 -0700 (PDT) Message-ID: <4648C63F.7020800@goop.org> Date: Mon, 14 May 2007 13:27:43 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jan Engelhardt CC: Matt Mackall , David Chinner , Linux Kernel Mailing List , xfs@oss.sgi.com, michal.k.k.piotrowski@gmail.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <20070510004918.GS85884050@sgi.com> <46426D31.8070000@goop.org> <20070510012609.GU85884050@sgi.com> <46433049.4020003@goop.org> <20070510153832.GQ11115@waste.org> <20070512124641.GZ11115@waste.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11423 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs Jan Engelhardt wrote: > On May 12 2007 07:46, Matt Mackall wrote: > >>> You should not assume alphabetical order. Filesystems may be free to >>> reorder things and return them (1) randomly like in a hash (2) by >>> creation time during readdir(). >>> >> There is no assumption. Mercurial explicitly visits files in >> alphabetical order for the above commands. >> > > But who says that > > for i in {a..z}; do ## {..} is a bash3 extension > touch $i; > done; > > actually makes readdir() return them in the same order? Nobody. But doing a readdir, sorting the results and visiting the files in that order does mean you'll visit them in alphabetical order. Hence "explicitly visits". J From owner-xfs@oss.sgi.com Mon May 14 16:06:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 16:06:07 -0700 (PDT) Received: from one.firstfloor.org (one.firstfloor.org [213.235.205.2]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4EN61fB008398 for ; Mon, 14 May 2007 16:06:05 -0700 Received: by one.firstfloor.org (Postfix, from userid 503) id 8529B18902A2; Tue, 15 May 2007 00:39:46 +0200 (CEST) Date: Tue, 15 May 2007 00:39:46 +0200 From: Andi Kleen To: David Chatterton Cc: "'Andi Kleen'" , "'xfs-dev'" , "'xfs-oss'" , "'David Chinner'" Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070514223946.GA19487@one.firstfloor.org> References: <000001c79544$44076ac0$0501010a@DCHATTERTONLAPTOP> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <000001c79544$44076ac0$0501010a@DCHATTERTONLAPTOP> User-Agent: Mutt/1.4.2.1i X-archive-position: 11424 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: andi@firstfloor.org Precedence: bulk X-list: xfs > So yes this is designed for a workload where the number of AGs is a multiple > of the number of streams since mixing streams in the one AG is the problem > it tries to avoid. Sounds like a awful special case. Is that common? -Andi From owner-xfs@oss.sgi.com Mon May 14 17:05:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:05:45 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4F05efB021310 for ; Mon, 14 May 2007 17:05:41 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA22854; Tue, 15 May 2007 10:05:33 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4F05VAf93938633; Tue, 15 May 2007 10:05:32 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4F05NdH93919177; Tue, 15 May 2007 10:05:23 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 15 May 2007 10:05:23 +1000 From: David Chinner To: Andi Kleen Cc: David Chatterton , "'xfs-dev'" , "'xfs-oss'" , "'David Chinner'" Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070515000523.GQ86004887@sgi.com> References: <000001c79544$44076ac0$0501010a@DCHATTERTONLAPTOP> <20070514223946.GA19487@one.firstfloor.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514223946.GA19487@one.firstfloor.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11425 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 15, 2007 at 12:39:46AM +0200, Andi Kleen wrote: > > So yes this is designed for a workload where the number of AGs is a multiple > > of the number of streams since mixing streams in the one AG is the problem > > it tries to avoid. > > Sounds like a awful special case. Is that common? Common enough to be a serious problem when running multiple 2k ingest and playout streams (320MB/s each). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 14 17:12:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:12:39 -0700 (PDT) Received: from smtps.tip.net.au (chilli.pcug.org.au [203.10.76.44]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F0CYfB022809 for ; Mon, 14 May 2007 17:12:35 -0700 Received: from localhost (bh02i525f01.au.ibm.com [202.81.18.30]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by smtps.tip.net.au (Postfix) with ESMTP id B0EFB368012; Tue, 15 May 2007 09:44:40 +1000 (EST) Date: Tue, 15 May 2007 09:44:36 +1000 From: Stephen Rothwell To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Message-Id: <20070515094436.d441098f.sfr@canb.auug.org.au> In-Reply-To: <20070514144524.GA31748@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> <20070514144524.GA31748@amitarora.in.ibm.com> X-Mailer: Sylpheed 2.4.0 (GTK+ 2.10.12; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg="PGP-SHA1"; boundary="Signature=_Tue__15_May_2007_09_44_36_+1000_FmQQEkgMAA8ZAwtk" X-archive-position: 11426 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sfr@canb.auug.org.au Precedence: bulk X-list: xfs --Signature=_Tue__15_May_2007_09_44_36_+1000_FmQQEkgMAA8ZAwtk Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: 7bit On Mon, 14 May 2007 20:15:24 +0530 "Amit K. Arora" wrote: > > This patch implements sys_fallocate() and adds support on i386, x86_64 > and powerpc platforms. This patch no longer applies to Linus' tree - for a start there is no file arch/x86_64/kernel/functionlist any more. Can you rebase it, please? -- Cheers, Stephen Rothwell sfr@canb.auug.org.au --Signature=_Tue__15_May_2007_09_44_36_+1000_FmQQEkgMAA8ZAwtk Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGSPRpFdBgD/zoJvwRAuHCAJsEB8TyYfKxqEtWnHM7smTPNqRiPwCfYj2B kUd5qmBLOd+TYg003bKAuVw= =ap96 -----END PGP SIGNATURE----- --Signature=_Tue__15_May_2007_09_44_36_+1000_FmQQEkgMAA8ZAwtk-- From owner-xfs@oss.sgi.com Mon May 14 17:15:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:15:08 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4F0F3fB023732 for ; Mon, 14 May 2007 17:15:05 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA22997; Tue, 15 May 2007 10:14:57 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4F0EsAf91589765; Tue, 15 May 2007 10:14:54 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4F0Epqv93878086; Tue, 15 May 2007 10:14:51 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 15 May 2007 10:14:50 +1000 From: David Chinner To: Jeremy Fitzhardinge Cc: David Chinner , Jan Engelhardt , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? Message-ID: <20070515001450.GS86004887@sgi.com> References: <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070512135143.GG85884050@sgi.com> <4645D594.4070801@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4645D594.4070801@goop.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11427 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sat, May 12, 2007 at 07:56:20AM -0700, Jeremy Fitzhardinge wrote: > David Chinner wrote: > > What I don't understand is that on unmount dirty xfs inodes get > > written out. Clearly this is not happening - either there's a hole > > in the writeback logic (unlikely - it was unchanged) or we've missed > > some case where we need to update the filesize and mark the inode > > dirty. > > > > Hmmmm - if the write was just a short append to the file, then the > > block that was written to should already be mapped. Then we'll just > > look up the extent by doing a BMAPI_READ lookup, set the type to > > IOMAP_READ and add the block to ioend we are building. > > > > Well, that result I mailed you showed that the difference was just over > 16k, and that there was a 32 block difference in the final extent > length. Does that fit with this theory? Yes - because when we do specualtive allocation of 64k beyond EOF by default on appends.... > > The type IOMAP_READ determines the I/O completion behaviour - in this case > > it is xfs_end_bio_read(), which fails to update the file size.... > > > > Bingo. > > > > A patch for you to try, Jeremy. I've just started a test run on it... > > > > Thanks, I'll give it a spin. Have you reproduced the bug yourself? No, not yet. I haven't had chance because I'm travelling at the moment.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 14 17:15:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:15:54 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F0FmfB024021 for ; Mon, 14 May 2007 17:15:50 -0700 Received: from DCHATTERTONLAPTOP (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id BCDA592C5DE; Tue, 15 May 2007 10:15:46 +1000 (EST) From: "David Chatterton" To: "'Andi Kleen'" Cc: "'xfs-dev'" , "'xfs-oss'" , "'David Chinner'" Subject: RE: Review: Concurrent Multi-File Data Streams Date: Tue, 15 May 2007 10:15:50 +1000 Message-ID: <00f501c79686$319a0530$0501010a@DCHATTERTONLAPTOP> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <20070514223946.GA19487@one.firstfloor.org> Thread-Index: AceWeMpxeBXomGDQT3ag+pGdmo367AACszxA X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 X-archive-position: 11428 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dchatterton@aconex.com Precedence: bulk X-list: xfs Andi, Dave just beat me to it, this represents the workload used by all post-production houses since they moved to digital where each stream is 320MB/s (2K format) or 1.3GB/s (4K format). Making sure those files are written sequentially on disk and do not overlap other streams has a huge benefit when supporting multiple streams. There is no reason why other workloads that would benefit from files in the same directory being written sequentially into their "own AG" would not use this feature. Post-production just tends to push the filesystem to the limits earlier than some other workloads. David > -----Original Message----- > From: Andi Kleen [mailto:andi@firstfloor.org] > Sent: Tuesday, 15 May 2007 8:40 AM > To: David Chatterton > Cc: 'Andi Kleen'; 'xfs-dev'; 'xfs-oss'; 'David Chinner' > Subject: Re: Review: Concurrent Multi-File Data Streams > > > So yes this is designed for a workload where the number of AGs is a > > multiple of the number of streams since mixing streams in > the one AG > > is the problem it tries to avoid. > > Sounds like a awful special case. Is that common? > > -Andi > From owner-xfs@oss.sgi.com Mon May 14 17:53:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 17:53:18 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F0rFfB031079 for ; Mon, 14 May 2007 17:53:15 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 3EADE2C8046; Mon, 14 May 2007 17:52:26 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 4DD452C803C; Mon, 14 May 2007 17:52:24 -0700 (PDT) Received: from [75.210.82.29] (29.sub-75-210-82.myvzw.com [75.210.82.29]) by lurch.goop.org (Postfix) with ESMTP; Mon, 14 May 2007 17:52:23 -0700 (PDT) Message-ID: <46490478.3010409@goop.org> Date: Mon, 14 May 2007 17:53:12 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: xfs@oss.sgi.com, Linux Kernel Mailing List Subject: 2.6.22-rc1 xfs lockdep messages Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-archive-position: 11429 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs I tend to get this when doing unlinks or rms in xfs: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.22-rc1-paravirt #1382 ------------------------------------------------------- rm/1451 is trying to acquire lock: (&(&ip->i_lock)->mr_lock/1){--..}, at: [] xfs_ilock+0x64/0x8d [xfs] but task is already holding lock: (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x64/0x8d [xfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&ip->i_lock)->mr_lock){----}: [] __lock_acquire+0xa1f/0xbab [] lock_acquire+0x7b/0x9f [] down_write_nested+0x3d/0x58 [] xfs_ilock+0x64/0x8d [xfs] [] xfs_iget_core+0x2bd/0x605 [xfs] [] xfs_iget+0xac/0x133 [xfs] [] xfs_trans_iget+0xdc/0x142 [xfs] [] xfs_ialloc+0xa5/0x457 [xfs] [] xfs_dir_ialloc+0x6d/0x260 [xfs] [] xfs_create+0x2f4/0x5a6 [xfs] [] xfs_vn_mknod+0x130/0x1e5 [xfs] [] xfs_vn_create+0x12/0x14 [xfs] [] vfs_create+0x9b/0xe5 [] open_namei+0x176/0x593 [] do_filp_open+0x26/0x3b [] do_sys_open+0x43/0xc7 [] sys_open+0x1c/0x1e [] syscall_call+0x7/0xb [] 0xffffffff -> #0 (&(&ip->i_lock)->mr_lock/1){--..}: [] __lock_acquire+0x903/0xbab [] lock_acquire+0x7b/0x9f [] down_write_nested+0x3d/0x58 [] xfs_ilock+0x64/0x8d [xfs] [] xfs_lock_inodes+0x11d/0x12f [xfs] [] xfs_lock_dir_and_entry+0xc2/0xcc [xfs] [] xfs_remove+0x213/0x425 [xfs] [] xfs_vn_unlink+0x1c/0x44 [xfs] [] vfs_unlink+0x75/0xb3 [] do_unlinkat+0x96/0x12c [] sys_unlink+0x13/0x15 [] syscall_call+0x7/0xb [] 0xffffffff other info that might help us debug this: 3 locks held by rm/1451: #0: (&inode->i_mutex/1){--..}, at: [] do_unlinkat+0x5e/0x12c #1: (&inode->i_mutex){--..}, at: [] mutex_lock+0x1f/0x23 #2: (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x64/0x8d [xfs] stack backtrace: [] show_trace_log_lvl+0x1a/0x30 [] show_trace+0x12/0x14 [] dump_stack+0x16/0x18 [] print_circular_bug_tail+0x5f/0x68 [] __lock_acquire+0x903/0xbab [] lock_acquire+0x7b/0x9f [] down_write_nested+0x3d/0x58 [] xfs_ilock+0x64/0x8d [xfs] [] xfs_lock_inodes+0x11d/0x12f [xfs] [] xfs_lock_dir_and_entry+0xc2/0xcc [xfs] [] xfs_remove+0x213/0x425 [xfs] [] xfs_vn_unlink+0x1c/0x44 [xfs] [] vfs_unlink+0x75/0xb3 [] do_unlinkat+0x96/0x12c [] sys_unlink+0x13/0x15 [] syscall_call+0x7/0xb ======================= J From owner-xfs@oss.sgi.com Mon May 14 23:23:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 23:23:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4F6NWfB020802 for ; Mon, 14 May 2007 23:23:34 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA00542; Tue, 15 May 2007 16:23:29 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4F6NSAf93905926; Tue, 15 May 2007 16:23:28 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4F6NRxD93828017; Tue, 15 May 2007 16:23:27 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 15 May 2007 16:23:27 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070515062327.GI85884050@sgi.com> References: <20070511003606.GB85884050@sgi.com> <20070513205953.GA14030@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070513205953.GA14030@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11430 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sun, May 13, 2007 at 09:59:53PM +0100, Christoph Hellwig wrote: > I already had some comments on this when discussing it with Sam in person, > but it seems like they didn't make it to you. Some people vaguely remembered some stuff (I did ask around) but it no-one knew the exact details of what you and Sam talked about. > First the mru cache while beeing quite nice code is heavily overengineered > for this case. Unless there are a many hundred filestreams per filesystem > it will be a lot faster to just have a simple wrap-around array of > linked lists. Well.... The mru cache is a wrap-around array of linked lists. i.e. There's a linked list for each time quanta group, and an array that holds all the head of each list. As each time quanta expires, we reclaim the oldest list and move the head pointer to the just emptied list for the new or newly referenced entries. I guess then you're commenting on the fact that it is also indexed by a radix tree? Given that during QA I've seen the cache grow to over 30,000 elements (one mru cache entry per cached inode), this cache can grow very large. In that particular test (083 - multiple fsstress at ENOSPC) each AG had around 2,000 stream references. That's far too large to search based on linked lists and the cache size variation pretty much rules out a hashing based solution. Radix tree gives pretty good lookup performance in these cases.... So the issue here is not that we have hundreds of streams but we have the possibility of having to search hundreds of thousands of cache objects to find the association for a given inode..... > We don't want to feed the argument that xfs has lots of > useless bloated code, do we? :) I've got two or three other things lined up that will use the mru cache so I don't think this is an issue at all... > All the pip != NULL checks are superflous in Linux. A regular > file can never have a non-null parent inode, and a directory can only > have a non-NULL parent inode in very odd corner cases involving NFS > exports, but it has to be connect again once you start doing > namespace modifying operations on it. Yes - I was told you'd said that about the code but I couldn't understand how or why it was even relevant because the code has nothing at all to do with dentries or looking up parent inodes. Now I have the full context.... So, we do this: 578 /* Pick a new AG for the parent inode starting at startag. */ 579 if ((err = _xfs_filestream_pick_ag(mp, startag, &ag, 0, 0)) || 580 ag == NULLAGNUMBER) 581 goto exit_did_pick; 582 583 /* Associate the parent inode with the AG. */ 584 if ((err = _xfs_filestream_set_ag(pip, NULL, ag))) { 585 dprint("_xfs_filestream_set_ag(%p (%lld), NULL, %d) -> err %d", 586 pip, pip->i_ino, ag, err); 587 goto exit_did_pick; 588 } 589 590 /* Associate the file inode with the AG. */ 591 if ((err = _xfs_filestream_set_ag(ip, pip, ag))) { 592 dprint("_xfs_filestream_set_ag(%p (%lld), %p (%lld), %d) -> " 593 "err %d", ip, ip->i_ino, pip, pip->i_ino, ag, err); 594 goto exit_did_pick; 595 } _xfs_filestream_set_ag() is called in two cases here - once without a parent inode, and once with. When we associate a directory with an AG, we don't care what ít's parent association is - we want that directory to be associated with the ag we got from _xfs_filestream_pick_ag(), not it's parent's association. With regular file inodes we want it to be associated with the parent inode's AG so we need to pass in a pip. Hence all the checks for pip being/not being NULL are required in this function. It really has nothing to do with whether an inode has a parent connected to it in the dentry tree or not.... > There some naming confusion: xfs_mount.h forward-declares struct > xfs_filestream but everything else uses struct fstrm_mnt_data. > The former is very non-descriptive and the latter but ugly, I'd > suggestjust putting the mru-cache replacement directly in there > as xfs_filestream_cache instead of the wrapping. I'll look at changing names to something more sensible, but at this point I don't see that the mru cache going away... > The xfs_zeroino changes looks good but should be a separate commit. Ok, I'll extract that out.... > Some comments on the actual code in xfs_filestream.c > > > +#ifdef DEBUG_FILESTREAMS > > +#define dprint(fmt, args...) do { \ > > + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ > > + current_pid(), __FUNCTION__, ##args); \ > > +} while(0) > > +#else > > +#define dprint(args...) do {} while (0) > > +#endif > > This should probably be killed entirely. I think it needs to be replaced with real tracing code rather than printk()s - this stuff is pretty much impossible to debug in a finite time period without some form of tracing telling us what happened. Is converting this to ktrace infrastructure acceptible? > > +#define GET_AG_REF(mp, ag) atomic_read(&(mp)->m_perag[ag].pagf_fstrms) > > +#define INC_AG_REF(mp, ag) atomic_inc_return(&(mp)->m_perag[ag].pagf_fstrms) > > +#define DEC_AG_REF(mp, ag) atomic_dec_return(&(mp)->m_perag[ag].pagf_fstrms) > > These should be inlines with more descriptive lower case names. *nod* > > +#define XFS_PICK_USERDATA 1 > > +#define XFS_PICK_LOWSPACE 2 > > enum. Yup. > > + for (nscan = 0; 1; nscan++) { > > + > > + //dprint("scanning AG %d[%d]", ag, GET_AG_REF(mp, ag)); > > please don't leave commented out debug code in. I missed that one :/ > > + pag = mp->m_perag + ag; > > + > > + if (!pag->pagf_init && > > + (err = xfs_alloc_pagf_init(mp, NULL, ag, trylock)) && > > + !trylock) { > > + dprint("xfs_alloc_pagf_init returned %d", err); > > + return err; > > + } > > if (!pag->pagf_init) { > err = xfs_alloc_pagf_init(mp, NULL, ag, trylock); > if (err && !trylock) > return err; > } Yup, I'll convert all those. > > +static int > > +_xfs_filestream_set_ag( > > + xfs_inode_t *ip, > > + xfs_inode_t *pip, > > + xfs_agnumber_t ag) > > +{ > > + int err = 0; > > + xfs_mount_t *mp; > > + xfs_mru_cache_t *cache; > > + fstrm_item_t *item; > > + xfs_agnumber_t old_ag; > > + xfs_inode_t *old_pip; > > + > > + /* > > + * Either ip is a regular file and pip is a directory, or ip is a > > + * directory and pip is NULL. > > + */ > > We have parent information for parents aswell so this should probably > be made more regular. As explained above, the association of the parent of a directory is irrelevant which is why we do not use it... > > +void > > +xfs_filestream_init(void) > > +{ > > + item_zone = kmem_zone_init(sizeof(fstrm_item_t), "fstrm_item"); > > + ASSERT(item_zone); > > Please check for errors instead and propagate them. Ooo. I missed that one. > > +/* > > + * xfs_filestream_uninit() is called at xfs termination time to destroy the > > + * memory zone that was used for filestream data structure allocation. > > + */ > > +void > > +xfs_filestream_uninit(void) > > +{ > > + if (item_zone) { > > + kmem_zone_destroy(item_zone); > > + item_zone = NULL; > > + } > > +} > > no need for the NULL check or setting it to NULL. *nod* > > + if (!(md = (fstrm_mnt_data_t*)kmem_zalloc(sizeof(*md), KM_SLEEP))) > > Please use KM_MAYFAIL for all new code otside of transactions. Yeah - that is pretty silly - checking if a KM_SLEEP allocation failed.... > > + ASSERT(ip->i_d.di_mode & (S_IFREG | S_IFDIR)); > > + if (!(ip->i_d.di_mode & (S_IFREG | S_IFDIR))) > > + return NULLAGNUMBER; > > either the assert or the if clause checking gor it, please. Purely defensive - on a production system we'll return NULLAGNUMBER if we get called for the wrong type so teh system will silently continue without issues. On a debug kernel we'll get an assert failure so we can debug why we got here incorrectly. This is a common way of handling should-not-happen-but-not-fatal error conditions in XFS - look at all the places where we have "ASSERT(0)" in error cases that a non-debug kernel will just return an error. What is the accepted way of coding this? > Now comes the worst part the new allocator function > i > IF we look at a diff between xfs_bmap_filestreams and xfs_bmap_btalloc > we see that it's a pretty bad cut & paste job: FWIW, it was done that way originally so that it didn't perturb the existing allocator code. > > --- btalloc 2007-05-12 12:43:03.000000000 +0200 > +++ fsalloc 2007-05-12 12:42:28.000000000 +0200 > @@ -1,44 +1,54 @@ > > > + rt = (ap->ip->i_d.di_flags & XFS_DIFLAG_REALTIME) && ap->userdata; > > xfs_bmap_alloc() never calls xfs_bmap_filestreams if this is > true so all code guarded by if (rt) is dead. Will kill. > > - if (unlikely(align)) { > > + if (align) { > > Âlign should have the same likelyhood for oth > > > - if (nullfb) > > - ap->rval = XFS_INO_TO_FSB(mp, ap->ip->i_ino); > > - else > > + if (nullfb) { > > + ag = xfs_filestream_get_ag(ap->ip); > > + ag = (ag != NULLAGNUMBER) ? ag : 0; > > + ap->rval = (ap->userdata) ? XFS_AGB_TO_FSB(mp, ag, 0) : > > + XFS_INO_TO_FSB(mp, ap->ip->i_ino); > > + } else { > > ap->rval = ap->firstblock; > > + } > > Some rreal changes :) But this could be just a third if case > for the filesystream case. Yes, it could..... > > @@ -117,18 +167,19 @@ > > */ > > else > > args.minlen = ap->alen; > > + ap->rval = args.fsbno = XFS_AGB_TO_FSB(mp, ag, 0); > > } else if (ap->low) { > > - args.type = XFS_ALLOCTYPE_START_BNO; > > + args.type = XFS_ALLOCTYPE_FIRST_AG; > > args.total = args.minlen = ap->minlen; > > Why is this different? Because when we are low on space stream associations typically fail and we associate with AG 0 in that case. > } > > - if (unlikely(ap->userdata && ap->ip->i_d.di_extsize && > > - (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE))) { > > + if (ap->userdata && ap->ip->i_d.di_extsize && > > + (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) { > args.prod = ap->ip->i_d.di_extsize; > > - if ((args.mod = (xfs_extlen_t)do_mod(ap->off, args.prod))) > > + if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) > > Gratious difference. > > * is >= the stripe unit and the allocation offset is > * at the end of file. > */ > > + atype = args.type; > > I don't quite undersatnd why we'd nee this in one, but not the other. I don't think it's needed in either. Possibly it was added to remove a used-uninitialised warning... > Based onthat my conclusion is that xfs_bmap_filestreams and xfs_bmap_btalloc > should be merged to avoid further maintaince overhead. Yes, agreed - they could be. Christoph - thanks for taking the time to review this code. I'll post a new version in a few days when I've had a chance to incorporate your suggestions... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 14 23:31:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 14 May 2007 23:31:31 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F6VPfB022437 for ; Mon, 14 May 2007 23:31:27 -0700 Received: from localhost.adilger.int (S0106000bdb95b39c.cg.shawcable.net [70.72.213.136]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 61CD44E4595; Tue, 15 May 2007 00:31:22 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 10F4E4078; Tue, 15 May 2007 00:31:21 -0600 (MDT) Date: Tue, 15 May 2007 00:31:21 -0600 From: Andreas Dilger To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5][TAKE2] fallocate system call Message-ID: <20070515063120.GI5286@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070514132926.GA30768@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11431 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On May 14, 2007 18:59 +0530, Amit K. Arora wrote: > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > fd: The descriptor of the open file. > > mode*: This specifies the behavior of the system call. Currently the > system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. > FA_ALLOCATE: Applications can use this mode to preallocate blocks to > a given file (specified by fd). This mode changes the file size if > the preallocation is done beyond the EOF. It also updates the > ctime/mtime in the inode of the corresponding file, marking a > successfull allocation. > FA_DEALLOCATE: This mode can be used by applications to deallocate the > previously preallocated blocks. This also may change the file size > and the ctime/mtime. > * New modes might get added in future. One such new mode which is > already under discussion is FA_PREALLOCATE, which when used will > preallocate space but will not change the filesize and [cm]time. > Since the semantics of this new mode is not clear and agreed upon yet, > this patchset does not implement it currently. > > offset: This is the offset in bytes, from where the preallocation should > start. > > len: This is the number of bytes requested for preallocation (from > offset). What is the return value? I'd hope it is the number of bytes preallocated, in case of interrupted preallocation for whatever reason (interrupt, out of space, etc) like a regular write(2) call. In this case the return type needs to also be an loff_t to match @len. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Tue May 15 01:12:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 01:12:44 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F8CffB011266 for ; Tue, 15 May 2007 01:12:43 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 4240BB00B91A; Tue, 15 May 2007 04:12:41 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 1C06E50001A7 for ; Tue, 15 May 2007 04:12:41 -0400 (EDT) Date: Tue, 15 May 2007 04:12:40 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: xfs@oss.sgi.com Subject: xfs_db: segfault: error 4 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11432 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Kernel: 2.6.21.1 # xfs_db -V xfs_db version 2.8.18 May 14 22:15:54 p34 kernel: [186121.414224] xfs_db[18999]: segfault at 00000000005b6ff8 rip 00002ac92ced40ce rsp 00007fff7e0a9a68 error 4 While running an xfs_db -c frag -f /dev/md3 (which runs nightly) this is the first time I have seen this problem. Justin. From owner-xfs@oss.sgi.com Tue May 15 02:49:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 02:49:40 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4F9nafB026820 for ; Tue, 15 May 2007 02:49:38 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HntFJ-0002YO-LE; Tue, 15 May 2007 10:23:05 +0100 Date: Tue, 15 May 2007 10:23:05 +0100 From: Christoph Hellwig To: David Chinner Cc: Christoph Hellwig , xfs-dev , xfs-oss Subject: Re: Review: Concurrent Multi-File Data Streams Message-ID: <20070515092305.GA9409@infradead.org> References: <20070511003606.GB85884050@sgi.com> <20070513205953.GA14030@infradead.org> <20070515062327.GI85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515062327.GI85884050@sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11433 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Tue, May 15, 2007 at 04:23:27PM +1000, David Chinner wrote: > Well.... The mru cache is a wrap-around array of linked lists. i.e. > There's a linked list for each time quanta group, and an array that > holds all the head of each list. As each time quanta expires, we > reclaim the oldest list and move the head pointer to the just > emptied list for the new or newly referenced entries. > > I guess then you're commenting on the fact that it is also indexed by > a radix tree? Yes. > > Given that during QA I've seen the cache grow to over 30,000 > elements (one mru cache entry per cached inode), this cache can grow > very large. In that particular test (083 - multiple fsstress at > ENOSPC) each AG had around 2,000 stream references. That's far too > large to search based on linked lists and the cache size variation > pretty much rules out a hashing based solution. Radix tree gives > pretty good lookup performance in these cases.... > > So the issue here is not that we have hundreds of streams but we > have the possibility of having to search hundreds of thousands of > cache objects to find the association for a given inode..... Okay, convinced. > > > We don't want to feed the argument that xfs has lots of > > useless bloated code, do we? :) > > I've got two or three other things lined up that will use the > mru cache so I don't think this is an issue at all... In that case however the code should move into lib/ instead of beeing in XFS. That also means updating it to kernel standard style, e.g. getting rid of all the odd XFS wrappers, removing useless casts, converting the documentation to kerneldoc style, return negative error values, etc.. Probably wants splitting into a separate patch. > > > All the pip != NULL checks are superflous in Linux. A regular > > file can never have a non-null parent inode, and a directory can only > > have a non-NULL parent inode in very odd corner cases involving NFS > > exports, but it has to be connect again once you start doing > > namespace modifying operations on it. > > Yes - I was told you'd said that about the code but I couldn't > understand how or why it was even relevant because the code has > nothing at all to do with dentries or looking up parent inodes. > Now I have the full context.... Actually here I meant a different context :) This is in reference to the xfs_inode.c changes, which are namespace operations only called from the VFS so the normal Linux gurantees should always apply here. > _xfs_filestream_set_ag() is called in two cases here - once without a > parent inode, and once with. When we associate a directory with an AG, > we don't care what ?t's parent association is - we want that directory > to be associated with the ag we got from _xfs_filestream_pick_ag(), not > it's parent's association. > > With regular file inodes we want it to be associated with the parent inode's > AG so we need to pass in a pip. Hence all the checks for pip being/not being > NULL are required in this function. It really has nothing to do with > whether an inode has a parent connected to it in the dentry tree or > not.... > > There some naming confusion: xfs_mount.h forward-declares struct > > xfs_filestream but everything else uses struct fstrm_mnt_data. > > The former is very non-descriptive and the latter but ugly, I'd > > suggestjust putting the mru-cache replacement directly in there > > as xfs_filestream_cache instead of the wrapping. > > I'll look at changing names to something more sensible, but at this > point I don't see that the mru cache going away... Well in that case s/replacement//. Just have a struct mru_cache *m_filestreams; in struct xfs_mount. > > Some comments on the actual code in xfs_filestream.c > > > > > +#ifdef DEBUG_FILESTREAMS > > > +#define dprint(fmt, args...) do { \ > > > + printk(KERN_DEBUG "%4d %s: " fmt "\n", \ > > > + current_pid(), __FUNCTION__, ##args); \ > > > +} while(0) > > > +#else > > > +#define dprint(args...) do {} while (0) > > > +#endif > > > > This should probably be killed entirely. > > I think it needs to be replaced with real tracing code rather than > printk()s - this stuff is pretty much impossible to debug in a finite > time period without some form of tracing telling us what happened. > Is converting this to ktrace infrastructure acceptible? Sounds fine to me, that way it's consistant with the reset of XFS. And now that the kernel tracing informations make progress we might actually be able to use that in mainline soon. > > > + ASSERT(ip->i_d.di_mode & (S_IFREG | S_IFDIR)); > > > + if (!(ip->i_d.di_mode & (S_IFREG | S_IFDIR))) > > > + return NULLAGNUMBER; > > > > either the assert or the if clause checking gor it, please. > > Purely defensive - on a production system we'll return NULLAGNUMBER if > we get called for the wrong type so teh system will silently continue > without issues. On a debug kernel we'll get an assert failure so we can > debug why we got here incorrectly. > > This is a common way of handling should-not-happen-but-not-fatal error > conditions in XFS - look at all the places where we have "ASSERT(0)" in > error cases that a non-debug kernel will just return an error. > > What is the accepted way of coding this? In normal kernel doc this would be a BUG() in the taken branch of the if, that would probably translate to an ASSERT(0) in XFS. > > Now comes the worst part the new allocator function > > i > > IF we look at a diff between xfs_bmap_filestreams and xfs_bmap_btalloc > > we see that it's a pretty bad cut & paste job: > > FWIW, it was done that way originally so that it didn't perturb the > existing allocator code. That might be a good strategy for delivering an IRIX patch to a customers, but for long-term maintaince this kind of duplication should rather be avoided. > > > } else if (ap->low) { > > > - args.type = XFS_ALLOCTYPE_START_BNO; > > > + args.type = XFS_ALLOCTYPE_FIRST_AG; > > > args.total = args.minlen = ap->minlen; > > > > Why is this different? > > Because when we are low on space stream associations typically fail > and we associate with AG 0 in that case. As Andi already mentioned that might be a bad default and some kind of round robing might be better. Or just falling back to the default allocator scheme so we don't get subtile differences. From owner-xfs@oss.sgi.com Tue May 15 05:15:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 05:15:17 -0700 (PDT) Received: from gk.uu.epigenomics.net (gk.uu.epigenomics.net [195.127.125.226]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FCFBfB025527 for ; Tue, 15 May 2007 05:15:13 -0700 Received: (qmail 2398 invoked from network); 15 May 2007 11:48:30 -0000 Received: from perl.epigenomics.epi (192.168.48.4) by salam.epigenomics.epi with SMTP; 15 May 2007 11:48:30 -0000 Received: (qmail 24312 invoked by uid 9); 15 May 2007 11:48:30 -0000 From: linux-xfs@ml.epigenomics.com X-Newsgroups: epi.ml.linux.xfs Subject: xfs_repair: buf calloc failed (4132 bytes): Cannot allocate memory Date: Tue, 15 May 2007 11:48:30 +0000 (UTC) Organization: Epigenomics AG Lines: 73 Message-ID: X-Complaints-To: usenet@epigenomics.net User-Agent: slrn/0.9.8.1pl1 (Debian) To: xfs@oss.sgi.com X-archive-position: 11434 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: linux-xfs@ml.epigenomics.com Precedence: bulk X-list: xfs Hi! We have a RAID0 set of 3 400GB disks. After a crash we needed to run xfs_repair, but it bails out with the error message: - ensuring existence of lost+found directory - traversing filesystem starting at / ... xfs_repair: buf calloc failed (4132 bytes): Cannot allocate memory The filesystem contains many hardlinked files as it is a dirvish repository (www.dirvish.org) with the hardlinks created by rsync. This is the xfs_db info: # xfs_db -r -c "sb 0" -c "p" /dev/md0 magicnum = 0x58465342 blocksize = 4096 dblocks = 293031424 rblocks = 0 rextents = 0 uuid = e8d3a22c-716f-4f3e-9e95-e06afb3559d0 logstart = 268435472 rootino = 256 rbmino = 257 rsumino = 258 rextsize = 48 agblocks = 9157232 agcount = 32 rbmblocks = 0 logblocks = 32768 versionnum = 0x3184 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 24 rextslog = 0 inprogress = 0 imax_pct = 25 icount = 18882496 ifree = 373596 fdblocks = 27494887 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 16 width = 48 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 0 features2 = 0 Kernel is 2.6.20.6 on a dual PIII machine with 1GB RAM and 10GB swap. Mounting the filesystem is possible, but what about its current state? Greetings -- Robert Sander Senior Manager Information Systems Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany phone:+49-30-24345-0 fax:+49-30-24345-555 http://www.epigenomics.com robert.sander@epigenomics.com From owner-xfs@oss.sgi.com Tue May 15 05:40:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 05:40:30 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FCeNfB031295 for ; Tue, 15 May 2007 05:40:25 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FCeN0N009204 for ; Tue, 15 May 2007 08:40:23 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FCeNkl492006 for ; Tue, 15 May 2007 08:40:23 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FCeM5B024328 for ; Tue, 15 May 2007 08:40:23 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FCeLZX024194; Tue, 15 May 2007 08:40:21 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id F3A7D94C82; Tue, 15 May 2007 18:10:26 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FCeLsU004120; Tue, 15 May 2007 18:10:21 +0530 Date: Tue, 15 May 2007 18:10:20 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5][TAKE2] fallocate system call Message-ID: <20070515124020.GA12964@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070515063120.GI5286@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515063120.GI5286@schatzie.adilger.int> User-Agent: Mutt/1.4.1i X-archive-position: 11435 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Tue, May 15, 2007 at 12:31:21AM -0600, Andreas Dilger wrote: > On May 14, 2007 18:59 +0530, Amit K. Arora wrote: > > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > > > fd: The descriptor of the open file. > > > > mode*: This specifies the behavior of the system call. Currently the > > system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. > > FA_ALLOCATE: Applications can use this mode to preallocate blocks to > > a given file (specified by fd). This mode changes the file size if > > the preallocation is done beyond the EOF. It also updates the > > ctime/mtime in the inode of the corresponding file, marking a > > successfull allocation. > > FA_DEALLOCATE: This mode can be used by applications to deallocate the > > previously preallocated blocks. This also may change the file size > > and the ctime/mtime. > > * New modes might get added in future. One such new mode which is > > already under discussion is FA_PREALLOCATE, which when used will > > preallocate space but will not change the filesize and [cm]time. > > Since the semantics of this new mode is not clear and agreed upon yet, > > this patchset does not implement it currently. > > > > offset: This is the offset in bytes, from where the preallocation should > > start. > > > > len: This is the number of bytes requested for preallocation (from > > offset). > > What is the return value? I'd hope it is the number of bytes preallocated, > in case of interrupted preallocation for whatever reason (interrupt, out of > space, etc) like a regular write(2) call. In this case the return type needs > to also be an loff_t to match @len. The return value in current implementation has been kept as "long" where zero is returned for success and an error on failure. This is done to keep it inline with posix_fallocate behavior. This point was brought up sometime back by Badari. At that time it was decided to keep it the way posix_fallocate is designed. Here are the posts related to this: Still if you feel that we should be returning number of bytes preallocated, we can again ask for opinion here. Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue May 15 06:24:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 06:24:08 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FDO1fB007752 for ; Tue, 15 May 2007 06:24:04 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FDNw43004993 for ; Tue, 15 May 2007 09:23:58 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FDNw4o269312 for ; Tue, 15 May 2007 07:23:58 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FDNvU0029955 for ; Tue, 15 May 2007 07:23:58 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FDNuUx029126; Tue, 15 May 2007 07:23:57 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id A399C94C82; Tue, 15 May 2007 18:53:53 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FDNrws021935; Tue, 15 May 2007 18:53:53 +0530 Date: Tue, 15 May 2007 18:53:53 +0530 From: "Amit K. Arora" To: Stephen Rothwell Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070515132353.GB12964@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070514132926.GA30768@amitarora.in.ibm.com> <20070514142820.GA31468@amitarora.in.ibm.com> <20070514144524.GA31748@amitarora.in.ibm.com> <20070515094436.d441098f.sfr@canb.auug.org.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515094436.d441098f.sfr@canb.auug.org.au> User-Agent: Mutt/1.4.1i X-archive-position: 11436 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Tue, May 15, 2007 at 09:44:36AM +1000, Stephen Rothwell wrote: > On Mon, 14 May 2007 20:15:24 +0530 "Amit K. Arora" wrote: > > > > This patch implements sys_fallocate() and adds support on i386, x86_64 > > and powerpc platforms. > > This patch no longer applies to Linus' tree - for a start there is no file > arch/x86_64/kernel/functionlist any more. > > Can you rebase it, please? I will rebase it to 2.6.22-rc1 and repost the patches soon. Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue May 15 12:24:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 12:24:28 -0700 (PDT) Received: from mail.goop.org (gw.goop.org [64.81.55.164]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FJOKfB016847 for ; Tue, 15 May 2007 12:24:21 -0700 Received: by lurch.goop.org (Postfix, from userid 525) id 680712C8047; Tue, 15 May 2007 12:23:30 -0700 (PDT) Received: from lurch.goop.org (localhost [127.0.0.1]) by lurch.goop.org (Postfix) with ESMTP id 05DDE2C8043; Tue, 15 May 2007 12:23:28 -0700 (PDT) Received: from [75.208.159.192] (192.sub-75-208-159.myvzw.com [75.208.159.192]) by lurch.goop.org (Postfix) with ESMTP; Tue, 15 May 2007 12:23:27 -0700 (PDT) Message-ID: <464A08DC.7030303@goop.org> Date: Tue, 15 May 2007 12:24:12 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Jan Engelhardt , Chuck Ebbert , Linux Kernel Mailing List , Matt Mackall , xfs@oss.sgi.com Subject: Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem? References: <4642389E.4080804@goop.org> <20070509231643.GM85884050@sgi.com> <4642598E.3000607@goop.org> <20070510000119.GO85884050@sgi.com> <46426194.3040403@goop.org> <46439185.5060207@redhat.com> <464392B4.3070009@goop.org> <464393E1.3050705@redhat.com> <46439491.9010604@goop.org> <20070512135143.GG85884050@sgi.com> In-Reply-To: <20070512135143.GG85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11437 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeremy@goop.org Precedence: bulk X-list: xfs David Chinner wrote: > A patch for you to try, Jeremy. I've just started a test run on it... > OK, it seems to work. I haven't given it an overnight run, but its run longer without failing than it did before. J From owner-xfs@oss.sgi.com Tue May 15 12:37:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 12:37:27 -0700 (PDT) Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.144]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FJbMfB019056 for ; Tue, 15 May 2007 12:37:23 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FJbL1g008123 for ; Tue, 15 May 2007 15:37:21 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FJbL8M528288 for ; Tue, 15 May 2007 15:37:21 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FJbKHH022655 for ; Tue, 15 May 2007 15:37:21 -0400 Received: from amitarora.in.ibm.com ([9.126.238.191]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FJbIoq022564; Tue, 15 May 2007 15:37:19 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 8853010CFF8; Wed, 16 May 2007 01:07:25 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FJbN4Q026374; Wed, 16 May 2007 01:07:23 +0530 Date: Wed, 16 May 2007 01:07:22 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 0/5][TAKE3] fallocate system call Message-ID: <20070515193722.GA3487@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11438 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ P L E A S E N O T E : *********************** 1. Patches have been now rebased to 2.6.22-rc1 kernel. Earlier they were based on 2.6.21. 2. An unnecessary export of symbol is removed from the ext4 preallocate patch. Details in the corresponding post (PATCH 4/5). 3. Return type now described in the interface description below. 4. Besides above points, everything is exactly same as TAKE2. -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This is the new set of patches which take care of the review comments received from the community (mainly from Andrew). Description: ----------- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: --------- The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime/mtime in the inode of the corresponding file, marking a successfull allocation. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. * New modes might get added in future. One such new mode which is already under discussion is FA_PREALLOCATE, which when used will preallocate space but will not change the filesize and [cm]time. Since the semantics of this new mode is not clear and agreed upon yet, this patchset does not implement it currently. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). RETURN VALUE: The system call returns 0 on success and an error on failure. This is done to keep the semantics same as of posix_fallocate(). sys_fallocate() on s390: ----------------------- There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problem: ------------- mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: ToDos: ----- 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Changelog: --------- Each post will have an individual changelog for a particular patch. Following patches follow: Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc Patch 2/5 : fallocate() on s390 Patch 3/5 : ext4: Extent overlap bugfix Patch 4/5 : ext4: fallocate support in ext4 Patch 5/5 : ext4: write support for preallocated blocks -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue May 15 13:04:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 13:04:07 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FK40fB023066 for ; Tue, 15 May 2007 13:04:03 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FK3umn028133 for ; Tue, 15 May 2007 16:03:56 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FK3uXw270492 for ; Tue, 15 May 2007 14:03:56 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FK3tN8001616 for ; Tue, 15 May 2007 14:03:56 -0600 Received: from amitarora.in.ibm.com ([9.126.238.191]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FK3rtw001348; Tue, 15 May 2007 14:03:54 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 66FD010CFF8; Wed, 16 May 2007 01:34:01 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FK40Tf008752; Wed, 16 May 2007 01:34:00 +0530 Date: Wed, 16 May 2007 01:33:59 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070515200359.GA5834@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515195421.GA2948@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11439 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Changelog: --------- Note: The changes below are from the initial post (dated 26th April, 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel version on which this patch is based. TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1. Following changes were made to the previous version: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) Here is the new patch: Signed-off-by: Amit Arora --- arch/i386/kernel/syscall_table.S | 1 arch/powerpc/kernel/sys_ppc32.c | 7 +++ arch/x86_64/ia32/ia32entry.S | 1 fs/open.c | 89 +++++++++++++++++++++++++++++++++++++++ include/asm-i386/unistd.h | 3 - include/asm-powerpc/systbl.h | 1 include/asm-powerpc/unistd.h | 3 - include/asm-x86_64/unistd.h | 2 include/linux/fs.h | 13 +++++ include/linux/syscalls.h | 1 10 files changed, 119 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S =================================================================== --- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c =================================================================== --- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, + u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, + ((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22-rc1/fs/open.c =================================================================== --- linux-2.6.22-rc1.orig/fs/open.c +++ linux-2.6.22-rc1/fs/open.c @@ -353,6 +353,95 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + /* + * Let individual file system decide if it supports preallocation + * for directories or not. + */ + if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + /* Check for wrap through zero too */ + if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; + + /* + * Update [cm]time. + * Partial allocation will not result in the time stamp changes, + * since ->fallocate will return error (say, -ENOSPC) in this case. + */ + if (!ret) + file_update_time(file); +out_fput: + fput(file); +out: + return ret; +} + +/* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and * switching the fsuid/fsgid around to the real ones. Index: linux-2.6.22-rc1/include/asm-i386/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-i386/unistd.h +++ linux-2.6.22-rc1/include/asm-i386/unistd.h @@ -329,10 +329,11 @@ #define __NR_signalfd 321 #define __NR_timerfd 322 #define __NR_eventfd 323 +#define __NR_fallocate 324 #ifdef __KERNEL__ -#define NR_syscalls 324 +#define NR_syscalls 325 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.22-rc1/include/asm-powerpc/systbl.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-powerpc/systbl.h +++ linux-2.6.22-rc1/include/asm-powerpc/systbl.h @@ -308,3 +308,4 @@ COMPAT_SYS_SPU(move_pages) SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) COMPAT_SYS_SPU(utimensat) +COMPAT_SYS(fallocate) Index: linux-2.6.22-rc1/include/asm-powerpc/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-powerpc/unistd.h +++ linux-2.6.22-rc1/include/asm-powerpc/unistd.h @@ -327,10 +327,11 @@ #define __NR_getcpu 302 #define __NR_epoll_pwait 303 #define __NR_utimensat 304 +#define __NR_fallocate 305 #ifdef __KERNEL__ -#define __NR_syscalls 305 +#define __NR_syscalls 306 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls Index: linux-2.6.22-rc1/include/asm-x86_64/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-x86_64/unistd.h +++ linux-2.6.22-rc1/include/asm-x86_64/unistd.h @@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd) __SYSCALL(__NR_timerfd, sys_timerfd) #define __NR_eventfd 283 __SYSCALL(__NR_eventfd, sys_eventfd) +#define __NR_fallocate 284 +__SYSCALL(__NR_fallocate, sys_fallocate) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.22-rc1/include/linux/fs.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/fs.h +++ linux-2.6.22-rc1/include/linux/fs.h @@ -266,6 +266,17 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * sys_fallocate modes + * Currently sys_fallocate supports two modes: + * FA_ALLOCATE : This is the preallocate mode, using which an application/user + * may request (pre)allocation of blocks. + * FA_DEALLOCATE: This is the deallocate mode, which can be used to free + * the preallocated blocks. + */ +#define FA_ALLOCATE 0x1 +#define FA_DEALLOCATE 0x2 + #ifdef __KERNEL__ #include @@ -1137,6 +1148,8 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *inode, int mode, loff_t offset, + loff_t len); }; struct seq_file; Index: linux-2.6.22-rc1/include/linux/syscalls.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/syscalls.h +++ linux-2.6.22-rc1/include/linux/syscalls.h @@ -608,6 +608,7 @@ asmlinkage long sys_signalfd(int ufd, si asmlinkage long sys_timerfd(int ufd, int clockid, int flags, const struct itimerspec __user *utmr); asmlinkage long sys_eventfd(unsigned int count); +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); Index: linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S =================================================================== --- linux-2.6.22-rc1.orig/arch/x86_64/ia32/ia32entry.S +++ linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys_fallocate ia32_syscall_end: From owner-xfs@oss.sgi.com Tue May 15 13:10:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 13:10:42 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FKAbfB024550 for ; Tue, 15 May 2007 13:10:38 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FKAaUa023072 for ; Tue, 15 May 2007 16:10:36 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FKAaP7557306 for ; Tue, 15 May 2007 16:10:36 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FKAaG5010348 for ; Tue, 15 May 2007 16:10:36 -0400 Received: from amitarora.in.ibm.com ([9.126.238.191]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FKAYbe010301; Tue, 15 May 2007 16:10:35 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 310A410CFF8; Wed, 16 May 2007 01:40:42 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FKAf4G012434; Wed, 16 May 2007 01:40:41 +0530 Date: Wed, 16 May 2007 01:40:40 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 2/5][TAKE3] fallocate() on s390 Message-ID: <20070515201040.GB5834@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515195421.GA2948@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11440 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This is the patch suggested by Martin Schwidefsky to support sys_fallocate() on s390(x) platform. He also suggested a wrapper in glibc to handle this system call on s390. Posting it here so that we get feedback for this too. .globl __fallocate ENTRY(__fallocate) stm %r6,%r7,28(%r15) /* save %r6/%r7 on stack */ cfi_offset (%r7, -68) cfi_offset (%r6, -72) lm %r6,%r7,96(%r15) /* load loff_t len from stack */ svc SYS_ify(fallocate) lm %r6,%r7,28(%r15) /* restore %r6/%r7 from stack */ br %r14 PSEUDO_END(__fallocate) Here are the comments and the patch to linux kernel from him. ------------- From: Martin Schwidefsky This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with the arguments of this system call. Signed-off-by: Martin Schwidefsky --- arch/s390/kernel/compat_wrapper.S | 10 ++++++++++ arch/s390/kernel/sys_s390.c | 29 +++++++++++++++++++++++++++++ arch/s390/kernel/syscalls.S | 1 + include/asm-s390/unistd.h | 3 ++- 4 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S =================================================================== --- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S +++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: llgtr %r2,%r2 # char * llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr %r2,%r2 # int + lgfr %r3,%r3 # int + sllg %r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg %r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c =================================================================== --- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c +++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c @@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar return -EFAULT; return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice); } + +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + union { + u64 len; + struct { + u32 high; + u32 low; + }; + } cv; + cv.high = len_high; + cv.low = len_low; + return sys_fallocate(fd, mode, offset, cv.len); +} +#endif Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S =================================================================== --- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) Index: linux-2.6.22-rc1/include/asm-s390/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h +++ linux-2.6.22-rc1/include/asm-s390/unistd.h @@ -251,8 +251,9 @@ #define __NR_getcpu 311 #define __NR_epoll_pwait 312 #define __NR_utimes 313 +#define __NR_fallocate 314 -#define NR_syscalls 314 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some From owner-xfs@oss.sgi.com Tue May 15 13:13:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 13:13:37 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FKDUfB025378 for ; Tue, 15 May 2007 13:13:31 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l4FK9pSH002775 for ; Tue, 15 May 2007 16:09:51 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FKDQ8U269646 for ; Tue, 15 May 2007 14:13:26 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FKDPw6006804 for ; Tue, 15 May 2007 14:13:26 -0600 Received: from amitarora.in.ibm.com ([9.126.238.191]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FKDOpZ006738; Tue, 15 May 2007 14:13:25 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 1FC6B10CFFA; Wed, 16 May 2007 01:43:32 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FKDVtr013986; Wed, 16 May 2007 01:43:31 +0530 Date: Wed, 16 May 2007 01:43:30 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 3/5][TAKE3] ext4: Extent overlap bugfix Message-ID: <20070515201330.GC5834@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515195421.GA2948@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11441 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch adds a check for overlap of extents and cuts short the new extent to be inserted, if there is a chance of overlap. Changelog: --------- Note: The changes below are from the initial post (dated 26th April, 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel version on which this patch is based. TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1. As suggested by Andrew, a check for wrap though zero has been added. Here is the new patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 60 ++++++++++++++++++++++++++++++++++++++-- include/linux/ext4_fs_extents.h | 1 2 files changed, 59 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c =================================================================== --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode } /* + * check if a portion of the "newext" extent overlaps with an + * existing extent. + * + * If there is an overlap discovered, it updates the length of the newext + * such that there will be no overlap, and then returns 1. + * If there is no overlap found, it returns 0. + */ +unsigned int ext4_ext_check_overlap(struct inode *inode, + struct ext4_extent *newext, + struct ext4_ext_path *path) +{ + unsigned long b1, b2; + unsigned int depth, len1; + unsigned int ret = 0; + + b1 = le32_to_cpu(newext->ee_block); + len1 = le16_to_cpu(newext->ee_len); + depth = ext_depth(inode); + if (!path[depth].p_ext) + goto out; + b2 = le32_to_cpu(path[depth].p_ext->ee_block); + + /* + * get the next allocated block if the extent in the path + * is before the requested block(s) + */ + if (b2 < b1) { + b2 = ext4_ext_next_allocated_block(path); + if (b2 == EXT_MAX_BLOCK) + goto out; + } + + /* check for wrap through zero */ + if (b1 + len1 < b1) { + len1 = EXT_MAX_BLOCK - b1; + newext->ee_len = cpu_to_le16(len1); + ret = 1; + } + + /* check for overlap */ + if (b1 + len1 > b2) { + newext->ee_len = cpu_to_le16(b2 - b1); + ret = 1; + } +out: + return ret; +} + +/* * ext4_ext_insert_extent: * tries to merge requsted extent into the existing extent or * inserts requested extent as new one into the tree, @@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); - allocated = max_blocks; + + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ + newex.ee_block = cpu_to_le32(iblock); + newex.ee_len = cpu_to_le16(max_blocks); + err = ext4_ext_check_overlap(inode, &newex, path); + if (err) + allocated = le16_to_cpu(newex.ee_len); + else + allocated = max_blocks; newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err); if (!newblock) goto out2; @@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle goal, newblock, allocated); /* try to insert new extent into found leaf and return */ - newex.ee_block = cpu_to_le32(iblock); ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); err = ext4_ext_insert_extent(handle, inode, path, &newex); Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h @@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode * extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *); From owner-xfs@oss.sgi.com Tue May 15 13:16:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 13:16:55 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FKGofB026629 for ; Tue, 15 May 2007 13:16:51 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FKGo9U005872 for ; Tue, 15 May 2007 16:16:50 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FKGo2p533384 for ; Tue, 15 May 2007 16:16:50 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FKGnOr020748 for ; Tue, 15 May 2007 16:16:49 -0400 Received: from amitarora.in.ibm.com ([9.126.238.191]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FKGkUM020645; Tue, 15 May 2007 16:16:47 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id D610B10CFFA; Wed, 16 May 2007 01:46:53 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FKGqN2015849; Wed, 16 May 2007 01:46:52 +0530 Date: Wed, 16 May 2007 01:46:52 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 4/5][TAKE3] ext4: fallocate support in ext4 Message-ID: <20070515201652.GD5834@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515195421.GA2948@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11442 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FA_ALLOCATE mode is being supported as of now. Supporting FA_DEALLOCATE mode is a item. Changelog: --------- Note: The changes below are from the initial post (dated 26th April, 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel version on which this patch is based and point "8)" below. TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1. Here are the changes from the previous post: 1) Added more description for ext4_fallocate(). 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent). 3) Moved journal_start & journal_stop inside the while loop. 4) Replaced BUG_ON with WARN_ON & ext4_error. 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally. 6) Added variable names in the function declaration of ext4_fallocate() 7) Converted macros that handle uninitialized extents into inline functions. 8) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);". Here is the updated patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 240 +++++++++++++++++++++++++++++++++------- fs/ext4/file.c | 1 include/linux/ext4_fs.h | 8 + include/linux/ext4_fs_extents.h | 12 ++ 4 files changed, 220 insertions(+), 41 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c =================================================================== --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1106,7 +1106,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* + * Make sure that either both extents are uninitialized, or + * both are _not_. + */ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; int depth, len, err, next; + unsigned uninitialized = 0; - BUG_ON(newext->ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode); ex = path[depth].p_ext; BUG_ON(path[depth].p_hdr == NULL); @@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han /* try to insert block into found extent and return */ if (ex && ext4_can_extents_be_merged(inode, ex, newext)) { ext_debug("append %d block to %d:%d (from %llu)\n", - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); err = ext4_ext_get_access(handle, inode, path + depth); if (err) return err; - ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len) - + le16_to_cpu(newext->ee_len)); + + /* + * ext4_can_extents_be_merged should have checked that either + * both extents are uninitialized, or both aren't. Thus we + * need to check only one of them here. + */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(newext)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); eh = path[depth].p_hdr; nearex = ex; goto merge; @@ -1263,7 +1286,7 @@ has_space: ext_debug("first extent in the leaf: %d:%llu:%d\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len)); + ext4_ext_get_actual_len(newext)); path[depth].p_ext = EXT_FIRST_EXTENT(eh); } else if (le32_to_cpu(newext->ee_block) > le32_to_cpu(nearex->ee_block)) { @@ -1276,7 +1299,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 2, nearex + 1, len); } @@ -1289,7 +1312,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 1, nearex, len); path[depth].p_ext = nearex; @@ -1308,8 +1331,13 @@ merge: if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) break; /* merge with next extent! */ - nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len) - + le16_to_cpu(nearex[1].ee_len)); + if (ext4_ext_is_uninitialized(nearex)) + uninitialized = 1; + nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) + + ext4_ext_get_actual_len(nearex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(nearex); + if (nearex + 1 < EXT_LAST_EXTENT(eh)) { len = (EXT_LAST_EXTENT(eh) - nearex - 1) * sizeof(struct ext4_extent); @@ -1379,8 +1407,8 @@ int ext4_ext_walk_space(struct inode *in end = le32_to_cpu(ex->ee_block); if (block + num < end) end = block + num; - } else if (block >= - le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) { + } else if (block >= le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex)) { /* need to allocate space after found extent */ start = block; end = block + num; @@ -1392,7 +1420,8 @@ int ext4_ext_walk_space(struct inode *in * by found extent */ start = block; - end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len); + end = le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex); if (block + num < end) end = block + num; exists = 1; @@ -1408,7 +1437,7 @@ int ext4_ext_walk_space(struct inode *in cbex.ec_type = EXT4_EXT_CACHE_GAP; } else { cbex.ec_block = le32_to_cpu(ex->ee_block); - cbex.ec_len = le16_to_cpu(ex->ee_len); + cbex.ec_len = ext4_ext_get_actual_len(ex); cbex.ec_start = ext_pblock(ex); cbex.ec_type = EXT4_EXT_CACHE_EXTENT; } @@ -1481,15 +1510,15 @@ ext4_ext_put_gap_in_cache(struct inode * ext_debug("cache gap(before): %lu [%lu:%lu]", (unsigned long) block, (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len)); + (unsigned long) ext4_ext_get_actual_len(ex)); } else if (block >= le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len)) { + + ext4_ext_get_actual_len(ex)) { lblock = le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len); + + ext4_ext_get_actual_len(ex); len = ext4_ext_next_allocated_block(path); ext_debug("cache gap(after): [%lu:%lu] %lu", (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len), + (unsigned long) ext4_ext_get_actual_len(ex), (unsigned long) block); BUG_ON(len == lblock); len = len - lblock; @@ -1619,12 +1648,12 @@ static int ext4_remove_blocks(handle_t * unsigned long from, unsigned long to) { struct buffer_head *bh; + unsigned short ee_len = ext4_ext_get_actual_len(ex); int i; #ifdef EXTENTS_STATS { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); - unsigned short ee_len = le16_to_cpu(ex->ee_len); spin_lock(&sbi->s_ext_stats_lock); sbi->s_ext_blocks += ee_len; sbi->s_ext_extents++; @@ -1638,12 +1667,12 @@ static int ext4_remove_blocks(handle_t * } #endif if (from >= le32_to_cpu(ex->ee_block) - && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to == le32_to_cpu(ex->ee_block) + ee_len - 1) { /* tail removal */ unsigned long num; ext4_fsblk_t start; - num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from; - start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num; + num = le32_to_cpu(ex->ee_block) + ee_len - from; + start = ext_pblock(ex) + ee_len - num; ext_debug("free last %lu blocks starting %llu\n", num, start); for (i = 0; i < num; i++) { bh = sb_find_get_block(inode->i_sb, start + i); @@ -1651,12 +1680,12 @@ static int ext4_remove_blocks(handle_t * } ext4_free_blocks(handle, inode, start, num); } else if (from == le32_to_cpu(ex->ee_block) - && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) { printk("strange request: removal %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } else { printk("strange request: removal(2) %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } return 0; } @@ -1671,6 +1700,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc unsigned a, b, block, num; unsigned long ex_ee_block; unsigned short ex_ee_len; + unsigned uninitialized = 0; struct ext4_extent *ex; ext_debug("truncate since %lu in leaf\n", start); @@ -1685,7 +1715,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex = EXT_LAST_EXTENT(eh); ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex_ee_len = ext4_ext_get_actual_len(ex); while (ex >= EXT_FIRST_EXTENT(eh) && ex_ee_block + ex_ee_len > start) { @@ -1753,6 +1785,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex->ee_block = cpu_to_le32(block); ex->ee_len = cpu_to_le16(num); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); if (err) @@ -1762,7 +1796,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc ext_pblock(ex)); ex--; ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + ex_ee_len = ext4_ext_get_actual_len(ex); } if (correct_index && eh->eh_entries) @@ -2038,7 +2072,7 @@ int ext4_ext_get_blocks(handle_t *handle if (ex) { unsigned long ee_block = le32_to_cpu(ex->ee_block); ext4_fsblk_t ee_start = ext_pblock(ex); - unsigned short ee_len = le16_to_cpu(ex->ee_len); + unsigned short ee_len; /* * Allow future support for preallocated extents to be added @@ -2046,8 +2080,9 @@ int ext4_ext_get_blocks(handle_t *handle * Uninitialized extents are treated as holes, except that * we avoid (fail) allocating new blocks during a write. */ - if (ee_len > EXT_MAX_LEN) + if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) goto out2; + ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { newblock = iblock - ee_block + ee_start; @@ -2055,8 +2090,11 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); - ext4_ext_put_in_cache(inode, ee_block, ee_len, - ee_start, EXT4_EXT_CACHE_EXTENT); + /* Do not put uninitialized extent in the cache */ + if (!ext4_ext_is_uninitialized(ex)) + ext4_ext_put_in_cache(inode, ee_block, + ee_len, ee_start, + EXT4_EXT_CACHE_EXTENT); goto out; } } @@ -2098,6 +2136,8 @@ int ext4_ext_get_blocks(handle_t *handle /* try to insert new extent into found leaf and return */ ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); + if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */ + ext4_ext_mark_uninitialized(&newex); err = ext4_ext_insert_extent(handle, inode, path, &newex); if (err) goto out2; @@ -2109,8 +2149,10 @@ int ext4_ext_get_blocks(handle_t *handle newblock = ext_pblock(&newex); __set_bit(BH_New, &bh_result->b_state); - ext4_ext_put_in_cache(inode, iblock, allocated, newblock, - EXT4_EXT_CACHE_EXTENT); + /* Cache only when it is _not_ an uninitialized extent */ + if (create!=EXT4_CREATE_UNINITIALIZED_EXT) + ext4_ext_put_in_cache(inode, iblock, allocated, newblock, + EXT4_EXT_CACHE_EXTENT); out: if (allocated > max_blocks) allocated = max_blocks; @@ -2214,6 +2256,122 @@ int ext4_ext_writepage_trans_blocks(stru return needed; } +/* + * preallocate space for a file. This implements ext4's fallocate inode + * operation, which gets called from sys_fallocate system call. + * Currently only FA_ALLOCATE mode is supported on extent based files. + * We may have more modes supported in future - like FA_DEALLOCATE, which + * tells fallocate to unallocate previously (pre)allocated blocks. + * For block-mapped files, posix_fallocate should fall back to the method + * of writing zeroes to the required new blocks (the same behavior which is + * expected for file systems which do not support fallocate() system call). + */ +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + handle_t *handle; + ext4_fsblk_t block, max_blocks; + ext4_fsblk_t nblocks = 0; + int ret = 0; + int ret2 = 0; + int retries = 0; + struct buffer_head map_bh; + unsigned int credits, blkbits = inode->i_blkbits; + + /* + * currently supporting (pre)allocate mode for extent-based + * files _only_ + */ + if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return -EOPNOTSUPP; + + /* preallocation to directories is currently not supported */ + if (S_ISDIR(inode->i_mode)) + return -ENODEV; + + block = offset >> blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) + - block; + + /* + * credits to insert 1 extent into extent tree + buffers to be able to + * modify 1 super block, 1 block bitmap and 1 group descriptor. + */ + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3; +retry: + while (ret >= 0 && ret < max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + handle = ext4_journal_start(inode, credits); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + break; + } + + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, &map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + WARN_ON(!ret); + if (!ret) { + ext4_error(inode->i_sb, "ext4_fallocate", + "ext4_ext_get_blocks returned 0! inode#%lu" + ", block=%llu, max_blocks=%llu", + inode->i_ino, block, max_blocks); + ret = -EIO; + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + break; + } + if (ret > 0) { + /* check wrap through sign-bit/zero here */ + if ((block + ret) < 0 || (block + ret) < block) { + ret = -EIO; + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + break; + } + if (buffer_new(&map_bh) && ((block + ret) > + (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits) + >> blkbits))) + nblocks = nblocks + ret; + } + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + if (ret2) + break; + } + + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) + goto retry; + + /* + * Time to update the file size. + * Update only when preallocation was requested beyond the file size. + */ + if ((offset + len) > i_size_read(inode)) { + if (ret > 0) { + /* + * if no error, we assume preallocation succeeded + * completely + */ + mutex_lock(&inode->i_mutex); + i_size_write(inode, offset + len); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } else if (ret < 0 && nblocks) { + /* Handle partial allocation scenario */ + loff_t newsize; + + mutex_lock(&inode->i_mutex); + newsize = (nblocks << blkbits) + i_size_read(inode); + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } + } + + return ret > 0 ? ret2 : ret; +} + EXPORT_SYMBOL(ext4_mark_inode_dirty); EXPORT_SYMBOL(ext4_ext_invalidate_cache); EXPORT_SYMBOL(ext4_ext_insert_extent); Index: linux-2.6.22-rc1/fs/ext4/file.c =================================================================== --- linux-2.6.22-rc1.orig/fs/ext4/file.c +++ linux-2.6.22-rc1/fs/ext4/file.c @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ .removexattr = generic_removexattr, #endif .permission = ext4_permission, + .fallocate = ext4_fallocate, }; Index: linux-2.6.22-rc1/include/linux/ext4_fs.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/ext4_fs.h +++ linux-2.6.22-rc1/include/linux/ext4_fs.h @@ -102,6 +102,7 @@ EXT4_GOOD_OLD_FIRST_INO : \ (s)->s_first_ino) #endif +#define EXT4_BLOCK_ALIGN(size, blkbits) ALIGN((size),(1 << (blkbits))) /* * Macro-instructions used to manage fragments @@ -225,6 +226,11 @@ struct ext4_new_group_data { __u32 free_blocks_count; }; +/* + * Following is used by preallocation code to tell get_blocks() that we + * want uninitialzed extents. + */ +#define EXT4_CREATE_UNINITIALIZED_EXT 2 /* * ioctl commands @@ -976,6 +982,8 @@ extern int ext4_ext_get_blocks(handle_t extern void ext4_ext_truncate(struct inode *, struct page *); extern void ext4_ext_init(struct super_block *); extern void ext4_ext_release(struct super_block *); +extern int ext4_fallocate(struct inode *inode, int mode, loff_t offset, + loff_t len); static inline int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, unsigned long max_blocks, struct buffer_head *bh, Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h @@ -188,6 +188,18 @@ ext4_ext_invalidate_cache(struct inode * EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO; } +static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext) { + ext->ee_len |= cpu_to_le16(0x8000); +} + +static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext) { + return (int)(le16_to_cpu((ext)->ee_len) & 0x8000); +} + +static inline int ext4_ext_get_actual_len(struct ext4_extent *ext) { + return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF); +} + extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); From owner-xfs@oss.sgi.com Tue May 15 13:19:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 13:19:04 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FKIxfB027336 for ; Tue, 15 May 2007 13:19:00 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FKIxo4025434 for ; Tue, 15 May 2007 16:18:59 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FKIw8G216732 for ; Tue, 15 May 2007 14:18:58 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FKIwJa006073 for ; Tue, 15 May 2007 14:18:58 -0600 Received: from amitarora.in.ibm.com ([9.126.238.191]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FKItCJ005844; Tue, 15 May 2007 14:18:56 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 0130E10CFFA; Wed, 16 May 2007 01:48:56 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4FKIt7g016988; Wed, 16 May 2007 01:48:55 +0530 Date: Wed, 16 May 2007 01:48:54 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 5/5][TAKE3] ext4: write support for preallocated blocks Message-ID: <20070515201854.GE5834@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515195421.GA2948@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11443 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Changelog: --------- Note: The changes below are from the initial post (dated 26th April, 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel version on which this patch is based. TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1. 1) Replaced BUG_ON with WARN_ON & ext4_error. 2) Added variable names to the function declaration of ext4_ext_try_to_merge(). 3) Updated variable declarations to use multiple-definitions-per-line. 4) "if((a=foo())).." was broken into "a=foo(); if(a).." 5) Removed extra spaces. Here is the updated patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 234 +++++++++++++++++++++++++++++++++++----- include/linux/ext4_fs_extents.h | 3 2 files changed, 210 insertions(+), 27 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c =================================================================== --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) + { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1327,25 +1375,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, newex; + struct ext4_extent *ex1 = NULL; + struct ext4_extent *ex2 = NULL; + struct ext4_extent *ex3 = NULL; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err = 0; + int ret = 0; + + depth = ext_depth(inode); + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + ee_block = le32_to_cpu(ex->ee_block); + ee_len = ext4_ext_get_actual_len(ex); + allocated = ee_len - (iblock - ee_block); + newblock = iblock - ee_block + ext_pblock(ex); + ex2 = ex; + + /* ex1: ee_block to iblock - 1 : uninitialized */ + if (iblock > ee_block) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* for sanity, update the length of the ex2 extent before + * we insert ex3, if ex1 is NULL. This is to avoid temporary + * overlap of blocks. + */ + if (!ex1 && allocated > max_blocks) + ex2->ee_len = cpu_to_le16(max_blocks); + /* ex3: to ee_block + ee_len : uninitialised */ + if (allocated > max_blocks) { + unsigned int newdepth; + ex3 = &newex; + ex3->ee_block = cpu_to_le32(iblock + max_blocks); + ext4_ext_store_pblock(ex3, newblock + max_blocks); + ex3->ee_len = cpu_to_le16(allocated - max_blocks); + ext4_ext_mark_uninitialized(ex3); + err = ext4_ext_insert_extent(handle, inode, path, ex3); + if (err) + goto out; + /* The depth, and hence eh & ex might change + * as part of the insert above. + */ + newdepth = ext_depth(inode); + if (newdepth != depth) { + depth = newdepth; + path = ext4_ext_find_extent(inode, iblock, NULL); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + goto out; + } + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + if (ex2 != &newex) + ex2 = ex; + } + allocated = max_blocks; + } + /* If there was a change of depth as part of the + * insertion of ex3 above, we need to update the length + * of the ex1 extent again here + */ + if (ex1 && ex1 != ex) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* ex2: iblock to iblock + maxblocks-1 : initialised */ + ex2->ee_block = cpu_to_le32(iblock); + ex2->ee_start = cpu_to_le32(newblock); + ext4_ext_store_pblock(ex2, newblock); + ex2->ee_len = cpu_to_le16(allocated); + if (ex2 != ex) + goto insert; + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; + /* New (initialized) extent starts from the first block + * in the current extent. i.e., ex2 == ex + * We have to see if it can be merged with the extent + * on the left. + */ + if (ex2 > EXT_FIRST_EXTENT(eh)) { + /* To merge left, pass "ex2 - 1" to try_to_merge(), + * since it merges towards right _only_. + */ + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + depth = ext_depth(inode); + ex2--; + } + } + /* Try to Merge towards right. This might be required + * only when the whole extent is being written to. + * i.e. ex2 == ex and ex3 == NULL. + */ + if (!ex3) { + ret = ext4_ext_try_to_merge(inode, path, ex2); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + } + } + /* Mark modified extent as dirty */ + err = ext4_ext_dirty(handle, inode, path + depth); + goto out; +insert: + err = ext4_ext_insert_extent(handle, inode, path, &newex); +out: + return err ? err : allocated; +} + int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, int create, int extend_disksize) { struct ext4_ext_path *path = NULL; + struct ext4_extent_header *eh; struct ext4_extent newex, *ex; ext4_fsblk_t goal, newblock; - int err = 0, depth; + int err = 0, depth, ret; unsigned long allocated = 0; __clear_bit(BH_New, &bh_result->b_state); @@ -2067,6 +2234,7 @@ int ext4_ext_get_blocks(handle_t *handle * this is why assert can't be put in ext4_ext_find_extent() */ BUG_ON(path[depth].p_ext == NULL && depth != 0); + eh = path[depth].p_hdr; ex = path[depth].p_ext; if (ex) { @@ -2075,13 +2243,9 @@ int ext4_ext_get_blocks(handle_t *handle unsigned short ee_len; /* - * Allow future support for preallocated extents to be added - * as an RO_COMPAT feature: * Uninitialized extents are treated as holes, except that - * we avoid (fail) allocating new blocks during a write. + * we split out initialized portions during a write. */ - if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) - goto out2; ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { @@ -2090,12 +2254,27 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); + /* Do not put uninitialized extent in the cache */ - if (!ext4_ext_is_uninitialized(ex)) + if (!ext4_ext_is_uninitialized(ex)) { ext4_ext_put_in_cache(inode, ee_block, ee_len, ee_start, EXT4_EXT_CACHE_EXTENT); - goto out; + goto out; + } + if (create == EXT4_CREATE_UNINITIALIZED_EXT) + goto out; + if (!create) + goto out2; + + ret = ext4_ext_convert_to_initialized(handle, inode, + path, iblock, + max_blocks); + if (ret <= 0) + goto out2; + else + allocated = ret; + goto outnew; } } @@ -2147,6 +2326,7 @@ int ext4_ext_get_blocks(handle_t *handle /* previous routine could use block we allocated */ newblock = ext_pblock(&newex); +outnew: __set_bit(BH_New, &bh_result->b_state); /* Cache only when it is _not_ an uninitialized extent */ Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h @@ -202,6 +202,9 @@ static inline int ext4_ext_get_actual_le extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); From owner-xfs@oss.sgi.com Tue May 15 16:52:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 16:52:39 -0700 (PDT) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4FNqYfB009879 for ; Tue, 15 May 2007 16:52:35 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e34.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4FNqYLZ026296 for ; Tue, 15 May 2007 19:52:34 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4FNqYwW243090 for ; Tue, 15 May 2007 17:52:34 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4FNqXQE014839 for ; Tue, 15 May 2007 17:52:34 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4FNqWDr014536; Tue, 15 May 2007 17:52:32 -0600 Subject: Re: [PATCH 0/5][TAKE3] fallocate system call From: Mingming Cao Reply-To: cmm@us.ibm.com To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070515193722.GA3487@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> Content-Type: text/plain Organization: IBM LTC Date: Tue, 15 May 2007 16:52:23 -0700 Message-Id: <1179273143.4819.15.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11444 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Wed, 2007-05-16 at 01:07 +0530, Amit K. Arora wrote: > ToDos: > ----- > 1> Implementation on other architectures (other than i386, x86_64, > ppc64 and s390(x)). David Chinner has already posted a patch for ia64. Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64 From: David Chinner Subject: [PATCH] ia64 fallocate syscall Cc: "Amit K. Arora" , akpm@linux-foundation.org, linux-ext4@vger.kernel.org, suparna@in.ibm.com, cmm@us.ibm.com ia64 fallocate syscall support. Signed-Off-By: Dave Chinner --- arch/ia64/kernel/entry.S | 1 + include/asm-ia64/unistd.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S =================================================================== --- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S 2007-05-12 18:45:56.000000000 -0700 +++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S 2007-05-15 15:36:48.000000000 -0700 @@ -1585,5 +1585,6 @@ data8 sys_getcpu data8 sys_epoll_pwait // 1305 data8 sys_utimensat + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 18:45:56.000000000 -0700 +++ linux-2.6.22-rc1/include/asm-ia64/unistd.h 2007-05-15 15:37:51.000000000 -0700 @@ -296,6 +296,7 @@ #define __NR_getcpu 1304 #define __NR_epoll_pwait 1305 #define __NR_utimensat 1306 +#define __NR_fallocate 1307 #ifdef __KERNEL__ From owner-xfs@oss.sgi.com Tue May 15 17:42:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 17:42:54 -0700 (PDT) Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.151]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4G0gnfB021519 for ; Tue, 15 May 2007 17:42:50 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e33.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4G0gnAC004917 for ; Tue, 15 May 2007 20:42:49 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4G0gnkM248582 for ; Tue, 15 May 2007 18:42:49 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4G0gmIL017394 for ; Tue, 15 May 2007 18:42:49 -0600 Received: from dyn9047017103.beaverton.ibm.com (dyn9047017103.beaverton.ibm.com [9.47.17.103]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4G0glMx017380; Tue, 15 May 2007 18:42:47 -0600 Subject: Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc From: Mingming Cao Reply-To: cmm@us.ibm.com To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070515200359.GA5834@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> <20070515200359.GA5834@amitarora.in.ibm.com> Content-Type: text/plain Organization: IBM LTC Date: Tue, 15 May 2007 17:42:46 -0700 Message-Id: <1179276167.4819.22.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11445 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Wed, 2007-05-16 at 01:33 +0530, Amit K. Arora wrote: > This patch implements sys_fallocate() and adds support on i386, x86_64 > and powerpc platforms. > @@ -1137,6 +1148,8 @@ struct inode_operations { > ssize_t (*listxattr) (struct dentry *, char *, size_t); > int (*removexattr) (struct dentry *, const char *); > void (*truncate_range)(struct inode *, loff_t, loff_t); > + long (*fallocate)(struct inode *inode, int mode, loff_t offset, > + loff_t len); > }; Does the return value from fallocate inode operation has to be *long*? It's not consistent with the ext4_fallocate() define in patch 4/5, +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) thus cause compile warnings. Mingming From owner-xfs@oss.sgi.com Tue May 15 20:03:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 20:03:19 -0700 (PDT) Received: from h1079484.serverkompetenz.net (ghostdub.de [81.169.157.63]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4G33EfB010782 for ; Tue, 15 May 2007 20:03:15 -0700 Received: from hsi-kbw-091-089-007-028.hsi2.kabelbw.de ([91.89.7.28] helo=[192.168.1.129]) by h1079484.serverkompetenz.net with esmtpsa (TLS-1.0:DHE_RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1Ho992-0007Wg-TS for xfs@oss.sgi.com; Wed, 16 May 2007 04:21:41 +0200 Message-ID: <464A6AAD.1020408@rakka.de> Date: Wed, 16 May 2007 04:21:33 +0200 From: outgoing@rakka.de Reply-To: xfs@rakka.de User-Agent: Icedove 1.5.0.10 (X11/20070329) MIME-Version: 1.0 To: xfs@oss.sgi.com Subject: OOPS on 2.6.21-rc6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11446 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: outgoing@rakka.de Precedence: bulk X-list: xfs I got an oops while copying from an ro ntfs to a fresh xfs filesystem. The cp process froze and was unkillable, some other processes also froze (even though / was ext3). After rebooting, recreating the xfs and starting the same copy process again, another oops occured about 2G earlier than in the previous copying. This time, nothing froze. Kernel version: 2.6.21-rc6 #2 SMP PREEMPT i686 (self-compiled with debian-lenny .config) a "zgrep 4KSTACKS /proc/config.gz" shows nothing. First oops: BUG: unable to handle kernel paging request at virtual address 0000a702 printing eip: c018559f *pde = 00000000 Oops: 0002 [#1] PREEMPT SMP Modules linked in: usb_storage fuse nls_iso8859_1 isofs udf snd_rtctimer xfs binfmt_misc radeon drm nfs nfsd exportfs lockd nfs_acl sunrpc ppdev lp button ac battery ipv6 nls_utf8 ntfs sha256 aes cbc blkcipher dm_crypt dm_snapshot dm_mirror sbp2 loop dm_mod snd_via82xx snd_mpu401_uart snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1 firmware_class snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_util_mem snd_hwdep snd_seq_dummy snd_seq_midi snd_seq_oss snd_rawmidi snd_seq_midi_event via_ircc parport_pc parport analog snd_seq snd_timer snd_seq_device irda rtc pcspkr crc_ccitt i2c_viapro i2c_core snd emu10k1_gp gameport soundcore shpchp pci_hotplug via_agp agpgart eth1394 tsdev evdev ext3 jbd mbcache raid10 raid456 xor raid1 raid0 multipath linear md_mod ide_cd cdrom ide_disk ata_generic pata_via usbhid hid via82cxxx sd_mod floppy via_rhine mii generic ide_core ehci_hcd r8169 uhci_hcd usbcore ohci1394 ieee1394 sata_promise libata scsi_mod thermal processor fan CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010002 (2.6.21-rc6 #2) EIP is at end_buffer_async_write+0xbf/0x165 eax: f5722000 ebx: ca81af98 ecx: ca81af98 edx: 00000001 esi: 0000a702 edi: 00000202 ebp: c10cdbe8 esp: f5723f20 ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 Process xfsdatad/0 (pid: 4618, ti=f5722000 task=f57c0540 task.ti=f5722000) Stack: c0353fc8 c02a5d99 f5723f8c c02a5dc5 00000003 f5788be8 00000000 00000001 f5723f64 c011c55f c3645c18 d114e2b0 00000202 d114e2d8 f94ec45d f57c064c d114e2dc f5788bc0 c0130b6b 00000000 1bd31201 0000007b 00000202 f94ec4ea Call Trace: [] __sched_text_start+0x689/0x738 [] __sched_text_start+0x6b5/0x738 [] __wake_up+0x32/0x43 [] xfs_destroy_ioend+0x1d/0x56 [xfs] [] run_workqueue+0x85/0x125 [] xfs_end_bio_delalloc+0x0/0x8 [xfs] [] worker_thread+0xf9/0x124 [] default_wake_function+0x0/0xc [] worker_thread+0x0/0x124 [] kthread+0xb2/0xdd [] kthread+0x0/0xdd [] kernel_thread_helper+0x7/0x10 ======================= Code: eb 28 89 e0 25 00 e0 ff ff ff 48 14 8b 40 08 a8 08 74 05 e8 6b 09 12 00 f3 90 89 e0 25 00 e0 ff ff ff 40 14 8b 06 a8 10 75 d8 90 <0f> ba 2e 04 19 c0 85 c0 75 ef 90 0f ba 33 08 89 d8 e8 ea ec ff EIP: [] end_buffer_async_write+0xbf/0x165 SS:ESP 0068:f5723f20 note: xfsdatad/0[4618] exited with preempt_count 1 Second oops: BUG: unable to handle kernel paging request at virtual address 287d423e printing eip: f8f22e9f *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: binfmt_misc radeon drm nfs nfsd exportfs lockd nfs_acl sunrpc ppdev lp button ac battery ipv6 xfs nls_utf8 ntfs fuse sha256 aes cbc blkcipher dm_crypt dm_snapshot dm_mirror sbp2 loop dm_mod snd_via82xx snd_mpu401_uart snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1 firmware_class snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_util_mem snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi snd_seq_midi_event via_ircc snd_seq irda analog crc_ccitt parport_pc parport i2c_viapro snd_timer snd_rawmidi snd_seq_device pcspkr i2c_core snd via_agp agpgart rtc emu10k1_gp gameport soundcore shpchp pci_hotplug tsdev evdev eth1394 ext3 jbd mbcache raid10 raid456 xor raid1 raid0 multipath linear md_mod ide_cd cdrom ide_disk ata_generic pata_via usbhid hid via82cxxx sd_mod floppy via_rhine mii ehci_hcd generic ide_core uhci_hcd usbcore r8169 ohci1394 ieee1394 sata_promise libata scsi_mod thermal processor fan CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010203 (2.6.21-rc6 #2) EIP is at xfs_count_page_state+0x27/0x62 [xfs] eax: 2133c724 ebx: ec1bc128 ecx: c1b5de08 edx: 287d423e esi: c1b5de04 edi: c1b5de0c ebp: eec39400 esp: c1b5ddc4 ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 Process kswapd0 (pid: 131, ti=c1b5c000 task=c1af0580 task.ti=c1b5c000) Stack: c11677a0 000000d0 c1b5de04 f8f23d06 c1b5de04 ef946c50 00000000 00000001 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000000 00000000 f8f23cbb 000000d0 00000001 c032cac0 c014f9cc Call Trace: [] xfs_vm_releasepage+0x4b/0xa5 [xfs] [] xfs_vm_releasepage+0x0/0xa5 [xfs] [] try_to_release_page+0x30/0x42 [] shrink_inactive_list+0x459/0x747 [] try_to_wake_up+0x3a1/0x3ab [] __sched_text_start+0x689/0x738 [] __sched_text_start+0x6b5/0x738 [] nfs_access_cache_shrinker+0x137/0x182 [nfs] [] shrink_zone+0xcd/0xf2 [] kswapd+0x2be/0x3ef [] autoremove_wake_function+0x0/0x35 [] kswapd+0x0/0x3ef [] kthread+0xb2/0xdd [] kthread+0x0/0xdd [] kernel_thread_helper+0x7/0x10 ======================= Code: 5f 89 d0 c3 57 89 d7 56 53 8b 74 24 10 89 c3 31 c0 c7 06 00 00 00 00 89 01 89 02 8b 03 f6 c4 08 75 04 0f 0b eb fe 8b 5b 0c 89 da <8b> 02 a8 01 74 0e 8b 02 a8 20 75 08 c7 01 01 00 00 00 eb 1c 8b EIP: [] xfs_count_page_state+0x27/0x62 [xfs] SS:ESP 0068:c1b5ddc4 - Paul Goeser From owner-xfs@oss.sgi.com Tue May 15 20:16:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 20:16:59 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4G3GsfB013432 for ; Tue, 15 May 2007 20:16:55 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA28361; Wed, 16 May 2007 13:16:37 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4G3GWAf94612834; Wed, 16 May 2007 13:16:33 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4G3GQh894804723; Wed, 16 May 2007 13:16:26 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 16 May 2007 13:16:26 +1000 From: David Chinner To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070516031626.GM85884050@sgi.com> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> <20070515200359.GA5834@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070515200359.GA5834@amitarora.in.ibm.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11447 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: > This patch implements sys_fallocate() and adds support on i386, x86_64 > and powerpc platforms. Can you please pick up the ia64 support patch I posted as well? > Changelog: > --------- > Note: The changes below are from the initial post (dated 26th April, > 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel > version on which this patch is based. TAKE2 was based on 2.6.21 and this > is based on 2.6.22-rc1. > > Following changes were made to the previous version: > 1) Added description before sys_fallocate() definition. > 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, > posix_fallocate should return EINVAL for len <= 0. > 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE > 4) Do not return ENODEV for dirs (let individual file systems decide if > they want to support preallocation to directories or not. > 5) Check for wrap through zero. > 6) Update c/mtime if fallocate() succeeds. Please don't make this always happen. c/mtime updates should be dependent on the mode being used and whether there is visible change to the file. If no userspace visible changes to the file occurred, then timestamps should not be changed. e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate() extending the file, otherwise no change in timestamps should occur. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 15 20:29:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 20:29:10 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4G3T4fB019177 for ; Tue, 15 May 2007 20:29:06 -0700 Received: from pc-bnaujok.melbourne.sgi.com (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA28681; Wed, 16 May 2007 13:28:57 +1000 Date: Wed, 16 May 2007 13:32:28 +1000 To: linux-xfs@ml.epigenomics.com, xfs@oss.sgi.com Subject: Re: xfs_repair: buf calloc failed (4132 bytes): Cannot allocate memory From: "Barry Naujok" Organization: SGI Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15 MIME-Version: 1.0 References: Message-ID: In-Reply-To: User-Agent: Opera Mail/9.10 (Win32) Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from Quoted-Printable to 8bit by oss.sgi.com id l4G3T7fB019196 X-archive-position: 11448 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs On Tue, 15 May 2007 21:48:30 +1000, wrote: > Hi! > > We have a RAID0 set of 3 400GB disks. > > After a crash we needed to run xfs_repair, but it bails out with the > error message: > > - ensuring existence of lost+found directory > - traversing filesystem starting at / ... > xfs_repair: buf calloc failed (4132 bytes): Cannot allocate memory > > The filesystem contains many hardlinked files as it is a dirvish > repository (www.dirvish.org) with the hardlinks created by rsync. > > This is the xfs_db info: > > # xfs_db -r -c "sb 0" -c "p" /dev/md0 > magicnum = 0x58465342 > blocksize = 4096 > dblocks = 293031424 > rblocks = 0 > rextents = 0 > uuid = e8d3a22c-716f-4f3e-9e95-e06afb3559d0 > logstart = 268435472 > rootino = 256 > rbmino = 257 > rsumino = 258 > rextsize = 48 > agblocks = 9157232 > agcount = 32 > rbmblocks = 0 > logblocks = 32768 > versionnum = 0x3184 > sectsize = 512 > inodesize = 256 > inopblock = 16 > fname = "\000\000\000\000\000\000\000\000\000\000\000\000" > blocklog = 12 > sectlog = 9 > inodelog = 8 > inopblog = 4 > agblklog = 24 > rextslog = 0 > inprogress = 0 > imax_pct = 25 > icount = 18882496 > ifree = 373596 > fdblocks = 27494887 > frextents = 0 > uquotino = 0 > gquotino = 0 > qflags = 0 > flags = 0 > shared_vn = 0 > inoalignmt = 2 > unit = 16 > width = 48 > dirblklog = 0 > logsectlog = 0 > logsectsize = 0 > logsunit = 0 > features2 = 0 > > Kernel is 2.6.20.6 on a dual PIII machine with 1GB RAM and 10GB swap. > > Mounting the filesystem is possible, but what about its current state? > > Greetings With 18 million inodes, 1.2TB of space, it will use a lot of memory. Assuming you are using a recent xfsprogs (2.8.11 or later), you can reduce memory usage by adding "-o bhash=512" which will limit the number of buffers cached in xfs_repair. In this scenario, the cache will overflow RAM rather easily with the number of inodes to be scanned. Regards, Barry. From owner-xfs@oss.sgi.com Tue May 15 21:03:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 15 May 2007 21:03:21 -0700 (PDT) Received: from h1079484.serverkompetenz.net (ghostdub.de [81.169.157.63]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4G43EfB024631 for ; Tue, 15 May 2007 21:03:15 -0700 Received: from hsi-kbw-091-089-007-028.hsi2.kabelbw.de ([91.89.7.28] helo=[192.168.1.112]) by h1079484.serverkompetenz.net with esmtpsa (TLS-1.0:DHE_RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1HoAGr-0007ft-F2 for xfs@oss.sgi.com; Wed, 16 May 2007 05:33:49 +0200 Message-ID: <464A7B93.5000704@rakka.de> Date: Wed, 16 May 2007 05:33:39 +0200 From: =?ISO-8859-1?Q?Paul_G=F6ser?= User-Agent: Thunderbird 1.5.0.10 (X11/20070315) MIME-Version: 1.0 To: xfs@oss.sgi.com Subject: Re: OOPS on 2.6.21-rc6 [Solved] References: <464A6AAD.1020408@rakka.de> In-Reply-To: <464A6AAD.1020408@rakka.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11449 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paul@rakka.de Precedence: bulk X-list: xfs Running memtest86 showed some rare errors with specific patterns after running for quite a while. So it wasn't xfs after all. Sorry about that. - Paul Goeser From owner-xfs@oss.sgi.com Wed May 16 05:31:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 16 May 2007 05:31:39 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4GCVWfB031927 for ; Wed, 16 May 2007 05:31:34 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l4GCRqMn032279 for ; Wed, 16 May 2007 08:27:52 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4GCVSD5155834 for ; Wed, 16 May 2007 06:31:28 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4GCVR9a011576 for ; Wed, 16 May 2007 06:31:28 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4GCVQIl011041; Wed, 16 May 2007 06:31:27 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 3B85618BCEE; Wed, 16 May 2007 18:01:27 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4GCVOuo026912; Wed, 16 May 2007 18:01:24 +0530 Date: Wed, 16 May 2007 18:01:24 +0530 From: "Amit K. Arora" To: Mingming Cao Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com Subject: Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070516123124.GA24299@amitarora.in.ibm.com> References: <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> <20070515200359.GA5834@amitarora.in.ibm.com> <1179276167.4819.22.camel@dyn9047017103.beaverton.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1179276167.4819.22.camel@dyn9047017103.beaverton.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11450 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Tue, May 15, 2007 at 05:42:46PM -0700, Mingming Cao wrote: > On Wed, 2007-05-16 at 01:33 +0530, Amit K. Arora wrote: > > This patch implements sys_fallocate() and adds support on i386, x86_64 > > and powerpc platforms. > > > @@ -1137,6 +1148,8 @@ struct inode_operations { > > ssize_t (*listxattr) (struct dentry *, char *, size_t); > > int (*removexattr) (struct dentry *, const char *); > > void (*truncate_range)(struct inode *, loff_t, loff_t); > > + long (*fallocate)(struct inode *inode, int mode, loff_t offset, > > + loff_t len); > > }; > > Does the return value from fallocate inode operation has to be *long*? > It's not consistent with the ext4_fallocate() define in patch 4/5, I think ->fallocate() should return a "long", since sys_fallocate() has to return what ->fallocate() returns and hence their return type should ideally match. > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t > len) I will change the ext4_fallocate() to return a "long" (in patch 4/5) in the next post. Agree ? Thanks! -- Regards, Amit Arora > > thus cause compile warnings. > > > > Mingming From owner-xfs@oss.sgi.com Wed May 16 05:37:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 16 May 2007 05:37:04 -0700 (PDT) Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4GCb0fB000522 for ; Wed, 16 May 2007 05:37:01 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4GCaxat007565 for ; Wed, 16 May 2007 08:36:59 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4GCaxWJ238944 for ; Wed, 16 May 2007 06:36:59 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4GCawIi012465 for ; Wed, 16 May 2007 06:36:59 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4GCavsp012422; Wed, 16 May 2007 06:36:58 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 99DD418BCEE; Wed, 16 May 2007 18:07:05 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4GCb576029218; Wed, 16 May 2007 18:07:05 +0530 Date: Wed, 16 May 2007 18:07:05 +0530 From: "Amit K. Arora" To: Dave Kleikamp Cc: David Chinner , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070516123705.GB24299@amitarora.in.ibm.com> References: <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> <20070515200359.GA5834@amitarora.in.ibm.com> <20070516031626.GM85884050@sgi.com> <1179318076.10313.6.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1179318076.10313.6.camel@kleikamp.austin.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11451 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote: > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: > > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: > > > > Following changes were made to the previous version: > > > 1) Added description before sys_fallocate() definition. > > > 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, > > > posix_fallocate should return EINVAL for len <= 0. > > > 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE > > > 4) Do not return ENODEV for dirs (let individual file systems decide if > > > they want to support preallocation to directories or not. > > > 5) Check for wrap through zero. > > > 6) Update c/mtime if fallocate() succeeds. > > > > Please don't make this always happen. c/mtime updates should be dependent > > on the mode being used and whether there is visible change to the file. If no > > userspace visible changes to the file occurred, then timestamps should not > > be changed. > > i_blocks will be updated, so it seems reasonable to update ctime. mtime > shouldn't be changed, though, since the contents of the file will be > unchanged. I agree. Thus the ctime should change for FA_PREALLOCATE mode also (which does not change the file size) - if we end up having this additional mode in near future. -- Regards, Amit Arora > > e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate() > > extending the file, otherwise no change in timestamps should occur. > > > > Cheers, > > > > Dave. > -- > David Kleikamp > IBM Linux Technology Center From owner-xfs@oss.sgi.com Wed May 16 05:51:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 16 May 2007 05:51:58 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4GCprfB003674 for ; Wed, 16 May 2007 05:51:54 -0700 Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.151]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4GCLNIk015559 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 16 May 2007 08:21:24 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e33.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4GCLLcA031457 for ; Wed, 16 May 2007 08:21:21 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4GCLJ02269472 for ; Wed, 16 May 2007 06:21:21 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4GCLJmm004801 for ; Wed, 16 May 2007 06:21:19 -0600 Received: from [9.67.144.126] (wecm-9-67-144-126.wecm.ibm.com [9.67.144.126]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4GCLHWB004693; Wed, 16 May 2007 06:21:17 -0600 Subject: Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc From: Dave Kleikamp To: David Chinner Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070516031626.GM85884050@sgi.com> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> <20070515200359.GA5834@amitarora.in.ibm.com> <20070516031626.GM85884050@sgi.com> Content-Type: text/plain Date: Wed, 16 May 2007 07:21:16 -0500 Message-Id: <1179318076.10313.6.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 Content-Transfer-Encoding: 7bit X-archive-position: 11452 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: shaggy@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: > > Following changes were made to the previous version: > > 1) Added description before sys_fallocate() definition. > > 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, > > posix_fallocate should return EINVAL for len <= 0. > > 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE > > 4) Do not return ENODEV for dirs (let individual file systems decide if > > they want to support preallocation to directories or not. > > 5) Check for wrap through zero. > > 6) Update c/mtime if fallocate() succeeds. > > Please don't make this always happen. c/mtime updates should be dependent > on the mode being used and whether there is visible change to the file. If no > userspace visible changes to the file occurred, then timestamps should not > be changed. i_blocks will be updated, so it seems reasonable to update ctime. mtime shouldn't be changed, though, since the contents of the file will be unchanged. > e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate() > extending the file, otherwise no change in timestamps should occur. > > Cheers, > > Dave. -- David Kleikamp IBM Linux Technology Center From owner-xfs@oss.sgi.com Wed May 16 09:05:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 16 May 2007 09:05:08 -0700 (PDT) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4GG52fB013061 for ; Wed, 16 May 2007 09:05:04 -0700 Received: from root by ciao.gmane.org with local (Exim 4.43) id 1HoL3m-0003Ql-FP for linux-xfs@oss.sgi.com; Wed, 16 May 2007 17:05:02 +0200 Received: from ns1.q-leap.de ([153.94.51.193]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 16 May 2007 17:05:02 +0200 Received: from bschubert by ns1.q-leap.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 16 May 2007 17:05:02 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: linux-xfs@oss.sgi.com From: Bernd Schubert Subject: possible recursive locking detected Date: Wed, 16 May 2007 16:50:08 +0200 Organization: q-leap networks GmbH Lines: 54 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7Bit X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: ns1.q-leap.de User-Agent: KNode/0.10.4 X-archive-position: 11453 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bschubert@q-leap.de Precedence: bulk X-list: xfs with 2.6.20 and almost all debugging options I get this: [ 293.840172] ============================================= [ 293.847880] [ INFO: possible recursive locking detected ] [ 293.853862] 2.6.20.3-debug #11 [ 293.857288] --------------------------------------------- [ 293.863243] dd/6202 is trying to acquire lock: [ 293.868192] (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x56/0x7a [xfs] [ 293.878200] [ 293.878201] but task is already holding lock: [ 293.884788] (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x56/0x7a [xfs] [ 293.894802] [ 293.894803] other info that might help us debug this: [ 293.902116] 2 locks held by dd/6202: [ 293.906114] #0: (&inode->i_mutex){--..}, at: [] mutex_lock+0x23/0x27 [ 293.915438] #1: (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x56/0x7a [xfs] [ 293.925977] [ 293.925978] stack backtrace: [ 293.930948] [ 293.930949] Call Trace: [ 293.935457] [] __lock_acquire+0x44d/0xc60 [ 293.941707] [] mark_held_locks+0x5a/0x71 [ 293.956483] [] :xfs:xfs_ilock+0x56/0x7a [ 293.962537] [] lock_acquire+0x7c/0xa0 [ 293.968447] [] :xfs:xfs_ilock+0x56/0x7a [ 293.974510] [] down_write+0x33/0x3f [ 293.980236] [] :xfs:xfs_ilock+0x56/0x7a [ 293.986315] [] :xfs:xfs_iget+0x43c/0x7a8 [ 293.992509] [] :xfs:xfs_trans_iget+0xa9/0x115 [ 293.999177] [] :xfs:xfs_ialloc+0x91/0x453 [ 294.005475] [] :xfs:xfs_dir_ialloc+0x74/0x286 [ 294.012145] [] :xfs:xfs_create+0x347/0x626 [ 294.018532] [] :xfs:xfs_vn_mknod+0x1e2/0x432 [ 294.025077] [] up_read+0x24/0x28 [ 294.030501] [] :xfs:xfs_iunlock+0x74/0x79 [ 294.036788] [] :xfs:xfs_access+0x43/0x4e [ 294.042943] [] __lock_acquire+0xc08/0xc60 [ 294.049200] [] _spin_unlock_irqrestore+0x3f/0x47 [ 294.056083] [] mark_held_locks+0x5a/0x71 [ 294.062239] [] _spin_unlock_irqrestore+0x3f/0x47 [ 294.069138] [] trace_hardirqs_on+0x129/0x154 [ 294.075690] [] :xfs:xfs_access+0x43/0x4e [ 294.081883] [] :xfs:xfs_vn_create+0xb/0xd [ 294.088139] [] vfs_create+0xb7/0xfb [ 294.093826] [] open_namei+0x1d1/0x6d3 [ 294.099717] [] __lock_acquire+0xc08/0xc60 [ 294.105970] [] do_filp_open+0x5b/0x7e [ 294.111851] [] do_sys_open+0x4d/0xd4 [ 294.117636] [] sys_open+0x1b/0x1d [ 294.123128] [] system_call+0x7e/0x83 [ 294.128913] Thanks, Bernd From owner-xfs@oss.sgi.com Wed May 16 10:35:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 16 May 2007 10:35:41 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4GHZafB030708 for ; Wed, 16 May 2007 10:35:37 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 10F771807A959; Wed, 16 May 2007 12:35:34 -0500 (CDT) Message-ID: <464B40E6.1060507@sandeen.net> Date: Wed, 16 May 2007 12:35:34 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Bernd Schubert CC: linux-xfs@oss.sgi.com Subject: Re: possible recursive locking detected References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11454 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Bernd Schubert wrote: > with 2.6.20 and almost all debugging options I get this: > > [ 293.840172] ============================================= > [ 293.847880] [ INFO: possible recursive locking detected ] > [ 293.853862] 2.6.20.3-debug #11 > [ 293.857288] --------------------------------------------- Many of these have been addressed with this patch: http://oss.sgi.com/archives/xfs/2007-04/msg00177.html -Eric From owner-xfs@oss.sgi.com Wed May 16 16:41:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 16 May 2007 16:41:09 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4GNf3fB016330 for ; Wed, 16 May 2007 16:41:05 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA29463; Thu, 17 May 2007 09:40:47 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4GNegAf95521412; Thu, 17 May 2007 09:40:43 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4GNeaWK90042333; Thu, 17 May 2007 09:40:36 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 17 May 2007 09:40:36 +1000 From: David Chinner To: Dave Kleikamp Cc: David Chinner , "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070516234036.GQ85884050@sgi.com> References: <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> <20070515200359.GA5834@amitarora.in.ibm.com> <20070516031626.GM85884050@sgi.com> <1179318076.10313.6.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1179318076.10313.6.camel@kleikamp.austin.ibm.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11455 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote: > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: > > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: > > > > Following changes were made to the previous version: > > > 1) Added description before sys_fallocate() definition. > > > 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, > > > posix_fallocate should return EINVAL for len <= 0. > > > 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE > > > 4) Do not return ENODEV for dirs (let individual file systems decide if > > > they want to support preallocation to directories or not. > > > 5) Check for wrap through zero. > > > 6) Update c/mtime if fallocate() succeeds. > > > > Please don't make this always happen. c/mtime updates should be dependent > > on the mode being used and whether there is visible change to the file. If no > > userspace visible changes to the file occurred, then timestamps should not > > be changed. > > i_blocks will be updated, so it seems reasonable to update ctime. mtime > shouldn't be changed, though, since the contents of the file will be > unchanged. That's assuming blocks were actually allocated - if the prealloc range already has underlying blocks there is no change and so we should not be changing mtime either. Only the filesystem will know if it has changed the file, so I think that timestamp updates need to be driven down to that level, not done blindy at the highest layer.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 16 19:45:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 16 May 2007 19:45:24 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4H2jIfB026826 for ; Wed, 16 May 2007 19:45:21 -0700 Received: from cxfsmac10.melbourne.sgi.com (cxfsmac10.melbourne.sgi.com [134.14.55.100]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA03426; Thu, 17 May 2007 12:24:14 +1000 Message-ID: <464BBCCE.8080500@sgi.com> Date: Thu, 17 May 2007 12:24:14 +1000 From: Donald Douwsma User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: David Chinner CC: xfs-dev , xfs-oss Subject: Re: Review: XFSQA: unwritten extent conversion vs synchronous direct I/O References: <20070508065327.GL32602149@melbourne.sgi.com> In-Reply-To: <20070508065327.GL32602149@melbourne.sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11456 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: donaldd@sgi.com Precedence: bulk X-list: xfs David Chinner wrote: > Test to exercise synchronous direct I/O into unwritten extents. > > Cheers, > > Dave. Would we ever want to adjust the IO_SIZE used in unwritten_sync from the qa script? A couple of small changes to fix compiler warnings and provide info on dio size errors. Otherwise looks good, Don Index: xfs-cmds/xfstests/src/unwritten_sync.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 -+++ xfs-cmds/xfstests/src/unwritten_sync.c 2007-05-07 11:44:38.668980258 +1000 -@@ -0,0 +1,167 @@ ++++ xfs-cmds/xfstests/src/unwritten_sync.c 2007-05-17 12:08:21.242781793 +1000 +@@ -0,0 +1,153 @@ +#include +#include +#include @@ -139,7 +139,9 @@ + } + + if ((dio.d_miniosz > IO_SIZE) || (dio.d_maxiosz < IO_SIZE)) { -+ fprintf(stderr,"Test won't work. Sorry\n"); ++ fprintf(stderr,"Test won't work, iosize out of range \ ++ (dio.d_miniosz=%d, dio.d_maxiosz=%d)\n", ++ dio.d_miniosz, dio.d_maxiosz); + exit(1); + } + buf = (char *)memalign(dio.d_mem , IO_SIZE); @@ -174,23 +176,7 @@ + print_getbmapx(file, fd, 0, 0); + close(fd); + } -+} -+ -+ -+ -+int -+get_getbmapx( -+ const char *pathname, -+ int fd, -+ struct getbmapx *bmapx) -+{ -+ int rc; -+ -+ rc = ioctl(fd, XFS_IOC_GETBMAPX, bmapx); -+ if (rc < 0) { -+ perror("xfs_ioc_getbmapx"); -+ exit(1); -+ } ++ return 0; +} + +void @@ -223,8 +209,8 @@ + if (x != array_size) { + break; /* end of file */ + } -+ if (get_getbmapx(pathname, fd, bmapx) < 0) { -+ fprintf(stderr, "getbmapx failed\n"); ++ if (xfsctl(pathname, fd, XFS_IOC_GETBMAPX, bmapx) < 0) { ++ fprintf(stderr, "XFS_IOC_GETBMAPX failed\n"); + exit(1); + } + if (bmapx[0].bmv_entries == 0) { From owner-xfs@oss.sgi.com Thu May 17 00:40:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 00:40:14 -0700 (PDT) Received: from molten.melbourne.sgi.com (molten.melbourne.sgi.com [134.14.54.33]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4H7e8fB006124 for ; Thu, 17 May 2007 00:40:11 -0700 Received: by molten.melbourne.sgi.com (Postfix, from userid 16365) id 4B4A228377B2; Thu, 17 May 2007 17:25:56 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com, xfs@oss.sgi.com Subject: TAKE 957794 - xfs_quota: project commands handle symlinks and special files badly. Message-Id: <20070517072556.4B4A228377B2@molten.melbourne.sgi.com> Date: Thu, 17 May 2007 17:25:56 +1000 (EST) From: donaldd@molten.melbourne.sgi.com X-archive-position: 11457 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: donaldd@molten.melbourne.sgi.com Precedence: bulk X-list: xfs Stop the project quota commands in xfs_quota from operating on special files symlinks, device nodes, sockets and fifos. Date: Thu May 17 17:12:57 AEST 2007 Workarea: molten.melbourne.sgi.com:/home/donaldd/isms/xfs-cmds Inspected by: tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/xfs-cmds/master-melb Modid: master-melb:xfs-cmds:28597a xfsprogs/doc/CHANGES - 1.238 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/doc/CHANGES.diff?r1=text&tr1=1.238&r2=text&tr2=1.237&f=h xfsprogs/quota/project.c - 1.5 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/quota/project.c.diff?r1=text&tr1=1.5&r2=text&tr2=1.4&f=h From owner-xfs@oss.sgi.com Thu May 17 01:01:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 01:01:52 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4H81kfB011744 for ; Thu, 17 May 2007 01:01:48 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA10133; Thu, 17 May 2007 17:36:47 +1000 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4H7akAf95706510; Thu, 17 May 2007 17:36:47 +1000 (AEST) Date: Thu, 17 May 2007 17:36:46 +1000 From: David Disseldorp To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: Review: fix b0rked test 030 behaviour. Message-ID: <20070517173646.0000116c@snort.melbourne.sgi.com> In-Reply-To: <20070511011145.GN86004887@sgi.com> References: <20070511011145.GN86004887@sgi.com> X-Mailer: Sylpheed-Claws 1.0.3 (GTK+ 1.2.10; mips-sgi-irix6.5) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l4H81ofB011758 X-archive-position: 11458 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: ddiss@sgi.com Precedence: bulk X-list: xfs Change looks good Dave, test 148 (xfs_prepair64 version of 030) will need the same .out file change. Cheers, David D Index: xfs-cmds/xfstests/148.out =================================================================== --- xfs-cmds.orig/xfstests/148.out 2007-05-03 17:10:29.554803451 +1000 +++ xfs-cmds/xfstests/148.out 2007-05-03 17:10:54.231584667 +1000 @@ -270,6 +270,10 @@ Phase 1 - find and verify superblock... Phase 2 - using log - zero log... - scan filesystem freespace and inode maps... +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 +bad agbno AGBNO in agfl, agno 0 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... On Fri, 11 May 2007 11:11:45 +1000 David Chinner wrote: > Test 030 is not testing things as it should. > > Specifically, corrupting the AGFL with "-1" is a no-op on > a freshly repaired filesystem, because xfs_repair rebuilds > the AGF btrees and AGFL from scratch and does not populate > the AGFL. > > The current test does: > > repair > mount > create file > remove file > umount > > And it does the filesystem twiddling to check that the filesystem > is ußable after repair. The problem is that this doesn't dirty the > filesystem - the create is followed by a remove, so nothing is > actually allocated and so the AGFL lists do not get modified. > > hence after a repair/check/corruption cycle, writing "-1" to > the AGFL is a no-op because it is already full of "-1" fields > (NULL blocks). > > With filestreams, the create/remove pair *does* modify the filesystem > and so when we write "-1" to the AGFL, we get different output > because the filesystem detects new corruptions and the test "fails". > > So, to make behaviour consistent, dirty the filesystem before > corrupting it on each cycle. Hence it doesn't matter if we are > using filestreams or not, we'll really test out corrupting the > AGFL with NULL blocks (-1) now. > > Comments? > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > > > --- > xfstests/030.out.irix | 4 ++++ > xfstests/030.out.linux | 4 ++++ > xfstests/common.repair | 10 ++++++++++ > 3 files changed, 18 insertions(+) > > Index: xfs-cmds/xfstests/030.out.irix > =================================================================== > --- xfs-cmds.orig/xfstests/030.out.irix 2007-05-03 17:10:29.554803451 +1000 > +++ xfs-cmds/xfstests/030.out.irix 2007-05-03 17:10:54.227585189 +1000 > @@ -262,6 +262,10 @@ Wrote X.XXKb (value 0xffffffff) > Phase 1 - find and verify superblock... > Phase 2 - zero log... > - scan filesystem freespace and inode maps... > +bad agbno AGBNO in agfl, agno 0 > +bad agbno AGBNO in agfl, agno 0 > +bad agbno AGBNO in agfl, agno 0 > +bad agbno AGBNO in agfl, agno 0 > - found root inode chunk > Phase 3 - for each AG... > - scan and clear agi unlinked lists... > Index: xfs-cmds/xfstests/030.out.linux > =================================================================== > --- xfs-cmds.orig/xfstests/030.out.linux 2007-05-03 17:10:29.554803451 +1000 > +++ xfs-cmds/xfstests/030.out.linux 2007-05-03 17:10:54.231584667 +1000 > @@ -270,6 +270,10 @@ Phase 1 - find and verify superblock... > Phase 2 - using log > - zero log... > - scan filesystem freespace and inode maps... > +bad agbno AGBNO in agfl, agno 0 > +bad agbno AGBNO in agfl, agno 0 > +bad agbno AGBNO in agfl, agno 0 > +bad agbno AGBNO in agfl, agno 0 > - found root inode chunk > Phase 3 - for each AG... > - scan and clear agi unlinked lists... > Index: xfs-cmds/xfstests/common.repair > =================================================================== > --- xfs-cmds.orig/xfstests/common.repair 2007-05-03 17:10:29.554803451 +1000 > +++ xfs-cmds/xfstests/common.repair 2007-05-03 17:10:54.231584667 +1000 > @@ -72,8 +72,18 @@ _check_repair() > { > value=$1 > structure="$2" > + > + #ensure the filesystem has been dirtied since last repair > + _scratch_mount > + POSIXLY_CORRECT=yes \ > + dd if=/bin/sh of=$SCRATCH_MNT/sh 2>&1 |_filter_dd > + sync > + rm -f $SCRATCH_MNT/sh > + umount $SCRATCH_MNT > + > _zero_position $value "$structure" > _scratch_xfs_repair 2>&1 | _filter_repair > + > # some basic sanity checks... > _check_scratch_fs > _scratch_mount #mount > From owner-xfs@oss.sgi.com Thu May 17 01:17:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 01:17:51 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4H8HkfB016098 for ; Thu, 17 May 2007 01:17:48 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA10922; Thu, 17 May 2007 18:17:42 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16378) id AF8B958CA531; Thu, 17 May 2007 18:17:42 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com, xfs@oss.sgi.com, asgqa@sgi.com Subject: TAKE 960160 960161 960162 960165 960166 960377 - mkfs arg conflicts Message-Id: <20070517081742.AF8B958CA531@chook.melbourne.sgi.com> Date: Thu, 17 May 2007 18:17:42 +1000 (EST) From: ddiss@sgi.com (David Mark Disseldorp) X-archive-position: 11459 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: ddiss@sgi.com Precedence: bulk X-list: xfs Currently there are a few ways mkfs options are specified in XFSQA: 1) suite wide MKFS_OPTIONS are specified before tests are run e.g. vimes:/home/fsgqa/kali/xfsqa/xfstests # export MKFS_OPTIONS="-l size=100m" 2) test wide MKFS_OPTIONS are specified during a particular test e.g. 119 export MKFS_OPTIONS="-l version=2,size=1200b,su=64k" 3) mkfs options are appended by a test at mkfs time e.g. 083 _scratch_mkfs_xfs -dsize=$fsz,agcount=$ags >>$seq.full (4) Another form of appending is just appending to MKFS_OPTIONS directly e.g. 114:export MKFS_OPTIONS="$MKFS_OPTIONS -i parent=1" 115:export MKFS_OPTIONS="$MKFS_OPTIONS -i paths=1" Conflicts between mkfs options specified with method 1 & 3 are common. 960377 XFSQA 041, 042 - mkfs fails with large log size MKFS_OPTIONS. This change means if a mkfs fails where mkfs options have been appended by a test (method 3), The mkfs is retried using only those options defined by the test. occurances of method 4 (in test 114 & 115) are also changed to use method 3. $seq.full logs the fact that a mkfs options conflict has occured. Date: Thu May 17 18:15:46 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/ddiss/xfs-cmds Inspected by: tes The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/xfs-cmds/master-melb Modid: master-melb:xfs-cmds:28601a xfstests/common.rc - 1.65 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/common.rc.diff?r1=text&tr1=1.65&r2=text&tr2=1.64&f=h - if there are conflicts between global and test specified mkfs options, then use only the test specified ones. xfstests/114 - 1.7 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/114.diff?r1=text&tr1=1.7&r2=text&tr2=1.6&f=h - append -i parent mkfs option using _scratch_mkfs_xfs -i parent... rather than export MKFS_OPTIONS="$MKFS_OPTIONS -i parent=1. this avoids mkfs option conflicts xfstests/115 - 1.4 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/115.diff?r1=text&tr1=1.4&r2=text&tr2=1.3&f=h - append -i paths mkfs option using _scratch_mkfs_xfs -i paths... rather than export MKFS_OPTIONS="$MKFS_OPTIONS -i paths=1. this avoids mkfs option conflicts From owner-xfs@oss.sgi.com Thu May 17 01:27:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 01:27:22 -0700 (PDT) Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4H8RIfB022703 for ; Thu, 17 May 2007 01:27:19 -0700 Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.192]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l4H8RCBY007181 for ; Thu, 17 May 2007 17:27:12 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l4H8RCi14717 for xfs@oss.sgi.com; Thu, 17 May 2007 17:27:12 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv3.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id l4H8RBL22867 for ; Thu, 17 May 2007 17:27:11 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070517.172726.40600716 for ; Thu, 17 May 2007 17:27:26 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Thu May 17 17:27:26 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id E254DAE4B5; Thu, 17 May 2007 17:27:08 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l4H8RA83015270; Thu, 17 May 2007 17:27:10 +0900 Message-Id: <200705170827.AA05393@TNESG9305.tnes.nec.co.jp> Date: Thu, 17 May 2007 17:27:05 +0900 To: xfs@oss.sgi.com Subject: [PATCH] Fix free command in xfs_quota. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11460 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, I have got the following error when doing free command in xfs_quota. The free command tries to get the quota status though the project quota is off. The man page for free command says: if project quota are in use, it will also report utilisation for those projects. So this patch checks if the file system is mounted with the quota option and if the project quota is on. Example: # mount -t xfs -o prjquota /dev/sda6 mpnt/ # xfs_quota -c "free" /dev/sda6 Filesystem 1K-blocks Used Available Use% Pathname /dev/sda6 10480128 4196084 6284044 41% /home/utako/mpnt /dev/sda6 10480128 104 6284044 1% /home/utako/mpnt/pjq /dev/sda6 10480128 0 6284044 0% /home/utako/mpnt/pjq2 # umount mpnt/ # mount -t xfs -o usrquota /dev/sda6 mpnt/ # xfs_quota -c "free" /dev/sda6 Filesystem 1K-blocks Used Available Use% Pathname /dev/sda6 10480128 4196084 6284044 41% /home/utako/mpnt XFS_GETQUOTA: No such process XFS_GETQUOTA: No such process # umount mpnt/ # mount -t xfs /dev/sda6 mpnt/ # xfs_quota -c "free" /dev/sda6 Filesystem 1K-blocks Used Available Use% Pathname /dev/sda6 10480128 4196084 6284044 41% /home/utako/mpnt XFS_GETQUOTA: Function not implemented XFS_GETQUOTA: Function not implemented Signed-off-by: Utako Kusaka --- --- xfsprogs-2.8.20/quota/free.orig 2007-05-09 10:36:01.000000000 +0900 +++ xfsprogs-2.8.20/quota/free.c 2007-05-10 17:18:16.000000000 +0900 @@ -116,12 +116,18 @@ projects_free_space_data( __uint64_t *rused, __uint64_t *rfree) { + fs_quota_stat_t qfs; fs_disk_quota_t d; struct fsxattr fsx; uint type = XFS_PROJ_QUOTA; char *dev = path->fs_name; int fd; + if ((xfsquotactl(XFS_GETQSTAT, dev, type, 0, &qfs) < 0) || + !(qfs.qs_flags & XFS_QUOTA_PDQ_ACCT)) { + return 0; + } + if ((fd = open(path->fs_dir, O_RDONLY)) < 0) { fprintf(stderr, "%s: cannot open %s: %s\n", progname, path->fs_dir, strerror(errno)); From owner-xfs@oss.sgi.com Thu May 17 05:10:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 05:10:54 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HCAlfB009638 for ; Thu, 17 May 2007 05:10:49 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4HCAkPI023807 for ; Thu, 17 May 2007 08:10:46 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4HCAkf9558516 for ; Thu, 17 May 2007 08:10:46 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4HCAj9B018272 for ; Thu, 17 May 2007 08:10:46 -0400 Received: from [9.53.41.190] (kleikamp.austin.ibm.com [9.53.41.190]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4HCAiVf018264; Thu, 17 May 2007 08:10:44 -0400 Subject: Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc From: Dave Kleikamp To: David Chinner Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com In-Reply-To: <20070516234036.GQ85884050@sgi.com> References: <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> <20070515200359.GA5834@amitarora.in.ibm.com> <20070516031626.GM85884050@sgi.com> <1179318076.10313.6.camel@kleikamp.austin.ibm.com> <20070516234036.GQ85884050@sgi.com> Content-Type: text/plain Date: Thu, 17 May 2007 07:10:44 -0500 Message-Id: <1179403844.13965.2.camel@kleikamp.austin.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 Content-Transfer-Encoding: 7bit X-archive-position: 11461 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: shaggy@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, 2007-05-17 at 09:40 +1000, David Chinner wrote: > On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote: > > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: > > > Please don't make this always happen. c/mtime updates should be dependent > > > on the mode being used and whether there is visible change to the file. If no > > > userspace visible changes to the file occurred, then timestamps should not > > > be changed. > > > > i_blocks will be updated, so it seems reasonable to update ctime. mtime > > shouldn't be changed, though, since the contents of the file will be > > unchanged. > > That's assuming blocks were actually allocated - if the prealloc range already > has underlying blocks there is no change and so we should not be changing > mtime either. Only the filesystem will know if it has changed the file, so I > think that timestamp updates need to be driven down to that level, not done > blindy at the highest layer.... Yes, I agree. Shaggy -- David Kleikamp IBM Linux Technology Center From owner-xfs@oss.sgi.com Thu May 17 05:28:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 05:28:38 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HCSYfB013436 for ; Thu, 17 May 2007 05:28:35 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4HCSXoR032548 for ; Thu, 17 May 2007 08:28:33 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4HCSX8R248122 for ; Thu, 17 May 2007 06:28:33 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4HCSWD9023479 for ; Thu, 17 May 2007 06:28:33 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4HCSVDM023396; Thu, 17 May 2007 06:28:32 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id B84B694EC3; Thu, 17 May 2007 17:58:41 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4HCSeI0015154; Thu, 17 May 2007 17:58:40 +0530 Date: Thu, 17 May 2007 17:58:40 +0530 From: "Amit K. Arora" To: David Chinner Cc: Dave Kleikamp , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070517122840.GA12381@amitarora.in.ibm.com> References: <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070515193722.GA3487@amitarora.in.ibm.com> <20070515195421.GA2948@amitarora.in.ibm.com> <20070515200359.GA5834@amitarora.in.ibm.com> <20070516031626.GM85884050@sgi.com> <1179318076.10313.6.camel@kleikamp.austin.ibm.com> <20070516234036.GQ85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070516234036.GQ85884050@sgi.com> User-Agent: Mutt/1.4.1i X-archive-position: 11462 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, May 17, 2007 at 09:40:36AM +1000, David Chinner wrote: > On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote: > > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: > > > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: > > > > > > Following changes were made to the previous version: > > > > 1) Added description before sys_fallocate() definition. > > > > 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, > > > > posix_fallocate should return EINVAL for len <= 0. > > > > 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE > > > > 4) Do not return ENODEV for dirs (let individual file systems decide if > > > > they want to support preallocation to directories or not. > > > > 5) Check for wrap through zero. > > > > 6) Update c/mtime if fallocate() succeeds. > > > > > > Please don't make this always happen. c/mtime updates should be dependent > > > on the mode being used and whether there is visible change to the file. If no > > > userspace visible changes to the file occurred, then timestamps should not > > > be changed. > > > > i_blocks will be updated, so it seems reasonable to update ctime. mtime > > shouldn't be changed, though, since the contents of the file will be > > unchanged. > > That's assuming blocks were actually allocated - if the prealloc range already > has underlying blocks there is no change and so we should not be changing > mtime either. Only the filesystem will know if it has changed the file, so I > think that timestamp updates need to be driven down to that level, not done > blindy at the highest layer.... Ok. Will make this change in the next post. -- Regards, Amit Arora > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 17 07:11:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 07:11:21 -0700 (PDT) Received: from e6.ny.us.ibm.com (e6.ny.us.ibm.com [32.97.182.146]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HEBDfB030335 for ; Thu, 17 May 2007 07:11:16 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4HECCtU030437 for ; Thu, 17 May 2007 10:12:12 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4HEBCuY524960 for ; Thu, 17 May 2007 10:11:12 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4HEBAS6007475 for ; Thu, 17 May 2007 10:11:11 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4HEB9sH007081; Thu, 17 May 2007 10:11:10 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 8336294EC3; Thu, 17 May 2007 19:41:15 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4HEBF8c025779; Thu, 17 May 2007 19:41:15 +0530 Date: Thu, 17 May 2007 19:41:15 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 0/6][TAKE4] fallocate system call Message-ID: <20070517141115.GA24260@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11463 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs Description: ----------- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: --------- The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime in the inode of the corresponding file, marking a successfull allocation. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. * New modes might get added in future. One such new mode which is already under discussion is FA_PREALLOCATE, which when used will preallocate space but will not change the filesize and [cm]time. Since the semantics of this new mode is not clear and agreed upon yet, this patchset does not implement it currently. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). RETURN VALUE: The system call returns 0 on success and an error on failure. This is done to keep the semantics same as of posix_fallocate(). sys_fallocate() on s390: ----------------------- There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problem: ------------- mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: ToDos: ----- 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Changelog: --------- Changes from Take2 to Take3: 1) Return type is now described in the interface description above. 2) Patches rebased to 2.6.22-rc1 kernel. ** Each post will have an individual changelog for a particular patch. Following patches follow: Patch 1/6 : fallocate() implementation on i86, x86_64 and powerpc Patch 2/6 : fallocate() on s390 Patch 3/6 : fallocate() on ia64 Patch 4/6 : ext4: Extent overlap bugfix Patch 5/6 : ext4: fallocate support in ext4 Patch 6/6 : ext4: write support for preallocated blocks -- Regards, Amit Arora From owner-xfs@oss.sgi.com Thu May 17 07:23:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 07:23:43 -0700 (PDT) Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HENcfB000617 for ; Thu, 17 May 2007 07:23:39 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4HENc22021812 for ; Thu, 17 May 2007 10:23:38 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4HENbQj204066 for ; Thu, 17 May 2007 08:23:37 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4HENaOC025830 for ; Thu, 17 May 2007 08:23:37 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4HENZV3025683; Thu, 17 May 2007 08:23:35 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id ACE0F94EC3; Thu, 17 May 2007 19:53:41 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4HENfQa030745; Thu, 17 May 2007 19:53:41 +0530 Date: Thu, 17 May 2007 19:53:41 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 1/6][TAKE4] fallocate() implementation on i86, x86_64 and powerpc Message-ID: <20070517142341.GA28726@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070517141115.GA24260@amitarora.in.ibm.com> <20070517141458.GA26641@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070517141458.GA26641@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11464 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Changelog: --------- Changes from Take3 to Take4: 1) Do not update c/mtime. Let each filesystem update ctime (update of mtime will not be required for allocation since we touch only metadata/inode and not blocks), if required. Changes from Take2 to Take3: 1) Patches now based on 2.6.22-rc1 kernel. Changes from Take1(initial post on 26th April, 2007) to Take2: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) Here is the new patch: Signed-off-by: Amit Arora --- arch/i386/kernel/syscall_table.S | 1 arch/powerpc/kernel/sys_ppc32.c | 7 +++ arch/x86_64/ia32/ia32entry.S | 1 fs/open.c | 86 +++++++++++++++++++++++++++++++++++++++ include/asm-i386/unistd.h | 3 - include/asm-powerpc/systbl.h | 1 include/asm-powerpc/unistd.h | 3 - include/asm-x86_64/unistd.h | 2 include/linux/fs.h | 13 +++++ include/linux/syscalls.h | 1 10 files changed, 116 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S =================================================================== --- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c =================================================================== --- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, + u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, + ((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { Index: linux-2.6.22-rc1/fs/open.c =================================================================== --- linux-2.6.22-rc1.orig/fs/open.c +++ linux-2.6.22-rc1/fs/open.c @@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * It is expected that the ->fallocate() inode operation implemented by the + * individual file systems will update the file size and/or ctime/mtime + * depending on the mode and also on the success of the operation. + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + /* + * Let individual file system decide if it supports preallocation + * for directories or not. + */ + if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + /* Check for wrap through zero too */ + if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; + +out_fput: + fput(file); +out: + return ret; +} + +/* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and * switching the fsuid/fsgid around to the real ones. Index: linux-2.6.22-rc1/include/asm-i386/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-i386/unistd.h +++ linux-2.6.22-rc1/include/asm-i386/unistd.h @@ -329,10 +329,11 @@ #define __NR_signalfd 321 #define __NR_timerfd 322 #define __NR_eventfd 323 +#define __NR_fallocate 324 #ifdef __KERNEL__ -#define NR_syscalls 324 +#define NR_syscalls 325 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.22-rc1/include/asm-powerpc/systbl.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-powerpc/systbl.h +++ linux-2.6.22-rc1/include/asm-powerpc/systbl.h @@ -308,3 +308,4 @@ COMPAT_SYS_SPU(move_pages) SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) COMPAT_SYS_SPU(utimensat) +COMPAT_SYS(fallocate) Index: linux-2.6.22-rc1/include/asm-powerpc/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-powerpc/unistd.h +++ linux-2.6.22-rc1/include/asm-powerpc/unistd.h @@ -327,10 +327,11 @@ #define __NR_getcpu 302 #define __NR_epoll_pwait 303 #define __NR_utimensat 304 +#define __NR_fallocate 305 #ifdef __KERNEL__ -#define __NR_syscalls 305 +#define __NR_syscalls 306 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls Index: linux-2.6.22-rc1/include/asm-x86_64/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-x86_64/unistd.h +++ linux-2.6.22-rc1/include/asm-x86_64/unistd.h @@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd) __SYSCALL(__NR_timerfd, sys_timerfd) #define __NR_eventfd 283 __SYSCALL(__NR_eventfd, sys_eventfd) +#define __NR_fallocate 284 +__SYSCALL(__NR_fallocate, sys_fallocate) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.22-rc1/include/linux/fs.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/fs.h +++ linux-2.6.22-rc1/include/linux/fs.h @@ -266,6 +266,17 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * sys_fallocate modes + * Currently sys_fallocate supports two modes: + * FA_ALLOCATE : This is the preallocate mode, using which an application/user + * may request (pre)allocation of blocks. + * FA_DEALLOCATE: This is the deallocate mode, which can be used to free + * the preallocated blocks. + */ +#define FA_ALLOCATE 0x1 +#define FA_DEALLOCATE 0x2 + #ifdef __KERNEL__ #include @@ -1137,6 +1148,8 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *inode, int mode, loff_t offset, + loff_t len); }; struct seq_file; Index: linux-2.6.22-rc1/include/linux/syscalls.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/syscalls.h +++ linux-2.6.22-rc1/include/linux/syscalls.h @@ -608,6 +608,7 @@ asmlinkage long sys_signalfd(int ufd, si asmlinkage long sys_timerfd(int ufd, int clockid, int flags, const struct itimerspec __user *utmr); asmlinkage long sys_eventfd(unsigned int count); +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); Index: linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S =================================================================== --- linux-2.6.22-rc1.orig/arch/x86_64/ia32/ia32entry.S +++ linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys_fallocate ia32_syscall_end: From owner-xfs@oss.sgi.com Thu May 17 07:24:57 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 07:25:00 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HEOtfB000912 for ; Thu, 17 May 2007 07:24:56 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4HEOskR006227 for ; Thu, 17 May 2007 10:24:54 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4HEOsar199074 for ; Thu, 17 May 2007 08:24:54 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4HEOrGU014146 for ; Thu, 17 May 2007 08:24:54 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4HEOqJp014052; Thu, 17 May 2007 08:24:53 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id CBAA194EC3; Thu, 17 May 2007 19:55:02 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4HEP2xW031286; Thu, 17 May 2007 19:55:02 +0530 Date: Thu, 17 May 2007 19:55:02 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 2/6][TAKE4] fallocate() on s390 Message-ID: <20070517142502.GB28726@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070517141115.GA24260@amitarora.in.ibm.com> <20070517141458.GA26641@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070517141458.GA26641@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11465 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This is the patch suggested by Martin Schwidefsky to support sys_fallocate() on s390(x) platform. He also suggested a wrapper in glibc to handle this system call on s390. Posting it here so that we get feedback for this too. .globl __fallocate ENTRY(__fallocate) stm %r6,%r7,28(%r15) /* save %r6/%r7 on stack */ cfi_offset (%r7, -68) cfi_offset (%r6, -72) lm %r6,%r7,96(%r15) /* load loff_t len from stack */ svc SYS_ify(fallocate) lm %r6,%r7,28(%r15) /* restore %r6/%r7 from stack */ br %r14 PSEUDO_END(__fallocate) Here are the comments and the patch to linux kernel from him. ------------- From: Martin Schwidefsky This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with the arguments of this system call. Signed-off-by: Martin Schwidefsky --- arch/s390/kernel/compat_wrapper.S | 10 ++++++++++ arch/s390/kernel/sys_s390.c | 29 +++++++++++++++++++++++++++++ arch/s390/kernel/syscalls.S | 1 + include/asm-s390/unistd.h | 3 ++- 4 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S =================================================================== --- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S +++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: llgtr %r2,%r2 # char * llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + + .globl sys_fallocate_wrapper +sys_fallocate_wrapper: + lgfr %r2,%r2 # int + lgfr %r3,%r3 # int + sllg %r4,%r4,32 # get high word of 64bit loff_t + lr %r4,%r5 # get low word of 64bit loff_t + sllg %r5,%r6,32 # get high word of 64bit loff_t + l %r5,164(%r15) # get low word of 64bit loff_t + jg sys_fallocate Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c =================================================================== --- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c +++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c @@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar return -EFAULT; return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice); } + +#ifndef CONFIG_64BIT +/* + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last + * 64 bit argument "len" is split into the upper and lower 32 bits. The + * system call wrapper in the user space loads the value to %r6/%r7. + * The code in entry.S keeps the values in %r2 - %r6 where they are and + * stores %r7 to 96(%r15). But the standard C linkage requires that + * the whole 64 bit value for len is stored on the stack and doesn't + * use %r6 at all. So s390_fallocate has to convert the arguments from + * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len + * to + * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len + */ +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset, + u32 len_high, u32 len_low) +{ + union { + u64 len; + struct { + u32 high; + u32 low; + }; + } cv; + cv.high = len_high; + cv.low = len_low; + return sys_fallocate(fd, mode, offset, cv.len); +} +#endif Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S =================================================================== --- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper) Index: linux-2.6.22-rc1/include/asm-s390/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h +++ linux-2.6.22-rc1/include/asm-s390/unistd.h @@ -251,8 +251,9 @@ #define __NR_getcpu 311 #define __NR_epoll_pwait 312 #define __NR_utimes 313 +#define __NR_fallocate 314 -#define NR_syscalls 314 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some From owner-xfs@oss.sgi.com Thu May 17 07:25:53 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 07:25:58 -0700 (PDT) Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HEPqfB001349 for ; Thu, 17 May 2007 07:25:53 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4HEPqaF024808 for ; Thu, 17 May 2007 10:25:52 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4HEPpPf248948 for ; Thu, 17 May 2007 08:25:51 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4HEPpP1017972 for ; Thu, 17 May 2007 08:25:51 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4HEPobX017868; Thu, 17 May 2007 08:25:50 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 5867A94EC3; Thu, 17 May 2007 19:55:55 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4HEPt7W031681; Thu, 17 May 2007 19:55:55 +0530 Date: Thu, 17 May 2007 19:55:55 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 3/6][TAKE4] fallocate() on ia64 Message-ID: <20070517142555.GC28726@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070517141115.GA24260@amitarora.in.ibm.com> <20070517141458.GA26641@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070517141458.GA26641@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11466 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64 From: David Chinner Subject: [PATCH] ia64 fallocate syscall Cc: "Amit K. Arora" , akpm@linux-foundation.org, linux-ext4@vger.kernel.org, suparna@in.ibm.com, cmm@us.ibm.com ia64 fallocate syscall support. Signed-Off-By: Dave Chinner --- arch/ia64/kernel/entry.S | 1 + include/asm-ia64/unistd.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S =================================================================== --- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S 2007-05-12 18:45:56.000000000 -0700 +++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S 2007-05-15 15:36:48.000000000 -0700 @@ -1585,5 +1585,6 @@ data8 sys_getcpu data8 sys_epoll_pwait // 1305 data8 sys_utimensat + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 18:45:56.000000000 -0700 +++ linux-2.6.22-rc1/include/asm-ia64/unistd.h 2007-05-15 15:37:51.000000000 -0700 @@ -296,6 +296,7 @@ #define __NR_getcpu 1304 #define __NR_epoll_pwait 1305 #define __NR_utimensat 1306 +#define __NR_fallocate 1307 #ifdef __KERNEL__ From owner-xfs@oss.sgi.com Thu May 17 07:26:44 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 07:26:47 -0700 (PDT) Received: from e6.ny.us.ibm.com (e6.ny.us.ibm.com [32.97.182.146]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HEQhfB001772 for ; Thu, 17 May 2007 07:26:44 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4HERe0Z024580 for ; Thu, 17 May 2007 10:27:40 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4HEQfSq529698 for ; Thu, 17 May 2007 10:26:41 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4HEQeSk011920 for ; Thu, 17 May 2007 10:26:40 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4HEQd6L011855; Thu, 17 May 2007 10:26:39 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 3860394EC3; Thu, 17 May 2007 19:56:49 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4HEQnLp032045; Thu, 17 May 2007 19:56:49 +0530 Date: Thu, 17 May 2007 19:56:49 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 4/6][TAKE4] ext4: Extent overlap bugfix Message-ID: <20070517142649.GD28726@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070517141115.GA24260@amitarora.in.ibm.com> <20070517141458.GA26641@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070517141458.GA26641@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11467 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch adds a check for overlap of extents and cuts short the new extent to be inserted, if there is a chance of overlap. Changelog: --------- Changes from Take3 to Take4: - no change - Changes from Take2 to Take3: 1) Patch rebased to 2.6.22-rc1 kernel. Changes from Take1 to Take2: 1) As suggested by Andrew, a check for wrap though zero has been added. Here is the new patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 60 ++++++++++++++++++++++++++++++++++++++-- include/linux/ext4_fs_extents.h | 1 2 files changed, 59 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c =================================================================== --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode } /* + * check if a portion of the "newext" extent overlaps with an + * existing extent. + * + * If there is an overlap discovered, it updates the length of the newext + * such that there will be no overlap, and then returns 1. + * If there is no overlap found, it returns 0. + */ +unsigned int ext4_ext_check_overlap(struct inode *inode, + struct ext4_extent *newext, + struct ext4_ext_path *path) +{ + unsigned long b1, b2; + unsigned int depth, len1; + unsigned int ret = 0; + + b1 = le32_to_cpu(newext->ee_block); + len1 = le16_to_cpu(newext->ee_len); + depth = ext_depth(inode); + if (!path[depth].p_ext) + goto out; + b2 = le32_to_cpu(path[depth].p_ext->ee_block); + + /* + * get the next allocated block if the extent in the path + * is before the requested block(s) + */ + if (b2 < b1) { + b2 = ext4_ext_next_allocated_block(path); + if (b2 == EXT_MAX_BLOCK) + goto out; + } + + /* check for wrap through zero */ + if (b1 + len1 < b1) { + len1 = EXT_MAX_BLOCK - b1; + newext->ee_len = cpu_to_le16(len1); + ret = 1; + } + + /* check for overlap */ + if (b1 + len1 > b2) { + newext->ee_len = cpu_to_le16(b2 - b1); + ret = 1; + } +out: + return ret; +} + +/* * ext4_ext_insert_extent: * tries to merge requsted extent into the existing extent or * inserts requested extent as new one into the tree, @@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); - allocated = max_blocks; + + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ + newex.ee_block = cpu_to_le32(iblock); + newex.ee_len = cpu_to_le16(max_blocks); + err = ext4_ext_check_overlap(inode, &newex, path); + if (err) + allocated = le16_to_cpu(newex.ee_len); + else + allocated = max_blocks; newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err); if (!newblock) goto out2; @@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle goal, newblock, allocated); /* try to insert new extent into found leaf and return */ - newex.ee_block = cpu_to_le32(iblock); ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); err = ext4_ext_insert_extent(handle, inode, path, &newex); Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h @@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode * extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *); From owner-xfs@oss.sgi.com Thu May 17 07:29:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 07:29:21 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HETGfB003097 for ; Thu, 17 May 2007 07:29:18 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l4HEPZaT027872 for ; Thu, 17 May 2007 10:25:35 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4HETDOb268260 for ; Thu, 17 May 2007 08:29:13 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4HETCEu019917 for ; Thu, 17 May 2007 08:29:13 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4HETATS019680; Thu, 17 May 2007 08:29:11 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id D8D1F94EC3; Thu, 17 May 2007 19:59:20 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4HETFuO000544; Thu, 17 May 2007 19:59:15 +0530 Date: Thu, 17 May 2007 19:59:15 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 5/6][TAKE4] ext4: fallocate support in ext4 Message-ID: <20070517142915.GE28726@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070517141115.GA24260@amitarora.in.ibm.com> <20070517141458.GA26641@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070517141458.GA26641@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11468 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FA_ALLOCATE mode is being supported as of now. Supporting FA_DEALLOCATE mode is a item. Changelog: --------- Changes from Take3 to Take4: 1) Changed ext4_fllocate() declaration and definition to return a "long" and not an "int", to match with ->fallocate() inode op. 2) Update ctime if new blocks get allocated. Changes from Take2 to Take3: 1) Patch rebased to 2.6.22-rc1 kernel version. 2) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);". Changes from Take1 to Take2: 1) Added more description for ext4_fallocate(). 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent). 3) Moved journal_start & journal_stop inside the while loop. 4) Replaced BUG_ON with WARN_ON & ext4_error. 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally. 6) Added variable names in the function declaration of ext4_fallocate() 7) Converted macros that handle uninitialized extents into inline functions. Here is the updated patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 249 +++++++++++++++++++++++++++++++++------- fs/ext4/file.c | 1 include/linux/ext4_fs.h | 8 + include/linux/ext4_fs_extents.h | 12 + 4 files changed, 229 insertions(+), 41 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c =================================================================== --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1106,7 +1106,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* + * Make sure that either both extents are uninitialized, or + * both are _not_. + */ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int ret = 0; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; int depth, len, err, next; + unsigned uninitialized = 0; - BUG_ON(newext->ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode); ex = path[depth].p_ext; BUG_ON(path[depth].p_hdr == NULL); @@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han /* try to insert block into found extent and return */ if (ex && ext4_can_extents_be_merged(inode, ex, newext)) { ext_debug("append %d block to %d:%d (from %llu)\n", - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); err = ext4_ext_get_access(handle, inode, path + depth); if (err) return err; - ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len) - + le16_to_cpu(newext->ee_len)); + + /* + * ext4_can_extents_be_merged should have checked that either + * both extents are uninitialized, or both aren't. Thus we + * need to check only one of them here. + */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(newext)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); eh = path[depth].p_hdr; nearex = ex; goto merge; @@ -1263,7 +1286,7 @@ has_space: ext_debug("first extent in the leaf: %d:%llu:%d\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len)); + ext4_ext_get_actual_len(newext)); path[depth].p_ext = EXT_FIRST_EXTENT(eh); } else if (le32_to_cpu(newext->ee_block) > le32_to_cpu(nearex->ee_block)) { @@ -1276,7 +1299,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 2, nearex + 1, len); } @@ -1289,7 +1312,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 1, nearex, len); path[depth].p_ext = nearex; @@ -1308,8 +1331,13 @@ merge: if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) break; /* merge with next extent! */ - nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len) - + le16_to_cpu(nearex[1].ee_len)); + if (ext4_ext_is_uninitialized(nearex)) + uninitialized = 1; + nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) + + ext4_ext_get_actual_len(nearex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(nearex); + if (nearex + 1 < EXT_LAST_EXTENT(eh)) { len = (EXT_LAST_EXTENT(eh) - nearex - 1) * sizeof(struct ext4_extent); @@ -1379,8 +1407,8 @@ int ext4_ext_walk_space(struct inode *in end = le32_to_cpu(ex->ee_block); if (block + num < end) end = block + num; - } else if (block >= - le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) { + } else if (block >= le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex)) { /* need to allocate space after found extent */ start = block; end = block + num; @@ -1392,7 +1420,8 @@ int ext4_ext_walk_space(struct inode *in * by found extent */ start = block; - end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len); + end = le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex); if (block + num < end) end = block + num; exists = 1; @@ -1408,7 +1437,7 @@ int ext4_ext_walk_space(struct inode *in cbex.ec_type = EXT4_EXT_CACHE_GAP; } else { cbex.ec_block = le32_to_cpu(ex->ee_block); - cbex.ec_len = le16_to_cpu(ex->ee_len); + cbex.ec_len = ext4_ext_get_actual_len(ex); cbex.ec_start = ext_pblock(ex); cbex.ec_type = EXT4_EXT_CACHE_EXTENT; } @@ -1481,15 +1510,15 @@ ext4_ext_put_gap_in_cache(struct inode * ext_debug("cache gap(before): %lu [%lu:%lu]", (unsigned long) block, (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len)); + (unsigned long) ext4_ext_get_actual_len(ex)); } else if (block >= le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len)) { + + ext4_ext_get_actual_len(ex)) { lblock = le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len); + + ext4_ext_get_actual_len(ex); len = ext4_ext_next_allocated_block(path); ext_debug("cache gap(after): [%lu:%lu] %lu", (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len), + (unsigned long) ext4_ext_get_actual_len(ex), (unsigned long) block); BUG_ON(len == lblock); len = len - lblock; @@ -1619,12 +1648,12 @@ static int ext4_remove_blocks(handle_t * unsigned long from, unsigned long to) { struct buffer_head *bh; + unsigned short ee_len = ext4_ext_get_actual_len(ex); int i; #ifdef EXTENTS_STATS { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); - unsigned short ee_len = le16_to_cpu(ex->ee_len); spin_lock(&sbi->s_ext_stats_lock); sbi->s_ext_blocks += ee_len; sbi->s_ext_extents++; @@ -1638,12 +1667,12 @@ static int ext4_remove_blocks(handle_t * } #endif if (from >= le32_to_cpu(ex->ee_block) - && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to == le32_to_cpu(ex->ee_block) + ee_len - 1) { /* tail removal */ unsigned long num; ext4_fsblk_t start; - num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from; - start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num; + num = le32_to_cpu(ex->ee_block) + ee_len - from; + start = ext_pblock(ex) + ee_len - num; ext_debug("free last %lu blocks starting %llu\n", num, start); for (i = 0; i < num; i++) { bh = sb_find_get_block(inode->i_sb, start + i); @@ -1651,12 +1680,12 @@ static int ext4_remove_blocks(handle_t * } ext4_free_blocks(handle, inode, start, num); } else if (from == le32_to_cpu(ex->ee_block) - && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) { printk("strange request: removal %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } else { printk("strange request: removal(2) %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } return 0; } @@ -1671,6 +1700,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc unsigned a, b, block, num; unsigned long ex_ee_block; unsigned short ex_ee_len; + unsigned uninitialized = 0; struct ext4_extent *ex; ext_debug("truncate since %lu in leaf\n", start); @@ -1685,7 +1715,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex = EXT_LAST_EXTENT(eh); ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex_ee_len = ext4_ext_get_actual_len(ex); while (ex >= EXT_FIRST_EXTENT(eh) && ex_ee_block + ex_ee_len > start) { @@ -1753,6 +1785,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex->ee_block = cpu_to_le32(block); ex->ee_len = cpu_to_le16(num); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); if (err) @@ -1762,7 +1796,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc ext_pblock(ex)); ex--; ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + ex_ee_len = ext4_ext_get_actual_len(ex); } if (correct_index && eh->eh_entries) @@ -2038,7 +2072,7 @@ int ext4_ext_get_blocks(handle_t *handle if (ex) { unsigned long ee_block = le32_to_cpu(ex->ee_block); ext4_fsblk_t ee_start = ext_pblock(ex); - unsigned short ee_len = le16_to_cpu(ex->ee_len); + unsigned short ee_len; /* * Allow future support for preallocated extents to be added @@ -2046,8 +2080,9 @@ int ext4_ext_get_blocks(handle_t *handle * Uninitialized extents are treated as holes, except that * we avoid (fail) allocating new blocks during a write. */ - if (ee_len > EXT_MAX_LEN) + if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) goto out2; + ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { newblock = iblock - ee_block + ee_start; @@ -2055,8 +2090,11 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); - ext4_ext_put_in_cache(inode, ee_block, ee_len, - ee_start, EXT4_EXT_CACHE_EXTENT); + /* Do not put uninitialized extent in the cache */ + if (!ext4_ext_is_uninitialized(ex)) + ext4_ext_put_in_cache(inode, ee_block, + ee_len, ee_start, + EXT4_EXT_CACHE_EXTENT); goto out; } } @@ -2098,6 +2136,8 @@ int ext4_ext_get_blocks(handle_t *handle /* try to insert new extent into found leaf and return */ ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); + if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */ + ext4_ext_mark_uninitialized(&newex); err = ext4_ext_insert_extent(handle, inode, path, &newex); if (err) goto out2; @@ -2109,8 +2149,10 @@ int ext4_ext_get_blocks(handle_t *handle newblock = ext_pblock(&newex); __set_bit(BH_New, &bh_result->b_state); - ext4_ext_put_in_cache(inode, iblock, allocated, newblock, - EXT4_EXT_CACHE_EXTENT); + /* Cache only when it is _not_ an uninitialized extent */ + if (create!=EXT4_CREATE_UNINITIALIZED_EXT) + ext4_ext_put_in_cache(inode, iblock, allocated, newblock, + EXT4_EXT_CACHE_EXTENT); out: if (allocated > max_blocks) allocated = max_blocks; @@ -2214,6 +2256,131 @@ int ext4_ext_writepage_trans_blocks(stru return needed; } +/* + * preallocate space for a file. This implements ext4's fallocate inode + * operation, which gets called from sys_fallocate system call. + * Currently only FA_ALLOCATE mode is supported on extent based files. + * We may have more modes supported in future - like FA_DEALLOCATE, which + * tells fallocate to unallocate previously (pre)allocated blocks. + * For block-mapped files, posix_fallocate should fall back to the method + * of writing zeroes to the required new blocks (the same behavior which is + * expected for file systems which do not support fallocate() system call). + */ +long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + handle_t *handle; + ext4_fsblk_t block, max_blocks; + ext4_fsblk_t nblocks = 0; + int ret = 0; + int ret2 = 0; + int retries = 0; + struct buffer_head map_bh; + unsigned int credits, blkbits = inode->i_blkbits; + + /* + * currently supporting (pre)allocate mode for extent-based + * files _only_ + */ + if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return -EOPNOTSUPP; + + /* preallocation to directories is currently not supported */ + if (S_ISDIR(inode->i_mode)) + return -ENODEV; + + block = offset >> blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) + - block; + + /* + * credits to insert 1 extent into extent tree + buffers to be able to + * modify 1 super block, 1 block bitmap and 1 group descriptor. + */ + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3; +retry: + while (ret >= 0 && ret < max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + handle = ext4_journal_start(inode, credits); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + break; + } + + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, &map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + WARN_ON(!ret); + if (!ret) { + ext4_error(inode->i_sb, "ext4_fallocate", + "ext4_ext_get_blocks returned 0! inode#%lu" + ", block=%llu, max_blocks=%llu", + inode->i_ino, block, max_blocks); + ret = -EIO; + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + break; + } + if (ret > 0) { + /* check wrap through sign-bit/zero here */ + if ((block + ret) < 0 || (block + ret) < block) { + ret = -EIO; + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + break; + } + if (buffer_new(&map_bh) && ((block + ret) > + (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits) + >> blkbits))) + nblocks = nblocks + ret; + } + + /* Update ctime if new blocks get allocated */ + if (nblocks) { + struct timespec now; + now = current_fs_time(inode->i_sb); + if (!timespec_equal(&inode->i_ctime, &now)) + inode->i_ctime = now; + } + + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + if (ret2) + break; + } + + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) + goto retry; + + /* + * Time to update the file size. + * Update only when preallocation was requested beyond the file size. + */ + if ((offset + len) > i_size_read(inode)) { + if (ret > 0) { + /* + * if no error, we assume preallocation succeeded + * completely + */ + mutex_lock(&inode->i_mutex); + i_size_write(inode, offset + len); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } else if (ret < 0 && nblocks) { + /* Handle partial allocation scenario */ + loff_t newsize; + + mutex_lock(&inode->i_mutex); + newsize = (nblocks << blkbits) + i_size_read(inode); + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } + } + + return ret > 0 ? ret2 : ret; +} + EXPORT_SYMBOL(ext4_mark_inode_dirty); EXPORT_SYMBOL(ext4_ext_invalidate_cache); EXPORT_SYMBOL(ext4_ext_insert_extent); Index: linux-2.6.22-rc1/fs/ext4/file.c =================================================================== --- linux-2.6.22-rc1.orig/fs/ext4/file.c +++ linux-2.6.22-rc1/fs/ext4/file.c @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ .removexattr = generic_removexattr, #endif .permission = ext4_permission, + .fallocate = ext4_fallocate, }; Index: linux-2.6.22-rc1/include/linux/ext4_fs.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/ext4_fs.h +++ linux-2.6.22-rc1/include/linux/ext4_fs.h @@ -102,6 +102,7 @@ EXT4_GOOD_OLD_FIRST_INO : \ (s)->s_first_ino) #endif +#define EXT4_BLOCK_ALIGN(size, blkbits) ALIGN((size),(1 << (blkbits))) /* * Macro-instructions used to manage fragments @@ -225,6 +226,11 @@ struct ext4_new_group_data { __u32 free_blocks_count; }; +/* + * Following is used by preallocation code to tell get_blocks() that we + * want uninitialzed extents. + */ +#define EXT4_CREATE_UNINITIALIZED_EXT 2 /* * ioctl commands @@ -976,6 +982,8 @@ extern int ext4_ext_get_blocks(handle_t extern void ext4_ext_truncate(struct inode *, struct page *); extern void ext4_ext_init(struct super_block *); extern void ext4_ext_release(struct super_block *); +extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset, + loff_t len); static inline int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, unsigned long max_blocks, struct buffer_head *bh, Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h @@ -188,6 +188,18 @@ ext4_ext_invalidate_cache(struct inode * EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO; } +static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext) { + ext->ee_len |= cpu_to_le16(0x8000); +} + +static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext) { + return (int)(le16_to_cpu((ext)->ee_len) & 0x8000); +} + +static inline int ext4_ext_get_actual_len(struct ext4_extent *ext) { + return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF); +} + extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); From owner-xfs@oss.sgi.com Thu May 17 07:30:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 07:31:02 -0700 (PDT) Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.151]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HEUvfB003542 for ; Thu, 17 May 2007 07:30:58 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e33.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l4HEUvQq013895 for ; Thu, 17 May 2007 10:30:57 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4HEUv1D237672 for ; Thu, 17 May 2007 08:30:57 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4HEUuO7011896 for ; Thu, 17 May 2007 08:30:56 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4HEUssi011749; Thu, 17 May 2007 08:30:55 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id ED76394EC3; Thu, 17 May 2007 20:01:04 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l4HEUx8b001224; Thu, 17 May 2007 20:00:59 +0530 Date: Thu, 17 May 2007 20:00:59 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 6/6][TAKE4] ext4: write support for preallocated blocks Message-ID: <20070517143059.GF28726@amitarora.in.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070517141115.GA24260@amitarora.in.ibm.com> <20070517141458.GA26641@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070517141458.GA26641@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11469 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Changelog: --------- Changes from Take3 to Take4: - no change - Changes from Take2 to Take3: 1) Patch now rebased to 2.6.22-rc1 kernel. Changes from Take1 to Take2: 1) Replaced BUG_ON with WARN_ON & ext4_error. 2) Added variable names to the function declaration of ext4_ext_try_to_merge(). 3) Updated variable declarations to use multiple-definitions-per-line. 4) "if((a=foo())).." was broken into "a=foo(); if(a).." 5) Removed extra spaces. Here is the updated patch: Signed-off-by: Amit Arora --- fs/ext4/extents.c | 234 +++++++++++++++++++++++++++++++++++----- include/linux/ext4_fs_extents.h | 3 2 files changed, 210 insertions(+), 27 deletions(-) Index: linux-2.6.22-rc1/fs/ext4/extents.c =================================================================== --- linux-2.6.22-rc1.orig/fs/ext4/extents.c +++ linux-2.6.22-rc1/fs/ext4/extents.c @@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode } /* + * This function tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done = 0; + int uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) + { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1); + merge_done = 1; + WARN_ON(eh->eh_entries == 0); + if (!eh->eh_entries) + ext4_error(inode->i_sb, "ext4_ext_try_to_merge", + "inode#%lu, eh->eh_entries = 0!", inode->i_ino); + } + + return merge_done; +} + +/* * check if a portion of the "newext" extent overlaps with an * existing extent. * @@ -1327,25 +1375,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block #endif } +/* + * This function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three - one initialized and two + * uninitialized). + * There are three possibilities: + * a> There is no split required: Entire extent should be initialized + * b> Splits in two extents: Write is happening at either end of the extent + * c> Splits in three extents: Somone is writing in middle of the extent + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, newex; + struct ext4_extent *ex1 = NULL; + struct ext4_extent *ex2 = NULL; + struct ext4_extent *ex3 = NULL; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err = 0; + int ret = 0; + + depth = ext_depth(inode); + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + ee_block = le32_to_cpu(ex->ee_block); + ee_len = ext4_ext_get_actual_len(ex); + allocated = ee_len - (iblock - ee_block); + newblock = iblock - ee_block + ext_pblock(ex); + ex2 = ex; + + /* ex1: ee_block to iblock - 1 : uninitialized */ + if (iblock > ee_block) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* for sanity, update the length of the ex2 extent before + * we insert ex3, if ex1 is NULL. This is to avoid temporary + * overlap of blocks. + */ + if (!ex1 && allocated > max_blocks) + ex2->ee_len = cpu_to_le16(max_blocks); + /* ex3: to ee_block + ee_len : uninitialised */ + if (allocated > max_blocks) { + unsigned int newdepth; + ex3 = &newex; + ex3->ee_block = cpu_to_le32(iblock + max_blocks); + ext4_ext_store_pblock(ex3, newblock + max_blocks); + ex3->ee_len = cpu_to_le16(allocated - max_blocks); + ext4_ext_mark_uninitialized(ex3); + err = ext4_ext_insert_extent(handle, inode, path, ex3); + if (err) + goto out; + /* The depth, and hence eh & ex might change + * as part of the insert above. + */ + newdepth = ext_depth(inode); + if (newdepth != depth) { + depth = newdepth; + path = ext4_ext_find_extent(inode, iblock, NULL); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + goto out; + } + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + if (ex2 != &newex) + ex2 = ex; + } + allocated = max_blocks; + } + /* If there was a change of depth as part of the + * insertion of ex3 above, we need to update the length + * of the ex1 extent again here + */ + if (ex1 && ex1 != ex) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* ex2: iblock to iblock + maxblocks-1 : initialised */ + ex2->ee_block = cpu_to_le32(iblock); + ex2->ee_start = cpu_to_le32(newblock); + ext4_ext_store_pblock(ex2, newblock); + ex2->ee_len = cpu_to_le16(allocated); + if (ex2 != ex) + goto insert; + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; + /* New (initialized) extent starts from the first block + * in the current extent. i.e., ex2 == ex + * We have to see if it can be merged with the extent + * on the left. + */ + if (ex2 > EXT_FIRST_EXTENT(eh)) { + /* To merge left, pass "ex2 - 1" to try_to_merge(), + * since it merges towards right _only_. + */ + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + depth = ext_depth(inode); + ex2--; + } + } + /* Try to Merge towards right. This might be required + * only when the whole extent is being written to. + * i.e. ex2 == ex and ex3 == NULL. + */ + if (!ex3) { + ret = ext4_ext_try_to_merge(inode, path, ex2); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + } + } + /* Mark modified extent as dirty */ + err = ext4_ext_dirty(handle, inode, path + depth); + goto out; +insert: + err = ext4_ext_insert_extent(handle, inode, path, &newex); +out: + return err ? err : allocated; +} + int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, int create, int extend_disksize) { struct ext4_ext_path *path = NULL; + struct ext4_extent_header *eh; struct ext4_extent newex, *ex; ext4_fsblk_t goal, newblock; - int err = 0, depth; + int err = 0, depth, ret; unsigned long allocated = 0; __clear_bit(BH_New, &bh_result->b_state); @@ -2067,6 +2234,7 @@ int ext4_ext_get_blocks(handle_t *handle * this is why assert can't be put in ext4_ext_find_extent() */ BUG_ON(path[depth].p_ext == NULL && depth != 0); + eh = path[depth].p_hdr; ex = path[depth].p_ext; if (ex) { @@ -2075,13 +2243,9 @@ int ext4_ext_get_blocks(handle_t *handle unsigned short ee_len; /* - * Allow future support for preallocated extents to be added - * as an RO_COMPAT feature: * Uninitialized extents are treated as holes, except that - * we avoid (fail) allocating new blocks during a write. + * we split out initialized portions during a write. */ - if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) - goto out2; ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { @@ -2090,12 +2254,27 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); + /* Do not put uninitialized extent in the cache */ - if (!ext4_ext_is_uninitialized(ex)) + if (!ext4_ext_is_uninitialized(ex)) { ext4_ext_put_in_cache(inode, ee_block, ee_len, ee_start, EXT4_EXT_CACHE_EXTENT); - goto out; + goto out; + } + if (create == EXT4_CREATE_UNINITIALIZED_EXT) + goto out; + if (!create) + goto out2; + + ret = ext4_ext_convert_to_initialized(handle, inode, + path, iblock, + max_blocks); + if (ret <= 0) + goto out2; + else + allocated = ret; + goto outnew; } } @@ -2147,6 +2326,7 @@ int ext4_ext_get_blocks(handle_t *handle /* previous routine could use block we allocated */ newblock = ext_pblock(&newex); +outnew: __set_bit(BH_New, &bh_result->b_state); /* Cache only when it is _not_ an uninitialized extent */ Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h @@ -202,6 +202,9 @@ static inline int ext4_ext_get_actual_le extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); From owner-xfs@oss.sgi.com Thu May 17 07:41:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 07:41:45 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HEfgfB006500 for ; Thu, 17 May 2007 07:41:43 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 1F46C1807A959 for ; Thu, 17 May 2007 09:41:41 -0500 (CDT) Message-ID: <464C69A4.6050605@sandeen.net> Date: Thu, 17 May 2007 09:41:40 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: xfs@oss.sgi.com Subject: Re: [PATCH] (and bad attr2 bug) - pack xfs_sb_t for 64-bit arches References: <455CB54F.8080901@sandeen.net> In-Reply-To: <455CB54F.8080901@sandeen.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11470 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Eric Sandeen wrote: > see also https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212201 > > Bugzilla Bug 212201: Cannot build sysem with XFS file system. > > I turned on attr2 in FC6 at nathan's suggestion, for selinux goodness > with more efficient xattr space usage. > > But, many reports that this was totally broken in fc6, on x86_64. Although it turned out to be a different issue, not the packing issue, is the packing/alignment (below) still something that needs to be fixed...? -Eric > Install went ok, but on reboot the filesystem was found to be corrupt. > > The filesystem was also found to be marked w/ attr1, not attr2.... > > If you do a fresh mkfs.xfs on x86_64, with -i attr=2, and dump out the > superblock (or look at it with xfs_db) you will find that although the > versionnum says that there is a morebits bit, the features2 flag is 0. > > if you dd/hexdump the superblock, you will find the attr2 flag, but at > the wrong offset. > > This is because the xfs_sb_t struct is padded out to 64 bits on 64-bit > arches, and the xfs_xlatesb() routine and xfs_sb_info[] array take this > padding to mean that the last item is 4 bytes bigger than it is, and > treats sb_features2 as 8 bytes not four. This then gets endian-flipped out... > > I can't quite figure out how this winds up causing problems if you stay > on the x86_64 arch, as I'd expect that if the offset is wrong, it should > at least be consistently wrong. And in fact if you do mkfs,mount,xfs_info, > it will tell you that you do have attr2. But somewhere along the line thing > go wrong, and post-install, post-reboot, the filesystem thinks it is attr1, > and is therefore corrupt. > > I think that maybe some accesses are post-xfs_xlatesb, while others > may access the un-flipped sb directly? Or maybe this is sb logging > code that has messed things up? Not sure... needs more investigation. > > In any case, dd does not lie, and this patch for the kernel, and a > corresponding one for userspace, at least make "mkfs.xfs -i attr=2" > puts the features2 flag in the right place, as shown by inspection via dd. > > Signed-off-by: Eric Sandeen > > Index: linux-2.6.18/fs/xfs/xfs_sb.h > =================================================================== > --- linux-2.6.18.orig/fs/xfs/xfs_sb.h > +++ linux-2.6.18/fs/xfs/xfs_sb.h > @@ -149,7 +149,7 @@ typedef struct xfs_sb > __uint16_t sb_logsectsize; /* sector size for the log, bytes */ > __uint32_t sb_logsunit; /* stripe unit size for the log */ > __uint32_t sb_features2; /* additional feature bits */ > -} xfs_sb_t; > +} __attribute__ ((packed)) xfs_sb_t; > > /* > * Sequence number values for the fields. > > > From owner-xfs@oss.sgi.com Thu May 17 09:47:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 09:47:27 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4HGlOfB004777 for ; Thu, 17 May 2007 09:47:25 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 218051807A959 for ; Thu, 17 May 2007 11:47:24 -0500 (CDT) Message-ID: <464C871B.3090402@sandeen.net> Date: Thu, 17 May 2007 11:47:23 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: xfs-oss Subject: [PATCH] fix up xfstests a bit for linux+udf Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11471 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs udf on linux doesn't support acls or attrs, so prevent those tests from running. Signed-off-by: Eric Sandeen Index: xfstests/020 =================================================================== --- xfstests.orig/020 +++ xfstests/020 @@ -65,7 +65,7 @@ _attr_list() # real QA test starts here -_supported_fs xfs udf +_supported_fs xfs _supported_os Linux [ -x /usr/bin/attr ] || _notrun "attr is not installed" Index: xfstests/051 =================================================================== --- xfstests.orig/051 +++ xfstests/051 @@ -56,7 +56,7 @@ _cleanup() # # real QA test starts here -_supported_fs xfs udf +_supported_fs xfs _supported_os Linux [ -x /usr/bin/chacl ] || _notrun "chacl executable not found" Index: xfstests/062 =================================================================== --- xfstests.orig/062 +++ xfstests/062 @@ -102,7 +102,7 @@ _create_test_bed() } # real QA test starts here -_supported_fs xfs udf nfs +_supported_fs xfs nfs _supported_os Linux _require_scratch Index: xfstests/070 =================================================================== --- xfstests.orig/070 +++ xfstests/070 @@ -33,6 +33,7 @@ _cleanup() _supported_fs xfs udf nfs _supported_os IRIX Linux +[ "$FSTYP" == udf -a "$HOSTOS" == Linux ] && _notrun "Linux UDF does not support extended attributes" _setup_testdir $FSSTRESS_PROG \ Index: xfstests/105 =================================================================== --- xfstests.orig/105 +++ xfstests/105 @@ -35,6 +35,8 @@ _cleanup() _supported_fs xfs udf _supported_os IRIX Linux +[ "$FSTYP" == udf -a "$HOSTOS" == Linux ] && _notrun "Linux UDF does not support ACLS" + # real QA test starts here rm -f $seq.full From owner-xfs@oss.sgi.com Thu May 17 23:08:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 17 May 2007 23:08:42 -0700 (PDT) Received: from amanpulo.fs3.ph (amanpulo.fs3.ph [72.51.42.241]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4I68bfB030111 for ; Thu, 17 May 2007 23:08:39 -0700 Received: from localhost (localhost [127.0.0.1]) by amanpulo.fs3.ph (Postfix) with ESMTP id 4F6661E0D5966 for ; Fri, 18 May 2007 14:08:36 +0800 (PHT) Received: from amanpulo.fs3.ph ([127.0.0.1]) by localhost (amanpulo.fs3.ph [127.0.0.1]) (amavisd-new, port 10024) with LMTP id zxsd0wF2tgBC for ; Fri, 18 May 2007 14:08:30 +0800 (PHT) Received: from musang.fs3.ph (unknown [222.127.47.132]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by amanpulo.fs3.ph (Postfix) with ESMTP id 719C51E0D5953 for ; Fri, 18 May 2007 14:08:28 +0800 (PHT) Received: by musang.fs3.ph (Postfix, from userid 1000) id A4633205EBF3; Fri, 18 May 2007 14:08:25 +0800 (PHT) Date: Fri, 18 May 2007 14:08:25 +0800 From: Federico Sevilla III To: XFS Mailing List Subject: Mount Failure: Totally Zeroed Log Message-ID: <20070518060825.GD3340@fs3.ph> Mail-Followup-To: XFS Mailing List MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Personal-URL: http://jijo.free.net.ph User-Agent: Mutt/1.5.13 (2006-08-11) X-archive-position: 11472 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jijo@fs3.ph Precedence: bulk X-list: xfs Hi, I encountered the following error after an unclean shutdown: Filesystem "md1": Disabling barriers, not supported by the underlying device XFS mounting filesystem md1 XFS: totally zeroed log Starting XFS recovery on filesystem: md1 (logdev: internal) Filesystem "md1": XFS internal error xlog_valid_rec_header(1) at line 3503 of file fs/xfs/xfs_log_recover.c. Caller 0xf8992fbd [] xlog_valid_rec_header+0xc5/0xd5 [xfs] [] xlog_do_recovery_pass+0x550/0x90f [xfs] [] xlog_do_recovery_pass+0x550/0x90f [xfs] [] xlog_do_log_recovery+0x45/0xa6 [xfs] [] xlog_do_recover+0x1d/0x102 [xfs] [] xlog_recover+0x87/0x98 [xfs] [] xfs_log_mount+0x8d/0xce [xfs] [] xfs_mountfs+0x982/0xc63 [xfs] [] xfs_ioinit+0x21/0x26 [xfs] [] xfs_mount+0x30b/0x37e [xfs] [] vfs_mount+0x28/0x2c [xfs] [] xfs_fs_fill_super+0x6e/0x193 [xfs] [] snprintf+0x16/0x1a [] disk_name+0x25/0x66 [] get_sb_bdev+0xca/0x117 [] __link_path_walk+0xadd/0xbd2 [] xfs_fs_get_sb+0x1e/0x22 [xfs] [] xfs_fs_fill_super+0x0/0x193 [xfs] [] vfs_kern_mount+0x35/0x66 [] do_kern_mount+0x29/0x39 [] do_new_mount+0x67/0xa4 [] do_mount+0x153/0x16b [] copy_mount_options+0x4c/0x99 [] sys_mount+0x79/0xba [] syscall_call+0x7/0xb XFS: log mount/recovery failed: error 117 XFS: log mount failed The machine runs Debian GNU/Linux 4.0 (Etch), with a custom-built 2.6.18 kernel using the Debian 2.6.18 source and the OpenVZ patch. xfs_repair is able to repair the filesystem, but most of the files just end up in lost+found. I have verified that write cache on both drives has been disabled, and hdparm is set to disable them everytime during startup. Any clues as to what could be causing this? Thank you very much. Cheers! -- Federico Sevilla III F S 3 Consulting Inc. http://www.fs3.ph From owner-xfs@oss.sgi.com Fri May 18 07:18:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 18 May 2007 07:18:11 -0700 (PDT) Received: from gk.uu.epigenomics.net (gk.uu.epigenomics.net [195.127.125.226]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4IEI6fB013278 for ; Fri, 18 May 2007 07:18:08 -0700 Received: (qmail 9438 invoked from network); 18 May 2007 14:18:04 -0000 Received: from perl.epigenomics.epi (192.168.48.4) by salam.epigenomics.epi with SMTP; 18 May 2007 14:18:04 -0000 Received: (qmail 3375 invoked by uid 9); 18 May 2007 14:18:04 -0000 From: linux-xfs@ml.epigenomics.com X-Newsgroups: epi.ml.linux.xfs Subject: Re: xfs_repair: buf calloc failed (4132 bytes): Cannot allocate memory Date: Fri, 18 May 2007 14:18:04 +0000 (UTC) Organization: Epigenomics AG Lines: 19 Message-ID: References: X-Complaints-To: usenet@epigenomics.net User-Agent: slrn/0.9.8.1pl1 (Debian) To: xfs@oss.sgi.com X-archive-position: 11473 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: linux-xfs@ml.epigenomics.com Precedence: bulk X-list: xfs On Wed, 16 May 2007 13:32:28 +1000, "Barry Naujok" wrote: > > With 18 million inodes, 1.2TB of space, it will use a lot of memory. > Assuming you are using a recent xfsprogs (2.8.11 or later), you can > reduce memory usage by adding "-o bhash=512" which will limit the > number of buffers cached in xfs_repair. In this scenario, the cache > will overflow RAM rather easily with the number of inodes to be > scanned. Thank you. That did it. xfs_repair was able to complete without memory error. Greetings -- Robert Sander Senior Manager Information Systems Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany phone:+49-30-24345-0 fax:+49-30-24345-555 http://www.epigenomics.com robert.sander@epigenomics.com From owner-xfs@oss.sgi.com Sat May 19 00:01:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 19 May 2007 00:01:41 -0700 (PDT) Received: from smtp2.linux-foundation.org (smtp2.linux-foundation.org [207.189.120.14]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4J71YfB002584 for ; Sat, 19 May 2007 00:01:35 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp2.linux-foundation.org (8.13.5.20060308/8.13.5/Debian-3ubuntu1.1) with ESMTP id l4J6imAj004286 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 18 May 2007 23:45:22 -0700 Received: from box (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l4J6ihY5031475; Fri, 18 May 2007 23:44:45 -0700 Date: Fri, 18 May 2007 23:44:44 -0700 From: Andrew Morton To: "Amit K. Arora" Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/6][TAKE4] fallocate system call Message-Id: <20070518234444.f53a4230.akpm@linux-foundation.org> In-Reply-To: <20070517141115.GA24260@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070517141115.GA24260@amitarora.in.ibm.com> X-Mailer: Sylpheed 2.4.1 (GTK+ 2.8.17; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.180 $ X-Scanned-By: MIMEDefang 2.53 on 207.189.120.14 X-archive-position: 11474 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: akpm@linux-foundation.org Precedence: bulk X-list: xfs On Thu, 17 May 2007 19:41:15 +0530 "Amit K. Arora" wrote: > fallocate() is a new system call being proposed here which will allow > applications to preallocate space to any file(s) in a file system. I merged the first three patches into -mm, thanks. All the system call numbers got changed due to recent additions. They may change in the future, too - nothing is stable until the code lands in mainline. I didn't merge any of the ext4 changes as they appear to be in Ted's devel tree. Although I didn't check that they are 100% the same in that tree. What's the plan to get some ext4 updates into mainline, btw? Things seem to be rather gradual. From owner-xfs@oss.sgi.com Sun May 20 16:34:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 20 May 2007 16:34:28 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4KNYNfB026748 for ; Sun, 20 May 2007 16:34:25 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA20542; Mon, 21 May 2007 09:34:19 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4KNYIAf99659883; Mon, 21 May 2007 09:34:18 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4KNYHPm99702355; Mon, 21 May 2007 09:34:17 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Mon, 21 May 2007 09:34:17 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review: ensure EOF writes into existing extents update filesize Message-ID: <20070520233417.GA85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11475 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs As reported a coupl eof times on lkml since the 2.6.22 window opened, XFS is not updating the file size corectly in all cases. This is a result of the null files fix not updating the file size when the write extending the file does not need to allocate a block. In that case, we use a read mapping of the extent, and this also happens to use the read I/O completion handler instead of the write I/O completion handle. Hence the file size was not updated on I/O completion. Comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_aops.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-05-11 16:03:59.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-12 23:46:54.379994052 +1000 @@ -973,8 +973,9 @@ xfs_page_state_convert( bh = head = page_buffers(page); offset = page_offset(page); - flags = -1; - type = IOMAP_READ; + iomap_valid = 0; + flags = BMAPI_READ; + type = IOMAP_NEW; /* TODO: cleanup count and page_dirty */ @@ -1004,14 +1005,14 @@ xfs_page_state_convert( * * Third case, an unmapped buffer was found, and we are * in a path where we need to write the whole page out. - */ + */ if (buffer_unwritten(bh) || buffer_delay(bh) || ((buffer_uptodate(bh) || PageUptodate(page)) && !buffer_mapped(bh) && (unmapped || startio))) { - /* + /* * Make sure we don't use a read-only iomap */ - if (flags == BMAPI_READ) + if (flags == BMAPI_READ) iomap_valid = 0; if (buffer_unwritten(bh)) { @@ -1060,7 +1061,7 @@ xfs_page_state_convert( * That means it must already have extents allocated * underneath it. Map the extent by reading it. */ - if (!iomap_valid || type != IOMAP_READ) { + if (!iomap_valid || flags != BMAPI_READ) { flags = BMAPI_READ; size = xfs_probe_cluster(inode, page, bh, head, 1); @@ -1071,7 +1072,15 @@ xfs_page_state_convert( iomap_valid = xfs_iomap_valid(&iomap, offset); } - type = IOMAP_READ; + /* + * We set the type to IOMAP_NEW in case we are doing a + * small write at EOF that is extending the file but + * without needing an allocation. We need to update the + * file size on I/O completion in this case so it is + * the same case as having just allocated a new extent + * that we are writing into for the first time. + */ + type = IOMAP_NEW; if (!test_and_set_bit(BH_Lock, &bh->b_state)) { ASSERT(buffer_mapped(bh)); if (iomap_valid) From owner-xfs@oss.sgi.com Sun May 20 16:40:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 20 May 2007 16:40:54 -0700 (PDT) Received: from smtp111.sbc.mail.mud.yahoo.com (smtp111.sbc.mail.mud.yahoo.com [68.142.198.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4KNepfB028856 for ; Sun, 20 May 2007 16:40:52 -0700 Received: (qmail 87717 invoked from network); 20 May 2007 23:40:50 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@70.231.251.20 with login) by smtp111.sbc.mail.mud.yahoo.com with SMTP; 20 May 2007 23:40:50 -0000 X-YMail-OSG: qldNWF4VM1l3IlBBrviKTVLT8pRmiGeUOUvJki6sSy7OtviVdvxTlBrN0kzpRBiQK2g_1er8pQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id C2CD41827261; Sun, 20 May 2007 16:40:48 -0700 (PDT) Date: Sun, 20 May 2007 16:40:48 -0700 From: Chris Wedgwood To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: Review: ensure EOF writes into existing extents update filesize Message-ID: <20070520234048.GA7423@tuatara.stupidest.org> References: <20070520233417.GA85884050@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070520233417.GA85884050@sgi.com> X-archive-position: 11476 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Mon, May 21, 2007 at 09:34:17AM +1000, David Chinner wrote: > In that case, we use a read mapping of the extent, and this also > happens to use the read I/O completion handler instead of the > write I/O completion handle. Hence the file size was not updated > on I/O completion. > > Comments? I've had this (well, the first version you sent to Jeremy) for a few days on a testing machine without any obvious ill effects. Once people are happy with this it should be checked to see if it needs to go into -stable. From owner-xfs@oss.sgi.com Sun May 20 16:44:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 20 May 2007 16:44:41 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4KNiafB029998 for ; Sun, 20 May 2007 16:44:38 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA20673; Mon, 21 May 2007 09:44:34 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4KNiXAf94792431; Mon, 21 May 2007 09:44:34 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4KNiW5e99762204; Mon, 21 May 2007 09:44:32 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Mon, 21 May 2007 09:44:32 +1000 From: David Chinner To: Chris Wedgwood Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: ensure EOF writes into existing extents update filesize Message-ID: <20070520234432.GB85884050@sgi.com> References: <20070520233417.GA85884050@sgi.com> <20070520234048.GA7423@tuatara.stupidest.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070520234048.GA7423@tuatara.stupidest.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11477 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sun, May 20, 2007 at 04:40:48PM -0700, Chris Wedgwood wrote: > > Once people are happy with this it should be checked to see if it > needs to go into -stable. It's new code so -stable is irrelevant.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sun May 20 19:15:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 20 May 2007 19:15:27 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4L2FIfB000774 for ; Sun, 20 May 2007 19:15:21 -0700 Received: from [134.14.55.89] (soarer.melbourne.sgi.com [134.14.55.89]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA23228; Mon, 21 May 2007 11:57:58 +1000 Message-ID: <4650FD38.4080408@sgi.com> Date: Mon, 21 May 2007 12:00:24 +1000 From: Vlad Apostolov User-Agent: Thunderbird 1.5.0.10 (X11/20070221) MIME-Version: 1.0 To: David Chinner CC: xfs-dev , xfs-oss Subject: Re: review: make xfs_dm_punch_hole() atomic when punching EOF References: <20070419071856.GR48531920@melbourne.sgi.com> In-Reply-To: <20070419071856.GR48531920@melbourne.sgi.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11478 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: vapo@sgi.com Precedence: bulk X-list: xfs It is looking good Dave. Regards, Vlad David Chinner wrote: > Currently punching a hole to EOF via xfs_dm_punch_hole() > truncates the file and then extends it. This leaves a small > window where applications can see an incorrect file size > while the punch is in progress. This can cause problems > with DMF leading to premature completion of recalls and > hence data corruption. > > Use the UNRESVSP ioctl rather than FREESP+setattr to punch the > hole at EOF. This can leave specualtive allocations past EOF, > so truncate them off so we don't leave blocks that can't be > migrated away around in the filesystem. > > Cheers, > > Dave. > From owner-xfs@oss.sgi.com Sun May 20 20:21:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 20 May 2007 20:21:01 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4L3KvfB021385 for ; Sun, 20 May 2007 20:20:59 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA25030; Mon, 21 May 2007 13:20:52 +1000 Date: Mon, 21 May 2007 13:25:06 +1000 From: Timothy Shimmin To: riesebie@lxtec.de cc: xfs@oss.sgi.com Subject: Re: [xfs-masters] 2.6.22-rc2 built on ppc Message-ID: <9BA71D7C90D3CBF30ABEF4AF@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070520111408.GF3253@aragorn.home.lxtec.de> References: <20070520111408.GF3253@aragorn.home.lxtec.de> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11479 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi there, --On 20 May 2007 1:14:08 PM +0200 Elimar Riesebieter wrote: > Hi, > FYI, building the kernel with > gcc (GCC) 4.1.3 20070514 (prerelease) (Debian 4.1.2-7) > on my powerbook (PPC) gives: > > ... > fs/xfs/xfs_bmap.c: In function 'xfs_bmap_rtalloc': > fs/xfs/xfs_bmap.c:2650: warning: 'rtx' is used uninitialized in this function > fs/xfs/linux-2.6/xfs_lrw.c: In function 'xfs_iozero': > fs/xfs/linux-2.6/xfs_lrw.c:162: warning: 'memclear_highpage_flush' is deprecated (declared at > include/linux/highmem.h:115) ... > The memclear deprecated warning is well known and the 1 line fix of calling zero_user_page() will be updated shortly (after I update our sgi 2.6.x-xfs tree to latest mainline and then push our xfs-linux tree to git). The rtx one is not really a problem. rtx is initialised by being passed to a function and only is uninitialised in the error case where it is then not referenced. However, we can init it on definition to shut the warning up. Thanks, Tim. From owner-xfs@oss.sgi.com Sun May 20 21:25:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 20 May 2007 21:25:55 -0700 (PDT) Received: from smtp103.sbc.mail.mud.yahoo.com (smtp103.sbc.mail.mud.yahoo.com [68.142.198.202]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4L4PkfB003611 for ; Sun, 20 May 2007 21:25:49 -0700 Received: (qmail 29599 invoked from network); 21 May 2007 04:25:46 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@70.231.251.20 with login) by smtp103.sbc.mail.mud.yahoo.com with SMTP; 21 May 2007 04:25:46 -0000 X-YMail-OSG: YA7taMUVM1ksJv8yqUlXYc7O9iZKee_F_us5ol.pugevuGe6.XOK3fPVm.lzINHLxzXN4x4aJQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 619C41827261; Sun, 20 May 2007 21:25:43 -0700 (PDT) Date: Sun, 20 May 2007 21:25:42 -0700 From: Chris Wedgwood To: Timothy Shimmin Cc: riesebie@lxtec.de, xfs@oss.sgi.com Subject: Re: [xfs-masters] 2.6.22-rc2 built on ppc Message-ID: <20070521042542.GA11798@tuatara.stupidest.org> References: <20070520111408.GF3253@aragorn.home.lxtec.de> <9BA71D7C90D3CBF30ABEF4AF@timothy-shimmins-power-mac-g5.local> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9BA71D7C90D3CBF30ABEF4AF@timothy-shimmins-power-mac-g5.local> X-archive-position: 11480 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Mon, May 21, 2007 at 01:25:06PM +1000, Timothy Shimmin wrote: > However, we can init it on definition to shut the warning up. Please don't. It's not a bug, doing an unneeded init just to silence gcc (gcc could be smarter in some cases) has the potential to hide real bugs later on. From owner-xfs@oss.sgi.com Sun May 20 22:10:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 20 May 2007 22:10:33 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4L5APfB013870 for ; Sun, 20 May 2007 22:10:27 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA27626; Mon, 21 May 2007 15:10:19 +1000 Date: Mon, 21 May 2007 15:14:34 +1000 From: Timothy Shimmin To: Chris Wedgwood cc: riesebie@lxtec.de, xfs@oss.sgi.com Subject: Re: [xfs-masters] 2.6.22-rc2 built on ppc Message-ID: In-Reply-To: <20070521042542.GA11798@tuatara.stupidest.org> References: <20070520111408.GF3253@aragorn.home.lxtec.de> <9BA71D7C90D3CBF30ABEF4AF@timothy-shimmins-power-mac-g5.local> <20070521042542.GA11798@tuatara.stupidest.org> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11481 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs --On 20 May 2007 9:25:42 PM -0700 Chris Wedgwood wrote: > On Mon, May 21, 2007 at 01:25:06PM +1000, Timothy Shimmin wrote: > >> However, we can init it on definition to shut the warning up. > > Please don't. It's not a bug, doing an unneeded init just to silence > gcc (gcc could be smarter in some cases) has the potential to hide > real bugs later on. :) Yeah, I was discussing with this with someone earlier today. He suggested that it was better to init it (to say NULLRTBLOCK) because it was more likely to find a problem (if we change/add code later on incorrectly) than if we leave it to be unitialised with a random value. --Tim From owner-xfs@oss.sgi.com Sun May 20 22:24:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 20 May 2007 22:24:57 -0700 (PDT) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4L5OrfB020521 for ; Sun, 20 May 2007 22:24:54 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l4L5L4Yx006226 for ; Mon, 21 May 2007 01:21:04 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l4L5OnIv160264 for ; Sun, 20 May 2007 23:24:49 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l4L5OnjC028311 for ; Sun, 20 May 2007 23:24:49 -0600 Received: from sig-9-65-12-108.mts.ibm.com (sig-9-65-12-108.mts.ibm.com [9.65.12.108]) by d03av03.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l4L5OlG7028155; Sun, 20 May 2007 23:24:47 -0600 Subject: Re: [PATCH 0/6][TAKE4] fallocate system call From: Mingming Cao Reply-To: cmm@us.ibm.com To: Andrew Morton Cc: "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com In-Reply-To: <20070518234444.f53a4230.akpm@linux-foundation.org> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070517141115.GA24260@amitarora.in.ibm.com> <20070518234444.f53a4230.akpm@linux-foundation.org> Content-Type: text/plain Organization: IBM LTC Date: Sun, 20 May 2007 22:24:42 -0700 Message-Id: <1179725082.3936.40.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 (2.0.4-7) Content-Transfer-Encoding: 7bit X-archive-position: 11482 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cmm@us.ibm.com Precedence: bulk X-list: xfs On Fri, 2007-05-18 at 23:44 -0700, Andrew Morton wrote: > On Thu, 17 May 2007 19:41:15 +0530 "Amit K. Arora" wrote: > > > fallocate() is a new system call being proposed here which will allow > > applications to preallocate space to any file(s) in a file system. > > I merged the first three patches into -mm, thanks. > > All the system call numbers got changed due to recent additions. They > may change in the future, too - nothing is stable until the code lands > in mainline. > In case you haven't realize it, the ia64 fallocate() patch comes with Amit's takes 4 fallocate patch series (3/6) missing one line change, thus fail to compile on ia64. Here is the updated one. Patch tested on ia64. (compile and fsx) fallocate() on ia64 ia64 fallocate syscall support. Signed-Off-By: Dave Chinner --- arch/ia64/kernel/entry.S | 1 + include/asm-ia64/unistd.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S =================================================================== --- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S 2007-05-18 16:30:16.000000000 -0700 +++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S 2007-05-18 16:32:45.000000000 -0700 @@ -1585,5 +1585,6 @@ data8 sys_getcpu data8 sys_epoll_pwait // 1305 data8 sys_utimensat + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-18 16:30:16.000000000 -0700 +++ linux-2.6.22-rc1/include/asm-ia64/unistd.h 2007-05-18 17:34:58.000000000 -0700 @@ -296,11 +296,12 @@ #define __NR_getcpu 1304 #define __NR_epoll_pwait 1305 #define __NR_utimensat 1306 +#define __NR_fallocate 1307 #ifdef __KERNEL__ -#define NR_syscalls 283 /* length of syscall table */ +#define NR_syscalls 285 /* length of syscall table */ #define __ARCH_WANT_SYS_RT_SIGACTION #define __ARCH_WANT_SYS_RT_SIGSUSPEND > I didn't merge any of the ext4 changes as they appear to be in Ted's > devel tree. Although I didn't check that they are 100% the same in > that tree. > Since both Amit and Ted are traveling, I will jump in... Most likely it's not the same one. What in Ted's devel tree is "takes 2" patches. I have incorporated takes 4 patches in the backing ext4 patch git tree here: I have tested these patch series on ia64,ppc64,x86 and x86_64. I am not sure if Ted got a chance to update his ext4 git tree from this patch queue git tree yet. > What's the plan to get some ext4 updates into mainline, btw? Things > seem to be rather gradual. Last time Ted and I discussed we all agree fallocate patches should go into mainline. Actually most patches marked before the "unstable patches" can get into mainline, especially the following patches (contains a few bug fixes patches) # New patch to fix whitespace before applying new patches whitespace.patch #New patch to remove unnecessary exported symbols ext4_remove_exported_symbles.patch # New patch to add mount option to turn off extents ext4_noextent_mount_opt.patch # Now Turn on extents feature by default ext4_extents_on_by_default.patch #New patch to propagate inode flags ext4-propagate_flags.patch #New patch to add extent sanity checks ext4-extent-sanity-checks.patch #New patch to free blocks when failed to insert an extent ext4-free-blocks-on-insert-extent-failure.patch We already missed rc-1 window, but if possible, I would like to see ext4 fallocate patches and above patches in mainline 2.6.22. The nanosecond timestamp patch is probably good to go also. Regards, Mingming > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From owner-xfs@oss.sgi.com Sun May 20 23:30:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 20 May 2007 23:30:23 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4L6UHfB004904 for ; Sun, 20 May 2007 23:30:18 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA29411; Mon, 21 May 2007 16:30:11 +1000 Date: Mon, 21 May 2007 16:34:26 +1000 From: Timothy Shimmin To: David Chinner , xfs-dev cc: xfs-oss Subject: Re: Review: ensure EOF writes into existing extents update filesize Message-ID: <738172E9F9634F7FC01B5B3C@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070520233417.GA85884050@sgi.com> References: <20070520233417.GA85884050@sgi.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11483 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi, Trying to understand how the initialisers prior to the loop are used during the loop. It looks like the initial "type" isn't used now with this change as we always assign to it near to where we access it. Previously, we did access it in the already mapped case (type != IOMAP_READ). So why do we initialise "type" prior to the loop then? The inited "flags" var is 1st accessed in the unmapped/allocated codepath to set iomap_valid to zero on BMAPI_READ or now also in the other path where it is already mapped. Previously, flags was init to -1 (not one of the BMAPI macros) and now it is set to BMAPI_READ. So, previously if we came straight to an unwritten/delayed/unmapped buffer we would not set iomap_valid to 0, as flags==-1, whereas now we will reset it. But what is this doing anyway - is it really needed to reset iomap_valid. Nor sure if there are any ramifications of this but just trying to see differences. So, xfs_add_to_ioend() sets up the io completion handlers. Previously, we would have set up xfs_end_bio_read (for IOMAP_READ) and now we set up xfs_end_bio_written (for IOMAP_NEW). xfs_end_bio_written does an xfs_setfilesize(ioend) which an xfs_end_bio_read does not. Which I guess is the crux of this change and that is apparent. I'm having trouble following the existing code to understand what it is currently doing. So you are better off with a reviewer that knows this code (but thought I'd have a look:) We seem to be continually testing for iomap_valid which I believe checks whether the offset is within the mapped range. If the offset is not within the mapping then we try mapping the and then retest for validity. So what happens when it is not valid even after mapping and why wouldn't it be valid. I need a better understanding of the background code I guess. Sorry, Tim. --On 21 May 2007 9:34:17 AM +1000 David Chinner wrote: > > As reported a coupl eof times on lkml since the 2.6.22 window opened, > XFS is not updating the file size corectly in all cases. This is a > result of the null files fix not updating the file size when the > write extending the file does not need to allocate a block. > > In that case, we use a read mapping of the extent, and this also > happens to use the read I/O completion handler instead of the > write I/O completion handle. Hence the file size was not updated > on I/O completion. > > Comments? > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > > --- > fs/xfs/linux-2.6/xfs_aops.c | 23 ++++++++++++++++------- > 1 file changed, 16 insertions(+), 7 deletions(-) > > Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c > =================================================================== > --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-05-11 16:03:59.000000000 +1000 > +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-12 23:46:54.379994052 +1000 > @@ -973,8 +973,9 @@ xfs_page_state_convert( > > bh = head = page_buffers(page); > offset = page_offset(page); > - flags = -1; > - type = IOMAP_READ; > + iomap_valid = 0; > + flags = BMAPI_READ; > + type = IOMAP_NEW; > > /* TODO: cleanup count and page_dirty */ > > @@ -1004,14 +1005,14 @@ xfs_page_state_convert( > * > * Third case, an unmapped buffer was found, and we are > * in a path where we need to write the whole page out. > - */ > + */ > if (buffer_unwritten(bh) || buffer_delay(bh) || > ((buffer_uptodate(bh) || PageUptodate(page)) && > !buffer_mapped(bh) && (unmapped || startio))) { > - /* > + /* > * Make sure we don't use a read-only iomap > */ > - if (flags == BMAPI_READ) > + if (flags == BMAPI_READ) > iomap_valid = 0; > > if (buffer_unwritten(bh)) { > @@ -1060,7 +1061,7 @@ xfs_page_state_convert( > * That means it must already have extents allocated > * underneath it. Map the extent by reading it. > */ > - if (!iomap_valid || type != IOMAP_READ) { > + if (!iomap_valid || flags != BMAPI_READ) { > flags = BMAPI_READ; > size = xfs_probe_cluster(inode, page, bh, > head, 1); > @@ -1071,7 +1072,15 @@ xfs_page_state_convert( > iomap_valid = xfs_iomap_valid(&iomap, offset); > } > > - type = IOMAP_READ; > + /* > + * We set the type to IOMAP_NEW in case we are doing a > + * small write at EOF that is extending the file but > + * without needing an allocation. We need to update the > + * file size on I/O completion in this case so it is > + * the same case as having just allocated a new extent > + * that we are writing into for the first time. > + */ > + type = IOMAP_NEW; > if (!test_and_set_bit(BH_Lock, &bh->b_state)) { > ASSERT(buffer_mapped(bh)); > if (iomap_valid) From owner-xfs@oss.sgi.com Mon May 21 00:22:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 00:22:17 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4L7MBfB023720 for ; Mon, 21 May 2007 00:22:12 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA00588; Mon, 21 May 2007 17:22:04 +1000 Date: Mon, 21 May 2007 17:26:19 +1000 From: Timothy Shimmin To: Federico Sevilla III , XFS Mailing List Subject: Re: Mount Failure: Totally Zeroed Log Message-ID: <055E345AE1C3F913656DFC26@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070518060825.GD3340@fs3.ph> References: <20070518060825.GD3340@fs3.ph> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11484 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi, I haven't seen that one before :) When it says "totally zeroed log" it is actually more likely that the h_cycle of the first log record header is zeroed. typedef struct xlog_rec_header { uint h_magicno; /* log record (LR) identifier : 4 */ ---> uint h_cycle; /* write cycle of log : 4 */ i.e. the 2nd 4byte word in the ondisk log is zero. This should never happen on Linux (on IRIX a new log is zeroed with no records in it). Each 512 byte sector is stamped with a cycle# which is started at 1 and never set to zero. The cycle# is at the start of each sector or at the 2nd int for sectors which contain log record headers. On Linux, we always start a fresh log with an unmount record (to store the uuid) which means a freshly mkfs'ed filesystem will always have 1 entry in it and thus with a cycle# of 1. The upshot is that something has corrupted at least the first sector of your ondisk log. If it was only the 1st sector corrupted and it wasn't in the active part (tail to head) then with some hacking one would be able to run recovery theoretically :). Try a: $ xfs_logprint -d device to have a look at all the cycle#'s on the log to get an idea of the corruption for interest's sake. Hmmm...it's interesting that we don't bail out if we find this corrupted log, after the call to xlog_find_head(). We don't return an error but return 0 and a warning. Okay, so when we try to validate the record header later, the h_magicno is wrong too and this time we do EFSCORRUPTED. Probably, could have done that in xlog_find_head() really. No, I don't know what is corrupting your log. Even with the write-cache on I would have thought it would cause metadata inconsistencies on replay but I would not have thought it would cause junk in the log - as the log write should still be valid (we just have ordering issues to deal with). --Tim --On 18 May 2007 2:08:25 PM +0800 Federico Sevilla III wrote: > Hi, > > I encountered the following error after an unclean shutdown: > > Filesystem "md1": Disabling barriers, not supported by the underlying device > XFS mounting filesystem md1 > XFS: totally zeroed log > Starting XFS recovery on filesystem: md1 (logdev: internal) > Filesystem "md1": XFS internal error xlog_valid_rec_header(1) at line 3503 of file > fs/xfs/xfs_log_recover.c. Caller 0xf8992fbd [] xlog_valid_rec_header+0xc5/0xd5 > [xfs] > [] xlog_do_recovery_pass+0x550/0x90f [xfs] > [] xlog_do_recovery_pass+0x550/0x90f [xfs] > [] xlog_do_log_recovery+0x45/0xa6 [xfs] > [] xlog_do_recover+0x1d/0x102 [xfs] > [] xlog_recover+0x87/0x98 [xfs] > [] xfs_log_mount+0x8d/0xce [xfs] > [] xfs_mountfs+0x982/0xc63 [xfs] > [] xfs_ioinit+0x21/0x26 [xfs] > [] xfs_mount+0x30b/0x37e [xfs] > [] vfs_mount+0x28/0x2c [xfs] > [] xfs_fs_fill_super+0x6e/0x193 [xfs] > [] snprintf+0x16/0x1a > [] disk_name+0x25/0x66 > [] get_sb_bdev+0xca/0x117 > [] __link_path_walk+0xadd/0xbd2 > [] xfs_fs_get_sb+0x1e/0x22 [xfs] > [] xfs_fs_fill_super+0x0/0x193 [xfs] > [] vfs_kern_mount+0x35/0x66 > [] do_kern_mount+0x29/0x39 > [] do_new_mount+0x67/0xa4 > [] do_mount+0x153/0x16b > [] copy_mount_options+0x4c/0x99 > [] sys_mount+0x79/0xba > [] syscall_call+0x7/0xb > XFS: log mount/recovery failed: error 117 > XFS: log mount failed > > The machine runs Debian GNU/Linux 4.0 (Etch), with a custom-built 2.6.18 > kernel using the Debian 2.6.18 source and the OpenVZ patch. xfs_repair > is able to repair the filesystem, but most of the files just end up in > lost+found. I have verified that write cache on both drives has been > disabled, and hdparm is set to disable them everytime during startup. > > Any clues as to what could be causing this? > > Thank you very much. > > Cheers! > > -- > Federico Sevilla III > F S 3 Consulting Inc. > http://www.fs3.ph > From owner-xfs@oss.sgi.com Mon May 21 00:43:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 00:43:09 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4L7h4fB026948 for ; Mon, 21 May 2007 00:43:05 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA00988; Mon, 21 May 2007 17:42:59 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4L7gwAf99145874; Mon, 21 May 2007 17:42:58 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4L7gvaN22313500; Mon, 21 May 2007 17:42:57 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Mon, 21 May 2007 17:42:57 +1000 From: David Chinner To: Eric Sandeen Cc: xfs@oss.sgi.com Subject: Re: [PATCH] (and bad attr2 bug) - pack xfs_sb_t for 64-bit arches Message-ID: <20070521074257.GW86004887@sgi.com> References: <455CB54F.8080901@sandeen.net> <464C69A4.6050605@sandeen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <464C69A4.6050605@sandeen.net> User-Agent: Mutt/1.4.2.1i X-archive-position: 11485 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 17, 2007 at 09:41:40AM -0500, Eric Sandeen wrote: > Eric Sandeen wrote: > >see also https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212201 > > > >Bugzilla Bug 212201: Cannot build sysem with XFS file system. > > > >I turned on attr2 in FC6 at nathan's suggestion, for selinux goodness > >with more efficient xattr space usage. > > > >But, many reports that this was totally broken in fc6, on x86_64. > > Although it turned out to be a different issue, not the packing issue, > is the packing/alignment (below) still something that needs to be fixed...? The problem of the screwed up structure should be fixed. however, packing the structure is not the correct solution IMO. I think we discussed the correct fx back when this was first brought up - blacklist the bad bits in the superblock, do on the fly detection and correction of the problem and make sure in future that we always leave padding in the structure so that it is always correctly translated.... Patches are welcome ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 21 01:10:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 01:10:08 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4L8A1fB031413 for ; Mon, 21 May 2007 01:10:03 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA01701; Mon, 21 May 2007 18:09:56 +1000 Date: Mon, 21 May 2007 18:14:11 +1000 From: Timothy Shimmin To: Eric Sandeen , xfs-oss Subject: Re: [PATCH] fix up xfstests a bit for linux+udf Message-ID: In-Reply-To: <464C871B.3090402@sandeen.net> References: <464C871B.3090402@sandeen.net> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11486 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Thanks, Eric. We probably should make all ACL/EA tests load up common.attr (update ones which aren't already - likely the non-acl tests as common.attr really has acl stuff at the moment) and then do the udf/linux test in common.attr. --Tim --On 17 May 2007 11:47:23 AM -0500 Eric Sandeen wrote: > udf on linux doesn't support acls or attrs, so prevent those tests from running. > > Signed-off-by: Eric Sandeen > > Index: xfstests/020 > =================================================================== > --- xfstests.orig/020 > +++ xfstests/020 > @@ -65,7 +65,7 @@ _attr_list() > # real QA test starts here > -_supported_fs xfs udf > +_supported_fs xfs > _supported_os Linux > [ -x /usr/bin/attr ] || _notrun "attr is not installed" > Index: xfstests/051 > =================================================================== > --- xfstests.orig/051 > +++ xfstests/051 > @@ -56,7 +56,7 @@ _cleanup() > # > # real QA test starts here > -_supported_fs xfs udf > +_supported_fs xfs > _supported_os Linux > [ -x /usr/bin/chacl ] || _notrun "chacl executable not found" > Index: xfstests/062 > =================================================================== > --- xfstests.orig/062 > +++ xfstests/062 > @@ -102,7 +102,7 @@ _create_test_bed() > } > # real QA test starts here > -_supported_fs xfs udf nfs > +_supported_fs xfs nfs > _supported_os Linux > _require_scratch > Index: xfstests/070 > =================================================================== > --- xfstests.orig/070 > +++ xfstests/070 > @@ -33,6 +33,7 @@ _cleanup() > _supported_fs xfs udf nfs > _supported_os IRIX Linux > +[ "$FSTYP" == udf -a "$HOSTOS" == Linux ] && _notrun "Linux UDF does not support extended > attributes" > _setup_testdir > $FSSTRESS_PROG \ > Index: xfstests/105 > =================================================================== > --- xfstests.orig/105 > +++ xfstests/105 > @@ -35,6 +35,8 @@ _cleanup() > _supported_fs xfs udf > _supported_os IRIX Linux > +[ "$FSTYP" == udf -a "$HOSTOS" == Linux ] && _notrun "Linux UDF does not support ACLS" > + > # real QA test starts here > rm -f $seq.full > From owner-xfs@oss.sgi.com Mon May 21 06:15:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 06:15:51 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4LDFkfB015539 for ; Mon, 21 May 2007 06:15:47 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id D3CF91807A959; Mon, 21 May 2007 08:15:45 -0500 (CDT) Message-ID: <46519B81.2000202@sandeen.net> Date: Mon, 21 May 2007 08:15:45 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Timothy Shimmin CC: xfs-oss Subject: Re: [PATCH] fix up xfstests a bit for linux+udf References: <464C871B.3090402@sandeen.net> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11487 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Timothy Shimmin wrote: > Thanks, Eric. > > We probably should make all ACL/EA tests load up common.attr > (update ones which aren't already - likely the non-acl tests > as common.attr really has acl stuff at the moment) > and then do the udf/linux test in common.attr. ok, I can do that. I hit a couple other changes, I'll send them all along when I can. -Eric > --Tim > > --On 17 May 2007 11:47:23 AM -0500 Eric Sandeen > wrote: > >> udf on linux doesn't support acls or attrs, so prevent those tests >> from running. >> >> Signed-off-by: Eric Sandeen >> >> Index: xfstests/020 >> =================================================================== >> --- xfstests.orig/020 >> +++ xfstests/020 >> @@ -65,7 +65,7 @@ _attr_list() >> # real QA test starts here >> -_supported_fs xfs udf >> +_supported_fs xfs >> _supported_os Linux >> [ -x /usr/bin/attr ] || _notrun "attr is not installed" >> Index: xfstests/051 >> =================================================================== >> --- xfstests.orig/051 >> +++ xfstests/051 >> @@ -56,7 +56,7 @@ _cleanup() >> # >> # real QA test starts here >> -_supported_fs xfs udf >> +_supported_fs xfs >> _supported_os Linux >> [ -x /usr/bin/chacl ] || _notrun "chacl executable not found" >> Index: xfstests/062 >> =================================================================== >> --- xfstests.orig/062 >> +++ xfstests/062 >> @@ -102,7 +102,7 @@ _create_test_bed() >> } >> # real QA test starts here >> -_supported_fs xfs udf nfs >> +_supported_fs xfs nfs >> _supported_os Linux >> _require_scratch >> Index: xfstests/070 >> =================================================================== >> --- xfstests.orig/070 >> +++ xfstests/070 >> @@ -33,6 +33,7 @@ _cleanup() >> _supported_fs xfs udf nfs >> _supported_os IRIX Linux >> +[ "$FSTYP" == udf -a "$HOSTOS" == Linux ] && _notrun "Linux UDF does >> not support extended >> attributes" >> _setup_testdir >> $FSSTRESS_PROG \ >> Index: xfstests/105 >> =================================================================== >> --- xfstests.orig/105 >> +++ xfstests/105 >> @@ -35,6 +35,8 @@ _cleanup() >> _supported_fs xfs udf >> _supported_os IRIX Linux >> +[ "$FSTYP" == udf -a "$HOSTOS" == Linux ] && _notrun "Linux UDF does >> not support ACLS" >> + >> # real QA test starts here >> rm -f $seq.full >> > > > > From owner-xfs@oss.sgi.com Mon May 21 16:32:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 16:32:44 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4LNWYfB013251 for ; Mon, 21 May 2007 16:32:36 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA25948; Tue, 22 May 2007 09:32:29 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id B2A2758CA531; Tue, 22 May 2007 09:32:29 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: PARTIAL TAKE 964983 - vmalloc leak on mount/unmount Message-Id: <20070521233229.B2A2758CA531@chook.melbourne.sgi.com> Date: Tue, 22 May 2007 09:32:29 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11488 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Fix double free in xfs_buf_get_noaddr error handling path Signed-off-by: Christoph Hellwig Date: Tue May 22 09:31:46 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@lst.de The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28639a fs/xfs/linux-2.6/xfs_buf.c - 1.238 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.c.diff?r1=text&tr1=1.238&r2=text&tr2=1.237&f=h - Fix double free in xfs_buf_get_noaddr error handling path. From owner-xfs@oss.sgi.com Mon May 21 16:39:08 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 16:39:16 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4LNd5fB014946 for ; Mon, 21 May 2007 16:39:07 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA26069; Tue, 22 May 2007 09:39:02 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 4571058CA531; Tue, 22 May 2007 09:39:02 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: PARTIAL TAKE 964983 - vmalloc leak on mount/unmount Message-Id: <20070521233902.4571058CA531@chook.melbourne.sgi.com> Date: Tue, 22 May 2007 09:39:02 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11489 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Fix vmalloc leak on mount/unmount. When setting the length of the iclogbuf to write out we should just be changing the desired byte count rather completely reassociating the buffer memory with the buffer. Reassociating the buffer memory changes the apparent length of the buffer and hence when we free the buffer, we don't free all the vmap()d space we originally allocated. Date: Tue May 22 09:38:29 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@lst.de The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28640a fs/xfs/xfs_log.c - 1.331 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log.c.diff?r1=text&tr1=1.331&r2=text&tr2=1.330&f=h - Set the iclogbuf I/O size correctly rather than reassociating the memory which can lead to vmalloc leaks. From owner-xfs@oss.sgi.com Mon May 21 17:14:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 17:14:16 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4M0E6fB023053 for ; Mon, 21 May 2007 17:14:08 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA26955; Tue, 22 May 2007 10:14:02 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 4147858CA531; Tue, 22 May 2007 10:14:02 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 962012 - invalid fullstat when recalling files Message-Id: <20070522001402.4147858CA531@chook.melbourne.sgi.com> Date: Tue, 22 May 2007 10:14:02 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11490 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Make hole punching at EOF atomic. If hole punching at EOF is done as two steps (i.e. truncate then extend) the file is in a transient state between the two steps where an application can see the incorrect file size. Punching a hole to EOF needs to be treated in teh same way as all other hole punching cases so that the file size is never seen to change. Date: Tue May 22 10:13:27 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: vapo@sgi.com The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28641a fs/xfs/xfs_rw.h - 1.81 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_rw.h.diff?r1=text&tr1=1.81&r2=text&tr2=1.80&f=h - export and define flags for xfs_free_eofblocks. fs/xfs/xfs_vnodeops.c - 1.696 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.696&r2=text&tr2=1.695&f=h - convert callers to new xfs_free_eofblocks() interface and allow it to be called with the iolock already held. fs/xfs/linux-2.6/xfs_ksyms.c - 1.58 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_ksyms.c.diff?r1=text&tr1=1.58&r2=text&tr2=1.57&f=h - export xfs_free_eofblocks. fs/xfs/dmapi/xfs_dm.c - 1.36 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/dmapi/xfs_dm.c.diff?r1=text&tr1=1.36&r2=text&tr2=1.35&f=h - Use UNRESVSP to punch a hole at EOF rather than FREESP + xfs_setattr to return truncate and extend the file. THis means we also need to truncate away blocks past EOF using xfs_free_eofblocks(). From owner-xfs@oss.sgi.com Mon May 21 17:23:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 17:23:20 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4M0N9fB025711 for ; Mon, 21 May 2007 17:23:12 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA27258; Tue, 22 May 2007 10:23:06 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 1050858CA531; Tue, 22 May 2007 10:23:06 +1000 (EST) Cc: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 964986 - make shrinker use generic interfaces Message-Id: <20070522002306.1050858CA531@chook.melbourne.sgi.com> Date: Tue, 22 May 2007 10:23:06 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11491 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Use generic shrinker interfaces in XFS. Signed-Off-By: Andrew Morton Date: Tue May 22 10:22:43 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: akpm@linux-foundation.org The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28642a fs/xfs/quota/xfs_qm.c - 1.49 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/quota/xfs_qm.c.diff?r1=text&tr1=1.49&r2=text&tr2=1.48&f=h - remove shrinker wrappers and call them directly. fs/xfs/linux-2.6/xfs_buf.c - 1.239 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.c.diff?r1=text&tr1=1.239&r2=text&tr2=1.238&f=h - remove shrinker wrappers and call them directly. fs/xfs/linux-2.6/kmem.h - 1.42 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/kmem.h.diff?r1=text&tr1=1.42&r2=text&tr2=1.41&f=h - remove shrinker wrappers and call them directly. fs/xfs/linux-2.6/xfs_ksyms.c - 1.59 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_ksyms.c.diff?r1=text&tr1=1.59&r2=text&tr2=1.58&f=h - remove shrinker wrappers and call them directly. From owner-xfs@oss.sgi.com Mon May 21 17:44:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 17:44:25 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4M0iIfB030309 for ; Mon, 21 May 2007 17:44:21 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA27876; Tue, 22 May 2007 10:44:16 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4M0iFAf100555308; Tue, 22 May 2007 10:44:16 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4M0iEXs100610241; Tue, 22 May 2007 10:44:14 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 22 May 2007 10:44:14 +1000 From: David Chinner To: Timothy Shimmin Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: Review: ensure EOF writes into existing extents update filesize Message-ID: <20070522004414.GL85884050@sgi.com> References: <20070520233417.GA85884050@sgi.com> <738172E9F9634F7FC01B5B3C@timothy-shimmins-power-mac-g5.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <738172E9F9634F7FC01B5B3C@timothy-shimmins-power-mac-g5.local> User-Agent: Mutt/1.4.2.1i X-archive-position: 11492 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, May 21, 2007 at 04:34:26PM +1000, Timothy Shimmin wrote: > Hi, > > Trying to understand how the initialisers prior to the loop > are used during the loop. > It looks like the initial "type" isn't used now with this change > as we always assign to it near to where we access it. Yes. > Previously, we did access it in the already mapped case (type != > IOMAP_READ). > So why do we initialise "type" prior to the loop then? Because it's good practise - it's not obvious that it gets initialised in all cases within the loop, so lets make sure that it is.... > The inited "flags" var is 1st accessed in the unmapped/allocated > codepath to set iomap_valid to zero on BMAPI_READ > or now also in the other path where it is already mapped. ..... > Nor sure if there are any ramifications of this but just trying to see > differences. There is no difference in behaviour.... > So, xfs_add_to_ioend() sets up the io completion handlers. > Previously, we would have set up xfs_end_bio_read (for IOMAP_READ) > and now we set up xfs_end_bio_written (for IOMAP_NEW). > xfs_end_bio_written does an xfs_setfilesize(ioend) > which an xfs_end_bio_read does not. > Which I guess is the crux of this change and > that is apparent. Yes. > I'm having trouble following the existing code to understand what > it is currently doing. So you are better off with a reviewer that > knows this code (but thought I'd have a look:) It's not obvious at all, is it? > We seem to be continually testing for iomap_valid which I believe > checks whether the offset is within the mapped range. If the > offset is not within the mapping then we try mapping the size> and then retest for validity. So what happens when it is > not valid even after mapping and why wouldn't it be valid. I need > a better understanding of the background code I guess. The iomap that we get is a mapping of a range that is the same type as the current bufferhead we are processing. Hence the mapping may extend much further than the current buffer (e.g. delalloc or unwritten mapping will extend to the end of the extent). therefore as we walk each buffer, we need to check that it falls within the current mapping and if it doesn't we need to map a new range. We also need to invalidate the mapping if the buffer changes from a mapped buffer to an unmapped/unwritten/delalloc buffer as we need to do extra work there.... I guess I need to ping Christoph and Nathan on this one.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 21 17:57:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 17:57:36 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4M0vVfB000939 for ; Mon, 21 May 2007 17:57:33 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id 8222192C513; Tue, 22 May 2007 10:57:29 +1000 (EST) Subject: Re: Review: ensure EOF writes into existing extents update filesize From: Nathan Scott Reply-To: nscott@aconex.com To: David Chinner Cc: Timothy Shimmin , xfs-dev , xfs-oss In-Reply-To: <20070522004414.GL85884050@sgi.com> References: <20070520233417.GA85884050@sgi.com> <738172E9F9634F7FC01B5B3C@timothy-shimmins-power-mac-g5.local> <20070522004414.GL85884050@sgi.com> Content-Type: text/plain Organization: Aconex Date: Tue, 22 May 2007 11:02:59 +1000 Message-Id: <1179795779.6273.510.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-archive-position: 11493 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs On Tue, 2007-05-22 at 10:44 +1000, David Chinner wrote: > On Mon, May 21, 2007 at 04:34:26PM +1000, Timothy Shimmin wrote: > I guess I need to ping Christoph and Nathan on this one.... Could you resend the patch to me please? I lost the previous copy while ruthlessly ploughing through my mail backlog. ;) cheers. -- Nathan From owner-xfs@oss.sgi.com Mon May 21 18:03:32 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 18:03:35 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4M13TfB002595 for ; Mon, 21 May 2007 18:03:31 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA28620; Tue, 22 May 2007 11:03:28 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4M13RAf100579614; Tue, 22 May 2007 11:03:28 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4M13Qq191237047; Tue, 22 May 2007 11:03:26 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 22 May 2007 11:03:26 +1000 From: David Chinner To: Nathan Scott Cc: David Chinner , Timothy Shimmin , xfs-dev , xfs-oss Subject: Re: Review: ensure EOF writes into existing extents update filesize Message-ID: <20070522010326.GN85884050@sgi.com> References: <20070520233417.GA85884050@sgi.com> <738172E9F9634F7FC01B5B3C@timothy-shimmins-power-mac-g5.local> <20070522004414.GL85884050@sgi.com> <1179795779.6273.510.camel@edge> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1179795779.6273.510.camel@edge> User-Agent: Mutt/1.4.2.1i X-archive-position: 11494 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 22, 2007 at 11:02:59AM +1000, Nathan Scott wrote: > On Tue, 2007-05-22 at 10:44 +1000, David Chinner wrote: > > On Mon, May 21, 2007 at 04:34:26PM +1000, Timothy Shimmin wrote: > > I guess I need to ping Christoph and Nathan on this one.... > > Could you resend the patch to me please? I lost the previous copy > while ruthlessly ploughing through my mail backlog. ;) Below. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_aops.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-05-11 16:03:59.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-12 23:46:54.379994052 +1000 @@ -973,8 +973,9 @@ xfs_page_state_convert( bh = head = page_buffers(page); offset = page_offset(page); - flags = -1; - type = IOMAP_READ; + iomap_valid = 0; + flags = BMAPI_READ; + type = IOMAP_NEW; /* TODO: cleanup count and page_dirty */ @@ -1004,14 +1005,14 @@ xfs_page_state_convert( * * Third case, an unmapped buffer was found, and we are * in a path where we need to write the whole page out. - */ + */ if (buffer_unwritten(bh) || buffer_delay(bh) || ((buffer_uptodate(bh) || PageUptodate(page)) && !buffer_mapped(bh) && (unmapped || startio))) { - /* + /* * Make sure we don't use a read-only iomap */ - if (flags == BMAPI_READ) + if (flags == BMAPI_READ) iomap_valid = 0; if (buffer_unwritten(bh)) { @@ -1060,7 +1061,7 @@ xfs_page_state_convert( * That means it must already have extents allocated * underneath it. Map the extent by reading it. */ - if (!iomap_valid || type != IOMAP_READ) { + if (!iomap_valid || flags != BMAPI_READ) { flags = BMAPI_READ; size = xfs_probe_cluster(inode, page, bh, head, 1); @@ -1071,7 +1072,15 @@ xfs_page_state_convert( iomap_valid = xfs_iomap_valid(&iomap, offset); } - type = IOMAP_READ; + /* + * We set the type to IOMAP_NEW in case we are doing a + * small write at EOF that is extending the file but + * without needing an allocation. We need to update the + * file size on I/O completion in this case so it is + * the same case as having just allocated a new extent + * that we are writing into for the first time. + */ + type = IOMAP_NEW; if (!test_and_set_bit(BH_Lock, &bh->b_state)) { ASSERT(buffer_mapped(bh)); if (iomap_valid) From owner-xfs@oss.sgi.com Mon May 21 21:04:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 21:04:49 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4M44hWt017986 for ; Mon, 21 May 2007 21:04:45 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id 5E30B92C398; Tue, 22 May 2007 14:04:43 +1000 (EST) Subject: Re: Review: ensure EOF writes into existing extents update filesize From: Nathan Scott Reply-To: nscott@aconex.com To: David Chinner Cc: Timothy Shimmin , xfs-dev , xfs-oss In-Reply-To: <20070522010326.GN85884050@sgi.com> References: <20070520233417.GA85884050@sgi.com> <738172E9F9634F7FC01B5B3C@timothy-shimmins-power-mac-g5.local> <20070522004414.GL85884050@sgi.com> <1179795779.6273.510.camel@edge> <20070522010326.GN85884050@sgi.com> Content-Type: multipart/mixed; boundary="=-Ol4eDZKyAzPXZLgYQsmK" Organization: Aconex Date: Tue, 22 May 2007 14:10:14 +1000 Message-Id: <1179807014.6273.519.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 X-archive-position: 11495 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs --=-Ol4eDZKyAzPXZLgYQsmK Content-Type: text/plain Content-Transfer-Encoding: 7bit On Tue, 2007-05-22 at 11:03 +1000, David Chinner wrote: > On Tue, May 22, 2007 at 11:02:59AM +1000, Nathan Scott wrote: > > On Tue, 2007-05-22 at 10:44 +1000, David Chinner wrote: > > > On Mon, May 21, 2007 at 04:34:26PM +1000, Timothy Shimmin wrote: > > > I guess I need to ping Christoph and Nathan on this one.... > > > > Could you resend the patch to me please? I lost the previous copy > > while ruthlessly ploughing through my mail backlog. ;) > > Below. Looks pretty good to me - xfs_convert_page has been overlooked, I think - attached patch fixes that. You also initialise iomap_valid twice inside xfs_page_state_convert now ... I reverted that to just the once. cheers. -- Nathan --=-Ol4eDZKyAzPXZLgYQsmK Content-Disposition: attachment; filename=nat.patch Content-Type: text/x-patch; name=nat.patch; charset=UTF-8 Content-Transfer-Encoding: 7bit Index: linux/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- linux.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-05-22 13:47:40.296829500 +1000 +++ linux/fs/xfs/linux-2.6/xfs_aops.c 2007-05-22 13:48:25.987685000 +1000 @@ -810,7 +810,7 @@ xfs_convert_page( page_dirty--; count++; } else { - type = 0; + type = IOMAP_NEW; if (buffer_mapped(bh) && all_bh && startio) { lock_buffer(bh); xfs_add_to_ioend(inode, bh, offset, @@ -968,7 +968,6 @@ xfs_page_state_convert( bh = head = page_buffers(page); offset = page_offset(page); - iomap_valid = 0; flags = BMAPI_READ; type = IOMAP_NEW; --=-Ol4eDZKyAzPXZLgYQsmK-- From owner-xfs@oss.sgi.com Mon May 21 23:06:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 23:06:09 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4M663Wt014354 for ; Mon, 21 May 2007 23:06:05 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA05308; Tue, 22 May 2007 16:06:02 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4M661Af100530880; Tue, 22 May 2007 16:06:01 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4M65xxq88891681; Tue, 22 May 2007 16:05:59 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 22 May 2007 16:05:59 +1000 From: David Chinner To: Nathan Scott Cc: David Chinner , Timothy Shimmin , xfs-dev , xfs-oss Subject: Re: Review: ensure EOF writes into existing extents update filesize Message-ID: <20070522060559.GI86004887@sgi.com> References: <20070520233417.GA85884050@sgi.com> <738172E9F9634F7FC01B5B3C@timothy-shimmins-power-mac-g5.local> <20070522004414.GL85884050@sgi.com> <1179795779.6273.510.camel@edge> <20070522010326.GN85884050@sgi.com> <1179807014.6273.519.camel@edge> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1179807014.6273.519.camel@edge> User-Agent: Mutt/1.4.2.1i X-archive-position: 11496 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 22, 2007 at 02:10:14PM +1000, Nathan Scott wrote: > On Tue, 2007-05-22 at 11:03 +1000, David Chinner wrote: > > On Tue, May 22, 2007 at 11:02:59AM +1000, Nathan Scott wrote: > > > On Tue, 2007-05-22 at 10:44 +1000, David Chinner wrote: > > > > On Mon, May 21, 2007 at 04:34:26PM +1000, Timothy Shimmin wrote: > > > > I guess I need to ping Christoph and Nathan on this one.... > > > > > > Could you resend the patch to me please? I lost the previous copy > > > while ruthlessly ploughing through my mail backlog. ;) > > > > Below. > > Looks pretty good to me - xfs_convert_page has been overlooked, I > think - attached patch fixes that. I thought about that, then tried to trip the problem and was not successful. AFAICT, if we have multiple pages dirty and in the same state (i.e. pwrite 0 32769 to dirty 3 pages, then sync, then pwrite 0 33000 to dirty and extend) we should hit the case where we cluster the dirty pages and call xfs_convert_page() on the last two pages. In that case, the entire range should be mapped via the one iomap and so the last buffer (the one we extended) should be added to an ioend with a type of 0 (IOMAP_READ) and hence we should see the bug. With the patch I posted, I can't get this to show the problem. It should, but it doesn't.... I'll make the change anyway to be safe, but I'm still perplexed as to why it doesn't seem necessary.... Ah - there it is - xfs_is_delayed_page(): 699 if (buffer_unwritten(bh)) 700 acceptable = (type == IOMAP_UNWRITTEN); 701 else if (buffer_delay(bh)) 702 acceptable = (type == IOMAP_DELAY); 703 else if (buffer_dirty(bh) && buffer_mapped(bh)) 704 >>>>>>>>>>> acceptable = (type == 0); 705 else 706 break; The ioend we started with now has type = IOMAP_NEW = 0x40 which means xfs_convert_page() aborts clustering this case immediately. IOWs, we are never getting to this xfs_convert_page() case and we are only passing through xfs_page_state_convert for mapped pages. > You also initialise iomap_valid > twice inside xfs_page_state_convert now ... I reverted that to just > the once. I'll fold these changes in and fixup xfs_is_delayed_page() to look for type == IOMAP_NEW and send out a new patch. Thanks, Nathan. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 21 23:30:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 21 May 2007 23:30:25 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4M6UJWt021102 for ; Mon, 21 May 2007 23:30:22 -0700 Received: from pc-bnaujok.melbourne.sgi.com (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA05721; Tue, 22 May 2007 16:19:20 +1000 To: "David Chinner" , xfs-dev Subject: Re: review [3 of 3]: lazy superblock counters - userspace bits From: "Barry Naujok" Organization: SGI Cc: xfs-oss Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15 MIME-Version: 1.0 References: <20070419232133.GZ48531920@melbourne.sgi.com> Content-Transfer-Encoding: 7bit Date: Tue, 22 May 2007 16:19:28 +1000 Message-ID: In-Reply-To: <20070419232133.GZ48531920@melbourne.sgi.com> User-Agent: Opera Mail/9.10 (Win32) X-archive-position: 11497 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@sgi.com Precedence: bulk X-list: xfs On Fri, 20 Apr 2007 09:21:33 +1000, David Chinner wrote: > > userspace tool support for lazy superblock accounting. > Looks fine to me. From owner-xfs@oss.sgi.com Tue May 22 00:59:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 22 May 2007 00:59:47 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4M7xaWt014097 for ; Tue, 22 May 2007 00:59:38 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA07913; Tue, 22 May 2007 17:59:33 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id E665058CA531; Tue, 22 May 2007 17:59:32 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: PARTIAL TAKE 964999 - lazy superblock counters for XFS Message-Id: <20070522075932.E665058CA531@chook.melbourne.sgi.com> Date: Tue, 22 May 2007 17:59:32 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11498 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Lazy Superblock Counters When we have a couple of hundred transactions on the fly at once, they all typically modify the on disk superblock in some way. create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify free block counts. When these counts are modified in a transaction, the must eventually lock the superblock buffer and apply the mods. The buffer then remains locked until the transaction is committed into the incore log buffer. The result of this is that with enough transactions on the fly the incore superblock buffer becomes a bottleneck. The result of contention on the incore superblock buffer is that transaction rates fall - the more pressure that is put on the superblock buffer, the slower things go. The key to removing the contention is to not require the superblock fields in question to be locked. We do that by not marking the superblock dirty in the transaction. IOWs, we modify the incore superblock but do not modify the cached superblock buffer. In short, we do not log superblock modifications to critical fields in the superblock on every transaction. In fact we only do it just before we write the superblock to disk every sync period or just before unmount. This creates an interesting problem - if we don't log or write out the fields in every transaction, then how do the values get recovered after a crash? the answer is simple - we keep enough duplicate, logged information in other structures that we can reconstruct the correct count after log recovery has been performed. It is the AGF and AGI structures that contain the duplicate information; after recovery, we walk every AGI and AGF and sum their individual counters to get the correct value, and we do a transaction into the log to correct them. An optimisation of this is that if we have a clean unmount record, we know the value in the superblock is correct, so we can avoid the summation walk under normal conditions and so mount/recovery times do not change under normal operation. One wrinkle that was discovered during development was that the blocks used in the freespace btrees are never accounted for in the AGF counters. This was once a valid optimisation to make; when the filesystem is full, the free space btrees are empty and consume no space. Hence when it matters, the "accounting" is correct. But that means the when we do the AGF summations, we would not have a correct count and xfs_check would complain. Hence a new counter was added to track the number of blocks used by the free space btrees. This is an *on-disk format change*. As a result of this, lazy superblock counters are a mkfs option and at the moment on linux there is no way to convert an old filesystem. This is possible - xfs_db can be used to twiddle the right bits and then xfs_repair will do the format conversion for you. Similarly, you can convert backwards as well. At some point we'll add functionality to xfs_admin to do the bit twiddling easily.... Date: Tue May 22 17:58:49 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@infradead.org The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28652a fs/xfs/xfsidbg.c - 1.314 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfsidbg.c.diff?r1=text&tr1=1.314&r2=text&tr2=1.313&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_log.c - 1.332 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log.c.diff?r1=text&tr1=1.332&r2=text&tr2=1.331&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_ialloc.h - 1.47 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_ialloc.h.diff?r1=text&tr1=1.47&r2=text&tr2=1.46&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_ialloc.c - 1.194 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_ialloc.c.diff?r1=text&tr1=1.194&r2=text&tr2=1.193&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_ag.h - 1.59 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_ag.h.diff?r1=text&tr1=1.59&r2=text&tr2=1.58&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_sb.h - 1.68 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_sb.h.diff?r1=text&tr1=1.68&r2=text&tr2=1.67&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_fs.h - 1.33 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_fs.h.diff?r1=text&tr1=1.33&r2=text&tr2=1.32&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_log_recover.c - 1.319 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log_recover.c.diff?r1=text&tr1=1.319&r2=text&tr2=1.318&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_vfsops.c - 1.520 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.520&r2=text&tr2=1.519&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_mount.h - 1.236 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.h.diff?r1=text&tr1=1.236&r2=text&tr2=1.235&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_mount.c - 1.395 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.c.diff?r1=text&tr1=1.395&r2=text&tr2=1.394&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_trans.c - 1.179 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_trans.c.diff?r1=text&tr1=1.179&r2=text&tr2=1.178&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_trans.h - 1.145 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_trans.h.diff?r1=text&tr1=1.145&r2=text&tr2=1.144&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_alloc.c - 1.186 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_alloc.c.diff?r1=text&tr1=1.186&r2=text&tr2=1.185&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_alloc.h - 1.62 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_alloc.h.diff?r1=text&tr1=1.62&r2=text&tr2=1.61&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_fsops.c - 1.124 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_fsops.c.diff?r1=text&tr1=1.124&r2=text&tr2=1.123&f=h - Changes to support lazy superblock counters. fs/xfs/xfs_alloc_btree.c - 1.91 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_alloc_btree.c.diff?r1=text&tr1=1.91&r2=text&tr2=1.90&f=h - Changes to support lazy superblock counters. fs/xfs/linux-2.6/xfs_vfs.h - 1.70 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_vfs.h.diff?r1=text&tr1=1.70&r2=text&tr2=1.69&f=h - Changes to support lazy superblock counters. fs/xfs/linux-2.6/xfs_super.c - 1.381 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_super.c.diff?r1=text&tr1=1.381&r2=text&tr2=1.380&f=h - Changes to support lazy superblock counters. fs/xfs/linux-2.4/xfs_vfs.h - 1.66 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.4/xfs_vfs.h.diff?r1=text&tr1=1.66&r2=text&tr2=1.65&f=h - Changes to support lazy superblock counters. fs/xfs/linux-2.4/xfs_super.c - 1.336 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.4/xfs_super.c.diff?r1=text&tr1=1.336&r2=text&tr2=1.335&f=h - Changes to support lazy superblock counters. From owner-xfs@oss.sgi.com Tue May 22 01:04:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 22 May 2007 01:04:42 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4M84VWt015922 for ; Tue, 22 May 2007 01:04:33 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA08095; Tue, 22 May 2007 18:04:28 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 1C4F858CA531; Tue, 22 May 2007 18:04:28 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: PARTIAL TAKE 964999 - lazy superblock counters for XFS Message-Id: <20070522080428.1C4F858CA531@chook.melbourne.sgi.com> Date: Tue, 22 May 2007 18:04:28 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11499 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Fix the transaction flags to make lazy superblock counters work. Date: Tue May 22 18:03:50 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: hch@infradead.org The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28653a fs/xfs/xfs_trans.c - 1.180 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_trans.c.diff?r1=text&tr1=1.180&r2=text&tr2=1.179&f=h - Only conditionally dirty the superblock in the transaction is lazy superblock counters are being used. From owner-xfs@oss.sgi.com Tue May 22 01:42:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 22 May 2007 01:42:23 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4M8gCWt024271 for ; Tue, 22 May 2007 01:42:14 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA09395; Tue, 22 May 2007 18:42:07 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id AFF6658CA531; Tue, 22 May 2007 18:42:07 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 964999 - lazy superblock counters for XFS Message-Id: <20070522084207.AFF6658CA531@chook.melbourne.sgi.com> Date: Tue, 22 May 2007 18:42:07 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11500 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Userspace support for lazy superblock counters libxfs changes to match kernel support repair, db, growfs and mkfs changes needed to support this feature. Date: Tue May 22 18:41:32 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/xfs-cmds Inspected by: bnaujok@sgi.com The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/xfs-cmds/master-melb Modid: master-melb:xfs-cmds:28654a xfsprogs/db/sb.c - 1.22 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/db/sb.c.diff?r1=text&tr1=1.22&r2=text&tr2=1.21&f=h xfsprogs/db/agf.c - 1.12 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/db/agf.c.diff?r1=text&tr1=1.12&r2=text&tr2=1.11&f=h xfsprogs/db/check.c - 1.33 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/db/check.c.diff?r1=text&tr1=1.33&r2=text&tr2=1.32&f=h xfsprogs/man/man8/mkfs.xfs.8 - 1.26 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/man/man8/mkfs.xfs.8.diff?r1=text&tr1=1.26&r2=text&tr2=1.25&f=h xfsprogs/mkfs/xfs_mkfs.c - 1.80 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/mkfs/xfs_mkfs.c.diff?r1=text&tr1=1.80&r2=text&tr2=1.79&f=h xfsprogs/growfs/xfs_growfs.c - 1.26 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/growfs/xfs_growfs.c.diff?r1=text&tr1=1.26&r2=text&tr2=1.25&f=h xfsprogs/include/xfs_ialloc.h - 1.11 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/include/xfs_ialloc.h.diff?r1=text&tr1=1.11&r2=text&tr2=1.10&f=h xfsprogs/include/xfs_ag.h - 1.20 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/include/xfs_ag.h.diff?r1=text&tr1=1.20&r2=text&tr2=1.19&f=h xfsprogs/include/xfs_sb.h - 1.19 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/include/xfs_sb.h.diff?r1=text&tr1=1.19&r2=text&tr2=1.18&f=h xfsprogs/include/xfs_fs.h - 1.39 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/include/xfs_fs.h.diff?r1=text&tr1=1.39&r2=text&tr2=1.38&f=h xfsprogs/include/libxfs.h - 1.60 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/include/libxfs.h.diff?r1=text&tr1=1.60&r2=text&tr2=1.59&f=h xfsprogs/include/xfs_mount.h - 1.47 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/include/xfs_mount.h.diff?r1=text&tr1=1.47&r2=text&tr2=1.46&f=h xfsprogs/include/xfs_trans.h - 1.21 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/include/xfs_trans.h.diff?r1=text&tr1=1.21&r2=text&tr2=1.20&f=h xfsprogs/include/xfs_alloc.h - 1.13 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/include/xfs_alloc.h.diff?r1=text&tr1=1.13&r2=text&tr2=1.12&f=h xfsprogs/repair/phase5.c - 1.14 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/repair/phase5.c.diff?r1=text&tr1=1.14&r2=text&tr2=1.13&f=h xfsprogs/libxfs/xfs_ialloc.c - 1.26 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libxfs/xfs_ialloc.c.diff?r1=text&tr1=1.26&r2=text&tr2=1.25&f=h xfsprogs/libxfs/xfs.h - 1.59 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libxfs/xfs.h.diff?r1=text&tr1=1.59&r2=text&tr2=1.58&f=h xfsprogs/libxfs/init.c - 1.53 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libxfs/init.c.diff?r1=text&tr1=1.53&r2=text&tr2=1.52&f=h xfsprogs/libxfs/xfs_mount.c - 1.25 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libxfs/xfs_mount.c.diff?r1=text&tr1=1.25&r2=text&tr2=1.24&f=h xfsprogs/libxfs/xfs_alloc.c - 1.28 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libxfs/xfs_alloc.c.diff?r1=text&tr1=1.28&r2=text&tr2=1.27&f=h xfsprogs/libxfs/xfs_alloc_btree.c - 1.18 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libxfs/xfs_alloc_btree.c.diff?r1=text&tr1=1.18&r2=text&tr2=1.17&f=h - Userspace support for lazy superblock counters From owner-xfs@oss.sgi.com Tue May 22 17:02:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 22 May 2007 17:02:23 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4N02HWt029771 for ; Tue, 22 May 2007 17:02:19 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA02177; Wed, 23 May 2007 10:02:16 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4N02EAf101455042; Wed, 23 May 2007 10:02:15 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4N02D4O101191714; Wed, 23 May 2007 10:02:13 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 23 May 2007 10:02:13 +1000 From: David Chinner To: David Chinner Cc: Nathan Scott , Timothy Shimmin , xfs-dev , xfs-oss Subject: Re: Review: ensure EOF writes into existing extents update filesize Message-ID: <20070523000213.GS86004887@sgi.com> References: <20070520233417.GA85884050@sgi.com> <738172E9F9634F7FC01B5B3C@timothy-shimmins-power-mac-g5.local> <20070522004414.GL85884050@sgi.com> <1179795779.6273.510.camel@edge> <20070522010326.GN85884050@sgi.com> <1179807014.6273.519.camel@edge> <20070522060559.GI86004887@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070522060559.GI86004887@sgi.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11501 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 22, 2007 at 04:05:59PM +1000, David Chinner wrote: > On Tue, May 22, 2007 at 02:10:14PM +1000, Nathan Scott wrote: > > On Tue, 2007-05-22 at 11:03 +1000, David Chinner wrote: > > > On Tue, May 22, 2007 at 11:02:59AM +1000, Nathan Scott wrote: > > > > On Tue, 2007-05-22 at 10:44 +1000, David Chinner wrote: > > > > > On Mon, May 21, 2007 at 04:34:26PM +1000, Timothy Shimmin wrote: > > > > > I guess I need to ping Christoph and Nathan on this one.... > > > > > > > > Could you resend the patch to me please? I lost the previous copy > > > > while ruthlessly ploughing through my mail backlog. ;) > > > > > > Below. > > > > Looks pretty good to me - xfs_convert_page has been overlooked, I > > think - attached patch fixes that. .... > I'll make the change anyway to be safe, but I'm still perplexed > as to why it doesn't seem necessary.... > > Ah - there it is - xfs_is_delayed_page(): > > 699 if (buffer_unwritten(bh)) > 700 acceptable = (type == IOMAP_UNWRITTEN); > 701 else if (buffer_delay(bh)) > 702 acceptable = (type == IOMAP_DELAY); > 703 else if (buffer_dirty(bh) && buffer_mapped(bh)) > 704 >>>>>>>>>>> acceptable = (type == 0); > 705 else > 706 break; > > The ioend we started with now has type = IOMAP_NEW = 0x40 which > means xfs_convert_page() aborts clustering this case immediately. > IOWs, we are never getting to this xfs_convert_page() case and > we are only passing through xfs_page_state_convert for mapped > pages. > > > You also initialise iomap_valid > > twice inside xfs_page_state_convert now ... I reverted that to just > > the once. > > I'll fold these changes in and fixup xfs_is_delayed_page() to look > for type == IOMAP_NEW and send out a new patch. Thanks, Nathan. Patch below, Nathan. It's passed my point tests and XFSQA... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_aops.c | 26 +++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-05-21 17:49:01.603432320 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-22 16:07:36.787017461 +1000 @@ -706,7 +706,7 @@ xfs_is_delayed_page( else if (buffer_delay(bh)) acceptable = (type == IOMAP_DELAY); else if (buffer_dirty(bh) && buffer_mapped(bh)) - acceptable = (type == 0); + acceptable = (type == IOMAP_NEW); else break; } while ((bh = bh->b_this_page) != head); @@ -815,7 +815,7 @@ xfs_convert_page( page_dirty--; count++; } else { - type = 0; + type = IOMAP_NEW; if (buffer_mapped(bh) && all_bh && startio) { lock_buffer(bh); xfs_add_to_ioend(inode, bh, offset, @@ -973,8 +973,8 @@ xfs_page_state_convert( bh = head = page_buffers(page); offset = page_offset(page); - flags = -1; - type = IOMAP_READ; + flags = BMAPI_READ; + type = IOMAP_NEW; /* TODO: cleanup count and page_dirty */ @@ -1004,14 +1004,14 @@ xfs_page_state_convert( * * Third case, an unmapped buffer was found, and we are * in a path where we need to write the whole page out. - */ + */ if (buffer_unwritten(bh) || buffer_delay(bh) || ((buffer_uptodate(bh) || PageUptodate(page)) && !buffer_mapped(bh) && (unmapped || startio))) { - /* + /* * Make sure we don't use a read-only iomap */ - if (flags == BMAPI_READ) + if (flags == BMAPI_READ) iomap_valid = 0; if (buffer_unwritten(bh)) { @@ -1060,7 +1060,7 @@ xfs_page_state_convert( * That means it must already have extents allocated * underneath it. Map the extent by reading it. */ - if (!iomap_valid || type != IOMAP_READ) { + if (!iomap_valid || flags != BMAPI_READ) { flags = BMAPI_READ; size = xfs_probe_cluster(inode, page, bh, head, 1); @@ -1071,7 +1071,15 @@ xfs_page_state_convert( iomap_valid = xfs_iomap_valid(&iomap, offset); } - type = IOMAP_READ; + /* + * We set the type to IOMAP_NEW in case we are doing a + * small write at EOF that is extending the file but + * without needing an allocation. We need to update the + * file size on I/O completion in this case so it is + * the same case as having just allocated a new extent + * that we are writing into for the first time. + */ + type = IOMAP_NEW; if (!test_and_set_bit(BH_Lock, &bh->b_state)) { ASSERT(buffer_mapped(bh)); if (iomap_valid) From owner-xfs@oss.sgi.com Tue May 22 17:09:44 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 22 May 2007 17:09:46 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4N09gWt031044 for ; Tue, 22 May 2007 17:09:43 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id 1BAB792C2D5; Wed, 23 May 2007 10:09:42 +1000 (EST) Subject: Re: Review: ensure EOF writes into existing extents update filesize From: Nathan Scott Reply-To: nscott@aconex.com To: David Chinner Cc: Timothy Shimmin , xfs-dev , xfs-oss In-Reply-To: <20070523000213.GS86004887@sgi.com> References: <20070520233417.GA85884050@sgi.com> <738172E9F9634F7FC01B5B3C@timothy-shimmins-power-mac-g5.local> <20070522004414.GL85884050@sgi.com> <1179795779.6273.510.camel@edge> <20070522010326.GN85884050@sgi.com> <1179807014.6273.519.camel@edge> <20070522060559.GI86004887@sgi.com> <20070523000213.GS86004887@sgi.com> Content-Type: text/plain Organization: Aconex Date: Wed, 23 May 2007 10:15:19 +1000 Message-Id: <1179879319.6273.534.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-archive-position: 11502 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs On Wed, 2007-05-23 at 10:02 +1000, David Chinner wrote: > ... > Patch below, Nathan. It's passed my point tests and XFSQA... *nod* - looks good Dave, nice detective work there. cheers. -- Nathan From owner-xfs@oss.sgi.com Tue May 22 17:59:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 22 May 2007 17:59:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4N0xZWt007388 for ; Tue, 22 May 2007 17:59:37 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA03602; Wed, 23 May 2007 10:59:32 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 226D858CA531; Wed, 23 May 2007 10:59:32 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: TAKE 965068 - write at EOF fails to update filesize Message-Id: <20070523005932.226D858CA531@chook.melbourne.sgi.com> Date: Wed, 23 May 2007 10:59:32 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11503 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Write at EOF may not update filesize correctly. The recent fix for preventing NULL files from being left around does not update the file size corectly in all cases. The missing case is a write extending the file that does not need to allocate a block. In that case we used a read mapping of the extent which forced the use of the read I/O completion handler instead of the write I/O completion handle. Hence the file size was not updated on I/O completion. Date: Wed May 23 10:58:50 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs Inspected by: nscott@aconex.com The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28657a fs/xfs/linux-2.6/xfs_aops.c - 1.143 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_aops.c.diff?r1=text&tr1=1.143&r2=text&tr2=1.142&f=h - Ensure we use an I/O completion handler that can update the filesize for all types of write regardless of whether we need to do allocation or not. From owner-xfs@oss.sgi.com Wed May 23 02:21:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 23 May 2007 02:21:16 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4N9L9Wt030426 for ; Wed, 23 May 2007 02:21:12 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA14523; Wed, 23 May 2007 19:21:05 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4N9L4Af100528355; Wed, 23 May 2007 19:21:05 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4N9L3uF101047625; Wed, 23 May 2007 19:21:03 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 23 May 2007 19:21:03 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review - writing to multiple non-contiguous unwritten extents within a page is broken. Message-ID: <20070523092103.GT85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11504 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs [Nathan - probably another one for you] This test run on ia64 (16k page size) on a 4k block size filesystem: #!/bin/bash file=$1 rm -f $file xfs_io -f \ -c "truncate 1048576" \ -c "resvsp 1032192 16384" \ -c "pwrite 1033216 2560" \ -c "pwrite 1040384 8192" \ -c "bmap -vvp" \ -c "fsync" \ -c "bmap -vvp" \ -c "close" \ $file Which writes 3 unwritten blocks in a page (first block and last 2) results in a corrupted write. The problem is that the second block on teh page is uninitialised and so is skipped by xfs_page_state_convert. The problem is that the xfs_ioend structures are not getting created correctly. When we skip the uninitialised block, we add the second unwritten block we are writing to into the original ioend. While this results in the correct I/O being sent to disk, it results in a ioend with a start offset of 0 and a length of 3 blocks. When we do unwritten extent conversion based on this range, we convert the wrong blocks. What we need to be doing is creating two xfs_ioend structures, one for the first block and one for the second set of blocks in the page. That way we get two separate I/O completion events and convert the ranges separately and correctly. I've checked xfs_convert_page(), and I don't think it needs any fix here - it already appears to force multiple ioends to be used in this case... Thoughts? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_aops.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-05-23 16:33:04.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-05-23 17:52:15.540456674 +1000 @@ -1008,6 +1008,8 @@ xfs_page_state_convert( if (buffer_unwritten(bh) || buffer_delay(bh) || ((buffer_uptodate(bh) || PageUptodate(page)) && !buffer_mapped(bh) && (unmapped || startio))) { + int new_ioend = 0; + /* * Make sure we don't use a read-only iomap */ @@ -1026,6 +1028,15 @@ xfs_page_state_convert( } if (!iomap_valid) { + /* + * if we didn't have a valid mapping then we + * need to ensure that we put the new mapping + * in a new ioend structure. This needs to be + * done to ensure that the ioends correctly + * reflect the block mappings at io completion + * for unwritten extent conversion. + */ + new_ioend = 1; if (type == IOMAP_NEW) { size = xfs_probe_cluster(inode, page, bh, head, 0); @@ -1045,7 +1056,7 @@ xfs_page_state_convert( if (startio) { xfs_add_to_ioend(inode, bh, offset, type, &ioend, - !iomap_valid); + new_ioend); } else { set_buffer_dirty(bh); unlock_buffer(bh); From owner-xfs@oss.sgi.com Wed May 23 18:08:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 23 May 2007 18:08:56 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4O18nWt012711 for ; Wed, 23 May 2007 18:08:51 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA09428; Thu, 24 May 2007 11:08:45 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1161) id 62E0E58CA531; Thu, 24 May 2007 11:08:45 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com Cc: xfs@oss.sgi.com Subject: TAKE 965149 - xfs_fsr - Temp working directory has full access for everyone Message-Id: <20070524010845.62E0E58CA531@chook.melbourne.sgi.com> Date: Thu, 24 May 2007 11:08:45 +1000 (EST) From: bnaujok@sgi.com (Barry Naujok) X-archive-position: 11505 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@sgi.com Precedence: bulk X-list: xfs Restrict access to temp working directory Date: Thu May 24 11:07:57 AEST 2007 Workarea: chook.melbourne.sgi.com:/home/bnaujok/isms/xfs-cmds Inspected by: Nathan Scott [nscott@aconex.com] The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/xfs-cmds/master-melb Modid: master-melb:xfs-cmds:28670a xfsdump/VERSION - 1.86 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsdump/VERSION.diff?r1=text&tr1=1.86&r2=text&tr2=1.85&f=h xfsdump/doc/CHANGES - 1.100 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsdump/doc/CHANGES.diff?r1=text&tr1=1.100&r2=text&tr2=1.99&f=h - Update to version 2.2.45 xfsdump/fsr/xfs_fsr.c - 1.28 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsdump/fsr/xfs_fsr.c.diff?r1=text&tr1=1.28&r2=text&tr2=1.27&f=h - Restrict access to temp working directory From owner-xfs@oss.sgi.com Thu May 24 04:20:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 04:20:43 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4OBKZWt020464 for ; Thu, 24 May 2007 04:20:37 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id C334BB04E56A; Thu, 24 May 2007 07:20:35 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id BD2CE5000086; Thu, 24 May 2007 07:20:35 -0400 (EDT) Date: Thu, 24 May 2007 07:20:35 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Pallai Roland cc: Linux-Raid , xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem In-Reply-To: <200705241318.30711.dap@mail.index.hu> Message-ID: References: <200705241318.30711.dap@mail.index.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11506 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Including XFS mailing list on this one. On Thu, 24 May 2007, Pallai Roland wrote: > > Hi, > > I wondering why the md raid5 does accept writes after 2 disks failed. I've an > array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed > (my friend kicked it off from the box on the floor:) and 2 disks have been > kicked but my download (yafc) not stopped, it tried and could write the file > system for whole night! > Now I changed the cable, tried to reassembly the array (mdadm -f --run), > event counter increased from 4908158 up to 4929612 on the failed disks, but I > cannot mount the file system and the 'xfs_repair -n' shows lot of errors > there. This is expainable by the partially successed writes. Ext3 and JFS > has "error=" mount option to switch filesystem read-only on any error, but > XFS hasn't: why? It's a good question too, but I think the md layer could > save dumb filesystems like XFS if denies writes after 2 disks are failed, and > I cannot see a good reason why it's not behave this way. > > Do you have better idea how can I avoid such filesystem corruptions in the > future? No, I don't want to use ext3 on this box. :) > > > my mount error: > XFS: Log inconsistent (didn't find previous header) > XFS: failed to find log head > XFS: log mount/recovery failed: error 5 > XFS: log mount failed > > > -- > d > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > From owner-xfs@oss.sgi.com Thu May 24 04:25:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 04:25:06 -0700 (PDT) Received: from poczta.o2.pl (mx10.go2.pl [193.17.41.74]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4OBOwWt022222 for ; Thu, 24 May 2007 04:24:59 -0700 Received: from poczta.o2.pl (mx10.go2.pl [127.0.0.1]) by poczta.o2.pl (Postfix) with ESMTP id 4C48E5800E; Thu, 24 May 2007 13:24:57 +0200 (CEST) Received: from lucke.localnet (xdsl-7687.bielsko.dialog.net.pl [62.87.234.135]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by poczta.o2.pl (Postfix) with ESMTP; Thu, 24 May 2007 13:24:57 +0200 (CEST) From: =?utf-8?q?=C5=81ukasz_Fibinger?= Reply-To: lucke@o2.pl To: xfs@oss.sgi.com Subject: Re: RESVSP problems Date: Thu, 24 May 2007 13:24:52 +0200 User-Agent: KMail/1.9.7 Cc: dgc@sgi.com MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705241324.53283.lucke@o2.pl> X-archive-position: 11507 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lucke@o2.pl Precedence: bulk X-list: xfs >>> You've probably hit: >>> http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 >>> unwritten extents remain unwritten after mmap() modifies them >>> >>> Bug dchinner about it... ;-) >> >> Dave, consider it a bugging from my humble self :-) > Yeah, yeah ;) > I'm waiting to see what happens with Nick's patches in .22 before > going any further. If they are not merged into .22, then I think we > should push the XFS specific fix in.... I wonder, what's the status of it, Dave? Cheers, Luke From owner-xfs@oss.sgi.com Thu May 24 13:51:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 13:51:32 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4OKpOWt026420 for ; Thu, 24 May 2007 13:51:25 -0700 Received: from Relay1.suse.de (mail2.suse.de [195.135.221.8]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx2.suse.de (Postfix) with ESMTP id 1C7EC21486; Thu, 24 May 2007 22:51:23 +0200 (CEST) To: dgc@sgi.com (David Chinner) Cc: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: Re: PARTIAL TAKE 964999 - lazy superblock counters for XFS References: <20070522075932.E665058CA531@chook.melbourne.sgi.com> From: Andi Kleen Date: 24 May 2007 23:48:16 +0200 In-Reply-To: <20070522075932.E665058CA531@chook.melbourne.sgi.com> Message-ID: Lines: 16 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 11508 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: andi@firstfloor.org Precedence: bulk X-list: xfs dgc@sgi.com (David Chinner) writes: > > The key to removing the contention is to not require the superblock > fields in question to be locked. We do that by not marking the > superblock dirty in the transaction. IOWs, we modify the incore > superblock but do not modify the cached superblock buffer. In short, > we do not log superblock modifications to critical fields in the > superblock on every transaction. In fact we only do it just before > we write the superblock to disk every sync period or just before > unmount. Does this mean it will increases performance on small systems too due to less super block writes or is it purely for large system scalability? -Andi From owner-xfs@oss.sgi.com Thu May 24 16:24:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 16:24:25 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4ONOKWt027777 for ; Thu, 24 May 2007 16:24:22 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA13337; Fri, 25 May 2007 09:24:13 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4ONOCAf103266746; Fri, 25 May 2007 09:24:13 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4ONO5gx103103914; Fri, 25 May 2007 09:24:05 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 25 May 2007 09:24:05 +1000 From: David Chinner To: Andi Kleen Cc: David Chinner , xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: Re: PARTIAL TAKE 964999 - lazy superblock counters for XFS Message-ID: <20070524232405.GE85884050@sgi.com> References: <20070522075932.E665058CA531@chook.melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-archive-position: 11509 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 24, 2007 at 11:48:16PM +0200, Andi Kleen wrote: > dgc@sgi.com (David Chinner) writes: > > > > The key to removing the contention is to not require the superblock > > fields in question to be locked. We do that by not marking the > > superblock dirty in the transaction. IOWs, we modify the incore > > superblock but do not modify the cached superblock buffer. In short, > > we do not log superblock modifications to critical fields in the > > superblock on every transaction. In fact we only do it just before > > we write the superblock to disk every sync period or just before > > unmount. > > Does this mean it will increases performance on small systems too > due to less super block writes or is it purely for large > system scalability? If you are running 100 concurrent transactions to your small filesystem, then yest, it will also help. But that sort of load is usually seen on file servers or large compute boxes doing lots of file manipuations.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 24 17:05:57 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 17:06:00 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4P05sWt003176 for ; Thu, 24 May 2007 17:05:56 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA14351; Fri, 25 May 2007 10:05:52 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4P05oAf103361574; Fri, 25 May 2007 10:05:50 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4P05lgf103112564; Fri, 25 May 2007 10:05:47 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 25 May 2007 10:05:47 +1000 From: David Chinner To: Justin Piszcz Cc: Pallai Roland , Linux-Raid , xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Message-ID: <20070525000547.GH85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-archive-position: 11510 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > Including XFS mailing list on this one. Thanks Justin. > On Thu, 24 May 2007, Pallai Roland wrote: > > > > >Hi, > > > >I wondering why the md raid5 does accept writes after 2 disks failed. I've > >an > >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable > >failed > >(my friend kicked it off from the box on the floor:) and 2 disks have been > >kicked but my download (yafc) not stopped, it tried and could write the > >file > >system for whole night! > >Now I changed the cable, tried to reassembly the array (mdadm -f --run), > >event counter increased from 4908158 up to 4929612 on the failed disks, > >but I > >cannot mount the file system and the 'xfs_repair -n' shows lot of errors > >there. This is expainable by the partially successed writes. Ext3 and JFS > >has "error=" mount option to switch filesystem read-only on any error, but > >XFS hasn't: why? "-o ro,norecovery" will allow you to mount the filesystem and get any uncorrupted data off it. You still may get shutdowns if you trip across corrupted metadata in the filesystem, though. > >It's a good question too, but I think the md layer could > >save dumb filesystems like XFS if denies writes after 2 disks are failed, > >and > >I cannot see a good reason why it's not behave this way. How is *any* filesystem supposed to know that the underlying block device has gone bad if it is not returning errors? I did mention this exact scenario in the filesystems workshop back in february - we'd *really* like to know if a RAID block device has gone into degraded mode (i.e. lost a disk) so we can throttle new writes until the rebuil dhas been completed. Stopping writes completely on a fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) would also be possible if only we could get the information out of the block layer. > >Do you have better idea how can I avoid such filesystem corruptions in the > >future? No, I don't want to use ext3 on this box. :) Well, the problem is a bug in MD - it should have detected drives going away and stopped access to the device until it was repaired. You would have had the same problem with ext3, or JFS, or reiser or any other filesystem, too. > >my mount error: > >XFS: Log inconsistent (didn't find previous header) > >XFS: failed to find log head > >XFS: log mount/recovery failed: error 5 > >XFS: log mount failed You MD device is still hosed - error 5 = EIO; the md device is reporting errors back the filesystem now. You need to fix that before trying to recover any data... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 24 18:53:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 18:53:25 -0700 (PDT) Received: from daptopfc.localdomain (inn.nightwish.hu [217.20.130.190]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4P1rHWt028033 for ; Thu, 24 May 2007 18:53:20 -0700 Received: from daptopfc.localdomain (daptopfc.localdomain [127.0.0.1]) by daptopfc.localdomain (8.13.8/8.13.8) with ESMTP id l4P1Znos006231; Fri, 25 May 2007 03:35:50 +0200 Received: (from dap@localhost) by daptopfc.localdomain (8.13.8/8.13.8/Submit) id l4P1ZnEJ006230; Fri, 25 May 2007 03:35:49 +0200 X-Authentication-Warning: daptopfc.localdomain: dap set sender to dap@mail.index.hu using -f Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem From: Pallai Roland To: David Chinner Cc: Linux-Raid , xfs@oss.sgi.com In-Reply-To: <20070525000547.GH85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> Content-Type: text/plain Content-Transfer-Encoding: 7bit Date: Fri, 25 May 2007 03:35:48 +0200 Message-Id: <1180056948.6183.10.camel@daptopfc.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.8.2.1 (2.8.2.1-3.fc6) X-archive-position: 11511 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dap@mail.index.hu Precedence: bulk X-list: xfs On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > > On Thu, 24 May 2007, Pallai Roland wrote: > > >I wondering why the md raid5 does accept writes after 2 disks failed. I've > > >an > > >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable > > >failed > > >(my friend kicked it off from the box on the floor:) and 2 disks have been > > >kicked but my download (yafc) not stopped, it tried and could write the > > >file > > >system for whole night! > > >Now I changed the cable, tried to reassembly the array (mdadm -f --run), > > >event counter increased from 4908158 up to 4929612 on the failed disks, > > >but I > > >cannot mount the file system and the 'xfs_repair -n' shows lot of errors > > >there. This is expainable by the partially successed writes. Ext3 and JFS > > >has "error=" mount option to switch filesystem read-only on any error, but > > >XFS hasn't: why? > > "-o ro,norecovery" will allow you to mount the filesystem and get any > uncorrupted data off it. > > You still may get shutdowns if you trip across corrupted metadata in > the filesystem, though. Thanks, I'll try it > > >It's a good question too, but I think the md layer could > > >save dumb filesystems like XFS if denies writes after 2 disks are failed, > > >and > > >I cannot see a good reason why it's not behave this way. > > How is *any* filesystem supposed to know that the underlying block > device has gone bad if it is not returning errors? It is returning errors, I think so. If I try to write raid5 with 2 failed disks with dd, I've got errors on the missing chunks. The difference between ext3 and XFS is that ext3 will remount to read-only on the first write error but the XFS won't, XFS only fails only the current operation, IMHO. The method of ext3 isn't perfect, but in practice, it's working well. > I did mention this exact scenario in the filesystems workshop back > in february - we'd *really* like to know if a RAID block device has gone > into degraded mode (i.e. lost a disk) so we can throttle new writes > until the rebuil dhas been completed. Stopping writes completely on a > fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) > would also be possible if only we could get the information out > of the block layer. It would be nice, but as I mentioned above, ext3 do it well in practice now. > > >Do you have better idea how can I avoid such filesystem corruptions in the > > >future? No, I don't want to use ext3 on this box. :) > > Well, the problem is a bug in MD - it should have detected > drives going away and stopped access to the device until it was > repaired. You would have had the same problem with ext3, or JFS, > or reiser or any other filesystem, too. > > > >my mount error: > > >XFS: Log inconsistent (didn't find previous header) > > >XFS: failed to find log head > > >XFS: log mount/recovery failed: error 5 > > >XFS: log mount failed > > You MD device is still hosed - error 5 = EIO; the md device is > reporting errors back the filesystem now. You need to fix that > before trying to recover any data... I play with it tomorrow, thanks for your help -- d From owner-xfs@oss.sgi.com Thu May 24 21:55:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 21:55:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4P4t9Wt012036 for ; Thu, 24 May 2007 21:55:11 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA20647; Fri, 25 May 2007 14:55:07 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4P4t6Af103564413; Fri, 25 May 2007 14:55:06 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4P4t0bB102646718; Fri, 25 May 2007 14:55:00 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 25 May 2007 14:55:00 +1000 From: David Chinner To: Pallai Roland Cc: David Chinner , Linux-Raid , xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Message-ID: <20070525045500.GF86004887@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <1180056948.6183.10.camel@daptopfc.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1180056948.6183.10.camel@daptopfc.localdomain> User-Agent: Mutt/1.4.2.1i X-archive-position: 11512 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 25, 2007 at 03:35:48AM +0200, Pallai Roland wrote: > On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > > > >It's a good question too, but I think the md layer could > > > >save dumb filesystems like XFS if denies writes after 2 disks are failed, > > > >and > > > >I cannot see a good reason why it's not behave this way. > > > > How is *any* filesystem supposed to know that the underlying block > > device has gone bad if it is not returning errors? > It is returning errors, I think so. If I try to write raid5 with 2 > failed disks with dd, I've got errors on the missing chunks. Oh, did you look at your logs and find that XFS had spammed them about writes that were failing? > The difference between ext3 and XFS is that ext3 will remount to > read-only on the first write error but the XFS won't, XFS only fails > only the current operation, IMHO. The method of ext3 isn't perfect, but > in practice, it's working well. XFS will shutdown the filesystem if metadata corruption will occur due to a failed write. We don't immediately fail the filesystem on data write errors because on large systems you can get *transient* I/O errors (e.g. FC path failover) and so retrying failed data writes is useful for preventing unnecessary shutdowns of the filesystem. Different design criteria, different solutions... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 24 22:06:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 22:06:44 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4P56cWt015754 for ; Thu, 24 May 2007 22:06:40 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA20916; Fri, 25 May 2007 15:06:35 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4P56XAf85616914; Fri, 25 May 2007 15:06:34 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4P56WjB103836182; Fri, 25 May 2007 15:06:32 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 25 May 2007 15:06:32 +1000 From: David Chinner To: =?iso-8859-1?Q?=C5=81ukasz?= Fibinger Cc: xfs@oss.sgi.com, dgc@sgi.com Subject: Re: RESVSP problems Message-ID: <20070525050632.GG86004887@sgi.com> References: <200705241324.53283.lucke@o2.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <200705241324.53283.lucke@o2.pl> User-Agent: Mutt/1.4.2.1i X-archive-position: 11513 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 24, 2007 at 01:24:52PM +0200, Å?ukasz Fibinger wrote: > >>> You've probably hit: > >>> http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 > >>> unwritten extents remain unwritten after mmap() modifies them > >>> > >>> Bug dchinner about it... ;-) > >> > >> Dave, consider it a bugging from my humble self :-) > > > Yeah, yeah ;) > > > I'm waiting to see what happens with Nick's patches in .22 before > > going any further. If they are not merged into .22, then I think we > > should push the XFS specific fix in.... > > I wonder, what's the status of it, Dave? A couple of days ago: http://marc.info/?l=linux-mm&m=117988295814519&w=2 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 24 23:10:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 23:10:39 -0700 (PDT) Received: from mail.ggsys.net (mail.ggsys.net [69.26.161.131]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4P6AYWt001120 for ; Thu, 24 May 2007 23:10:36 -0700 Received: (qmail 22718 invoked from network); 25 May 2007 05:43:53 -0000 Received: from cpe-70-112-65-134.austin.res.rr.com (HELO ?192.168.4.12?) (70.112.65.134) by mail.ggsys.net with SMTP; 25 May 2007 05:43:53 -0000 Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem From: Alberto Alonso To: David Chinner Cc: Pallai Roland , Linux-Raid , xfs@oss.sgi.com In-Reply-To: <20070525045500.GF86004887@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <1180056948.6183.10.camel@daptopfc.localdomain> <20070525045500.GF86004887@sgi.com> Content-Type: text/plain Organization: Global Gate Systems LLC. Date: Fri, 25 May 2007 00:43:51 -0500 Message-Id: <1180071831.21028.125.camel@w100> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 Content-Transfer-Encoding: 7bit X-archive-position: 11514 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: alberto@ggsys.net Precedence: bulk X-list: xfs > > The difference between ext3 and XFS is that ext3 will remount to > > read-only on the first write error but the XFS won't, XFS only fails > > only the current operation, IMHO. The method of ext3 isn't perfect, but > > in practice, it's working well. > > XFS will shutdown the filesystem if metadata corruption will occur > due to a failed write. We don't immediately fail the filesystem on > data write errors because on large systems you can get *transient* > I/O errors (e.g. FC path failover) and so retrying failed data > writes is useful for preventing unnecessary shutdowns of the > filesystem. > > Different design criteria, different solutions... I think his point was that going into a read only mode causes a less catastrophic situation (ie. a web server can still serve pages). I think that is a valid point, rather than shutting down the file system completely, an automatic switch to where the least disruption of service can occur is always desired. Maybe the automatic failure mode could be something that is configurable via the mount options. I personally have found the XFS file system to be great for my needs (except issues with NFS interaction, where the bug report never got answered), but that doesn't mean it can not be improved. Just my 2 cents, Alberto > Cheers, > > Dave. -- Alberto Alonso Global Gate Systems LLC. (512) 351-7233 http://www.ggsys.net Hardware, consulting, sysadmin, monitoring and remote backups From owner-xfs@oss.sgi.com Thu May 24 23:53:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 24 May 2007 23:53:13 -0700 (PDT) Received: from one.firstfloor.org (one.firstfloor.org [213.235.205.2]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4P6r7Wt012051 for ; Thu, 24 May 2007 23:53:10 -0700 Received: by one.firstfloor.org (Postfix, from userid 503) id 2DD6718902A7; Fri, 25 May 2007 08:53:04 +0200 (CEST) Date: Fri, 25 May 2007 08:53:03 +0200 From: Andi Kleen To: David Chinner Cc: Andi Kleen , xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: Re: PARTIAL TAKE 964999 - lazy superblock counters for XFS Message-ID: <20070525065303.GA8094@one.firstfloor.org> References: <20070522075932.E665058CA531@chook.melbourne.sgi.com> <20070524232405.GE85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070524232405.GE85884050@sgi.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11515 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: andi@firstfloor.org Precedence: bulk X-list: xfs > If you are running 100 concurrent transactions to your small > filesystem, then yest, it will also help. But that sort of load > is usually seen on file servers or large compute boxes doing lots > of file manipuations.... But won't you do less sb writes on any workload since the data is stored elsewhere? -Andi From owner-xfs@oss.sgi.com Fri May 25 01:37:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 25 May 2007 01:37:07 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4P8b1Wt005561 for ; Fri, 25 May 2007 01:37:02 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA26515; Fri, 25 May 2007 18:36:57 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4P8atAf104646802; Fri, 25 May 2007 18:36:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4P8aobI104669484; Fri, 25 May 2007 18:36:50 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 25 May 2007 18:36:50 +1000 From: David Chinner To: Alberto Alonso Cc: David Chinner , Pallai Roland , Linux-Raid , xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Message-ID: <20070525083650.GO85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <1180056948.6183.10.camel@daptopfc.localdomain> <20070525045500.GF86004887@sgi.com> <1180071831.21028.125.camel@w100> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1180071831.21028.125.camel@w100> User-Agent: Mutt/1.4.2.1i X-archive-position: 11516 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > > The difference between ext3 and XFS is that ext3 will remount to > > > read-only on the first write error but the XFS won't, XFS only fails > > > only the current operation, IMHO. The method of ext3 isn't perfect, but > > > in practice, it's working well. > > > > XFS will shutdown the filesystem if metadata corruption will occur > > due to a failed write. We don't immediately fail the filesystem on > > data write errors because on large systems you can get *transient* > > I/O errors (e.g. FC path failover) and so retrying failed data > > writes is useful for preventing unnecessary shutdowns of the > > filesystem. > > > > Different design criteria, different solutions... > > I think his point was that going into a read only mode causes a > less catastrophic situation (ie. a web server can still serve > pages). Sure - but once you've detected one corruption or had metadata I/O errors, can you trust the rest of the filesystem? > I think that is a valid point, rather than shutting down > the file system completely, an automatic switch to where the least > disruption of service can occur is always desired. I consider the possibility of serving out bad data (i.e after a remount to readonly) to be the worst possible disruption of service that can happen ;) > Maybe the automatic failure mode could be something that is > configurable via the mount options. If only it were that simple. Have you looked to see how many hooks there are in XFS to shutdown without causing further damage? % grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l 116 Changing the way we handle shutdowns would take a lot of time, effort and testing. When can I expect a patch? ;) > I personally have found the XFS file system to be great for > my needs (except issues with NFS interaction, where the bug report > never got answered), but that doesn't mean it can not be improved. Got a pointer? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri May 25 03:24:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 25 May 2007 03:24:31 -0700 (PDT) Received: from poczta.o2.pl (mx10.go2.pl [193.17.41.74]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4PAOQWt028976 for ; Fri, 25 May 2007 03:24:27 -0700 Received: from poczta.o2.pl (mx10.go2.pl [127.0.0.1]) by poczta.o2.pl (Postfix) with ESMTP id BDFA658044; Fri, 25 May 2007 12:24:25 +0200 (CEST) Received: from lucke.localnet (xdsl-7687.bielsko.dialog.net.pl [62.87.234.135]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by poczta.o2.pl (Postfix) with ESMTP; Fri, 25 May 2007 12:24:25 +0200 (CEST) From: =?utf-8?q?=C5=81ukasz_Fibinger?= Reply-To: lucke@o2.pl To: David Chinner Subject: Re: RESVSP problems Date: Fri, 25 May 2007 12:24:23 +0200 User-Agent: KMail/1.9.7 References: <200705241324.53283.lucke@o2.pl> <20070525050632.GG86004887@sgi.com> In-Reply-To: <20070525050632.GG86004887@sgi.com> Cc: xfs@oss.sgi.com MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200705251224.23624.lucke@o2.pl> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l4PAOSWt028980 X-archive-position: 11517 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lucke@o2.pl Precedence: bulk X-list: xfs On Friday 25 of May 2007, you wrote: > On Thu, May 24, 2007 at 01:24:52PM +0200, Å?ukasz Fibinger wrote: > > >>> You've probably hit: > > >>> http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 > > >>> unwritten extents remain unwritten after mmap() modifies them > > >>> > > >>> Bug dchinner about it... ;-) > > >> > > >> Dave, consider it a bugging from my humble self :-) > > > > > > Yeah, yeah ;) > > > > > > I'm waiting to see what happens with Nick's patches in .22 before > > > going any further. If they are not merged into .22, then I think we > > > should push the XFS specific fix in.... > > > > I wonder, what's the status of it, Dave? > > A couple of days ago: > > http://marc.info/?l=linux-mm&m=117988295814519&w=2 > I see. If that's the case, would sharing your 2.6.21 XFS specific fixes (you have them, right?) with me be any problem for you? Cheers, Luke From owner-xfs@oss.sgi.com Fri May 25 04:24:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 25 May 2007 04:25:06 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4PBOtWt017634 for ; Fri, 25 May 2007 04:24:57 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id VAA29782; Fri, 25 May 2007 21:24:44 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4PBOhAf104582273; Fri, 25 May 2007 21:24:44 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4PBOffF104725440; Fri, 25 May 2007 21:24:41 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Fri, 25 May 2007 21:24:41 +1000 From: David Chinner To: =?iso-8859-1?Q?=C5=81ukasz?= Fibinger Cc: David Chinner , xfs@oss.sgi.com Subject: Re: RESVSP problems Message-ID: <20070525112441.GQ85884050@sgi.com> References: <200705241324.53283.lucke@o2.pl> <20070525050632.GG86004887@sgi.com> <200705251224.23624.lucke@o2.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <200705251224.23624.lucke@o2.pl> User-Agent: Mutt/1.4.2.1i X-archive-position: 11518 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 25, 2007 at 12:24:23PM +0200, Å?ukasz Fibinger wrote: > On Friday 25 of May 2007, you wrote: > > On Thu, May 24, 2007 at 01:24:52PM +0200, Å?ukasz Fibinger wrote: > > > >>> You've probably hit: > > > >>> http://oss.sgi.com/bugzilla/show_bug.cgi?id=418 > > > >>> unwritten extents remain unwritten after mmap() modifies them > > > >>> > > > >>> Bug dchinner about it... ;-) > > > >> > > > >> Dave, consider it a bugging from my humble self :-) > > > > > > > > Yeah, yeah ;) > > > > > > > > I'm waiting to see what happens with Nick's patches in .22 before > > > > going any further. If they are not merged into .22, then I think we > > > > should push the XFS specific fix in.... > > > > > > I wonder, what's the status of it, Dave? > > > > A couple of days ago: > > > > http://marc.info/?l=linux-mm&m=117988295814519&w=2 > > > > I see. If that's the case, would sharing your 2.6.21 XFS specific fixes (you > have them, right?) with me be any problem for you? The original XFS specific code was here: http://marc.info/?l=linux-fsdevel&m=117080251026985&w=2 But if you want to patch your own kernel, I suggest that you use the latest generic version I posted here: http://marc.info/?l=linux-fsdevel&m=117426058311029&w=2 http://marc.info/?l=linux-fsdevel&m=117426070111133&w=2 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri May 25 06:38:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 25 May 2007 06:38:07 -0700 (PDT) Received: from poczta.o2.pl (mx10.go2.pl [193.17.41.74]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4PDbxWt008855 for ; Fri, 25 May 2007 06:38:00 -0700 Received: from poczta.o2.pl (mx10.go2.pl [127.0.0.1]) by poczta.o2.pl (Postfix) with ESMTP id BEEC858033; Fri, 25 May 2007 15:37:58 +0200 (CEST) Received: from lucke.localnet (xdsl-7687.bielsko.dialog.net.pl [62.87.234.135]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by poczta.o2.pl (Postfix) with ESMTP; Fri, 25 May 2007 15:37:58 +0200 (CEST) From: =?utf-8?q?=C5=81ukasz_Fibinger?= Reply-To: lucke@o2.pl To: David Chinner Subject: Re: RESVSP problems Date: Fri, 25 May 2007 15:37:54 +0200 User-Agent: KMail/1.9.7 Cc: xfs@oss.sgi.com References: <200705241324.53283.lucke@o2.pl> <200705251224.23624.lucke@o2.pl> <20070525112441.GQ85884050@sgi.com> In-Reply-To: <20070525112441.GQ85884050@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200705251537.55567.lucke@o2.pl> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l4PDc1Wt008862 X-archive-position: 11519 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lucke@o2.pl Precedence: bulk X-list: xfs On Friday 25 of May 2007, David Chinner wrote: > On Fri, May 25, 2007 at 12:24:23PM +0200, Å?ukasz Fibinger wrote: > > I see. If that's the case, would sharing your 2.6.21 XFS specific fixes > > (you have them, right?) with me be any problem for you? > > The original XFS specific code was here: > > http://marc.info/?l=linux-fsdevel&m=117080251026985&w=2 > > But if you want to patch your own kernel, I suggest that > you use the latest generic version I posted here: > > http://marc.info/?l=linux-fsdevel&m=117426058311029&w=2 > http://marc.info/?l=linux-fsdevel&m=117426070111133&w=2 > Seems to be working as expected. Thanks. Cheers, Luke From owner-xfs@oss.sgi.com Fri May 25 07:01:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 25 May 2007 07:01:40 -0700 (PDT) Received: from dh.localdomain (3e44a16d.adsl.index.hu [217.20.130.176]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4PE1XWt014547 for ; Fri, 25 May 2007 07:01:35 -0700 Received: from dap by dh.localdomain with local (Exim 4.63) (envelope-from ) id 1HraMB-0003W0-Si; Fri, 25 May 2007 16:01:28 +0200 From: Pallai Roland Organization: magex To: David Chinner Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Date: Fri, 25 May 2007 16:01:27 +0200 User-Agent: KMail/1.9.6 Cc: Linux-Raid , xfs@oss.sgi.com References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <1180056948.6183.10.camel@daptopfc.localdomain> In-Reply-To: <1180056948.6183.10.camel@daptopfc.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705251601.27577.dap@mail.index.hu> X-archive-position: 11520 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dap@mail.index.hu Precedence: bulk X-list: xfs On Friday 25 May 2007 03:35:48 Pallai Roland wrote: > On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > > On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > > > On Thu, 24 May 2007, Pallai Roland wrote: > > > >It's a good question too, but I think the md layer could > > > >save dumb filesystems like XFS if denies writes after 2 disks are > > > > failed, and > > > >I cannot see a good reason why it's not behave this way. > > > > How is *any* filesystem supposed to know that the underlying block > > device has gone bad if it is not returning errors? > > It is returning errors, I think so. If I try to write raid5 with 2 > failed disks with dd, I've got errors on the missing chunks. > The difference between ext3 and XFS is that ext3 will remount to > read-only on the first write error but the XFS won't, XFS only fails > only the current operation, IMHO. The method of ext3 isn't perfect, but > in practice, it's working well. Sorry, I was wrong: md really isn't returning error! It's madness, IMHO. The reason why ext3 safer on raid5 in practice is that ext3 remounts to read-only on read errors too and when a raid5 array got 2 failed drives and there's some read, the error= behavior of ext3 will be activated and stops further writes. You're right, it's not a good solution and there should be read operations to prevent data loss in this case on ext3 too. Raid5 *must deny all writes* when 2 disks failed: I still can't see a good reason why not, and the current method is braindead! > > I did mention this exact scenario in the filesystems workshop back > > in february - we'd *really* like to know if a RAID block device has gone > > into degraded mode (i.e. lost a disk) so we can throttle new writes > > until the rebuil dhas been completed. Stopping writes completely on a > > fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) > > would also be possible if only we could get the information out > > of the block layer. Yes, it's sounds good, but I think we need a quick fix now, it's a real problem and easily can lead to mass data loss. -- d From owner-xfs@oss.sgi.com Fri May 25 07:35:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 25 May 2007 07:35:44 -0700 (PDT) Received: from dh.localdomain (3e44a16d.adsl.index.hu [217.20.130.176]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4PEZdWt019898 for ; Fri, 25 May 2007 07:35:41 -0700 Received: from dap by dh.localdomain with local (Exim 4.63) (envelope-from ) id 1HratE-0003XZ-Ka; Fri, 25 May 2007 16:35:36 +0200 From: Pallai Roland Organization: magex To: David Chinner Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Date: Fri, 25 May 2007 16:35:36 +0200 User-Agent: KMail/1.9.6 Cc: Linux-Raid , xfs@oss.sgi.com References: <200705241318.30711.dap@mail.index.hu> <1180056948.6183.10.camel@daptopfc.localdomain> <20070525045500.GF86004887@sgi.com> In-Reply-To: <20070525045500.GF86004887@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705251635.36533.dap@mail.index.hu> X-archive-position: 11521 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dap@mail.index.hu Precedence: bulk X-list: xfs On Friday 25 May 2007 06:55:00 David Chinner wrote: > Oh, did you look at your logs and find that XFS had spammed them > about writes that were failing? The first message after the incident: May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xf8ac14f8 May 24 01:53:50 hq kernel: xfs_btree_check_sblock+0x4f/0xc2 [xfs] xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: xfs_alloc_lookup+0x34e/0x47b [xfs] kmem_zone_zalloc+0x1b/0x43 [xfs] May 24 01:53:50 hq kernel: xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: xfs_bmapi+0x1ac4/0x23cd [xfs] xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel: xlog_dealloc_log+0x49/0xea [xfs] xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel: xfs_iomap+0x60e/0x82d [xfs] __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: xfs_map_blocks+0x39/0x6c [xfs] xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel: schedule+0x5d1/0xf4d xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: xfs_vm_writepage+0x57/0xe0 [xfs] mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel: mpage_writepages+0x133/0x3bb xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: do_writepages+0x35/0x3b __writeback_single_inode+0x88/0x387 May 24 01:53:50 hq kernel: sync_sb_inodes+0x1b4/0x2a8 writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel: background_writeout+0x66/0x9f pdflush+0x0/0x1ad May 24 01:53:50 hq kernel: pdflush+0xef/0x1ad background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: kthread+0xc2/0xc6 kthread+0x0/0xc6 May 24 01:53:50 hq kernel: kernel_thread_helper+0x5/0xb ..and I've spammed such messages. This "internal error" isn't a good reason to shut down the file system? I think if there's a sign of corrupted file system, the first thing we should do is to stop writes (or the entire FS) and let the admin to examine the situation. I'm not talking about my case where the md raid5 was a braindead, I'm talking about general situations. -- d From owner-xfs@oss.sgi.com Fri May 25 07:49:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 25 May 2007 07:49:46 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4PEnfWt022607 for ; Fri, 25 May 2007 07:49:43 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hrae4-0002nt-OT; Fri, 25 May 2007 15:19:56 +0100 Date: Fri, 25 May 2007 15:19:56 +0100 From: Christoph Hellwig To: David Chinner Cc: ??ukasz Fibinger , xfs@oss.sgi.com Subject: Re: RESVSP problems Message-ID: <20070525141956.GA10669@infradead.org> References: <200705241324.53283.lucke@o2.pl> <20070525050632.GG86004887@sgi.com> <200705251224.23624.lucke@o2.pl> <20070525112441.GQ85884050@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070525112441.GQ85884050@sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11522 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, May 25, 2007 at 09:24:41PM +1000, David Chinner wrote: > The original XFS specific code was here: > > http://marc.info/?l=linux-fsdevel&m=117080251026985&w=2 Btw, I think you should just push it for now. We can convert XFS to the generic code once ->fault goes in whenever that may be. From owner-xfs@oss.sgi.com Sun May 27 17:30:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 27 May 2007 17:30:27 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4S0ULWt021934 for ; Sun, 27 May 2007 17:30:22 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA27800; Mon, 28 May 2007 10:30:17 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4S0UFAf105233397; Mon, 28 May 2007 10:30:16 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4S0UBLF106707477; Mon, 28 May 2007 10:30:11 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Mon, 28 May 2007 10:30:11 +1000 From: David Chinner To: Pallai Roland Cc: David Chinner , Linux-Raid , xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Message-ID: <20070528003010.GS85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <1180056948.6183.10.camel@daptopfc.localdomain> <20070525045500.GF86004887@sgi.com> <200705251635.36533.dap@mail.index.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705251635.36533.dap@mail.index.hu> User-Agent: Mutt/1.4.2.1i X-archive-position: 11523 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > On Friday 25 May 2007 06:55:00 David Chinner wrote: > > Oh, did you look at your logs and find that XFS had spammed them > > about writes that were failing? > > The first message after the incident: > > May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xf8ac14f8 > May 24 01:53:50 hq kernel: xfs_btree_check_sblock+0x4f/0xc2 [xfs] xfs_alloc_lookup+0x34e/0x47b [xfs] > May 24 01:53:50 HF kernel: xfs_alloc_lookup+0x34e/0x47b [xfs] kmem_zone_zalloc+0x1b/0x43 [xfs] > May 24 01:53:50 hq kernel: xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] xfs_alloc_vextent+0x3bd/0x53b [xfs] > May 24 01:53:50 hq kernel: xfs_bmapi+0x1ac4/0x23cd [xfs] xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] > May 24 01:53:50 hq kernel: xlog_dealloc_log+0x49/0xea [xfs] xfs_iomap_write_allocate+0x2d9/0x58b [xfs] > May 24 01:53:50 hq kernel: xfs_iomap+0x60e/0x82d [xfs] __wake_up_common+0x39/0x59 > May 24 01:53:50 hq kernel: xfs_map_blocks+0x39/0x6c [xfs] xfs_page_state_convert+0x644/0xf9c [xfs] > May 24 01:53:50 hq kernel: schedule+0x5d1/0xf4d xfs_vm_writepage+0x0/0xe0 [xfs] > May 24 01:53:50 hq kernel: xfs_vm_writepage+0x57/0xe0 [xfs] mpage_writepages+0x1fb/0x3bb > May 24 01:53:50 hq kernel: mpage_writepages+0x133/0x3bb xfs_vm_writepage+0x0/0xe0 [xfs] > May 24 01:53:50 hq kernel: do_writepages+0x35/0x3b __writeback_single_inode+0x88/0x387 > May 24 01:53:50 hq kernel: sync_sb_inodes+0x1b4/0x2a8 writeback_inodes+0x63/0xdc > May 24 01:53:50 hq kernel: background_writeout+0x66/0x9f pdflush+0x0/0x1ad > May 24 01:53:50 hq kernel: pdflush+0xef/0x1ad background_writeout+0x0/0x9f > May 24 01:53:50 hq kernel: kthread+0xc2/0xc6 kthread+0x0/0xc6 > May 24 01:53:50 hq kernel: kernel_thread_helper+0x5/0xb > > .and I've spammed such messages. This "internal error" isn't a good reason to shut down > the file system? Actaully, that error does shut the filesystem down in most cases. When you see that output, the function is returning -EFSCORRUPTED. You've got a corrupted freespace btree. The reason why you get spammed is that this is happening during background writeback, and there is no one to return the -EFSCORRUPTED error to. The background writeback path doesn't specifically detect shut down filesystems or trigger shutdowns on errors because that happens in different layers so you just end up with failed data writes. These errors will occur on the next foreground data or metadata allocation and that will shut the filesystem down at that point. I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in this case we should be shutting down the filesystem. That would certainly cut down on the spamming and would not appear to change anything other behaviour.... > I think if there's a sign of corrupted file system, the first thing we should do > is to stop writes (or the entire FS) and let the admin to examine the situation. Yes, that's *exactly* what a shutdown does. In this case, your writes are being stopped - hence the error messages - but the filesystem has not yet been shutdown..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sun May 27 17:50:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 27 May 2007 17:50:27 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4S0oKWt027517 for ; Sun, 27 May 2007 17:50:23 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA28475; Mon, 28 May 2007 10:50:15 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4S0oDAf106609103; Mon, 28 May 2007 10:50:13 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4S0oAAD106689876; Mon, 28 May 2007 10:50:10 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Mon, 28 May 2007 10:50:10 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , ??ukasz Fibinger , xfs@oss.sgi.com Subject: Re: RESVSP problems Message-ID: <20070528005010.GV85884050@sgi.com> References: <200705241324.53283.lucke@o2.pl> <20070525050632.GG86004887@sgi.com> <200705251224.23624.lucke@o2.pl> <20070525112441.GQ85884050@sgi.com> <20070525141956.GA10669@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070525141956.GA10669@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11524 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, May 25, 2007 at 03:19:56PM +0100, Christoph Hellwig wrote: > On Fri, May 25, 2007 at 09:24:41PM +1000, David Chinner wrote: > > The original XFS specific code was here: > > > > http://marc.info/?l=linux-fsdevel&m=117080251026985&w=2 > > Btw, I think you should just push it for now. We can convert XFS to > the generic code once ->fault goes in whenever that may be. Ok - That original code had a coupl eof problems, so I'll fix it up first... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sun May 27 18:50:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 27 May 2007 18:50:28 -0700 (PDT) Received: from dh.localdomain (inn.nightwish.hu [217.20.130.190]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4S1oNWt007902 for ; Sun, 27 May 2007 18:50:25 -0700 Received: from dap by dh.localdomain with local (Exim 4.63) (envelope-from ) id 1HsUNG-0007sd-QL; Mon, 28 May 2007 03:50:19 +0200 From: Pallai Roland Organization: magex To: David Chinner Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Date: Mon, 28 May 2007 03:50:17 +0200 User-Agent: KMail/1.9.6 Cc: Linux-Raid , xfs@oss.sgi.com References: <200705241318.30711.dap@mail.index.hu> <200705251635.36533.dap@mail.index.hu> <20070528003010.GS85884050@sgi.com> In-Reply-To: <20070528003010.GS85884050@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705280350.18384.dap@mail.index.hu> X-archive-position: 11525 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dap@mail.index.hu Precedence: bulk X-list: xfs On Monday 28 May 2007 02:30:11 David Chinner wrote: > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > On Friday 25 May 2007 06:55:00 David Chinner wrote: > > > Oh, did you look at your logs and find that XFS had spammed them > > > about writes that were failing? > > > > The first message after the incident: > > > > May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error > > xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller > > 0xf8ac14f8 May 24 01:53:50 hq kernel: > > xfs_btree_check_sblock+0x4f/0xc2 [xfs] > > xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: > > xfs_alloc_lookup+0x34e/0x47b [xfs] kmem_zone_zalloc+0x1b/0x43 > > [xfs] May 24 01:53:50 hq kernel: > > xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] > > xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: > > xfs_bmapi+0x1ac4/0x23cd [xfs] > > xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel: > > xlog_dealloc_log+0x49/0xea [xfs] > > xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel: > > xfs_iomap+0x60e/0x82d [xfs] > > __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: > > xfs_map_blocks+0x39/0x6c [xfs] > > xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel: > > schedule+0x5d1/0xf4d xfs_vm_writepage+0x0/0xe0 > > [xfs] May 24 01:53:50 hq kernel: xfs_vm_writepage+0x57/0xe0 > > [xfs] mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel: > > mpage_writepages+0x133/0x3bb > > xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: > > do_writepages+0x35/0x3b __writeback_single_inode+0x88/0x387 > > May 24 01:53:50 hq kernel: sync_sb_inodes+0x1b4/0x2a8 > > writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel: > > background_writeout+0x66/0x9f pdflush+0x0/0x1ad > > May 24 01:53:50 hq kernel: pdflush+0xef/0x1ad > > background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: > > kthread+0xc2/0xc6 kthread+0x0/0xc6 May 24 01:53:50 hq kernel: > > kernel_thread_helper+0x5/0xb > > > > .and I've spammed such messages. This "internal error" isn't a good > > reason to shut down the file system? > > Actaully, that error does shut the filesystem down in most cases. When you > see that output, the function is returning -EFSCORRUPTED. You've got a > corrupted freespace btree. > > The reason why you get spammed is that this is happening during background > writeback, and there is no one to return the -EFSCORRUPTED error to. The > background writeback path doesn't specifically detect shut down filesystems > or trigger shutdowns on errors because that happens in different layers so > you just end up with failed data writes. These errors will occur on the > next foreground data or metadata allocation and that will shut the > filesystem down at that point. > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in > this case we should be shutting down the filesystem. That would certainly > cut down on the spamming and would not appear to change anything other > behaviour.... If I remember correctly, my file system wasn't shutted down at all, it was "writeable" for whole night, the yafc slowly "written" files to it. Maybe all write operations had failed, but yafc doesn't warn. Spamming is just annoying when we need to find out what went wrong (My kernel.log is 300Mb), but for data security it's important to react to EFSCORRUPTED error in any case, I think so. Please consider this. > > I think if there's a sign of corrupted file system, the first thing we > > should do is to stop writes (or the entire FS) and let the admin to > > examine the situation. > > Yes, that's *exactly* what a shutdown does. In this case, your writes are > being stopped - hence the error messages - but the filesystem has not yet > been shutdown..... All writes being stopped that were involved in the freespace btree, but a few operations were executed (on the corrupted FS), right? Ignoring of EFSCORRUPTED isn't a good idea in this case. -- d From owner-xfs@oss.sgi.com Sun May 27 19:17:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 27 May 2007 19:17:30 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4S2HNWt016282 for ; Sun, 27 May 2007 19:17:26 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA01034; Mon, 28 May 2007 12:17:22 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4S2HKAf101593323; Mon, 28 May 2007 12:17:21 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4S2HISa106121051; Mon, 28 May 2007 12:17:18 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Mon, 28 May 2007 12:17:18 +1000 From: David Chinner To: Pallai Roland Cc: David Chinner , Linux-Raid , xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Message-ID: <20070528021718.GZ85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <200705251635.36533.dap@mail.index.hu> <20070528003010.GS85884050@sgi.com> <200705280350.18384.dap@mail.index.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705280350.18384.dap@mail.index.hu> User-Agent: Mutt/1.4.2.1i X-archive-position: 11526 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote: > On Monday 28 May 2007 02:30:11 David Chinner wrote: > > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > > .and I've spammed such messages. This "internal error" isn't a good > > > reason to shut down the file system? > > > > Actaully, that error does shut the filesystem down in most cases. When you > > see that output, the function is returning -EFSCORRUPTED. You've got a > > corrupted freespace btree. > > > > The reason why you get spammed is that this is happening during background > > writeback, and there is no one to return the -EFSCORRUPTED error to. The > > background writeback path doesn't specifically detect shut down filesystems > > or trigger shutdowns on errors because that happens in different layers so > > you just end up with failed data writes. These errors will occur on the > > next foreground data or metadata allocation and that will shut the > > filesystem down at that point. > > > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in > > this case we should be shutting down the filesystem. That would certainly > > cut down on the spamming and would not appear to change anything other > > behaviour.... > If I remember correctly, my file system wasn't shutted down at all, it > was "writeable" for whole night, the yafc slowly "written" files to it. Maybe > all write operations had failed, but yafc doesn't warn. So you never created new files or directories, unlinked files or directories, did synchronous writes, etc? Just had slowly growing files? > Spamming is just annoying when we need to find out what went wrong (My > kernel.log is 300Mb), but for data security it's important to react to > EFSCORRUPTED error in any case, I think so. Please consider this. The filesystem has responded correctly to the corruption in terms of data security (i.e. failed the data write and warned noisily about it), but it probably hasn't done everything it should.... Hmmmm. A quick look at the linux code makes me thikn that background writeback on linux has never been able to cause a shutdown in this case. However, the same error on Irix will definitely cause a shutdown, though.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sun May 27 22:19:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 27 May 2007 22:19:40 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4S5JVWt025287 for ; Sun, 27 May 2007 22:19:33 -0700 Received: from pc-bnaujok.melbourne.sgi.com (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA04330; Mon, 28 May 2007 15:19:30 +1000 Date: Mon, 28 May 2007 15:22:58 +1000 To: "xfs@oss.sgi.com" , xfs-dev Subject: [PATCH 3/3] XFS metadump utility From: "Barry Naujok" Organization: SGI Content-Type: multipart/mixed; boundary=----------kI0MHNesDwnWBMiH0h2rG0 MIME-Version: 1.0 Message-ID: User-Agent: Opera Mail/9.10 (Win32) X-archive-position: 11529 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@sgi.com Precedence: bulk X-list: xfs ------------kI0MHNesDwnWBMiH0h2rG0 Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15 Content-Transfer-Encoding: 7bit xfs_metadump and xfs_mdrestore man pages. ------------kI0MHNesDwnWBMiH0h2rG0 Content-Disposition: attachment; filename=xfs_md_man.patch Content-Type: application/octet-stream; name=xfs_md_man.patch Content-Transfer-Encoding: Base64 Cj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQp4ZnNwcm9ncy9tYW4v bWFuOC94ZnNfZGIuOAo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0K Ci0tLSBhL3hmc3Byb2dzL21hbi9tYW44L3hmc19kYi44CTIwMDctMDUtMjgg MTU6MDk6MzguMDAwMDAwMDAwICsxMDAwCisrKyBiL3hmc3Byb2dzL21hbi9t YW44L3hmc19kYi44CTIwMDctMDUtMjggMTI6NDY6MDAuODIzMTIxMzE5ICsx MDAwCkBAIC0zNjcsNiArMzY3LDEwIEBAIElmIG5vIFxmMmxhYmVsXGYxIGlz IGdpdmVuLCB0aGUgY3VycmVudCAKIFN0YXJ0IGxvZ2dpbmcgb3V0cHV0IHRv IFxmMmZpbGVuYW1lXGYxLCBzdG9wIGxvZ2dpbmcsCiBvciBwcmludCB0aGUg Y3VycmVudCBsb2dnaW5nIHN0YXR1cy4KIC5UUAorXGYzbWV0YWR1bXBcZjEg WyBcZjMtZWdvd1xmMSBdIFxmMmZpbGVuYW1lXGYxCitEdW1wcyBtZXRhZGF0 YSB0byBhIGZpbGUuIFNlZQorLkJSIHhmc19tZXRhZHVtcCAiKDgpIGZvciBt b3JlIGluZm9ybWF0aW9uLiIKKy5UUAogXGYzbmNoZWNrXGYxIFsgXGYzXC1z XGYxIF0gWyBcZjNcLWlcZjEgXGYyaW5vXGYxIF0gLi4uCiBQcmludCBuYW1l LWlub2RlIHBhaXJzLgogQSBcZjNibG9ja2dldCBcLW5cZjEgY29tbWFuZCBt dXN0IGJlIHJ1biBmaXJzdCB0byBnYXRoZXIgdGhlIGluZm9ybWF0aW9uLgpA QCAtMTIzOSw2ICsxMjQzLDcgQEAgeGZzX2FkbWluKDgpLAogeGZzX2NoZWNr KDgpLAogeGZzX2NvcHkoOCksCiB4ZnNfbG9ncHJpbnQoOCksCit4ZnNfbWV0 YWR1bXAoOCksCiB4ZnNfbmNoZWNrKDgpLAogeGZzX3JlcGFpcig4KSwKIG1v dW50KDgpLAoKPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Cnhmc3By b2dzL21hbi9tYW44L3hmc19tZHJlc3RvcmUuOAo9PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT0KCi0tLSBhL3hmc3Byb2dzL21hbi9tYW44L3hmc19t ZHJlc3RvcmUuOAkyMDA2LTA2LTE3IDAwOjU4OjI0LjAwMDAwMDAwMCArMTAw MAorKysgYi94ZnNwcm9ncy9tYW4vbWFuOC94ZnNfbWRyZXN0b3JlLjgJMjAw Ny0wNS0yOCAxMzowMDowMS45MTk0MTg1MzEgKzEwMDAKQEAgLTAsMCArMSw0 OCBAQAorLlRIIHhmc19tZHJlc3RvcmUgOAorLlNIIE5BTUUKK3hmc19tZHJl c3RvcmUgXC0gcmVzdG9yZXMgYW4gWEZTIG1ldGFkdW1wIGltYWdlIHRvIGEg ZmlsZXN5c3RlbSBpbWFnZQorLlNIIFNZTk9QU0lTCisuQiB4ZnNfbWRyZXN0 b3JlCisuUkIgWyBcLWcgXQorLkkgc291cmNlCisuSSB0YXJnZXQKKy5TSCBE RVNDUklQVElPTgorLkIgeGZzX21kcmVzdG9yZQoraXMgYSBkZWJ1Z2dpbmcg dG9vbCB0aGF0IHJlc3RvcmVzIGEgbWV0YWRhdGEgaW1hZ2UgZ2VuZXJhdGVk IGJ5CisuQlIgeGZzX21ldGFkdW1wICg4KQordG8gYSBmaWxlc3lzdGVtLiBU aGUKKy5JIHNvdXJjZQorYXJndW1lbnQgc3BlY2lmaWVzIHRoZSBsb2NhdGlv biBvZiB0aGUgbWV0YWR1bXAgaW1hZ2UgYW5kIHRoZQorLkkgdGFyZ2V0Cith cmd1bWVudCBzcGVjaWZpZXMgdGhlIGRlc3RpbmF0aW9uIGZvciB0aGUgZmls c3lzdGVtIGltYWdlLgorSWYgdGhlCisuSSBzb3VyY2UKK2lzIFwtLCB0aGVu IHRoZSBtZXRhZGF0YSBpbWFnZSBpcyByZWFkIGZyb20gc3RkaW4uIFRoaXMg YWxsb3dzIHRoZSBvdXRwdXQgb2YKK2JlIGFub3RoZXIgcHJvZ3JhbSBzdWNo IGFzIGEgY29tcHJlc3Npb24gYXBwbGljYXRpb24gdG8gYmUgcmVkaXJlY3Rl ZCB0bworLkJSIHhmc19tZHJlc3RvcmUgLgorVGhlCisuSSB0YXJnZXQKK2Nh biBiZSBlaXRoZXIgYSBmaWxlIG9yIGEgZGV2aWNlLgorLlBQCisuQiB4ZnNf bWRyZXN0b3JlCitzaG91bGQgbm90IGJlIHVzZWQgdG8gcmVzdG9yZSBtZXRh ZGF0YSBvbnRvIGFuIGV4aXN0aW5nIGZpbGVzeXN0ZW0gdW5sZXNzCit5b3Ug YXJlIGNvbXBsZXRlbHkgY2VydGFpbiB0aGUKKy5JIHRhcmdldAorY2FuIGJl IGRlc3Ryb3llZC4KKy5QUAorLlNIIE9QVElPTlMKKy5UUAorLkIgXC1nCitT aG93cyByZXN0b3JlIHByb2dyZXNzIG9uIHN0ZG91dC4KKy5TSCBESUFHTk9T VElDUworLkIgeGZzX21kcmVzdG9yZQorcmV0dXJucyBhbiBleGl0IGNvZGUg b2YgMCBpZiBhbGwgdGhlIG1ldGFkYXRhIGlzIHN1Y2Nlc2Z1bGx5IHJlc3Rv cmVkIG9yCisxIGlmIGFuIGVycm9yIG9jY3Vycy4KKy5TSCBTRUUgQUxTTwor LkJSIHhmc19tZXRhZHVtcCAoOCksCisuQlIgeGZzX3JlcGFpciAoOCksCisu QlIgeGZzX2NoZWNrICg4KSwKKy5CUiB4ZnMgKDUpCisuU0ggQlVHUworRW1h aWwgYnVnIHJlcG9ydHMgdG8KKy5CUiB4ZnNAb3NzLnNnaS5jb20gLgpcIE5v IG5ld2xpbmUgYXQgZW5kIG9mIGZpbGUKCj09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PQp4ZnNwcm9ncy9tYW4vbWFuOC94ZnNfbWV0YWR1bXAuOAo9 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KCi0tLSBhL3hmc3Byb2dz L21hbi9tYW44L3hmc19tZXRhZHVtcC44CTIwMDYtMDYtMTcgMDA6NTg6MjQu MDAwMDAwMDAwICsxMDAwCisrKyBiL3hmc3Byb2dzL21hbi9tYW44L3hmc19t ZXRhZHVtcC44CTIwMDctMDUtMjggMTU6MDc6NTguMzUwMTQ3NTUxICsxMDAw CkBAIC0wLDAgKzEsMTI3IEBACisuVEggeGZzX21ldGFkdW1wIDgKKy5TSCBO QU1FCit4ZnNfbWV0YWR1bXAgXC0gY29weSBYRlMgZmlsZXN5c3RlbSBtZXRh ZGF0YSB0byBhIGZpbGUKKy5TSCBTWU5PUFNJUworLkIgeGZzX21ldGFkdW1w CisuUkIgWyBcLWVmZ293IF0KKy5SQiBbIFwtbAorLklSIGxvZ2RldiBdCisu SSBzb3VyY2UKKy5JIHRhcmdldAorLlNIIERFU0NSSVBUSU9OCisuQiB4ZnNf bWV0YWR1bXAKK2lzIGEgZGVidWdnaW5nIHRvb2wgdGhhdCBjb3BpZXMgdGhl IG1ldGFkYXRhIGZyb20gYW4gWEZTIGZpbGVzeXN0ZW0gdG8gYSBmaWxlLgor VGhlCisuSSBzb3VyY2UKK2FyZ3VtZW50IG11c3QgYmUgdGhlIHBhdGhuYW1l IG9mIHRoZSBkZXZpY2Ugb3IgZmlsZQorY29udGFpbmluZyB0aGUgWEZTIGZp bGVzeXN0ZW0gYW5kIHRoZQorLkkgdGFyZ2V0Cithcmd1bWVudCBzcGVjaWZp ZXMgdGhlIGRlc3RpbmF0aW9uIGZpbGUgbmFtZS4KK0lmCisuSSB0YXJnZXQK K2lzIFwtLCB0aGVuIHRoZSBvdXRwdXQgaXMgc2VudCB0byBzdGRvdXQuIFRo aXMgYWxsb3dzIHRoZSBvdXRwdXQgdG8gYmUKK3JlZGlyZWN0ZWQgdG8gYW5v dGhlciBwcm9ncmFtIHN1Y2ggYXMgYSBjb21wcmVzc2lvbiBhcHBsaWNhdGlv bi4KKy5QUAorLkIgeGZzX21ldGFkdW1wCitzaG91bGQgb25seSBiZSB1c2Vk IHRvIGNvcHkgdW5tb3VudGVkIGZpbGVzeXN0ZW1zLCByZWFkLW9ubHkgbW91 bnRlZAorZmlsZXN5c3RlbXMsIG9yIGZyb3plbiBmaWxlc3lzdGVtcyAoc2Vl CisuQlIgeGZzX2ZyZWV6ZSAoOCkpLgorT3RoZXJ3aXNlLCB0aGUgZ2VuZXJh dGVkIGR1bXAgY291bGQgYmUgaW5jb25zaXN0ZW50IG9yIGNvcnJ1cHQuCisu UFAKKy5CIHhmc19tZXRhZHVtcAorZG9lcyBub3QgYWx0ZXIgdGhlIHNvdXJj ZSBmaWxlc3lzdGVtIGluIGFueSB3YXkuIFRoZQorLkkgdGFyZ2V0CitpbWFn ZSBpcyBhIGNvbnRpZ3VvdXMgKG5vbi1zcGFyc2UpIGZpbGUgY29udGFpbmlu ZyBhbGwgdGhlCitmaWxlc3lzdGVtJ3MgbWV0YWRhdGEgYW5kIGluZGV4ZXMg dG8gd2hlcmUgdGhlIGJsb2NrcyB3ZXJlIGNvcGllZCBmcm9tLgorLlBQCitC eSBkZWZhdWx0LAorLkIgeGZzX21ldGFkdW1wCitvYmZ1c2NhdGVzIG1vc3Qg ZGlyZWN0b3J5IG5hbWVzIGFuZCBleHRlbmRlZCBhdHRyaWJ1dGUgbmFtZXMg dG8gYWxsb3cgdGhlIGR1bXBzCit0byBiZSBzZW50IHdpdGhvdXQgcmV2ZWFs aW5nIGNvbmZpZGVudGlhbCBpbmZvcm1hdGlvbi4gRXh0ZW5kZWQgYXR0cmli dXRlCit2YWx1ZXMgYXJlIHplcm9lZCBhbmQgbm8gZGF0YSBpcyBjb3BpZWQu IFRoZSBvbmx5IGV4Y2VwdGlvbnMgYXJlIGRpcmVjdG9yeQorb3IgYXR0cmli dXRlIG5hbWVzIHRoYXQgYXJlIDQgb3IgbGVzcyBjaGFyYWN0ZXJzIGluIGxl bmd0aC4gQWxzbyBkaXJlY3RvcnkKK25hbWVzIHRoYXQgc3BhbiBleHRlbnRz ICh0aGlzIGNhbiBvbmx5IG9jY3VyIHdpdGggdGhlCisuQlIgbWtmcy54ZnMg KDgpCitvcHRpb25zIHdoZXJlCisuQiBcLW4KKy5JIHNpemUKKz4KKy5CIFwt YgorLklSIHNpemUgKQorYXJlIG5vdCBvYmZ1c2NhdGVkLgorLlBQCisuQiB4 ZnNfbWV0YWR1bXAKK3Nob3VsZCBub3QgYmUgdXNlZCBmb3IgYW55IHB1cnBv c2VzIG90aGVyIHRoYW4gZm9yIGRlYnVnZ2luZyBhbmQgcmVwb3J0aW5nCitm aWxlc3lzdGVtIHByb2JsZW1zLiBUaGUgbW9zdCBjb21tb24gdXNhZ2Ugc2Nl bmFyaW8gZm9yIHRoaXMgdG9vbCBpcyB3aGVuCisuQlIgeGZzX3JlcGFpciAo OCkKK2ZhaWxzIHRvIHJlcGFpciBhIGZpbGVzeXN0ZW0gYW5kIGEgbWV0YWR1 bXAgaW1hZ2UgY2FuIGJlIHNlbnQgZm9yCithbmFseXNpcy4KKy5QUAorVGhl IGZpbGUgZ2VuZXJhdGVkIGJ5CisuQiB4ZnNfbWV0YWR1bXAKK2NhbiBiZSBy ZXN0b3JlZCB0byBmaWxlc3lzdGVtIGltYWdlIChtaW51cyB0aGUgZGF0YSkg dXNpbmcgdGhlCisuQlIgeGZzX21kcmVzdG9yZSAoOCkKK3Rvb2wuCisuUFAK Ky5TSCBPUFRJT05TCisuVFAKKy5CIFwtZQorU3RvcHMgdGhlIGR1bXAgb24g YSByZWFkIGVycm9yLiBOb21hbGx5LCBpdCB3aWxsIGlnbm9yZSByZWFkIGVy cm9ycyBhbmQgY29weQorYXMgbXVjaCBtZXRhZHVtcCBkYXRhIGFzIGlzIGFj Y2Vzc2libGUuCisuVFAKKy5CIFwtZgorU3BlY2lmaWVzIHRoYXQgdGhlIGZp bGVzeXN0ZW0gaW1hZ2UgdG8gYmUgcHJvY2Vzc2VkIGlzIHN0b3JlZCBpbiBh IHJlZ3VsYXIgZmlsZQorKHNlZSB0aGUKKy5CIG1rZnMueGZzIC1kCitmaWxl IG9wdGlvbikuIFRoaXMgY2FuIGFsc28gaGFwcGVuIGlmIGFuIGltYWdlIGNv cHkgb2YgYSBmaWxlc3lzdGVtIGhhcworYmVlbiBtYWRlIGludG8gYW4gb3Jk aW5hcnkgZmlsZSB3aXRoCisuQlIgeGZzX2NvcHkgKDgpLgorLlRQCisuQiBc LWcKK1Nob3dzIGR1bXAgcHJvZ3Jlc3MuIFRoaXMgaXMgc2VudCB0byBzdGRv dXQgaWYgdGhlCisuSSB0YXJnZXQKK2lzIGEgZmlsZSBvciB0byBzdGRlcnIg aWYgdGhlCisuSSB0YXJnZXQKK2lzIHN0ZG91dC4KKy5UUAorLkJJIFwtbCAi IGxvZ2RldiIKK0ZvciBmaWxlc3lzdGVtcyB3aGljaCB1c2UgYW4gZXh0ZXJu YWwgbG9nLCB0aGlzIHNwZWNpZmllcyB0aGUgZGV2aWNlIHdoZXJlIHRoZQor ZXh0ZXJuYWwgbG9nIHJlc2lkZXMuIFRoZSBleHRlcm5hbCBsb2cgaXMgbm90 IGNvcGllZCwgb25seSBpbnRlcm5hbCBsb2dzIGFyZQorY29waWVkLgorLlRQ CisuQiBcLW8KK0Rpc2FibGVzIG9iZnVzY2F0aW9uIG9mIGZpbGUgbmFtZXMg YW5kIGV4dGVuZGVkIGF0dHJpYnV0ZXMuCisuVFAKKy5CIFwtdworUHJpbnRz IHdhcm5pbmdzIG9mIGluY29uc2lzdGFudCBtZXRhZGF0YSBlbmNvdW50ZXJl ZCB0byBzdGRlcnIuIEJhZCBtZXRhZGF0YQoraXMgc3RpbGwgY29waWVkLgor LlNIIERJQUdOT1NUSUNTCisuQiB4ZnNfbWV0YWR1bXAKK3JldHVybnMgYW4g ZXhpdCBjb2RlIG9mIDAgaWYgYWxsIHJlYWRhYmxlIG1ldGFkYXRhIGlzIHN1 Y2Nlc2Z1bGx5IGNvcGllZCBvcgorMSBpZiBhIHdyaXRlIGVycm9yIG9jY3Vy cyBvciBhIHJlYWQgZXJyb3Igb2NjdXJzIGFuZCB0aGUKKy5CIFwtZQorb3B0 aW9uIHVzZWQuCisuU0ggTk9URVMKK0FzCisuQiB4ZnNfbWV0YWR1bXAKK2Nv cGllcyBtZXRhZGF0YSBvbmx5LCBpdCBkb2VzIG5vdCBtYXR0ZXIgaWYgdGhl CisuSSBzb3VyY2UKK2ZpbGVzeXN0ZW0gaGFzIGEgcmVhbHRpbWUgc2VjdGlv biBvciBub3QuIElmIHRoZSBmaWxlc3lzdGVtIGhhcyBhbiBleHRlcm5hbAor bG9nLCBpdCBpcyBub3QgY29waWVkLiBJbnRlcm5hbCBsb2dzIGFyZSBjb3Bp ZWQgYW5kIGFueSBvdXRzdGFuZGluZyBsb2cKK3RyYW5zYWN0aW9ucyBhcmUg bm90IG9iZnVzY2F0ZWQgaWYgdGhleSBjb250YWluIG5hbWVzLgorLlBQCisu QiB4ZnNfbWV0YWR1bXAKK2lzIGEgc2hlbGwgd3JhcHBlciBhcm91bmQgdGhl CisuQlIgeGZzX2RiICg4KQorLkIgbWV0YWR1bXAKK2NvbW1hbmQuCisuU0gg U0VFIEFMU08KKy5CUiB4ZnNfcmVwYWlyICg4KSwKKy5CUiB4ZnNfbWRyZXN0 b3JlICg4KSwKKy5CUiB4ZnNfZnJlZXplICg4KSwKKy5CUiB4ZnNfZGIgKDgp LAorLkJSIHhmc19jb3B5ICg4KSwKKy5CUiB4ZnMgKDUpCisuU0ggQlVHUwor RW1haWwgYnVnIHJlcG9ydHMgdG8KKy5CUiB4ZnNAb3NzLnNnaS5jb20gLgpc IE5vIG5ld2xpbmUgYXQgZW5kIG9mIGZpbGUKCj09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PQp4ZnNwcm9ncy9tYW4vbWFuOC94ZnNfcmVwYWlyLjgK PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09CgotLS0gYS94ZnNwcm9n cy9tYW4vbWFuOC94ZnNfcmVwYWlyLjgJMjAwNy0wNS0yOCAxNTowOTozOC4w MDAwMDAwMDAgKzEwMDAKKysrIGIveGZzcHJvZ3MvbWFuL21hbjgveGZzX3Jl cGFpci44CTIwMDctMDUtMjggMTI6NDc6NDQuODg5ODAwNTUyICsxMDAwCkBA IC00MzQsOSArNDM0LDE3IEBAIG1hcHMsIHBhcnRpY3VsYXJseSBsb3N0IGJs b2NrcyBvciBzdWJ0bHkKIFRoZSBuby1tb2RpZnkgbW9kZSBjYW4gZ2VuZXJh dGUgcmVwZWF0ZWQgd2FybmluZ3MgYWJvdXQKIHRoZSBzYW1lIHByb2JsZW1z IGJlY2F1c2UgaXQgY2Fubm90IGZpeCB0aGUgcHJvYmxlbXMgYXMgdGhleQog YXJlIGVuY291bnRlcmVkLgorLlBQCitJZiBhIGZpbGVzeXN0ZW0gZmFpbHMg dG8gYmUgcmVwYWlyZWQsIGEgbWV0YWR1bXAgaW1hZ2UgY2FuIGJlIGdlbmVy YXRlZAord2l0aAorLkJSIHhmc19tZXRhZHVtcCAoOCkKK2FuZCBiZSBzZW50 IHRvIGFuIFhGUyBtYWludGFpbmVyIHRvIGJlIGFuYWx5c2VkIGFuZAorLkIg eGZzX3JlcGFpcgorZml4ZWQgYW5kL29yIGltcHJvdmVkLgogLlNIIFNFRSBB TFNPCiBkZCgxKSwKIG1rZnMueGZzKDgpLAogdW1vdW50KDgpLAogeGZzX2No ZWNrKDgpLAoreGZzX21ldGFkdW1wKDgpLAogeGZzKDUpLgo= ------------kI0MHNesDwnWBMiH0h2rG0-- From owner-xfs@oss.sgi.com Sun May 27 22:19:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 27 May 2007 22:19:39 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4S5JUWt025281 for ; Sun, 27 May 2007 22:19:33 -0700 Received: from pc-bnaujok.melbourne.sgi.com (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA04328; Mon, 28 May 2007 15:19:29 +1000 Date: Mon, 28 May 2007 15:22:55 +1000 To: "xfs@oss.sgi.com" , xfs-dev Subject: [PATCH 2/3] XFS metadump utility From: "Barry Naujok" Organization: SGI Content-Type: multipart/mixed; boundary=----------TWTWHiHWl3WLjMq9Cu9UYh MIME-Version: 1.0 Message-ID: User-Agent: Opera Mail/9.10 (Win32) X-archive-position: 11528 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@sgi.com Precedence: bulk X-list: xfs ------------TWTWHiHWl3WLjMq9Cu9UYh Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15 Content-Transfer-Encoding: 7bit xfs_mdrestore ------------TWTWHiHWl3WLjMq9Cu9UYh Content-Disposition: attachment; filename=xfs_mdrestore.patch Content-Type: application/octet-stream; name=xfs_mdrestore.patch Content-Transfer-Encoding: Base64 Cj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQp4ZnNwcm9ncy9NYWtl ZmlsZQo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KCi0tLSBhL3hm c3Byb2dzL01ha2VmaWxlCTIwMDctMDUtMjggMTQ6NTI6MzQuMDAwMDAwMDAw ICsxMDAwCisrKyBiL3hmc3Byb2dzL01ha2VmaWxlCTIwMDctMDUtMTggMTI6 MDM6MjIuNzc0MzI2ODU0ICsxMDAwCkBAIC0xNiw3ICsxNiw3IEBAIExESVJU ID0gY29uZmlnLmxvZyAuZGVwIGNvbmZpZy5zdGF0dXMgY28KIAlMb2dzLyog YnVpbHQgLmNlbnN1cyBpbnN0YWxsLiogaW5zdGFsbC1kZXYuKiAqLmd6CiAK IFNVQkRJUlMgPSBpbmNsdWRlIGxpYnhmcyBsaWJ4bG9nIGxpYnhjbWQgbGli aGFuZGxlIGxpYmRpc2sgXAotCWNvcHkgZGIgZnNjayBncm93ZnMgaW8gbG9n cHJpbnQgbWtmcyBxdW90YSByZXBhaXIgcnRjcCBcCisJY29weSBkYiBmc2Nr IGdyb3dmcyBpbyBsb2dwcmludCBta2ZzIHF1b3RhIG1kcmVzdG9yZSByZXBh aXIgcnRjcCBcCiAJbTQgbWFuIGRvYyBwbyBkZWJpYW4gYnVpbGQKIAogZGVm YXVsdDogJChDT05GSUdVUkUpCgo9PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT0KeGZzcHJvZ3MvbWRyZXN0b3JlL01ha2VmaWxlCj09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PQoKLS0tIGEveGZzcHJvZ3MvbWRyZXN0b3Jl L01ha2VmaWxlCTIwMDYtMDYtMTcgMDA6NTg6MjQuMDAwMDAwMDAwICsxMDAw CisrKyBiL3hmc3Byb2dzL21kcmVzdG9yZS9NYWtlZmlsZQkyMDA3LTA1LTE4 IDEyOjU2OjE2Ljg4NTA0MDYyMSArMTAwMApAQCAtMCwwICsxLDIyIEBACisj CisjIENvcHlyaWdodCAoYykgMjAwNyBTaWxpY29uIEdyYXBoaWNzLCBJbmMu ICBBbGwgUmlnaHRzIFJlc2VydmVkLgorIworCitUT1BESVIgPSAuLgoraW5j bHVkZSAkKFRPUERJUikvaW5jbHVkZS9idWlsZGRlZnMKKworTFRDT01NQU5E ID0geGZzX21kcmVzdG9yZQorQ0ZJTEVTID0geGZzX21kcmVzdG9yZS5jCisK K0xMRExJQlMgPSAkKExJQlhGUykgJChMSUJSVCkKK0xUREVQRU5ERU5DSUVT ID0gJChMSUJYRlMpCitMTERGTEFHUyA9IC1zdGF0aWMKKworZGVmYXVsdDog JChMVENPTU1BTkQpCisKK2luY2x1ZGUgJChCVUlMRFJVTEVTKQorCitpbnN0 YWxsOgorCSQoSU5TVEFMTCkgLW0gNzU1IC1kICQoUEtHX0JJTl9ESVIpCisJ JChMVElOU1RBTEwpIC1tIDc1NSAkKExUQ09NTUFORCkgJChQS0dfQklOX0RJ UikKK2luc3RhbGwtZGV2OgoKPT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09Cnhmc3Byb2dzL21kcmVzdG9yZS94ZnNfbWRyZXN0b3JlLmMKPT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09CgotLS0gYS94ZnNwcm9ncy9tZHJl c3RvcmUveGZzX21kcmVzdG9yZS5jCTIwMDYtMDYtMTcgMDA6NTg6MjQuMDAw MDAwMDAwICsxMDAwCisrKyBiL3hmc3Byb2dzL21kcmVzdG9yZS94ZnNfbWRy ZXN0b3JlLmMJMjAwNy0wNS0yNSAxNjozNTozOS44NjI5MDU3MDMgKzEwMDAK QEAgLTAsMCArMSwyNjMgQEAKKy8qCisgKiBDb3B5cmlnaHQgKGMpIDIwMDcg U2lsaWNvbiBHcmFwaGljcywgSW5jLgorICogQWxsIFJpZ2h0cyBSZXNlcnZl ZC4KKyAqCisgKiBUaGlzIHByb2dyYW0gaXMgZnJlZSBzb2Z0d2FyZTsgeW91 IGNhbiByZWRpc3RyaWJ1dGUgaXQgYW5kL29yCisgKiBtb2RpZnkgaXQgdW5k ZXIgdGhlIHRlcm1zIG9mIHRoZSBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5z ZSBhcworICogcHVibGlzaGVkIGJ5IHRoZSBGcmVlIFNvZnR3YXJlIEZvdW5k YXRpb24uCisgKgorICogVGhpcyBwcm9ncmFtIGlzIGRpc3RyaWJ1dGVkIGlu IHRoZSBob3BlIHRoYXQgaXQgd291bGQgYmUgdXNlZnVsLAorICogYnV0IFdJ VEhPVVQgQU5ZIFdBUlJBTlRZOyB3aXRob3V0IGV2ZW4gdGhlIGltcGxpZWQg d2FycmFudHkgb2YKKyAqIE1FUkNIQU5UQUJJTElUWSBvciBGSVRORVNTIEZP UiBBIFBBUlRJQ1VMQVIgUFVSUE9TRS4gIFNlZSB0aGUKKyAqIEdOVSBHZW5l cmFsIFB1YmxpYyBMaWNlbnNlIGZvciBtb3JlIGRldGFpbHMuCisgKgorICog WW91IHNob3VsZCBoYXZlIHJlY2VpdmVkIGEgY29weSBvZiB0aGUgR05VIEdl bmVyYWwgUHVibGljIExpY2Vuc2UKKyAqIGFsb25nIHdpdGggdGhpcyBwcm9n cmFtOyBpZiBub3QsIHdyaXRlIHRoZSBGcmVlIFNvZnR3YXJlIEZvdW5kYXRp b24sCisgKiBJbmMuLCAgNTEgRnJhbmtsaW4gU3QsIEZpZnRoIEZsb29yLCBC b3N0b24sIE1BICAwMjExMC0xMzAxICBVU0EKKyAqLworCisjaW5jbHVkZSA8 bGlieGZzLmg+CisjaW5jbHVkZSAieGZzX21ldGFkdW1wLmgiCisKK2NoYXIg CQkqcHJvZ25hbWU7CitpbnQJCXNob3dfcHJvZ3Jlc3MgPSAwOworaW50CQlw cm9ncmVzc19zaW5jZV93YXJuaW5nID0gMDsKKworc3RhdGljIHZvaWQKK2Zh dGFsKGNvbnN0IGNoYXIgKm1zZywgLi4uKQoreworCXZhX2xpc3QJCWFyZ3M7 CisKKwl2YV9zdGFydChhcmdzLCBtc2cpOworCWZwcmludGYoc3RkZXJyLCAi JXM6ICIsIHByb2duYW1lKTsKKwl2ZnByaW50ZihzdGRlcnIsIG1zZywgYXJn cyk7CisJZXhpdCgxKTsKK30KKworc3RhdGljIHZvaWQKK3ByaW50X3Byb2dy ZXNzKGNvbnN0IGNoYXIgKmZtdCwgLi4uKQoreworCWNoYXIJCWJ1Zls2MF07 CisJdmFfbGlzdAkJYXA7CisKKwl2YV9zdGFydChhcCwgZm10KTsKKwl2c25w cmludGYoYnVmLCBzaXplb2YoYnVmKSwgZm10LCBhcCk7CisJdmFfZW5kKGFw KTsKKwlidWZbc2l6ZW9mKGJ1ZiktMV0gPSAnXDAnOworCisJcHJpbnRmKCJc ciUtNTlzIiwgYnVmKTsKKwlmZmx1c2goc3Rkb3V0KTsKKwlwcm9ncmVzc19z aW5jZV93YXJuaW5nID0gMTsKK30KKworc3RhdGljIHZvaWQKK3BlcmZvcm1f cmVzdG9yZSgKKwlGSUxFCQkJKnNyY19mLAorCWludAkJCWRzdF9mZCwKKwlp bnQJCQlpc190YXJnZXRfZmlsZSkKK3sKKwl4ZnNfbWV0YWJsb2NrX3QgCSpt ZXRhYmxvY2s7CS8qIGhlYWRlciArIGluZGV4ICsgYmxvY2tzICovCisJX19i ZTY0CQkJKmJsb2NrX2luZGV4OworCWNoYXIJCQkqYmxvY2tfYnVmZmVyOwor CWludAkJCWJsb2NrX3NpemU7CisJaW50CQkJbWF4X2luZGljaWVzOworCWlu dAkJCWN1cl9pbmRleDsKKwl4ZnNfbWV0YWJsb2NrX3QJCXRtYjsKKwl4ZnNf c2JfdAkJc2I7CisJX19pbnQ2NF90CQlieXRlc19yZWFkOworCisJLyoKKwkg KiByZWFkIGluIGZpcnN0IGJsb2NrcyAoc3VwZXJibG9jayAwKSwgc2V0ICJp bnByb2dyZXNzIiBmbGFnIGZvciBpdCwKKwkgKiByZWFkIGluIHRoZSByZXN0 IG9mIHRoZSBmaWxlLCBhbmQgaWYgY29tcGxldGUsIGNsZWFyIFNCIDAncwor CSAqICJpbnByb2dyZXNzIGZsYWciCisJICovCisKKwlpZiAoZnJlYWQoJnRt Yiwgc2l6ZW9mKHRtYiksIDEsIHNyY19mKSAhPSAxKQorCQlmYXRhbCgiZXJy b3IgcmVhZGluZyBmcm9tIGZpbGU6ICVzXG4iLCBzdHJlcnJvcihlcnJubykp OworCisJaWYgKGJlMzJfdG9fY3B1KHRtYi5tYl9tYWdpYykgIT0gWEZTX01E X01BR0lDKQorCQlmYXRhbCgic3BlY2lmaWVkIGZpbGUgaXMgbm90IGEgbWV0 YWRhdGEgZHVtcFxuIik7CisKKwlibG9ja19zaXplID0gMSA8PCB0bWIubWJf YmxvY2tsb2c7CisJbWF4X2luZGljaWVzID0gKGJsb2NrX3NpemUgLSBzaXpl b2YoeGZzX21ldGFibG9ja190KSkgLyBzaXplb2YoX19iZTY0KTsKKworCW1l dGFibG9jayA9ICh4ZnNfbWV0YWJsb2NrX3QgKiljYWxsb2MobWF4X2luZGlj aWVzICsgMSwgYmxvY2tfc2l6ZSk7CisJaWYgKG1ldGFibG9jayA9PSBOVUxM KQorCQlmYXRhbCgibWVtb3J5IGFsbG9jYXRpb24gZmFpbHVyZVxuIik7CisK KwltZXRhYmxvY2stPm1iX2NvdW50ID0gYmUxNl90b19jcHUodG1iLm1iX2Nv dW50KTsKKwltZXRhYmxvY2stPm1iX2Jsb2NrbG9nID0gdG1iLm1iX2Jsb2Nr bG9nOworCisJaWYgKG1ldGFibG9jay0+bWJfY291bnQgPT0gMCB8fCBtZXRh YmxvY2stPm1iX2NvdW50ID4gbWF4X2luZGljaWVzKQorCQlmYXRhbCgiYmFk IGJsb2NrIGNvdW50OiAldVxuIiwgbWV0YWJsb2NrLT5tYl9jb3VudCk7CisK KwlibG9ja19pbmRleCA9IChfX2JlNjQgKikoKGNoYXIgKiltZXRhYmxvY2sg KyBzaXplb2YoeGZzX21ldGFibG9ja190KSk7CisJYmxvY2tfYnVmZmVyID0g KGNoYXIgKiltZXRhYmxvY2sgKyBibG9ja19zaXplOworCisJaWYgKGZyZWFk KGJsb2NrX2luZGV4LCBibG9ja19zaXplIC0gc2l6ZW9mKHRtYiksIDEsIHNy Y19mKSAhPSAxKQorCQlmYXRhbCgiZXJyb3IgcmVhZGluZyBmcm9tIGZpbGU6 ICVzXG4iLCBzdHJlcnJvcihlcnJubykpOworCisJaWYgKGJsb2NrX2luZGV4 WzBdICE9IDApCisJCWZhdGFsKCJmaXJzdCBibG9jayBpcyBub3QgdGhlIHBy aW1hcnkgc3VwZXJibG9ja1xuIik7CisKKworCWlmIChmcmVhZChibG9ja19i dWZmZXIsIG1ldGFibG9jay0+bWJfY291bnQgPDwgbWV0YWJsb2NrLT5tYl9i bG9ja2xvZywKKwkJCTEsIHNyY19mKSAhPSAxKQorCQlmYXRhbCgiZXJyb3Ig cmVhZGluZyBmcm9tIGZpbGU6ICVzXG4iLCBzdHJlcnJvcihlcnJubykpOwor CisJbGlieGZzX3hsYXRlX3NiKGJsb2NrX2J1ZmZlciwgJnNiLCAxLCBYRlNf U0JfQUxMX0JJVFMpOworCisJaWYgKHNiLnNiX21hZ2ljbnVtICE9IFhGU19T Ql9NQUdJQykKKwkJZmF0YWwoImJhZCBtYWdpYyBudW1iZXIgZm9yIHByaW1h cnkgc3VwZXJibG9ja1xuIik7CisKKwkoKHhmc19zYl90KilibG9ja19idWZm ZXIpLT5zYl9pbnByb2dyZXNzID0gMTsKKworCWlmIChpc190YXJnZXRfZmls ZSkgIHsKKwkJLyogZW5zdXJlIHJlZ3VsYXIgZmlsZXMgYXJlIGNvcnJlY3Rs eSBzaXplZCAqLworCisJCWlmIChmdHJ1bmNhdGU2NChkc3RfZmQsIHNiLnNi X2RibG9ja3MgKiBzYi5zYl9ibG9ja3NpemUpKQorCQkJZmF0YWwoImNhbm5v dCBzZXQgZmlsZXN5c3RlbSBpbWFnZSBzaXplOiAlc1xuIiwKKwkJCQlzdHJl cnJvcihlcnJubykpOworCX0gZWxzZSAgeworCQkvKiBlbnN1cmUgZGV2aWNl IGlzIHN1ZmZpY2llbnRseSBsYXJnZSBlbm91Z2ggKi8KKworCQljaGFyCQkq bGJbWEZTX01BWF9TRUNUT1JTSVpFXSA9IHsgMCB9OworCQlvZmY2NF90CQlv ZmY7CisKKwkJb2ZmID0gc2Iuc2JfZGJsb2NrcyAqIHNiLnNiX2Jsb2Nrc2l6 ZSAtIHNpemVvZihsYik7CisJCWlmIChwd3JpdGU2NChkc3RfZmQsIGxiLCBz aXplb2YobGIpLCBvZmYpIDwgMCkKKwkJCWZhdGFsKCJmYWlsZWQgdG8gd3Jp dGUgbGFzdCBibG9jaywgaXMgdGFyZ2V0IHRvbyAiCisJCQkJInNtYWxsPyAo ZXJyb3I6ICVzKVxuIiwgc3RyZXJyb3IoZXJybm8pKTsKKwl9CisKKwlieXRl c19yZWFkID0gMDsKKworCWZvciAoOzspIHsKKwkJaWYgKHNob3dfcHJvZ3Jl c3MgJiYgKGJ5dGVzX3JlYWQgJiAoKDEgPDwgMjApIC0gMSkpID09IDApCisJ CQlwcmludF9wcm9ncmVzcygiJWxsZCBNQiByZWFkXG4iLCBieXRlc19yZWFk ID4+IDIwKTsKKworCQlmb3IgKGN1cl9pbmRleCA9IDA7IGN1cl9pbmRleCA8 IG1ldGFibG9jay0+bWJfY291bnQ7IGN1cl9pbmRleCsrKSB7CisJCQlpZiAo cHdyaXRlNjQoZHN0X2ZkLCAmYmxvY2tfYnVmZmVyW2N1cl9pbmRleCA8PAor CQkJCQkJbWV0YWJsb2NrLT5tYl9ibG9ja2xvZ10sCisJCQkJCWJsb2NrX3Np emUsCisJCQkJCWJlNjRfdG9fY3B1KGJsb2NrX2luZGV4W2N1cl9pbmRleF0p IDw8CisJCQkJCQlCQlNISUZUKSA8IDApCisJCQkJZmF0YWwoImVycm9yIHdy aXRpbmcgYmxvY2sgJWxsdTogJXNcbiIsCisJCQkJCWJlNjRfdG9fY3B1KGJs b2NrX2luZGV4W2N1cl9pbmRleF0pIDw8IEJCU0hJRlQsCisJCQkJCXN0cmVy cm9yKGVycm5vKSk7CisJCX0KKwkJaWYgKG1ldGFibG9jay0+bWJfY291bnQg PCBtYXhfaW5kaWNpZXMpCisJCQlicmVhazsKKworCQlpZiAoZnJlYWQobWV0 YWJsb2NrLCBibG9ja19zaXplLCAxLCBzcmNfZikgIT0gMSkKKwkJCWZhdGFs KCJlcnJvciByZWFkaW5nIGZyb20gZmlsZTogJXNcbiIsIHN0cmVycm9yKGVy cm5vKSk7CisKKwkJaWYgKG1ldGFibG9jay0+bWJfY291bnQgPT0gMCkKKwkJ CWJyZWFrOworCisJCW1ldGFibG9jay0+bWJfY291bnQgPSBiZTE2X3RvX2Nw dShtZXRhYmxvY2stPm1iX2NvdW50KTsKKwkJaWYgKG1ldGFibG9jay0+bWJf Y291bnQgPiBtYXhfaW5kaWNpZXMpCisJCQlmYXRhbCgiYmFkIGJsb2NrIGNv dW50OiAldVxuIiwgbWV0YWJsb2NrLT5tYl9jb3VudCk7CisKKwkJaWYgKGZy ZWFkKGJsb2NrX2J1ZmZlciwgbWV0YWJsb2NrLT5tYl9jb3VudCA8PAorCQkJ CSBtZXRhYmxvY2stPm1iX2Jsb2NrbG9nLCAxLCBzcmNfZikgIT0gMSkKKwkJ CWZhdGFsKCJlcnJvciByZWFkaW5nIGZyb20gZmlsZTogJXNcbiIsIHN0cmVy cm9yKGVycm5vKSk7CisKKwkJYnl0ZXNfcmVhZCArPSBibG9ja19zaXplOwor CX0KKworCWlmIChwcm9ncmVzc19zaW5jZV93YXJuaW5nKQorCQlwdXRjaGFy KCdcbicpOworCisJbWVtc2V0KGJsb2NrX2J1ZmZlciwgMCwgc2Iuc2Jfc2Vj dHNpemUpOworCXNiLnNiX2lucHJvZ3Jlc3MgPSAwOworCWxpYnhmc194bGF0 ZV9zYihibG9ja19idWZmZXIsICZzYiwgMCwgWEZTX1NCX0FMTF9CSVRTKTsK KwlpZiAocHdyaXRlKGRzdF9mZCwgYmxvY2tfYnVmZmVyLCBzYi5zYl9zZWN0 c2l6ZSwgMCkgPCAwKQorCQlmYXRhbCgiZXJyb3Igd3JpdGluZyBwcmltYXJ5 IHN1cGVyYmxvY2s6ICVzXG4iLCBzdHJlcnJvcihlcnJubykpOworCisJZnJl ZShtZXRhYmxvY2spOworfQorCitzdGF0aWMgdm9pZAordXNhZ2Uodm9pZCkK K3sKKwlmcHJpbnRmKHN0ZGVyciwgIlVzYWdlOiAlcyBbLWJnXSBzb3VyY2Ug dGFyZ2V0XG4iLCBwcm9nbmFtZSk7CisJZXhpdCgxKTsKK30KKworZXh0ZXJu IGludAlwbGF0Zm9ybV9jaGVja19pc21vdW50ZWQoY2hhciAqLCBjaGFyICos IHN0cnVjdCBzdGF0NjQgKiwgaW50KTsKKworaW50CittYWluKAorCWludCAJ CWFyZ2MsCisJY2hhciAJCSoqYXJndikKK3sKKwlGSUxFCQkqc3JjX2Y7CisJ aW50CQlkc3RfZmQ7CisJaW50CQljOworCWludAkJb3Blbl9mbGFnczsKKwlz dHJ1Y3Qgc3RhdDY0CXN0YXRidWY7CisJaW50CQlpc190YXJnZXRfZmlsZTsK KworCXByb2duYW1lID0gYmFzZW5hbWUoYXJndlswXSk7CisKKwl3aGlsZSAo KGMgPSBnZXRvcHQoYXJnYywgYXJndiwgImdWIikpICE9IEVPRikgeworCQlz d2l0Y2ggKGMpIHsKKwkJCWNhc2UgJ2cnOgorCQkJCXNob3dfcHJvZ3Jlc3Mg PSAxOworCQkJCWJyZWFrOworCQkJY2FzZSAnVic6CisJCQkJcHJpbnRmKCIl cyB2ZXJzaW9uICVzXG4iLCBwcm9nbmFtZSwgVkVSU0lPTik7CisJCQkJZXhp dCgwKTsKKwkJCWRlZmF1bHQ6CisJCQkJdXNhZ2UoKTsKKwkJfQorCX0KKwor CWlmIChhcmdjIC0gb3B0aW5kICE9IDIpCisJCXVzYWdlKCk7CisKKwkvKiBv cGVuIHNvdXJjZSAqLworCWlmIChzdHJjbXAoYXJndltvcHRpbmRdLCAiLSIp ID09IDApIHsKKwkJc3JjX2YgPSBzdGRpbjsKKwkJaWYgKGlzYXR0eShmaWxl bm8oc3RkaW4pKSkKKwkJCWZhdGFsKCJjYW5ub3QgcmVhZCBmcm9tIGEgdGVy bWluYWxcbiIpOworCX0gZWxzZSB7CisJCXNyY19mID0gZm9wZW4oYXJndltv cHRpbmRdLCAicmIiKTsKKwkJaWYgKHNyY19mID09IE5VTEwpCisJCQlmYXRh bCgiY2Fubm90IG9wZW4gc291cmNlIGR1bXAgZmlsZVxuIik7CisJfQorCW9w dGluZCsrOworCisJLyogY2hlY2sgYW5kIG9wZW4gdGFyZ2V0ICovCisJb3Bl bl9mbGFncyA9IE9fUkRXUjsKKwlpc190YXJnZXRfZmlsZSA9IDA7CisJaWYg KHN0YXQ2NChhcmd2W29wdGluZF0sICZzdGF0YnVmKSA8IDApICB7CisJCS8q IG9rLCBhc3N1bWUgaXQncyBhIGZpbGUgYW5kIGNyZWF0ZSBpdCAqLworCQlv cGVuX2ZsYWdzIHw9IE9fQ1JFQVQ7CisJCWlzX3RhcmdldF9maWxlID0gMTsK Kwl9IGVsc2UgaWYgKFNfSVNSRUcoc3RhdGJ1Zi5zdF9tb2RlKSkgIHsKKwkJ b3Blbl9mbGFncyB8PSBPX1RSVU5DOworCQlpc190YXJnZXRfZmlsZSA9IDE7 CisJfSBlbHNlICB7CisJCS8qCisJCSAqIGNoZWNrIHRvIG1ha2Ugc3VyZSBh IGZpbGVzeXN0ZW0gaXNuJ3QgbW91bnRlZCBvbiB0aGUgZGV2aWNlCisJCSAq LworCQlpZiAocGxhdGZvcm1fY2hlY2tfaXNtb3VudGVkKGFyZ3Zbb3B0aW5k XSwgTlVMTCwgJnN0YXRidWYsIDApKQorCQkJZmF0YWwoImEgZmlsZXN5c3Rl bSBpcyBtb3VudGVkIG9uIHRhcmdldCBkZXZpY2UgXCIlc1wiLCIKKwkJCQki IGNhbm5vdCByZXN0b3JlIHRvIGEgbW91bnRlZCBmaWxlc3lzdGVtLlxuIiwK KwkJCQlhcmd2W29wdGluZF0pOworCX0KKworCWRzdF9mZCA9IG9wZW4oYXJn dltvcHRpbmRdLCBvcGVuX2ZsYWdzLCAwNjQ0KTsKKwlpZiAoZHN0X2ZkIDwg MCkKKwkJZmF0YWwoImNvdWxkbid0IG9wZW4gdGFyZ2V0IFwiJXNcIlxuIiwg YXJndltvcHRpbmRdKTsKKworCXBlcmZvcm1fcmVzdG9yZShzcmNfZiwgZHN0 X2ZkLCBpc190YXJnZXRfZmlsZSk7CisKKwljbG9zZShkc3RfZmQpOworCWlm IChzcmNfZiAhPSBzdGRpbikKKwkJZmNsb3NlKHNyY19mKTsKKworCXJldHVy biAwOworfQo= ------------TWTWHiHWl3WLjMq9Cu9UYh-- From owner-xfs@oss.sgi.com Sun May 27 22:19:32 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 27 May 2007 22:19:37 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4S5JQWt025246 for ; Sun, 27 May 2007 22:19:28 -0700 Received: from pc-bnaujok.melbourne.sgi.com (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA04321; Mon, 28 May 2007 15:19:24 +1000 Date: Mon, 28 May 2007 15:22:52 +1000 To: "xfs@oss.sgi.com" , xfs-dev Subject: [PATCH 1/3] XFS metadump utility From: "Barry Naujok" Organization: SGI Content-Type: multipart/mixed; boundary=----------k9x39jGsayoNr3EUsJAZe6 MIME-Version: 1.0 Message-ID: User-Agent: Opera Mail/9.10 (Win32) X-archive-position: 11527 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@sgi.com Precedence: bulk X-list: xfs ------------k9x39jGsayoNr3EUsJAZe6 Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Back in February, I posted a patch to xfs_db to capture metadata from a filesystem into a file http://oss.sgi.com/archives/xfs/2007-02/msg00072.html . I have now updated it with the following changes: - obfuscates directory names and attribute names - zeros attribute values - better support of stdin/stdout for redirection. - separated the restore tool to it's own binary. This was required as making it part of xfs_db required an already valid filesystem to overwrite with the restore operation. It's also a very small compact piece of code. - now has man pages. Part 1 contains the changes for metadump. Part 2 contains the changes for the restore. Part 3 contains the man pages. ------------k9x39jGsayoNr3EUsJAZe6 Content-Disposition: attachment; filename=xfs_metadump.patch Content-Type: application/octet-stream; name=xfs_metadump.patch Content-Transfer-Encoding: Base64 Cj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQp4ZnNwcm9ncy9kYi9N YWtlZmlsZQo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KCi0tLSBh L3hmc3Byb2dzL2RiL01ha2VmaWxlCTIwMDctMDUtMjggMTU6MDk6NTkuMDAw MDAwMDAwICsxMDAwCisrKyBiL3hmc3Byb2dzL2RiL01ha2VmaWxlCTIwMDct MDUtMTcgMTQ6NDE6MTguNzgwNzU2NzI5ICsxMDAwCkBAIC0xMSwxMSArMTEs MTEgQEAgSEZJTEVTID0gYWRkci5oIGFnZi5oIGFnZmwuaCBhZ2kuaCBhdHRy LgogCWJtYXBidC5oIGJtcm9vdC5oIGJub2J0LmggY2hlY2suaCBjbnRidC5o IGNvbW1hbmQuaCBjb252ZXJ0LmggXAogCWRicmVhZC5oIGRlYnVnLmggZGly LmggZGlyMi5oIGRpcjJzZi5oIGRpcnNob3J0LmggZHF1b3QuaCBlY2hvLmgg XAogCWZhZGRyLmggZmllbGQuaCBmbGlzdC5oIGZwcmludC5oIGZyYWcuaCBm cmVlc3AuaCBoYXNoLmggaGVscC5oIFwKLQlpbml0LmggaW5vYnQuaCBpbm9k ZS5oIGlucHV0LmggaW8uaCBtYWxsb2MuaCBvdXRwdXQuaCBcCisJaW5pdC5o IGlub2J0LmggaW5vZGUuaCBpbnB1dC5oIGlvLmggbWFsbG9jLmggbWV0YWR1 bXAuaCBvdXRwdXQuaCBcCiAJcHJpbnQuaCBxdWl0Lmggc2IuaCBzaWcuaCBz dHJ2ZWMuaCB0ZXh0LmggdHlwZS5oIHdyaXRlLmggXAogCWF0dHJzZXQuaAog Q0ZJTEVTID0gJChIRklMRVM6Lmg9LmMpCi1MU1JDRklMRVMgPSB4ZnNfYWRt aW4uc2ggeGZzX2NoZWNrLnNoIHhmc19uY2hlY2suc2gKK0xTUkNGSUxFUyA9 IHhmc19hZG1pbi5zaCB4ZnNfY2hlY2suc2ggeGZzX25jaGVjay5zaCB4ZnNf bWV0YWR1bXAuc2gKIExMRExJQlMJPSAkKExJQlhGUykgJChMSUJYTE9HKSAk KExJQlVVSUQpICQoTElCUlQpCiBMVERFUEVOREVOQ0lFUyA9ICQoTElCWEZT KSAkKExJQlhMT0cpCiBMTERGTEFHUyArPSAtc3RhdGljCkBAIC00MCw0ICs0 MCw1IEBAIGluc3RhbGw6IGRlZmF1bHQKIAkkKElOU1RBTEwpIC1tIDc1NSB4 ZnNfYWRtaW4uc2ggJChQS0dfQklOX0RJUikveGZzX2FkbWluCiAJJChJTlNU QUxMKSAtbSA3NTUgeGZzX2NoZWNrLnNoICQoUEtHX0JJTl9ESVIpL3hmc19j aGVjawogCSQoSU5TVEFMTCkgLW0gNzU1IHhmc19uY2hlY2suc2ggJChQS0df QklOX0RJUikveGZzX25jaGVjaworCSQoSU5TVEFMTCkgLW0gNzU1IHhmc19t ZXRhZHVtcC5zaCAkKFBLR19CSU5fRElSKS94ZnNfbWV0YWR1bXAKIGluc3Rh bGwtZGV2OgoKPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Cnhmc3By b2dzL2RiL2NvbW1hbmQuYwo9PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT0KCi0tLSBhL3hmc3Byb2dzL2RiL2NvbW1hbmQuYwkyMDA3LTA1LTI4IDE1 OjA5OjU5LjAwMDAwMDAwMCArMTAwMAorKysgYi94ZnNwcm9ncy9kYi9jb21t YW5kLmMJMjAwNy0wNS0xOCAxMjozMDozMS44NzY0NTc1OTkgKzEwMDAKQEAg LTQwLDYgKzQwLDcgQEAKICNpbmNsdWRlICJpbm9kZS5oIgogI2luY2x1ZGUg ImlucHV0LmgiCiAjaW5jbHVkZSAiaW8uaCIKKyNpbmNsdWRlICJtZXRhZHVt cC5oIgogI2luY2x1ZGUgIm91dHB1dC5oIgogI2luY2x1ZGUgInByaW50Lmgi CiAjaW5jbHVkZSAicXVpdC5oIgpAQCAtMTMxLDYgKzEzMiw3IEBAIGluaXRf Y29tbWFuZHModm9pZCkKIAlpbm9kZV9pbml0KCk7CiAJaW5wdXRfaW5pdCgp OwogCWlvX2luaXQoKTsKKwltZXRhZHVtcF9pbml0KCk7CiAJb3V0cHV0X2lu aXQoKTsKIAlwcmludF9pbml0KCk7CiAJcXVpdF9pbml0KCk7Cgo9PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT0KeGZzcHJvZ3MvZGIvaW5pdC5jCj09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PQoKLS0tIGEveGZzcHJvZ3Mv ZGIvaW5pdC5jCTIwMDctMDUtMjggMTU6MDk6NTkuMDAwMDAwMDAwICsxMDAw CisrKyBiL3hmc3Byb2dzL2RiL2luaXQuYwkyMDA3LTA1LTE3IDE0OjQxOjE5 LjA2NDcxOTMxNyArMTAwMApAQCAtMTA3LDggKzEwNyw4IEBAIGluaXQoCiAJ fQogCiAJaWYgKHJlYWRfYmJzKFhGU19TQl9EQUREUiwgMSwgJmJ1ZnAsIE5V TEwpKSB7Ci0JCWRicHJpbnRmKF8oIiVzOiAlcyBpcyBpbnZhbGlkIChjYW5u b3QgcmVhZCBmaXJzdCA1MTIgYnl0ZXMpXG4iKSwKLQkJCXByb2duYW1lLCBm c2RldmljZSk7CisJCWZwcmludGYoc3RkZXJyLCBfKCIlczogJXMgaXMgaW52 YWxpZCAoY2Fubm90IHJlYWQgZmlyc3QgNTEyICIKKwkJCSJieXRlcylcbiIp LCBwcm9nbmFtZSwgZnNkZXZpY2UpOwogCQlleGl0KDEpOwogCX0KIApAQCAt MTE4LDcgKzExOCw3IEBAIGluaXQoCiAKIAlzYnAgPSAmeG1vdW50Lm1fc2I7 CiAJaWYgKHNicC0+c2JfbWFnaWNudW0gIT0gWEZTX1NCX01BR0lDKSB7Ci0J CWRicHJpbnRmKF8oIiVzOiB1bmV4cGVjdGVkIFhGUyBTQiBtYWdpYyBudW1i ZXIgMHglMDh4XG4iKSwKKwkJZnByaW50ZihzdGRlcnIsIF8oIiVzOiB1bmV4 cGVjdGVkIFhGUyBTQiBtYWdpYyBudW1iZXIgMHglMDh4XG4iKSwKIAkJCXBy b2duYW1lLCBzYnAtPnNiX21hZ2ljbnVtKTsKIAl9CiAKQEAgLTEyOCw4ICsx MjgsOCBAQCBpbml0KAogCQltcCA9IGxpYnhmc19tb3VudCgmeG1vdW50LCBz YnAsIHguZGRldiwgeC5sb2dkZXYsIHgucnRkZXYsCiAJCQkJTElCWEZTX01P VU5UX0RFQlVHR0VSKTsKIAkJaWYgKCFtcCkgewotCQkJZGJwcmludGYoXygi JXM6IGRldmljZSAlcyB1bnVzYWJsZSAobm90IGFuIFhGUyBmaWxlc3lzdGVt PylcbiIpLAotCQkJcHJvZ25hbWUsIGZzZGV2aWNlKTsKKwkJCWZwcmludGYo c3RkZXJyLCBfKCIlczogZGV2aWNlICVzIHVudXNhYmxlIChub3QgYW4gWEZT ICIKKwkJCQkiZmlsZXN5c3RlbT8pXG4iKSwgcHJvZ25hbWUsIGZzZGV2aWNl KTsKIAkJCWV4aXQoMSk7CiAJCX0KIAl9Cgo9PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT0KeGZzcHJvZ3MvZGIvbWV0YWR1bXAuYwo9PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT0KCi0tLSBhL3hmc3Byb2dzL2RiL21ldGFk dW1wLmMJMjAwNi0wNi0xNyAwMDo1ODoyNC4wMDAwMDAwMDAgKzEwMDAKKysr IGIveGZzcHJvZ3MvZGIvbWV0YWR1bXAuYwkyMDA3LTA1LTI1IDE2OjE2OjA0 Ljc4MzIxOTgyOSArMTAwMApAQCAtMCwwICsxLDE1NTQgQEAKKy8qCisgKiBD b3B5cmlnaHQgKGMpIDIwMDcgU2lsaWNvbiBHcmFwaGljcywgSW5jLgorICog QWxsIFJpZ2h0cyBSZXNlcnZlZC4KKyAqCisgKiBUaGlzIHByb2dyYW0gaXMg ZnJlZSBzb2Z0d2FyZTsgeW91IGNhbiByZWRpc3RyaWJ1dGUgaXQgYW5kL29y CisgKiBtb2RpZnkgaXQgdW5kZXIgdGhlIHRlcm1zIG9mIHRoZSBHTlUgR2Vu ZXJhbCBQdWJsaWMgTGljZW5zZSBhcworICogcHVibGlzaGVkIGJ5IHRoZSBG cmVlIFNvZnR3YXJlIEZvdW5kYXRpb24uCisgKgorICogVGhpcyBwcm9ncmFt IGlzIGRpc3RyaWJ1dGVkIGluIHRoZSBob3BlIHRoYXQgaXQgd291bGQgYmUg dXNlZnVsLAorICogYnV0IFdJVEhPVVQgQU5ZIFdBUlJBTlRZOyB3aXRob3V0 IGV2ZW4gdGhlIGltcGxpZWQgd2FycmFudHkgb2YKKyAqIE1FUkNIQU5UQUJJ TElUWSBvciBGSVRORVNTIEZPUiBBIFBBUlRJQ1VMQVIgUFVSUE9TRS4gIFNl ZSB0aGUKKyAqIEdOVSBHZW5lcmFsIFB1YmxpYyBMaWNlbnNlIGZvciBtb3Jl IGRldGFpbHMuCisgKgorICogWW91IHNob3VsZCBoYXZlIHJlY2VpdmVkIGEg Y29weSBvZiB0aGUgR05VIEdlbmVyYWwgUHVibGljIExpY2Vuc2UKKyAqIGFs b25nIHdpdGggdGhpcyBwcm9ncmFtOyBpZiBub3QsIHdyaXRlIHRoZSBGcmVl IFNvZnR3YXJlIEZvdW5kYXRpb24sCisgKiBJbmMuLCAgNTEgRnJhbmtsaW4g U3QsIEZpZnRoIEZsb29yLCBCb3N0b24sIE1BICAwMjExMC0xMzAxICBVU0EK KyAqLworCisjaW5jbHVkZSA8bGlieGZzLmg+CisjaW5jbHVkZSAiYm1hcC5o IgorI2luY2x1ZGUgImNvbW1hbmQuaCIKKyNpbmNsdWRlICJtZXRhZHVtcC5o IgorI2luY2x1ZGUgImlvLmgiCisjaW5jbHVkZSAib3V0cHV0LmgiCisjaW5j bHVkZSAidHlwZS5oIgorI2luY2x1ZGUgImluaXQuaCIKKyNpbmNsdWRlICJz aWcuaCIKKyNpbmNsdWRlICJ4ZnNfbWV0YWR1bXAuaCIKKworLyogY29weSBh bGwgbWV0YWRhdGEgc3RydWN0dXJlcyB0by9mcm9tIGEgZmlsZSAqLworCitz dGF0aWMgaW50CW1ldGFkdW1wX2YoaW50IGFyZ2MsIGNoYXIgKiphcmd2KTsK K3N0YXRpYyB2b2lkCW1ldGFkdW1wX2hlbHAodm9pZCk7CisKKy8qCisgKiBt ZXRhZHVtcCBjb21tYW5kcyBpc3N1ZSBpbmZvL3dvcm5pbmdzL2Vycm9ycyB0 byBzdGFuZGFyZCBlcnJvciBhcworICogbWV0YWR1bXAgc3VwcG9ydHMgc3Rk b3V0IGFzIGEgZGVzdGluYXRpb24uCisgKgorICogQWxsIHN0YXRpYyBmdW5j dGlvbnMgcmV0dXJuIHplcm8gb24gZmFpbHVyZSwgd2hpbGUgdGhlIHB1Ymxp YyBmdW5jdGlvbnMKKyAqIHJldHVybiB6ZXJvIG9uIHN1Y2Nlc3MuCisgKi8K Kworc3RhdGljIGNvbnN0IGNtZGluZm9fdAltZXRhZHVtcF9jbWQgPQorCXsg Im1ldGFkdW1wIiwgTlVMTCwgbWV0YWR1bXBfZiwgMCwgLTEsIDAsCisJCSJb LWVdIFstZ10gWy13XSBbLW9dIGZpbGVuYW1lIiwKKwkJImR1bXAgbWV0YWRh dGEgdG8gYSBmaWxlIiwgbWV0YWR1bXBfaGVscCB9OworCitzdGF0aWMgRklM RQkJKm91dGY7CQkvKiBtZXRhZHVtcCBmaWxlICovCisKK3N0YXRpYyB4ZnNf bWV0YWJsb2NrX3QgCSptZXRhYmxvY2s7CS8qIGhlYWRlciArIGluZGV4ICsg YnVmZmVycyAqLworc3RhdGljIF9fYmU2NAkJKmJsb2NrX2luZGV4Oworc3Rh dGljIGNoYXIJCSpibG9ja19idWZmZXI7CisKK3N0YXRpYyBpbnQJCW51bV9p bmRpY2llczsKK3N0YXRpYyBpbnQJCWN1cl9pbmRleDsKKworc3RhdGljIHhm c19pbm9fdAljdXJfaW5vOworCitzdGF0aWMgaW50CQlzaG93X3Byb2dyZXNz ID0gMDsKK3N0YXRpYyBpbnQJCXN0b3Bfb25fcmVhZF9lcnJvciA9IDA7Citz dGF0aWMgaW50CQlkb250X29iZnVzY2F0ZSA9IDA7CitzdGF0aWMgaW50CQlz aG93X3dhcm5pbmdzID0gMDsKK3N0YXRpYyBpbnQJCXByb2dyZXNzX3NpbmNl X3dhcm5pbmcgPSAwOworCit2b2lkCittZXRhZHVtcF9pbml0KHZvaWQpCit7 CisJYWRkX2NvbW1hbmQoJm1ldGFkdW1wX2NtZCk7Cit9CisKK3N0YXRpYyB2 b2lkCittZXRhZHVtcF9oZWxwKHZvaWQpCit7CisJZGJwcmludGYoCisiXG4i CisiIFRoZSAnbWV0YWR1bXAnIGNvbW1hbmQgZHVtcHMgdGhlIGtub3duIG1l dGFkYXRhIHRvIGEgY29tcGFjdCBmaWxlIHN1aXRhYmxlXG4iCisiIGZvciBj b21wcmVzc2luZyBhbmQgc2VuZGluZyB0byBhbiBYRlMgbWFpbnRhaW5lciBm b3IgY29ycnVwdGlvbiBhbmFseXNpcyBcbiIKKyIgb3IgeGZzX3JlcGFpciBm YWlsdXJlcy5cblxuIgorIiBUaGVyZSBhcmUgMyBvcHRpb25zOlxuIgorIiAg IC1lIC0tIElnbm9yZSByZWFkIGVycm9ycyBhbmQga2VlcCBnb2luZ1xuIgor IiAgIC1nIC0tIERpc3BsYXkgZHVtcCBwcm9ncmVzc1xuIgorIiAgIC1vIC0t IERvbid0IG9iZnVzY2F0ZSBuYW1lcyBhbmQgZXh0ZW5kZWQgYXR0cmlidXRl c1xuIgorIiAgIC13IC0tIFNob3cgd2FybmluZ3Mgb2YgYmFkIG1ldGFkYXRh IGluZm9ybWF0aW9uXG4iCisiXG4iKTsKK30KKworc3RhdGljIHZvaWQKK3By aW50X3dhcm5pbmcoY29uc3QgY2hhciAqZm10LCAuLi4pCit7CisJY2hhcgkJ YnVmWzIwMF07CisJdmFfbGlzdAkJYXA7CisKKwlpZiAoc2VlbmludCgpKQor CQlyZXR1cm47CisKKwl2YV9zdGFydChhcCwgZm10KTsKKwl2c25wcmludGYo YnVmLCBzaXplb2YoYnVmKSwgZm10LCBhcCk7CisJdmFfZW5kKGFwKTsKKwli dWZbc2l6ZW9mKGJ1ZiktMV0gPSAnXDAnOworCisJZnByaW50ZihzdGRlcnIs ICIlcyVzOiAlc1xuIiwgcHJvZ3Jlc3Nfc2luY2Vfd2FybmluZyA/ICJcbiIg OiAiIiwKKwkJCXByb2duYW1lLCBidWYpOworCXByb2dyZXNzX3NpbmNlX3dh cm5pbmcgPSAwOworfQorCitzdGF0aWMgdm9pZAorcHJpbnRfcHJvZ3Jlc3Mo Y29uc3QgY2hhciAqZm10LCAuLi4pCit7CisJY2hhcgkJYnVmWzYwXTsKKwl2 YV9saXN0CQlhcDsKKwlGSUxFCQkqZjsKKworCWlmIChzZWVuaW50KCkpCisJ CXJldHVybjsKKworCXZhX3N0YXJ0KGFwLCBmbXQpOworCXZzbnByaW50Zihi dWYsIHNpemVvZihidWYpLCBmbXQsIGFwKTsKKwl2YV9lbmQoYXApOworCWJ1 ZltzaXplb2YoYnVmKS0xXSA9ICdcMCc7CisKKwlmID0gKG91dGYgPT0gc3Rk b3V0KSA/IHN0ZGVyciA6IHN0ZG91dDsKKwlmcHJpbnRmKGYsICJcciUtNTlz IiwgYnVmKTsKKwlmZmx1c2goZik7CisJcHJvZ3Jlc3Nfc2luY2Vfd2Fybmlu ZyA9IDE7Cit9CisKKy8qCisgKiBBIGNvbXBsZXRlIGR1bXAgZmlsZSB3aWxs IGhhdmUgYSAiemVybyIgZW50cnkgaW4gdGhlIGxhc3QgaW5kZXggYmxvY2ss CisgKiBldmVuIGlmIHRoZSBkdW1wIGlzIGV4YWN0bHkgYWxpZ25lZCwgdGhl IGxhc3QgaW5kZXggd2lsbCBiZSBmdWxsIG9mCisgKiB6ZXJvcy4gSWYgdGhl IGxhc3QgaW5kZXggZW50cnkgaXMgbm9uLXplcm8sIHRoZSBkdW1wIGlzIGlu Y29tcGxldGUuCisgKiBDb3JyZXNwb25kaW5nbHksIHRoZSBsYXN0IGNodW5r IHdpbGwgaGF2ZSBhIGNvdW50IDwgbnVtX2luZGljaWVzLgorICovCisKK3N0 YXRpYyBpbnQKK3dyaXRlX2luZGV4KHZvaWQpCit7CisJLyoKKwkgKiB3cml0 ZSBpbmRleCBibG9jayBhbmQgZm9sbG93aW5nIGRhdGEgYmxvY2tzIChzdHJl YW1pbmcpCisJICovCisJbWV0YWJsb2NrLT5tYl9jb3VudCA9IGNwdV90b19i ZTE2KGN1cl9pbmRleCk7CisJaWYgKGZ3cml0ZShtZXRhYmxvY2ssIChjdXJf aW5kZXggKyAxKSA8PCBCQlNISUZULCAxLCBvdXRmKSAhPSAxKSB7CisJCXBy aW50X3dhcm5pbmcoImVycm9yIHdyaXRpbmcgdG8gZmlsZTogJXMiLCBzdHJl cnJvcihlcnJubykpOworCQlyZXR1cm4gMDsKKwl9CisKKwltZW1zZXQoYmxv Y2tfaW5kZXgsIDAsIG51bV9pbmRpY2llcyAqIHNpemVvZihfX2JlNjQpKTsK KwljdXJfaW5kZXggPSAwOworCXJldHVybiAxOworfQorCitzdGF0aWMgaW50 Cit3cml0ZV9idWYoCisJaW9jdXJfdAkJKmJ1ZikKK3sKKwljaGFyCQkqZGF0 YTsKKwlfX2ludDY0X3QJb2ZmOworCWludAkJaTsKKworCWZvciAoaSA9IDAs IG9mZiA9IGJ1Zi0+YmIsIGRhdGEgPSBidWYtPmRhdGE7CisJCQlpIDwgYnVm LT5ibGVuOworCQkJaSsrLCBvZmYrKywgZGF0YSArPSBCQlNJWkUpIHsKKwkJ YmxvY2tfaW5kZXhbY3VyX2luZGV4XSA9IGNwdV90b19iZTY0KG9mZik7CisJ CW1lbWNweSgmYmxvY2tfYnVmZmVyW2N1cl9pbmRleCA8PCBCQlNISUZUXSwg ZGF0YSwgQkJTSVpFKTsKKwkJaWYgKCsrY3VyX2luZGV4ID09IG51bV9pbmRp Y2llcykgeworCQkJaWYgKCF3cml0ZV9pbmRleCgpKQorCQkJCXJldHVybiAw OworCQl9CisJfQorCXJldHVybiAhc2VlbmludCgpOworfQorCisKK3N0YXRp YyBpbnQKK3NjYW5fYnRyZWUoCisJeGZzX2FnbnVtYmVyX3QJYWdubywKKwl4 ZnNfYWdibG9ja190CWFnYm5vLAorCWludAkJbGV2ZWwsCisJdHlwbm1fdAkJ YnR5cGUsCisJdm9pZAkJKmFyZywKKwlpbnQJCSgqZnVuYykoeGZzX2J0cmVl X2hkcl90CQkqYnRoZHIsCisJCQkJeGZzX2FnbnVtYmVyX3QJCWFnbm8sCisJ CQkJeGZzX2FnYmxvY2tfdAkJYWdibm8sCisJCQkJaW50CQkJbGV2ZWwsCisJ CQkJdHlwbm1fdAkJCWJ0eXBlLAorCQkJCXZvaWQJCQkqYXJnKSkKK3sKKwlw dXNoX2N1cigpOworCXNldF9jdXIoJnR5cHRhYltidHlwZV0sIFhGU19BR0Jf VE9fREFERFIobXAsIGFnbm8sIGFnYm5vKSwgYmxrYmIsCisJCQlEQl9SSU5H X0lHTiwgTlVMTCk7CisJaWYgKGlvY3VyX3RvcC0+ZGF0YSA9PSBOVUxMKSB7 CisJCXByaW50X3dhcm5pbmcoImNhbm5vdCByZWFkICVzIGJsb2NrICV1LyV1 IiwgdHlwdGFiW2J0eXBlXS5uYW1lLAorCQkJCWFnbm8sIGFnYm5vKTsKKwkJ cmV0dXJuICFzdG9wX29uX3JlYWRfZXJyb3I7CisJfQorCWlmICghd3JpdGVf YnVmKGlvY3VyX3RvcCkpCisJCXJldHVybiAwOworCisJaWYgKCEoKmZ1bmMp KGlvY3VyX3RvcC0+ZGF0YSwgYWdubywgYWdibm8sIGxldmVsIC0gMSwgYnR5 cGUsIGFyZykpCisJCXJldHVybiAwOworCisJcG9wX2N1cigpOworCXJldHVy biAxOworfQorCisvKiBmcmVlIHNwYWNlIHRyZWUgY29weSByb3V0aW5lcyAq LworCitzdGF0aWMgaW50Cit2YWxpZF9ibm8oCisJeGZzX2FnYmxvY2tfdAkJ Ym5vLAorCXhmc19hZ251bWJlcl90CQlhZ25vLAorCXhmc19hZ2Jsb2NrX3QJ CWFnYm5vLAorCXR5cG5tX3QJCQlidHlwZSkKK3sKKwlpZiAoYm5vID4gMCAm JiBibm8gPD0gbXAtPm1fc2Iuc2JfYWdibG9ja3MpCisJCXJldHVybiAxOwor CisJaWYgKHNob3dfd2FybmluZ3MpCisJCXByaW50X3dhcm5pbmcoImludmFs aWQgYmxvY2sgbnVtYmVyICgldSkgaW4gJXMgYmxvY2sgJXUvJXUiLAorCQkJ CWJubywgdHlwdGFiW2J0eXBlXS5uYW1lLCBhZ25vLCBhZ2Jubyk7CisJcmV0 dXJuIDA7Cit9CisKK3N0YXRpYyBpbnQKK3NjYW5mdW5jX2ZyZWVzcCgKKwl4 ZnNfYnRyZWVfaGRyX3QJCSpidGhkciwKKwl4ZnNfYWdudW1iZXJfdAkJYWdu bywKKwl4ZnNfYWdibG9ja190CQlhZ2JubywKKwlpbnQJCQlsZXZlbCwKKwl0 eXBubV90CQkJYnR5cGUsCisJdm9pZAkJCSphcmcpCit7CisJeGZzX2FsbG9j X3B0cl90CQkqcHA7CisJaW50CQkJaTsKKwlpbnQJCQlucmVjczsKKworCWlm IChsZXZlbCA9PSAwKQorCQlyZXR1cm4gMTsKKworCW5yZWNzID0gYmUxNl90 b19jcHUoYnRoZHItPmJiX251bXJlY3MpOworCWlmIChucmVjcyA+IG1wLT5t X2FsbG9jX214clsxXSkgeworCQlpZiAoc2hvd193YXJuaW5ncykKKwkJCXBy aW50X3dhcm5pbmcoImludmFsaWQgbnJlY3MgKCV1KSBpbiAlcyBibG9jayAl dS8ldSIsCisJCQkJCW5yZWNzLCB0eXB0YWJbYnR5cGVdLm5hbWUsIGFnbm8s IGFnYm5vKTsKKwkJcmV0dXJuIDE7CisJfQorCisJcHAgPSBYRlNfQlRSRUVf UFRSX0FERFIobXAtPm1fc2Iuc2JfYmxvY2tzaXplLCB4ZnNfYWxsb2MsIGJ0 aGRyLCAxLAorCQkJbXAtPm1fYWxsb2NfbXhyWzFdKTsKKwlmb3IgKGkgPSAw OyBpIDwgbnJlY3M7IGkrKykgeworCQlpZiAoIXZhbGlkX2JubyhiZTMyX3Rv X2NwdShwcFtpXSksIGFnbm8sIGFnYm5vLCBidHlwZSkpCisJCQljb250aW51 ZTsKKwkJaWYgKCFzY2FuX2J0cmVlKGFnbm8sIGJlMzJfdG9fY3B1KHBwW2ld KSwgbGV2ZWwsIGJ0eXBlLCBhcmcsCisJCQkJc2NhbmZ1bmNfZnJlZXNwKSkK KwkJCXJldHVybiAwOworCX0KKwlyZXR1cm4gMTsKK30KKworc3RhdGljIGlu dAorY29weV9mcmVlX2Jub19idHJlZSgKKwl4ZnNfYWdudW1iZXJfdAlhZ25v LAorCXhmc19hZ2ZfdAkqYWdmKQoreworCXhmc19hZ2Jsb2NrX3QJcm9vdDsK KwlpbnQJCWxldmVsczsKKworCXJvb3QgPSBiZTMyX3RvX2NwdShhZ2YtPmFn Zl9yb290c1tYRlNfQlROVU1fQk5PXSk7CisJbGV2ZWxzID0gYmUzMl90b19j cHUoYWdmLT5hZ2ZfbGV2ZWxzW1hGU19CVE5VTV9CTk9dKTsKKworCS8qIHZh bGlkYXRlIHJvb3QgYW5kIGxldmVscyBiZWZvcmUgcHJvY2Vzc2luZyB0aGUg dHJlZSAqLworCWlmIChyb290ID09IDAgfHwgcm9vdCA+IG1wLT5tX3NiLnNi X2FnYmxvY2tzKSB7CisJCWlmIChzaG93X3dhcm5pbmdzKQorCQkJcHJpbnRf d2FybmluZygiaW52YWxpZCBibG9jayBudW1iZXIgKCV1KSBpbiBibm9idCAi CisJCQkJCSJyb290IGluIGFnZiAldSIsIHJvb3QsIGFnbm8pOworCQlyZXR1 cm4gMTsKKwl9CisJaWYgKGxldmVscyA+PSBYRlNfQlRSRUVfTUFYTEVWRUxT KSB7CisJCWlmIChzaG93X3dhcm5pbmdzKQorCQkJcHJpbnRfd2FybmluZygi aW52YWxpZCBsZXZlbCAoJXUpIGluIGJub2J0IHJvb3QgIgorCQkJCQkiaW4g YWdmICV1IiwgbGV2ZWxzLCBhZ25vKTsKKwkJcmV0dXJuIDE7CisJfQorCisJ cmV0dXJuIHNjYW5fYnRyZWUoYWdubywgcm9vdCwgbGV2ZWxzLCBUWVBfQk5P QlQsIGFnZiwgc2NhbmZ1bmNfZnJlZXNwKTsKK30KKworc3RhdGljIGludAor Y29weV9mcmVlX2NudF9idHJlZSgKKwl4ZnNfYWdudW1iZXJfdAlhZ25vLAor CXhmc19hZ2ZfdAkqYWdmKQoreworCXhmc19hZ2Jsb2NrX3QJcm9vdDsKKwlp bnQJCWxldmVsczsKKworCXJvb3QgPSBiZTMyX3RvX2NwdShhZ2YtPmFnZl9y b290c1tYRlNfQlROVU1fQ05UXSk7CisJbGV2ZWxzID0gYmUzMl90b19jcHUo YWdmLT5hZ2ZfbGV2ZWxzW1hGU19CVE5VTV9DTlRdKTsKKworCS8qIHZhbGlk YXRlIHJvb3QgYW5kIGxldmVscyBiZWZvcmUgcHJvY2Vzc2luZyB0aGUgdHJl ZSAqLworCWlmIChyb290ID09IDAgfHwgcm9vdCA+IG1wLT5tX3NiLnNiX2Fn YmxvY2tzKSB7CisJCWlmIChzaG93X3dhcm5pbmdzKQorCQkJcHJpbnRfd2Fy bmluZygiaW52YWxpZCBibG9jayBudW1iZXIgKCV1KSBpbiBjbnRidCAiCisJ CQkJCSJyb290IGluIGFnZiAldSIsIHJvb3QsIGFnbm8pOworCQlyZXR1cm4g MTsKKwl9CisJaWYgKGxldmVscyA+PSBYRlNfQlRSRUVfTUFYTEVWRUxTKSB7 CisJCWlmIChzaG93X3dhcm5pbmdzKQorCQkJcHJpbnRfd2FybmluZygiaW52 YWxpZCBsZXZlbCAoJXUpIGluIGNudGJ0IHJvb3QgIgorCQkJCQkiaW4gYWdm ICV1IiwgbGV2ZWxzLCBhZ25vKTsKKwkJcmV0dXJuIDE7CisJfQorCisJcmV0 dXJuIHNjYW5fYnRyZWUoYWdubywgcm9vdCwgbGV2ZWxzLCBUWVBfQ05UQlQs IGFnZiwgc2NhbmZ1bmNfZnJlZXNwKTsKK30KKworLyogZmlsZW5hbWUgYW5k IGV4dGVuZGVkIGF0dHJpYnV0ZSBvYmZ1c2NhdGlvbiByb3V0aW5lcyAqLwor Cit0eXBlZGVmIHN0cnVjdCBuYW1lX2VudCB7CisJc3RydWN0IG5hbWVfZW50 CQkqbmV4dDsKKwl4ZnNfZGFoYXNoX3QJCWhhc2g7CisJaW50CSAgCSAgICAJ bmFtZWxlbjsKKwl1Y2hhcl90ICAgIAkgICAgCW5hbWVbMV07Cit9IG5hbWVf ZW50X3Q7CisKKyNkZWZpbmUgTkFNRV9UQUJMRV9TSVpFCQk0MDk2CisKK3N0 YXRpYyBuYW1lX2VudF90IAkJKipuYW1ldGFibGU7CisKK3N0YXRpYyBpbnQK K2NyZWF0ZV9uYW1ldGFibGUodm9pZCkKK3sKKwluYW1ldGFibGUgPSBjYWxs b2MoTkFNRV9UQUJMRV9TSVpFLCBzaXplb2YobmFtZV9lbnRfdCkpOworCXJl dHVybiBuYW1ldGFibGUgIT0gTlVMTDsKK30KKworc3RhdGljIHZvaWQKK2Ns ZWFyX25hbWV0YWJsZSh2b2lkKQoreworCWludAkJCWk7CisJbmFtZV9lbnRf dAkJKnA7CisKKwlmb3IgKGkgPSAwOyBpIDwgTkFNRV9UQUJMRV9TSVpFOyBp KyspIHsKKwkJd2hpbGUgKG5hbWV0YWJsZVtpXSkgeworCQkJcCA9IG5hbWV0 YWJsZVtpXTsKKwkJCW5hbWV0YWJsZVtpXSA9IHAtPm5leHQ7CisJCQlmcmVl KHApOworCQl9CisJfQorfQorCisKKyNkZWZpbmUgaXNfaW52YWxpZF9jaGFy KGMpCSgoYykgPT0gJy8nIHx8IChjKSA9PSAnXDAnKQorI2RlZmluZSByb2wz Mih4LHkpCQkoKCh4KSA8PCAoeSkpIHwgKCh4KSA+PiAoMzIgLSAoeSkpKSkK Kworc3RhdGljIGlubGluZSB1Y2hhcl90CityYW5kb21fZmlsZW5hbWVfY2hh cih2b2lkKQoreworCXVjaGFyX3QJCQljOworCisJZG8geworCQljID0gcmFu ZG9tKCkgJSAxMjcgKyAxOworCX0gd2hpbGUgKGMgPT0gJy8nKTsKKwlyZXR1 cm4gYzsKK30KKworc3RhdGljIGludAoraXNfc3BlY2lhbF9kaXJlbnQoCisJ eGZzX2lub190CQlpbm8sCisJaW50CQkJbmFtZWxlbiwKKwl1Y2hhcl90CQkJ Km5hbWUpCit7CisJc3RhdGljIHhmc19pbm9fdAlvcnBoYW5hZ2VfaW5vID0g MDsKKwljaGFyCQkJc1szMl07CisJaW50CQkJc2xlbjsKKworCS8qCisJICog ZHVlIHRvIHRoZSBYRlMgbmFtZSBoYXNoaW5nIGFsZ29yaXRobSwgd2UgY2Fu bm90IG9iZnVzY2F0ZQorCSAqIG5hbWVzIHdpdGggNCBjaGFycyBvciBsZXNz LgorCSAqLworCWlmIChuYW1lbGVuIDw9IDQpCisJCXJldHVybiAxOworCisJ aWYgKGlubyA9PSAwKQorCQlyZXR1cm4gMDsKKworCS8qCisJICogZG9uJ3Qg b2JmdXNjYXRlIGxvc3QrZm91bmQgbm9yIGFueSBpbm9kZXMgd2l0aGluIGxv c3QrZm91bmQgd2l0aAorCSAqIHRoZSBpbm9kZSBudW1iZXIKKwkgKi8KKwlp ZiAoY3VyX2lubyA9PSBtcC0+bV9zYi5zYl9yb290aW5vICYmIG5hbWVsZW4g PT0gMTAgJiYKKwkJCW1lbWNtcChuYW1lLCAibG9zdCtmb3VuZCIsIDEwKSA9 PSAwKSB7CisJCW9ycGhhbmFnZV9pbm8gPSBpbm87CisJCXJldHVybiAxOwor CX0KKwlpZiAoY3VyX2lubyAhPSBvcnBoYW5hZ2VfaW5vKQorCQlyZXR1cm4g MDsKKworCXNsZW4gPSBzcHJpbnRmKHMsICIlbGxkIiwgKGxvbmcgbG9uZylp bm8pOworCXJldHVybiAoc2xlbiA9PSBuYW1lbGVuICYmIG1lbWNtcChuYW1l LCBzLCBuYW1lbGVuKSA9PSAwKTsKK30KKworc3RhdGljIHZvaWQKK2dlbmVy YXRlX29iZnVzY2F0ZWRfbmFtZSgKKwl4ZnNfaW5vX3QJCWlubywKKwlpbnQJ CQluYW1lbGVuLAorCXVjaGFyX3QJCQkqbmFtZSkKK3sKKwl4ZnNfZGFoYXNo X3QJCWhhc2g7CisJbmFtZV9lbnRfdAkJKnA7CisJaW50CQkJaTsKKwlpbnQJ CQlkdXA7CisJeGZzX2RhaGFzaF90CQluZXdoYXNoOworCXVjaGFyX3QJCQlu ZXduYW1lW25hbWVsZW5dOworCisJaWYgKGlzX3NwZWNpYWxfZGlyZW50KGlu bywgbmFtZWxlbiwgbmFtZSkpCisJCXJldHVybjsKKworCWhhc2ggPSBsaWJ4 ZnNfZGFfaGFzaG5hbWUobmFtZSwgbmFtZWxlbik7CisKKwkvKiBjcmVhdGUg YSByYW5kb20gbmFtZSB3aXRoIHRoZSBzYW1lIGhhc2ggdmFsdWUgKi8KKwor CWRvIHsKKwkJZHVwID0gMDsKKwkJbmV3bmFtZVswXSA9ICcvJzsKKworCQlm b3IgKDs7KSB7CisJCQkvKiBpZiB0aGUgZmlyc3QgY2hhciBpcyBhICIvIiwg cHJlc2VydmUgaXQgKi8KKwkJCWkgPSAobmFtZVswXSA9PSAnLycpOworCisJ CQlmb3IgKG5ld2hhc2ggPSAwOyBpIDwgbmFtZWxlbiAtIDU7IGkrKykgewor CQkJCW5ld25hbWVbaV0gPSByYW5kb21fZmlsZW5hbWVfY2hhcigpOworCQkJ CW5ld2hhc2ggPSBuZXduYW1lW2ldIF4gcm9sMzIobmV3aGFzaCwgNyk7CisJ CQl9CisJCQluZXdoYXNoID0gcm9sMzIobmV3aGFzaCwgMykgXiBoYXNoOwor CQkJaWYgKG5hbWVbMF0gIT0gJy8nIHx8IG5hbWVsZW4gPiA1KSB7CisJCQkJ bmV3bmFtZVtuYW1lbGVuIC0gNV0gPSAobmV3aGFzaCA+PiAyOCkgfAorCQkJ CQkJKHJhbmRvbV9maWxlbmFtZV9jaGFyKCkgJiAweGYwKTsKKwkJCQlpZiAo aXNfaW52YWxpZF9jaGFyKG5ld25hbWVbbmFtZWxlbiAtIDVdKSkKKwkJCQkJ Y29udGludWU7CisJCQl9CisJCQluZXduYW1lW25hbWVsZW4gLSA0XSA9IChu ZXdoYXNoID4+IDIxKSAmIDB4N2Y7CisJCQlpZiAoaXNfaW52YWxpZF9jaGFy KG5ld25hbWVbbmFtZWxlbiAtIDRdKSkKKwkJCQljb250aW51ZTsKKwkJCW5l d25hbWVbbmFtZWxlbiAtIDNdID0gKG5ld2hhc2ggPj4gMTQpICYgMHg3ZjsK KwkJCWlmIChpc19pbnZhbGlkX2NoYXIobmV3bmFtZVtuYW1lbGVuIC0gM10p KQorCQkJCWNvbnRpbnVlOworCQkJbmV3bmFtZVtuYW1lbGVuIC0gMl0gPSAo bmV3aGFzaCA+PiA3KSAmIDB4N2Y7CisJCQlpZiAoaXNfaW52YWxpZF9jaGFy KG5ld25hbWVbbmFtZWxlbiAtIDJdKSkKKwkJCQljb250aW51ZTsKKwkJCW5l d25hbWVbbmFtZWxlbiAtIDFdID0gKChuZXdoYXNoID4+IDApIF4KKwkJCQkJ KG5ld25hbWVbbmFtZWxlbiAtIDVdID4+IDQpKSAmIDB4N2Y7CisJCQlpZiAo aXNfaW52YWxpZF9jaGFyKG5ld25hbWVbbmFtZWxlbiAtIDFdKSkKKwkJCQlj b250aW51ZTsKKwkJCWJyZWFrOworCQl9CisKKwkJQVNTRVJUKGxpYnhmc19k YV9oYXNobmFtZShuZXduYW1lLCBuYW1lbGVuKSA9PSBoYXNoKTsKKworCQlm b3IgKHAgPSBuYW1ldGFibGVbaGFzaCAlIE5BTUVfVEFCTEVfU0laRV07IHA7 IHAgPSBwLT5uZXh0KSB7CisJCQlpZiAocC0+aGFzaCA9PSBoYXNoICYmIHAt Pm5hbWVsZW4gPT0gbmFtZWxlbiAmJgorCQkJCQltZW1jbXAocC0+bmFtZSwg bmV3bmFtZSwgbmFtZWxlbikgPT0gMCl7CisJCQkJZHVwID0gMTsKKwkJCQli cmVhazsKKwkJCX0KKwkJfQorCX0gd2hpbGUgKGR1cCk7CisKKwltZW1jcHko bmFtZSwgbmV3bmFtZSwgbmFtZWxlbik7CisKKwlwID0gbWFsbG9jKHNpemVv ZihuYW1lX2VudF90KSArIG5hbWVsZW4pOworCWlmIChwID09IE5VTEwpCisJ CXJldHVybjsKKworCXAtPm5leHQgPSBuYW1ldGFibGVbaGFzaCAlIE5BTUVf VEFCTEVfU0laRV07CisJcC0+aGFzaCA9IGhhc2g7CisJcC0+bmFtZWxlbiA9 IG5hbWVsZW47CisJbWVtY3B5KHAtPm5hbWUsIG5hbWUsIG5hbWVsZW4pOwor CisJbmFtZXRhYmxlW2hhc2ggJSBOQU1FX1RBQkxFX1NJWkVdID0gcDsKK30K Kworc3RhdGljIHZvaWQKK29iZnVzY2F0ZV9zZl9kaXIoCisJeGZzX2Rpbm9k ZV90CQkqZGlwKQoreworCXhmc19kaXIyX3NmX3QJCSpzZnA7CisJeGZzX2Rp cjJfc2ZfZW50cnlfdAkqc2ZlcDsKKwlpbnQJCQlpbm9fZGlyX3NpemU7CisJ aW50CQkJaTsKKworCXNmcCA9ICZkaXAtPmRpX3UuZGlfZGlyMnNmOworCWlu b19kaXJfc2l6ZSA9IGRpcC0+ZGlfY29yZS5kaV9zaXplOworCWlmIChpbm9f ZGlyX3NpemUgPiBYRlNfREZPUktfRFNJWkUoZGlwLCBtcCkpIHsKKwkJaW5v X2Rpcl9zaXplID0gWEZTX0RGT1JLX0RTSVpFKGRpcCwgbXApOworCQlpZiAo c2hvd193YXJuaW5ncykKKwkJCXByaW50X3dhcm5pbmcoImludmFsaWQgc2l6 ZSBmb3IgZGlyIGlub2RlICVsbHUiLAorCQkJCQkobG9uZyBsb25nKWN1cl9p bm8pOworCX0KKworCXNmZXAgPSBYRlNfRElSMl9TRl9GSVJTVEVOVFJZKHNm cCk7CisJZm9yIChpID0gMDsgKGkgPCBzZnAtPmhkci5jb3VudCkgJiYKKwkJ CSgoY2hhciAqKXNmZXAgLSAoY2hhciAqKXNmcCA8IGlub19kaXJfc2l6ZSk7 IGkrKykgeworCisJCS8qCisJCSAqIGZpcnN0IGNoZWNrIGZvciBiYWQgbmFt ZSBsZW5ndGhzLiBJZiB0aGV5IGFyZSBiYWQsIHdlCisJCSAqIGhhdmUgbGlt aXRhdGlvbnMgdG8gaG93IG11Y2ggY2FuIGJlIG9iZnVzY2F0ZWQuCisJCSAq LworCQlpbnQJbmFtZWxlbiA9IHNmZXAtPm5hbWVsZW47CisKKwkJaWYgKG5h bWVsZW4gPT0gMCkgeworCQkJaWYgKHNob3dfd2FybmluZ3MpCisJCQkJcHJp bnRfd2FybmluZygiemVybyBsZW5ndGggZW50cnkgaW4gZGlyIGlub2RlICIK KwkJCQkJCSIlbGx1IiwgKGxvbmcgbG9uZyljdXJfaW5vKTsKKwkJCWlmIChp ICE9IHNmcC0+aGRyLmNvdW50IC0gMSkKKwkJCQlicmVhazsKKwkJCW5hbWVs ZW4gPSBpbm9fZGlyX3NpemUgLSAoKGNoYXIgKikmc2ZlcC0+bmFtZVswXSAt CisJCQkJCSAoY2hhciAqKXNmcCk7CisJCX0gZWxzZSBpZiAoKGNoYXIgKilz ZmVwIC0gKGNoYXIgKilzZnAgKworCQkJCVhGU19ESVIyX1NGX0VOVFNJWkVf QllFTlRSWShzZnAsIHNmZXApID4KKwkJCQlpbm9fZGlyX3NpemUpIHsKKwkJ CWlmIChzaG93X3dhcm5pbmdzKQorCQkJCXByaW50X3dhcm5pbmcoImVudHJ5 IGxlbmd0aCBpbiBkaXIgaW5vZGUgJWxsdSAiCisJCQkJCSJvdmVyZmxvd3Mg c3BhY2UiLCAobG9uZyBsb25nKWN1cl9pbm8pOworCQkJaWYgKGkgIT0gc2Zw LT5oZHIuY291bnQgLSAxKQorCQkJCWJyZWFrOworCQkJbmFtZWxlbiA9IGlu b19kaXJfc2l6ZSAtICgoY2hhciAqKSZzZmVwLT5uYW1lWzBdIC0KKwkJCQkJ IChjaGFyICopc2ZwKTsKKwkJfQorCisJCWdlbmVyYXRlX29iZnVzY2F0ZWRf bmFtZShYRlNfRElSMl9TRl9HRVRfSU5VTUJFUihzZnAsCisJCQkJWEZTX0RJ UjJfU0ZfSU5VTUJFUlAoc2ZlcCkpLCBuYW1lbGVuLAorCQkJCSZzZmVwLT5u YW1lWzBdKTsKKworCQlzZmVwID0gKHhmc19kaXIyX3NmX2VudHJ5X3QgKiko KGNoYXIgKilzZmVwICsKKwkJCQlYRlNfRElSMl9TRl9FTlRTSVpFX0JZTkFN RShzZnAsIG5hbWVsZW4pKTsKKwl9Cit9CisKK3N0YXRpYyB2b2lkCitvYmZ1 c2NhdGVfc2Zfc3ltbGluaygKKwl4ZnNfZGlub2RlX3QJCSpkaXApCit7CisJ aW50CQkJaTsKKworCWZvciAoaSA9IDA7IGkgPCBkaXAtPmRpX2NvcmUuZGlf c2l6ZTsgaSsrKQorCQlkaXAtPmRpX3UuZGlfc3ltbGlua1tpXSA9IHJhbmRv bSgpICUgMTI3ICsgMTsKK30KKworc3RhdGljIHZvaWQKK29iZnVzY2F0ZV9z Zl9hdHRyKAorCXhmc19kaW5vZGVfdAkJKmRpcCkKK3sKKwkvKgorCSAqIHdp dGggZXh0ZW5kZWQgYXR0cmlidXRlcywgb2JmdXNjYXRlIHRoZSBuYW1lcyBh bmQgemVybyB0aGUgYWN0dWFsCisJICogdmFsdWVzLgorCSAqLworCisJeGZz X2F0dHJfc2hvcnRmb3JtX3QJKmFzZnA7CisJeGZzX2F0dHJfc2ZfZW50cnlf dAkqYXNmZXA7CisJaW50CQkJaW5vX2F0dHJfc2l6ZTsKKwlpbnQJCQlpOwor CisJYXNmcCA9ICh4ZnNfYXR0cl9zaG9ydGZvcm1fdCAqKVhGU19ERk9SS19B UFRSKGRpcCk7CisJaWYgKGFzZnAtPmhkci5jb3VudCA9PSAwKQorCQlyZXR1 cm47CisKKwlpbm9fYXR0cl9zaXplID0gYmUxNl90b19jcHUoYXNmcC0+aGRy LnRvdHNpemUpOworCWlmIChpbm9fYXR0cl9zaXplID4gWEZTX0RGT1JLX0FT SVpFKGRpcCwgbXApKSB7CisJCWlub19hdHRyX3NpemUgPSBYRlNfREZPUktf QVNJWkUoZGlwLCBtcCk7CisJCWlmIChzaG93X3dhcm5pbmdzKQorCQkJcHJp bnRfd2FybmluZygiaW52YWxpZCBhdHRyIHNpemUgaW4gaW5vZGUgJWxsdSIs CisJCQkJCShsb25nIGxvbmcpY3VyX2lubyk7CisJfQorCisJYXNmZXAgPSAm YXNmcC0+bGlzdFswXTsKKwlmb3IgKGkgPSAwOyAoaSA8IGFzZnAtPmhkci5j b3VudCkgJiYKKwkJCSgoY2hhciAqKWFzZmVwIC0gKGNoYXIgKilhc2ZwIDwg aW5vX2F0dHJfc2l6ZSk7IGkrKykgeworCisJCWludAluYW1lbGVuID0gYXNm ZXAtPm5hbWVsZW47CisKKwkJaWYgKG5hbWVsZW4gPT0gMCkgeworCQkJaWYg KHNob3dfd2FybmluZ3MpCisJCQkJcHJpbnRfd2FybmluZygiemVybyBsZW5n dGggYXR0ciBlbnRyeSBpbiBpbm9kZSAiCisJCQkJCQkiJWxsdSIsIChsb25n IGxvbmcpY3VyX2lubyk7CisJCQlicmVhazsKKwkJfSBlbHNlIGlmICgoY2hh ciAqKWFzZmVwIC0gKGNoYXIgKilhc2ZwICsKKwkJCQlYRlNfQVRUUl9TRl9F TlRTSVpFKGFzZmVwKSA+IGlub19hdHRyX3NpemUpIHsKKwkJCWlmIChzaG93 X3dhcm5pbmdzKQorCQkJCXByaW50X3dhcm5pbmcoImF0dHIgZW50cnkgbGVu Z3RoIGluIGlub2RlICVsbHUgIgorCQkJCQkib3ZlcmZsb3dzIHNwYWNlIiwg KGxvbmcgbG9uZyljdXJfaW5vKTsKKwkJCWJyZWFrOworCQl9CisKKwkJZ2Vu ZXJhdGVfb2JmdXNjYXRlZF9uYW1lKDAsIGFzZmVwLT5uYW1lbGVuLCAmYXNm ZXAtPm5hbWV2YWxbMF0pOworCQltZW1zZXQoJmFzZmVwLT5uYW1ldmFsW2Fz ZmVwLT5uYW1lbGVuXSwgMCwgYXNmZXAtPnZhbHVlbGVuKTsKKworCQlhc2Zl cCA9ICh4ZnNfYXR0cl9zZl9lbnRyeV90ICopKChjaGFyICopYXNmZXAgKwor CQkJCVhGU19BVFRSX1NGX0VOVFNJWkUoYXNmZXApKTsKKwl9Cit9CisKKy8q CisgKiBkaXJfZGF0YSBzdHJ1Y3R1cmUgaXMgdXNlZCB0byB0cmFjayBtdWx0 aS1mc2Jsb2NrIGRpcjIgYmxvY2tzIGJldHdlZW4gZXh0ZW50CisgKiBwcm9j ZXNzaW5nIGNhbGxzLgorICovCisKK3N0YXRpYyBzdHJ1Y3QgZGlyX2RhdGFf cyB7CisJaW50CQkJZW5kX29mX2RhdGE7CisJaW50CQkJYmxvY2tfaW5kZXg7 CisJaW50CQkJb2Zmc2V0X3RvX2VudHJ5OworCWludAkJCWJhZF9ibG9jazsK K30gZGlyX2RhdGE7CisKK3N0YXRpYyB2b2lkCitvYmZ1c2NhdGVfZGlyX2Rh dGFfYmxvY2tzKAorCWNoYXIJCQkqYmxvY2ssCisJeGZzX2RmaWxvZmZfdAkJ b2Zmc2V0LAorCXhmc19kZmlsYmxrc190CQljb3VudCwKKwlpbnQJCQlpc19i bG9ja19mb3JtYXQpCit7CisJLyoKKwkgKiB3ZSBoYXZlIHRvIHJlbHkgb24g dGhlIGZpbGVvZmZzZXQgYW5kIHNpZ25hdHVyZSBvZiB0aGUgYmxvY2sgdG8K KwkgKiBoYW5kbGUgaXQncyBjb250ZW50cy4gSWYgaXQncyBpbnZhbGlkLCBs ZWF2ZSBpdCBhbG9uZS4KKwkgKiBmb3IgbXVsdGktZnNibG9jayBkaXIgYmxv Y2tzLCBpZiBhIG5hbWUgY3Jvc3NlcyBhbiBleHRlbnQgYm91bmRhcnksCisJ ICogaWdub3JlIGl0IGFuZCBjb250aW51ZS4KKwkgKi8KKwlpbnQJCQljOwor CWludAkJCWRpcl9vZmZzZXQ7CisJY2hhcgkJCSpwdHI7CisJY2hhcgkJCSpl bmRwdHI7CisKKwlpZiAoaXNfYmxvY2tfZm9ybWF0ICYmIGNvdW50ICE9IG1w LT5tX2RpcmJsa2ZzYnMpCisJCXJldHVybjsgLyogdG9vIGNvbXBsZXggdG8g aGFuZGxlIHRoaXMgcmFyZSBjYXNlICovCisKKwlmb3IgKGMgPSAwLCBlbmRw dHIgPSBibG9jazsgYyA8IGNvdW50OyBjKyspIHsKKworCQlpZiAoZGlyX2Rh dGEuYmxvY2tfaW5kZXggPT0gMCkgeworCQkJaW50CQl3YW50bWFnaWM7CisK KwkJCWlmIChvZmZzZXQgJSBtcC0+bV9kaXJibGtmc2JzICE9IDApCisJCQkJ cmV0dXJuOwkvKiBjb3JydXB0ZWQsIGxlYXZlIGl0IGFsb25lICovCisKKwkJ CWRpcl9kYXRhLmJhZF9ibG9jayA9IDA7CisKKwkJCWlmIChpc19ibG9ja19m b3JtYXQpIHsKKwkJCQl4ZnNfZGlyMl9sZWFmX2VudHJ5X3QJKmJscDsKKwkJ CQl4ZnNfZGlyMl9ibG9ja190YWlsX3QJKmJ0cDsKKworCQkJCWJ0cCA9IFhG U19ESVIyX0JMT0NLX1RBSUxfUChtcCwKKwkJCQkJCSh4ZnNfZGlyMl9ibG9j a190ICopYmxvY2spOworCQkJCWJscCA9IFhGU19ESVIyX0JMT0NLX0xFQUZf UChidHApOworCQkJCWlmICgoY2hhciAqKWJscCA+IChjaGFyICopYnRwKQor CQkJCQlibHAgPSAoeGZzX2RpcjJfbGVhZl9lbnRyeV90ICopYnRwOworCisJ CQkJZGlyX2RhdGEuZW5kX29mX2RhdGEgPSAoY2hhciAqKWJscCAtIGJsb2Nr OworCQkJCXdhbnRtYWdpYyA9IFhGU19ESVIyX0JMT0NLX01BR0lDOworCQkJ fSBlbHNlIHsgLyogbGVhZi9ub2RlIGZvcm1hdCAqLworCQkJCWRpcl9kYXRh LmVuZF9vZl9kYXRhID0gbXAtPm1fZGlyYmxrZnNicyA8PAorCQkJCQkJbXAt Pm1fc2Iuc2JfYmxvY2tsb2c7CisJCQkJd2FudG1hZ2ljID0gWEZTX0RJUjJf REFUQV9NQUdJQzsKKwkJCX0KKwkJCWRpcl9kYXRhLm9mZnNldF90b19lbnRy eSA9IG9mZnNldG9mKHhmc19kaXIyX2RhdGFfdCwgdSk7CisKKwkJCWlmIChi ZTMyX3RvX2NwdSgoKHhmc19kaXIyX2RhdGFfaGRyX3QqKWJsb2NrKS0+bWFn aWMpICE9CisJCQkJCXdhbnRtYWdpYykgeworCQkJCWlmIChzaG93X3dhcm5p bmdzKQorCQkJCQlwcmludF93YXJuaW5nKCJpbnZhbGlkIG1hZ2ljIGluIGRp ciAiCisJCQkJCQkiaW5vZGUgJWxsdSBibG9jayAlbGQiLAorCQkJCQkJKGxv bmcgbG9uZyljdXJfaW5vLAorCQkJCQkJKGxvbmcpb2Zmc2V0KTsKKwkJCQlk aXJfZGF0YS5iYWRfYmxvY2sgPSAxOworCQkJfQorCQl9CisJCWRpcl9kYXRh LmJsb2NrX2luZGV4Kys7CisJCWlmIChkaXJfZGF0YS5ibG9ja19pbmRleCA9 PSBtcC0+bV9kaXJibGtmc2JzKQorCQkJZGlyX2RhdGEuYmxvY2tfaW5kZXgg PSAwOworCisJCWlmIChkaXJfZGF0YS5iYWRfYmxvY2spCisJCQljb250aW51 ZTsKKworCQlkaXJfb2Zmc2V0ID0gKGRpcl9kYXRhLmJsb2NrX2luZGV4IDw8 IG1wLT5tX3NiLnNiX2Jsb2NrbG9nKSArCisJCQkJZGlyX2RhdGEub2Zmc2V0 X3RvX2VudHJ5OworCisJCXB0ciA9IGVuZHB0ciArIGRpcl9kYXRhLm9mZnNl dF90b19lbnRyeTsKKwkJZW5kcHRyICs9IG1wLT5tX3NiLnNiX2Jsb2Nrc2l6 ZTsKKworCQl3aGlsZSAocHRyIDwgZW5kcHRyICYmIGRpcl9vZmZzZXQgPCBk aXJfZGF0YS5lbmRfb2ZfZGF0YSkgeworCQkJeGZzX2RpcjJfZGF0YV9lbnRy eV90CSpkZXA7CisJCQl4ZnNfZGlyMl9kYXRhX3VudXNlZF90CSpkdXA7CisJ CQlpbnQJCQlsZW5ndGg7CisKKwkJCWR1cCA9ICh4ZnNfZGlyMl9kYXRhX3Vu dXNlZF90ICopcHRyOworCisJCQlpZiAoYmUxNl90b19jcHUoZHVwLT5mcmVl dGFnKSA9PSBYRlNfRElSMl9EQVRBX0ZSRUVfVEFHKSB7CisJCQkJaW50CWxl bmd0aCA9IGJlMTZfdG9fY3B1KGR1cC0+bGVuZ3RoKTsKKwkJCQlpZiAoZGly X29mZnNldCArIGxlbmd0aCA+IGRpcl9kYXRhLmVuZF9vZl9kYXRhIHx8CisJ CQkJCQlsZW5ndGggPT0gMCB8fCAobGVuZ3RoICYKKwkJCQkJCSAoWEZTX0RJ UjJfREFUQV9BTElHTiAtIDEpKSkgeworCQkJCQlpZiAoc2hvd193YXJuaW5n cykKKwkJCQkJCXByaW50X3dhcm5pbmcoImludmFsaWQgbGVuZ3RoICIKKwkJ CQkJCQkiZm9yIGRpciBmcmVlIHNwYWNlIGluICIKKwkJCQkJCQkiaW5vZGUg JWxsdSIsCisJCQkJCQkJKGxvbmcgbG9uZyljdXJfaW5vKTsKKwkJCQkJZGly X2RhdGEuYmFkX2Jsb2NrID0gMTsKKwkJCQkJYnJlYWs7CisJCQkJfQorCQkJ CWlmIChiZTE2X3RvX2NwdSgqWEZTX0RJUjJfREFUQV9VTlVTRURfVEFHX1Ao ZHVwKSkgIT0KKwkJCQkJCWRpcl9vZmZzZXQpIHsKKwkJCQkJZGlyX2RhdGEu YmFkX2Jsb2NrID0gMTsKKwkJCQkJYnJlYWs7CisJCQkJfQorCQkJCWRpcl9v ZmZzZXQgKz0gbGVuZ3RoOworCQkJCXB0ciArPSBsZW5ndGg7CisJCQkJaWYg KGRpcl9vZmZzZXQgPj0gZGlyX2RhdGEuZW5kX29mX2RhdGEgfHwKKwkJCQkJ CXB0ciA+PSBlbmRwdHIpCisJCQkJCWJyZWFrOworCQkJfQorCisJCQlkZXAg PSAoeGZzX2RpcjJfZGF0YV9lbnRyeV90ICopcHRyOworCQkJbGVuZ3RoID0g WEZTX0RJUjJfREFUQV9FTlRTSVpFKGRlcC0+bmFtZWxlbik7CisKKwkJCWlm IChkaXJfb2Zmc2V0ICsgbGVuZ3RoID4gZGlyX2RhdGEuZW5kX29mX2RhdGEg fHwKKwkJCQkJcHRyICsgbGVuZ3RoID4gZW5kcHRyKSB7CisJCQkJaWYgKHNo b3dfd2FybmluZ3MpCisJCQkJCXByaW50X3dhcm5pbmcoImludmFsaWQgbGVu Z3RoIGZvciAiCisJCQkJCQkiZGlyIGVudHJ5IG5hbWUgaW4gaW5vZGUgJWxs dSIsCisJCQkJCQkobG9uZyBsb25nKWN1cl9pbm8pOworCQkJCWJyZWFrOwor CQkJfQorCQkJaWYgKGJlMTZfdG9fY3B1KCpYRlNfRElSMl9EQVRBX0VOVFJZ X1RBR19QKGRlcCkpICE9CisJCQkJCWRpcl9vZmZzZXQpIHsKKwkJCQlkaXJf ZGF0YS5iYWRfYmxvY2sgPSAxOworCQkJCWJyZWFrOworCQkJfQorCQkJZ2Vu ZXJhdGVfb2JmdXNjYXRlZF9uYW1lKGJlNjRfdG9fY3B1KGRlcC0+aW51bWJl ciksCisJCQkJCWRlcC0+bmFtZWxlbiwgJmRlcC0+bmFtZVswXSk7CisJCQlk aXJfb2Zmc2V0ICs9IGxlbmd0aDsKKwkJCXB0ciArPSBsZW5ndGg7CisJCX0K KwkJZGlyX2RhdGEub2Zmc2V0X3RvX2VudHJ5ID0gZGlyX29mZnNldCAmCisJ CQkJCQkobXAtPm1fc2Iuc2JfYmxvY2tzaXplIC0gMSk7CisJfQorfQorCitz dGF0aWMgdm9pZAorb2JmdXNjYXRlX3N5bWxpbmtfYmxvY2tzKAorCWNoYXIJ CQkqYmxvY2ssCisJeGZzX2RmaWxibGtzX3QJCWNvdW50KQoreworCWludCAJ CQlpOworCisJY291bnQgPDw9IG1wLT5tX3NiLnNiX2Jsb2NrbG9nOworCWZv ciAoaSA9IDA7IGkgPCBjb3VudDsgaSsrKQorCQlibG9ja1tpXSA9IHJhbmRv bSgpICUgMTI3ICsgMTsKK30KKworI2RlZmluZSBNQVhfUkVNT1RFX1ZBTFMJ CTQwOTUKKworc3RhdGljIHN0cnVjdCBhdHRyX2RhdGFfcyB7CisJaW50CQkJ cmVtb3RlX3ZhbF9jb3VudDsKKwl4ZnNfZGFibGtfdAkJcmVtb3RlX3ZhbHNb TUFYX1JFTU9URV9WQUxTXTsKK30gYXR0cl9kYXRhOworCitzdGF0aWMgaW5s aW5lIHZvaWQKK2FkZF9yZW1vdGVfdmFscygKKwl4ZnNfZGFibGtfdCAJCWJs b2NraWR4LAorCWludAkJCWxlbmd0aCkKK3sKKwl3aGlsZSAobGVuZ3RoID4g MCAmJiBhdHRyX2RhdGEucmVtb3RlX3ZhbF9jb3VudCA8IE1BWF9SRU1PVEVf VkFMUykgeworCQlhdHRyX2RhdGEucmVtb3RlX3ZhbHNbYXR0cl9kYXRhLnJl bW90ZV92YWxfY291bnRdID0gYmxvY2tpZHg7CisJCWF0dHJfZGF0YS5yZW1v dGVfdmFsX2NvdW50Kys7CisJCWJsb2NraWR4Kys7CisJCWxlbmd0aCAtPSBY RlNfTEJTSVpFKG1wKTsKKwl9Cit9CisKK3N0YXRpYyB2b2lkCitvYmZ1c2Nh dGVfYXR0cl9ibG9ja3MoCisJY2hhcgkJCSpibG9jaywKKwl4ZnNfZGZpbG9m Zl90CQlvZmZzZXQsCisJeGZzX2RmaWxibGtzX3QJCWNvdW50KQoreworCXhm c19hdHRyX2xlYWZibG9ja190CSpsZWFmOworCWludAkJCWM7CisJaW50CQkJ aTsKKwlpbnQJCQluZW50cmllczsKKwl4ZnNfYXR0cl9sZWFmX2VudHJ5X3Qg CSplbnRyeTsKKwl4ZnNfYXR0cl9sZWFmX25hbWVfbG9jYWxfdCAqbG9jYWw7 CisJeGZzX2F0dHJfbGVhZl9uYW1lX3JlbW90ZV90ICpyZW1vdGU7CisKKwlm b3IgKGMgPSAwOyBjIDwgY291bnQ7IGMrKywgb2Zmc2V0KyssIGJsb2NrICs9 IFhGU19MQlNJWkUobXApKSB7CisKKwkJbGVhZiA9ICh4ZnNfYXR0cl9sZWFm YmxvY2tfdCAqKWJsb2NrOworCisJCWlmIChiZTE2X3RvX2NwdShsZWFmLT5o ZHIuaW5mby5tYWdpYykgIT0gWEZTX0FUVFJfTEVBRl9NQUdJQykgeworCQkJ Zm9yIChpID0gMDsgaSA8IGF0dHJfZGF0YS5yZW1vdGVfdmFsX2NvdW50OyBp KyspIHsKKwkJCQlpZiAoYXR0cl9kYXRhLnJlbW90ZV92YWxzW2ldID09IG9m ZnNldCkKKwkJCQkJbWVtc2V0KGJsb2NrLCAwLCBYRlNfTEJTSVpFKG1wKSk7 CisJCQl9CisJCQljb250aW51ZTsKKwkJfQorCisJCW5lbnRyaWVzID0gYmUx Nl90b19jcHUobGVhZi0+aGRyLmNvdW50KTsKKwkJaWYgKG5lbnRyaWVzICog c2l6ZW9mKHhmc19hdHRyX2xlYWZfZW50cnlfdCkgKworCQkJCXNpemVvZih4 ZnNfYXR0cl9sZWFmX2hkcl90KSA+IFhGU19MQlNJWkUobXApKSB7CisJCQlp ZiAoc2hvd193YXJuaW5ncykKKwkJCQlwcmludF93YXJuaW5nKCJpbnZhbGlk IGF0dHIgY291bnQgaW4gaW5vZGUgJWxsdSIsCisJCQkJCQkobG9uZyBsb25n KWN1cl9pbm8pOworCQkJY29udGludWU7CisJCX0KKworCQlmb3IgKGkgPSAw LCBlbnRyeSA9ICZsZWFmLT5lbnRyaWVzWzBdOyBpIDwgbmVudHJpZXM7CisJ CQkJaSsrLCBlbnRyeSsrKSB7CisJCQlpZiAoYmUxNl90b19jcHUoZW50cnkt Pm5hbWVpZHgpID4gWEZTX0xCU0laRShtcCkpIHsKKwkJCQlpZiAoc2hvd193 YXJuaW5ncykKKwkJCQkJcHJpbnRfd2FybmluZygiaW52YWxpZCBhdHRyIG5h bWVpZHggIgorCQkJCQkJCSJpbiBpbm9kZSAlbGx1IiwKKwkJCQkJCQkobG9u ZyBsb25nKWN1cl9pbm8pOworCQkJCWJyZWFrOworCQkJfQorCQkJaWYgKGVu dHJ5LT5mbGFncyAmIFhGU19BVFRSX0xPQ0FMKSB7CisJCQkJbG9jYWwgPSBY RlNfQVRUUl9MRUFGX05BTUVfTE9DQUwobGVhZiwgaSk7CisJCQkJaWYgKGxv Y2FsLT5uYW1lbGVuID09IDApIHsKKwkJCQkJaWYgKHNob3dfd2FybmluZ3Mp CisJCQkJCQlwcmludF93YXJuaW5nKCJ6ZXJvIGxlbmd0aCBmb3IgIgorCQkJ CQkJCSJhdHRyIG5hbWUgaW4gaW5vZGUgJWxsdSIsCisJCQkJCQkJKGxvbmcg bG9uZyljdXJfaW5vKTsKKwkJCQkJYnJlYWs7CisJCQkJfQorCQkJCWdlbmVy YXRlX29iZnVzY2F0ZWRfbmFtZSgwLCBsb2NhbC0+bmFtZWxlbiwKKwkJCQkJ JmxvY2FsLT5uYW1ldmFsWzBdKTsKKwkJCQltZW1zZXQoJmxvY2FsLT5uYW1l dmFsW2xvY2FsLT5uYW1lbGVuXSwgMCwKKwkJCQkJYmUxNl90b19jcHUobG9j YWwtPnZhbHVlbGVuKSk7CisJCQl9IGVsc2UgeworCQkJCXJlbW90ZSA9IFhG U19BVFRSX0xFQUZfTkFNRV9SRU1PVEUobGVhZiwgaSk7CisJCQkJaWYgKHJl bW90ZS0+bmFtZWxlbiA9PSAwIHx8CisJCQkJCQlyZW1vdGUtPnZhbHVlYmxr ID09IDApIHsKKwkJCQkJaWYgKHNob3dfd2FybmluZ3MpCisJCQkJCQlwcmlu dF93YXJuaW5nKCJpbnZhbGlkIGF0dHIgIgorCQkJCQkJCSJlbnRyeSBpbiBp bm9kZSAlbGx1IiwKKwkJCQkJCQkobG9uZyBsb25nKWN1cl9pbm8pOworCQkJ CQlicmVhazsKKwkJCQl9CisJCQkJZ2VuZXJhdGVfb2JmdXNjYXRlZF9uYW1l KDAsIHJlbW90ZS0+bmFtZWxlbiwKKwkJCQkJJnJlbW90ZS0+bmFtZVswXSk7 CisJCQkJYWRkX3JlbW90ZV92YWxzKGJlMzJfdG9fY3B1KHJlbW90ZS0+dmFs dWVibGspLAorCQkJCQliZTMyX3RvX2NwdShyZW1vdGUtPnZhbHVlbGVuKSk7 CisJCQl9CisJCX0KKwl9Cit9CisKKy8qIGlub2RlIGNvcHkgcm91dGluZXMg Ki8KKworc3RhdGljIGludAorcHJvY2Vzc19ibWJ0X3JlY2xpc3QoCisJeGZz X2JtYnRfcmVjX3QgCQkqcnAsCisJaW50IAkJCW51bXJlY3MsCisJdHlwbm1f dAkJCWJ0eXBlKQoreworCWludAkJCWk7CisJeGZzX2RmaWxvZmZfdAkJbzsK Kwl4ZnNfZGZzYm5vX3QJCXM7CisJeGZzX2RmaWxibGtzX3QJCWM7CisJaW50 CQkJZjsKKwl4ZnNfZGZpbG9mZl90CQlsYXN0OworCisJaWYgKGJ0eXBlID09 IFRZUF9EQVRBKQorCQlyZXR1cm4gMTsKKworCWNvbnZlcnRfZXh0ZW50KCZy cFtudW1yZWNzIC0gMV0sICZvLCAmcywgJmMsICZmKTsKKwlsYXN0ID0gbyAr IGM7CisKKwlmb3IgKGkgPSAwOyBpIDwgbnVtcmVjczsgaSsrLCBycCsrKSB7 CisJCWNvbnZlcnRfZXh0ZW50KHJwLCAmbywgJnMsICZjLCAmZik7CisKKwkJ cHVzaF9jdXIoKTsKKwkJc2V0X2N1cigmdHlwdGFiW2J0eXBlXSwgWEZTX0ZT Ql9UT19EQUREUihtcCwgcyksIGMgKiBibGtiYiwKKwkJCQlEQl9SSU5HX0lH TiwgTlVMTCk7CisJCWlmIChpb2N1cl90b3AtPmRhdGEgPT0gTlVMTCkgewor CQkJcHJpbnRfd2FybmluZygiY2Fubm90IHJlYWQgJXMgYmxvY2sgJXUvJXUi LAorCQkJCQl0eXB0YWJbYnR5cGVdLm5hbWUsCisJCQkJCVhGU19GU0JfVE9f QUdOTyhtcCwgcyksCisJCQkJCVhGU19GU0JfVE9fQUdCTk8obXAsIHMpKTsK KwkJCWlmIChzdG9wX29uX3JlYWRfZXJyb3IpCisJCQkJcmV0dXJuIDA7CisJ CX0gZWxzZSB7CisJCQlpZiAoIWRvbnRfb2JmdXNjYXRlKQorCQkJICAgIHN3 aXRjaCAoYnR5cGUpIHsKKwkJCQljYXNlIFRZUF9ESVIyOgorCQkJCQlpZiAo byA8IG1wLT5tX2RpcmxlYWZibGspCisJCQkJCQlvYmZ1c2NhdGVfZGlyX2Rh dGFfYmxvY2tzKAorCQkJCQkJCWlvY3VyX3RvcC0+ZGF0YSwgbywgYywKKwkJ CQkJCQlsYXN0ID09IG1wLT5tX2RpcmJsa2ZzYnMpOworCQkJCQlicmVhazsK KworCQkJCWNhc2UgVFlQX1NZTUxJTks6CisJCQkJCW9iZnVzY2F0ZV9zeW1s aW5rX2Jsb2NrcygKKwkJCQkJCWlvY3VyX3RvcC0+ZGF0YSwgYyk7CisJCQkJ CWJyZWFrOworCisJCQkJY2FzZSBUWVBfQVRUUjoKKwkJCQkJb2JmdXNjYXRl X2F0dHJfYmxvY2tzKGlvY3VyX3RvcC0+ZGF0YSwKKwkJCQkJCW8sIGMpOwor CQkJCQlicmVhazsKKworCQkJCWRlZmF1bHQ6IDsKKwkJCSAgICB9CisJCQlp ZiAoIXdyaXRlX2J1Zihpb2N1cl90b3ApKQorCQkJCXJldHVybiAwOworCQl9 CisJCXBvcF9jdXIoKTsKKwl9CisKKwlyZXR1cm4gMTsKK30KKworc3RhdGlj IGludAorc2NhbmZ1bmNfYm1hcCgKKwl4ZnNfYnRyZWVfaGRyX3QJCSpidGhk ciwKKwl4ZnNfYWdudW1iZXJfdAkJYWdubywKKwl4ZnNfYWdibG9ja190CQlh Z2JubywKKwlpbnQJCQlsZXZlbCwKKwl0eXBubV90CQkJYnR5cGUsCisJdm9p ZAkJCSphcmcpCS8qIHB0ciB0byBpdHlwZSAqLworeworCWludAkJCWk7CisJ eGZzX2JtYnRfcHRyX3QJCSpwcDsKKwl4ZnNfYm1idF9yZWNfdAkJKnJwOwor CWludAkJCW5yZWNzOworCisJbnJlY3MgPSBiZTE2X3RvX2NwdShidGhkci0+ YmJfbnVtcmVjcyk7CisKKwlpZiAobGV2ZWwgPT0gMCkgeworCQlpZiAobnJl Y3MgPiBtcC0+bV9ibWFwX2RteHJbMF0pIHsKKwkJCWlmIChzaG93X3dhcm5p bmdzKQorCQkJCXByaW50X3dhcm5pbmcoImludmFsaWQgbnVtcmVjcyAoJXUp IGluICVzICIKKwkJCQkJImJsb2NrICV1LyV1IiwgbnJlY3MsCisJCQkJCXR5 cHRhYltidHlwZV0ubmFtZSwgYWdubywgYWdibm8pOworCQkJcmV0dXJuIDE7 CisJCX0KKwkJcnAgPSBYRlNfQlRSRUVfUkVDX0FERFIobXAtPm1fc2Iuc3Fi X2Jsb2Nrc2l6ZSwgeGZzX2JtYnQsIGJ0aGRyLAorCQkJCQkxLCBtcC0+bV9i bWFwX2RteHJbMF0pOworCisJCXJldHVybiBwcm9jZXNzX2JtYnRfcmVjbGlz dChycCwgbnJlY3MsICoodHlwbm1fdCopYXJnKTsKKwl9CisKKwlpZiAobnJl Y3MgPiBtcC0+bV9ibWFwX2RteHJbMV0pIHsKKwkJaWYgKHNob3dfd2Fybmlu Z3MpCisJCQlwcmludF93YXJuaW5nKCJpbnZhbGlkIG51bXJlY3MgKCV1KSBp biAlcyBibG9jayAldS8ldSIsCisJCQkJCW5yZWNzLCB0eXB0YWJbYnR5cGVd Lm5hbWUsIGFnbm8sIGFnYm5vKTsKKwkJcmV0dXJuIDE7CisJfQorCXBwID0g WEZTX0JUUkVFX1BUUl9BRERSKG1wLT5tX3NiLnNiX2Jsb2Nrc2l6ZSwgeGZz X2JtYnQsIGJ0aGRyLCAxLAorCQkJCW1wLT5tX2JtYXBfZG14clsxXSk7CisJ Zm9yIChpID0gMDsgaSA8IG5yZWNzOyBpKyspIHsKKwkJeGZzX2FnbnVtYmVy X3QJYWc7CisJCXhmc19hZ2Jsb2NrX3QJYm5vOworCisJCWFnID0gWEZTX0ZT Ql9UT19BR05PKG1wLCBiZTY0X3RvX2NwdShwcFtpXSkpOworCQlibm8gPSBY RlNfRlNCX1RPX0FHQk5PKG1wLCBiZTY0X3RvX2NwdShwcFtpXSkpOworCisJ CWlmIChibm8gPT0gMCB8fCBibm8gPiBtcC0+bV9zYi5zYl9hZ2Jsb2NrcyB8 fAorCQkJCWFnID4gbXAtPm1fc2Iuc2JfYWdjb3VudCkgeworCQkJaWYgKHNo b3dfd2FybmluZ3MpCisJCQkJcHJpbnRfd2FybmluZygiaW52YWxpZCBibG9j ayBudW1iZXIgKCV1LyV1KSAiCisJCQkJCSJpbiAlcyBibG9jayAldS8ldSIs IGFnLCBibm8sCisJCQkJCXR5cHRhYltidHlwZV0ubmFtZSwgYWdubywgYWdi bm8pOworCQkJY29udGludWU7CisJCX0KKworCQlpZiAoIXNjYW5fYnRyZWUo YWcsIGJubywgbGV2ZWwsIGJ0eXBlLCBhcmcsIHNjYW5mdW5jX2JtYXApKQor CQkJcmV0dXJuIDA7CisJfQorCXJldHVybiAxOworfQorCitzdGF0aWMgaW50 Citwcm9jZXNzX2J0aW5vZGUoCisJeGZzX2Rpbm9kZV90IAkJKmRpcCwKKwl0 eXBubV90CQkJaXR5cGUpCit7CisJeGZzX2JtZHJfYmxvY2tfdAkqZGliOwor CWludAkJCWk7CisJeGZzX2JtYnRfcHRyX3QJCSpwcDsKKwl4ZnNfYm1idF9y ZWNfdAkJKnJwOworCWludAkJCWxldmVsOworCWludAkJCW5yZWNzOworCWlu dAkJCW1heHJlY3M7CisJaW50CQkJd2hpY2hmb3JrOworCXR5cG5tX3QJCQli dHlwZTsKKworCXdoaWNoZm9yayA9IChpdHlwZSA9PSBUWVBfQVRUUikgPyBY RlNfQVRUUl9GT1JLIDogWEZTX0RBVEFfRk9SSzsKKwlidHlwZSA9IChpdHlw ZSA9PSBUWVBfQVRUUikgPyBUWVBfQk1BUEJUQSA6IFRZUF9CTUFQQlREOwor CisJZGliID0gKHhmc19ibWRyX2Jsb2NrX3QgKilYRlNfREZPUktfUFRSKGRp cCwgd2hpY2hmb3JrKTsKKwlsZXZlbCA9IGJlMTZfdG9fY3B1KGRpYi0+YmJf bGV2ZWwpOworCW5yZWNzID0gYmUxNl90b19jcHUoZGliLT5iYl9udW1yZWNz KTsKKworCWlmIChsZXZlbCA+IFhGU19CTV9NQVhMRVZFTFMobXAsIHdoaWNo Zm9yaykpIHsKKwkJaWYgKHNob3dfd2FybmluZ3MpCisJCQlwcmludF93YXJu aW5nKCJpbnZhbGlkIGxldmVsICgldSkgaW4gaW5vZGUgJWxsZCAlcyAiCisJ CQkJCSJyb290IiwgbGV2ZWwsIChsb25nIGxvbmcpY3VyX2lubywKKwkJCQkJ dHlwdGFiW2J0eXBlXS5uYW1lKTsKKwkJcmV0dXJuIDE7CisJfQorCisJaWYg KGxldmVsID09IDApIHsKKwkJcnAgPSBYRlNfQlRSRUVfUkVDX0FERFIoWEZT X0RGT1JLX1NJWkUoZGlwLCBtcCwgd2hpY2hmb3JrKSwKKwkJCQl4ZnNfYm1k ciwgZGliLCAxLCBYRlNfQlRSRUVfQkxPQ0tfTUFYUkVDUygKKwkJCQkJWEZT X0RGT1JLX1NJWkUoZGlwLCBtcCwgd2hpY2hmb3JrKSwKKwkJCQkJeGZzX2Jt ZHIsIDEpKTsKKworCQlyZXR1cm4gcHJvY2Vzc19ibWJ0X3JlY2xpc3QocnAs IG5yZWNzLCBpdHlwZSk7CisJfQorCisJbWF4cmVjcyA9IFhGU19CVFJFRV9C TE9DS19NQVhSRUNTKFhGU19ERk9SS19TSVpFKGRpcCwgbXAsIHdoaWNoZm9y ayksCisJCQl4ZnNfYm1kciwgMCk7CisJaWYgKG5yZWNzID4gbWF4cmVjcykg eworCQlpZiAoc2hvd193YXJuaW5ncykKKwkJCXByaW50X3dhcm5pbmcoImlu dmFsaWQgbnVtcmVjcyAoJXUpIGluIGlub2RlICVsbGQgJXMgIgorCQkJCQki cm9vdCIsIG5yZWNzLCAobG9uZyBsb25nKWN1cl9pbm8sCisJCQkJCXR5cHRh YltidHlwZV0ubmFtZSk7CisJCXJldHVybiAxOworCX0KKworCXBwID0gWEZT X0JUUkVFX1BUUl9BRERSKFhGU19ERk9SS19TSVpFKGRpcCwgbXAsIHdoaWNo Zm9yayksIHhmc19ibWRyLAorCQkJZGliLCAxLCBtYXhyZWNzKTsKKwlmb3Ig KGkgPSAwOyBpIDwgbnJlY3M7IGkrKykgeworCQl4ZnNfYWdudW1iZXJfdAlh ZzsKKwkJeGZzX2FnYmxvY2tfdAlibm87CisKKwkJYWcgPSBYRlNfRlNCX1RP X0FHTk8obXAsIGJlNjRfdG9fY3B1KHBwW2ldKSk7CisJCWJubyA9IFhGU19G U0JfVE9fQUdCTk8obXAsIGJlNjRfdG9fY3B1KHBwW2ldKSk7CisKKwkJaWYg KGJubyA9PSAwIHx8IGJubyA+IG1wLT5tX3NiLnNiX2FnYmxvY2tzIHx8CisJ CQkJYWcgPiBtcC0+bV9zYi5zYl9hZ2NvdW50KSB7CisJCQlpZiAoc2hvd193 YXJuaW5ncykKKwkJCQlwcmludF93YXJuaW5nKCJpbnZhbGlkIGJsb2NrIG51 bWJlciAoJXUvJXUpICIKKwkJCQkJCSJpbiBpbm9kZSAlbGx1ICVzIHJvb3Qi LCBhZywKKwkJCQkJCWJubywgKGxvbmcgbG9uZyljdXJfaW5vLAorCQkJCQkJ dHlwdGFiW2J0eXBlXS5uYW1lKTsKKwkJCWNvbnRpbnVlOworCQl9CisKKwkJ aWYgKCFzY2FuX2J0cmVlKGFnLCBibm8sIGxldmVsLCBidHlwZSwgJml0eXBl LCBzY2FuZnVuY19ibWFwKSkKKwkJCXJldHVybiAwOworCX0KKwlyZXR1cm4g MTsKK30KKworc3RhdGljIGludAorcHJvY2Vzc19leGlub2RlKAorCXhmc19k aW5vZGVfdCAJCSpkaXAsCisJdHlwbm1fdAkJCWl0eXBlKQoreworCWludAkJ CXdoaWNoZm9yazsKKworCXdoaWNoZm9yayA9IChpdHlwZSA9PSBUWVBfQVRU UikgPyBYRlNfQVRUUl9GT1JLIDogWEZTX0RBVEFfRk9SSzsKKworCXJldHVy biBwcm9jZXNzX2JtYnRfcmVjbGlzdCgKKwkJCSh4ZnNfYm1idF9yZWNfdCAq KVhGU19ERk9SS19QVFIoZGlwLCB3aGljaGZvcmspLAorCQkJWEZTX0RGT1JL X05FWFRFTlRTX0hPU1QoZGlwLCB3aGljaGZvcmspLCBpdHlwZSk7Cit9CisK K3N0YXRpYyBpbnQKK3Byb2Nlc3NfaW5vZGVfZGF0YSgKKwl4ZnNfZGlub2Rl X3QJCSpkaXAsCisJdHlwbm1fdAkJCWl0eXBlKQoreworCXN3aXRjaCAoZGlw LT5kaV9jb3JlLmRpX2Zvcm1hdCkgeworCQljYXNlIFhGU19ESU5PREVfRk1U X0xPQ0FMOgorCQkJaWYgKCFkb250X29iZnVzY2F0ZSkKKwkJCQlzd2l0Y2gg KGl0eXBlKSB7CisJCQkJCWNhc2UgVFlQX0RJUjI6CisJCQkJCQlvYmZ1c2Nh dGVfc2ZfZGlyKGRpcCk7CisJCQkJCQlicmVhazsKKworCQkJCQljYXNlIFRZ UF9TWU1MSU5LOgorCQkJCQkJb2JmdXNjYXRlX3NmX3N5bWxpbmsoZGlwKTsK KwkJCQkJCWJyZWFrOworCisJCQkJCWRlZmF1bHQ6IDsKKwkJCQl9CisJCQli cmVhazsKKworCQljYXNlIFhGU19ESU5PREVfRk1UX0VYVEVOVFM6CisJCQly ZXR1cm4gcHJvY2Vzc19leGlub2RlKGRpcCwgaXR5cGUpOworCisJCWNhc2Ug WEZTX0RJTk9ERV9GTVRfQlRSRUU6CisJCQlyZXR1cm4gcHJvY2Vzc19idGlu b2RlKGRpcCwgaXR5cGUpOworCX0KKwlyZXR1cm4gMTsKK30KKworc3RhdGlj IGludAorcHJvY2Vzc19pbm9kZSgKKwl4ZnNfYWdudW1iZXJfdAkJYWdubywK Kwl4ZnNfYWdpbm9fdCAJCWFnaW5vLAorCXhmc19kaW5vZGVfdCAJCSpkaXAp Cit7CisJeGZzX2Rpbm9kZV9jb3JlX3QgICAgICAgb2RpYzsKKwlpbnQJCQlz dWNjZXNzOworCisJLyogY29udmVydCB0aGUgY29yZSAqLworCW1lbWNweSgm b2RpYywgJmRpcC0+ZGlfY29yZSwgc2l6ZW9mKHhmc19kaW5vZGVfY29yZV90 KSk7CisJbGlieGZzX3hsYXRlX2Rpbm9kZV9jb3JlKCh4ZnNfY2FkZHJfdCkm b2RpYywgJmRpcC0+ZGlfY29yZSwgMSk7CisKKwlzdWNjZXNzID0gMTsKKwlj dXJfaW5vID0gWEZTX0FHSU5PX1RPX0lOTyhtcCwgYWdubywgYWdpbm8pOwor CisKKwkvKiBjb3B5IGFwcHJvcHJpYXRlIGRhdGEgZm9yayBtZXRhZGF0YSAq LworCXN3aXRjaCAoZGlwLT5kaV9jb3JlLmRpX21vZGUgJiBTX0lGTVQpIHsK KwkJY2FzZSBTX0lGRElSOgorCQkJbWVtc2V0KCZkaXJfZGF0YSwgMCwgc2l6 ZW9mKGRpcl9kYXRhKSk7CisJCQlzdWNjZXNzID0gcHJvY2Vzc19pbm9kZV9k YXRhKGRpcCwgVFlQX0RJUjIpOworCQkJYnJlYWs7CisJCWNhc2UgU19JRkxO SzoKKwkJCXN1Y2Nlc3MgPSBwcm9jZXNzX2lub2RlX2RhdGEoZGlwLCBUWVBf U1lNTElOSyk7CisJCQlicmVhazsKKwkJZGVmYXVsdDoKKwkJCXN1Y2Nlc3Mg PSBwcm9jZXNzX2lub2RlX2RhdGEoZGlwLCBUWVBfREFUQSk7CisJfQorCWNs ZWFyX25hbWV0YWJsZSgpOworCisJLyogY29weSBleHRlbmRlZCBhdHRyaWJ1 dGVzIGlmIHRoZXkgZXhpc3QgKi8KKwlpZiAoc3VjY2VzcyAmJiBkaXAtPmRp X2NvcmUuZGlfZm9ya29mZikgeworCQlhdHRyX2RhdGEucmVtb3RlX3ZhbF9j b3VudCA9IDA7CisJCXN3aXRjaCAoZGlwLT5kaV9jb3JlLmRpX2Fmb3JtYXQp IHsKKwkJCWNhc2UgWEZTX0RJTk9ERV9GTVRfTE9DQUw6CisJCQkJaWYgKCFk b250X29iZnVzY2F0ZSkKKwkJCQkJb2JmdXNjYXRlX3NmX2F0dHIoZGlwKTsK KwkJCQlicmVhazsKKworCQkJY2FzZSBYRlNfRElOT0RFX0ZNVF9FWFRFTlRT OgorCQkJCXN1Y2Nlc3MgPSBwcm9jZXNzX2V4aW5vZGUoZGlwLCBUWVBfQVRU Uik7CisJCQkJYnJlYWs7CisKKwkJCWNhc2UgWEZTX0RJTk9ERV9GTVRfQlRS RUU6CisJCQkJc3VjY2VzcyA9IHByb2Nlc3NfYnRpbm9kZShkaXAsIFRZUF9B VFRSKTsKKwkJCQlicmVhazsKKwkJfQorCQljbGVhcl9uYW1ldGFibGUoKTsK Kwl9CisKKwkvKiByZXN0b3JlIHRoZSBjb3JlIGJhY2sgdG8gaXQncyBvcmln aW5hbCBlbmRpYW5lc3MgKi8KKwltZW1jcHkoJmRpcC0+ZGlfY29yZSwgJm9k aWMsIHNpemVvZih4ZnNfZGlub2RlX2NvcmVfdCkpOworCisJcmV0dXJuIHN1 Y2Nlc3M7Cit9CisKK3N0YXRpYyBfX3VpbnQzMl90CWlub2Rlc19jb3BpZWQg PSAwOworCitzdGF0aWMgaW50Citjb3B5X2lub2RlX2NodW5rKAorCXhmc19h Z251bWJlcl90IAkJYWdubywKKwl4ZnNfaW5vYnRfcmVjX3QgCSpycCkKK3sK Kwl4ZnNfYWdpbm9fdCAJCWFnaW5vOworCWludAkJCW9mZjsKKwl4ZnNfYWdi bG9ja190CQlhZ2JubzsKKwlpbnQJCQlpOworCisJYWdpbm8gPSBiZTMyX3Rv X2NwdShycC0+aXJfc3RhcnRpbm8pOworCWFnYm5vID0gWEZTX0FHSU5PX1RP X0FHQk5PKG1wLCBhZ2lubyk7CisJb2ZmID0gWEZTX0lOT19UT19PRkZTRVQo bXAsIGFnaW5vKTsKKworCXB1c2hfY3VyKCk7CisJc2V0X2N1cigmdHlwdGFi W1RZUF9JTk9ERV0sIFhGU19BR0JfVE9fREFERFIobXAsIGFnbm8sIGFnYm5v KSwKKwkJCVhGU19GU0JfVE9fQkIobXAsIFhGU19JQUxMT0NfQkxPQ0tTKG1w KSksCisJCQlEQl9SSU5HX0lHTiwgTlVMTCk7CisJaWYgKGlvY3VyX3RvcC0+ ZGF0YSA9PSBOVUxMKSB7CisJCXByaW50X3dhcm5pbmcoImNhbm5vdCByZWFk IGlub2RlIGJsb2NrICV1LyV1IiwgYWdubywgYWdibm8pOworCQlyZXR1cm4g IXN0b3Bfb25fcmVhZF9lcnJvcjsKKwl9CisKKwkvKgorCSAqIHNjYW4gdGhy b3VnaCBpbm9kZXMgYW5kIGNvcHkgYW55IGJ0cmVlIGV4dGVudCBsaXN0cywg ZGlyZWN0b3J5CisJICogY29udGVudHMgYW5kIGV4dGVuZGVkIGF0dHJpYnV0 ZXMuCisJICovCisKKwlmb3IgKGkgPSAwOyBpIDwgWEZTX0lOT0RFU19QRVJf Q0hVTks7IGkrKykgeworCQl4ZnNfZGlub2RlX3QgICAgICAgICAgICAqZGlw OworCisJCWlmIChYRlNfSU5PQlRfSVNfRlJFRV9ESVNLKHJwLCBpKSkKKwkJ CWNvbnRpbnVlOworCisJCWRpcCA9ICh4ZnNfZGlub2RlX3QgKikoKGNoYXIg Kilpb2N1cl90b3AtPmRhdGEgKworCQkJCSgob2ZmICsgaSkgPDwgbXAtPm1f c2Iuc2JfaW5vZGVsb2cpKTsKKworCQlpZiAoIXByb2Nlc3NfaW5vZGUoYWdu bywgYWdpbm8gKyBpLCBkaXApKQorCQkJcmV0dXJuIDA7CisJfQorCisJaWYg KCF3cml0ZV9idWYoaW9jdXJfdG9wKSkKKwkJcmV0dXJuIDA7CisKKwlpbm9k ZXNfY29waWVkICs9IFhGU19JTk9ERVNfUEVSX0NIVU5LOworCisJaWYgKHNo b3dfcHJvZ3Jlc3MpCisJCXByaW50X3Byb2dyZXNzKCJDb3BpZWQgJXUgb2Yg JXUgaW5vZGVzICgldSBvZiAldSBBR3MpIiwKKwkJCQlpbm9kZXNfY29waWVk LCBtcC0+bV9zYi5zYl9pY291bnQsIGFnbm8sCisJCQkJbXAtPm1fc2Iuc2Jf YWdjb3VudCk7CisKKwlwb3BfY3VyKCk7CisKKwlyZXR1cm4gMTsKK30KKwor c3RhdGljIGludAorc2NhbmZ1bmNfaW5vKAorCXhmc19idHJlZV9oZHJfdAkJ KmJ0aGRyLAorCXhmc19hZ251bWJlcl90CQlhZ25vLAorCXhmc19hZ2Jsb2Nr X3QJCWFnYm5vLAorCWludAkJCWxldmVsLAorCXR5cG5tX3QJCQlidHlwZSwK Kwl2b2lkCQkJKmFyZykKK3sKKwl4ZnNfaW5vYnRfcmVjX3QJCSpycDsKKwl4 ZnNfaW5vYnRfcHRyX3QJCSpwcDsKKwlpbnQJCQlpOworCisJaWYgKGxldmVs ID09IDApIHsKKwkJcnAgPSBYRlNfQlRSRUVfUkVDX0FERFIobXAtPm1fc2Iu c2JfYmxvY2tzaXplLCB4ZnNfaW5vYnQsCisJCQkJYnRoZHIsIDEsIG1wLT5t X2lub2J0X214clswXSk7CisJCWZvciAoaSA9IDA7IGkgPCBiZTE2X3RvX2Nw dShidGhkci0+YmJfbnVtcmVjcyk7IGkrKywgcnArKykgeworCQkJaWYgKCFj b3B5X2lub2RlX2NodW5rKGFnbm8sIHJwKSkKKwkJCQlyZXR1cm4gMDsKKwkJ fQorCX0gZWxzZSB7CisJCXBwID0gWEZTX0JUUkVFX1BUUl9BRERSKG1wLT5t X3NiLnNiX2Jsb2Nrc2l6ZSwgeGZzX2lub2J0LAorCQkJCWJ0aGRyLCAxLCBt cC0+bV9pbm9idF9teHJbMV0pOworCQlmb3IgKGkgPSAwOyBpIDwgYmUxNl90 b19jcHUoYnRoZHItPmJiX251bXJlY3MpOyBpKyspIHsKKwkJCWlmICghdmFs aWRfYm5vKGJlMzJfdG9fY3B1KHBwW2ldKSwgYWdubywgYWdibm8sIGJ0eXBl KSkKKwkJCQljb250aW51ZTsKKwkJCWlmICghc2Nhbl9idHJlZShhZ25vLCBi ZTMyX3RvX2NwdShwcFtpXSksIGxldmVsLAorCQkJCQlidHlwZSwgYXJnLCBz Y2FuZnVuY19pbm8pKQorCQkJCXJldHVybiAwOworCQl9CisJfQorCXJldHVy biAxOworfQorCitzdGF0aWMgaW50Citjb3B5X2lub2RlcygKKwl4ZnNfYWdu dW1iZXJfdAkJYWdubywKKwl4ZnNfYWdpX3QJCSphZ2kpCit7CisJeGZzX2Fn YmxvY2tfdAkJcm9vdDsKKwlpbnQJCQlsZXZlbHM7CisKKwlyb290ID0gYmUz Ml90b19jcHUoYWdpLT5hZ2lfcm9vdCk7CisJbGV2ZWxzID0gYmUzMl90b19j cHUoYWdpLT5hZ2lfbGV2ZWwpOworCisJLyogdmFsaWRhdGUgcm9vdCBhbmQg bGV2ZWxzIGJlZm9yZSBwcm9jZXNzaW5nIHRoZSB0cmVlICovCisJaWYgKHJv b3QgPT0gMCB8fCByb290ID4gbXAtPm1fc2Iuc2JfYWdibG9ja3MpIHsKKwkJ aWYgKHNob3dfd2FybmluZ3MpCisJCQlwcmludF93YXJuaW5nKCJpbnZhbGlk IGJsb2NrIG51bWJlciAoJXUpIGluIGlub2J0ICIKKwkJCQkJInJvb3QgaW4g YWdpICV1Iiwgcm9vdCwgYWdubyk7CisJCXJldHVybiAxOworCX0KKwlpZiAo bGV2ZWxzID49IFhGU19CVFJFRV9NQVhMRVZFTFMpIHsKKwkJaWYgKHNob3df d2FybmluZ3MpCisJCQlwcmludF93YXJuaW5nKCJpbnZhbGlkIGxldmVsICgl dSkgaW4gaW5vYnQgcm9vdCAiCisJCQkJCSJpbiBhZ2kgJXUiLCBsZXZlbHMs IGFnbm8pOworCQlyZXR1cm4gMTsKKwl9CisKKwlyZXR1cm4gc2Nhbl9idHJl ZShhZ25vLCByb290LCBsZXZlbHMsIFRZUF9JTk9CVCwgYWdpLCBzY2FuZnVu Y19pbm8pOworfQorCitzdGF0aWMgaW50CitzY2FuX2FnKAorCXhmc19hZ251 bWJlcl90CWFnbm8pCit7CisJeGZzX2FnZl90CSphZ2Y7CisJeGZzX2FnaV90 CSphZ2k7CisKKwkvKiBjb3B5IHRoZSBzdXBlcmJsb2NrIG9mIHRoZSBBRyAq LworCXB1c2hfY3VyKCk7CisJc2V0X2N1cigmdHlwdGFiW1RZUF9TQl0sIFhG U19BR19EQUREUihtcCwgYWdubywgWEZTX1NCX0RBRERSKSwKKwkJCVhGU19G U1NfVE9fQkIobXAsIDEpLCBEQl9SSU5HX0lHTiwgTlVMTCk7CisJaWYgKCFp b2N1cl90b3AtPmRhdGEpIHsKKwkJcHJpbnRfd2FybmluZygiY2Fubm90IHJl YWQgc3VwZXJibG9jayBmb3IgYWcgJXUiLCBhZ25vKTsKKwkJaWYgKHN0b3Bf b25fcmVhZF9lcnJvcikKKwkJCXJldHVybiAwOworCX0gZWxzZSB7CisJCWlm ICghd3JpdGVfYnVmKGlvY3VyX3RvcCkpCisJCQlyZXR1cm4gMDsKKwl9CisK KwkvKiBjb3B5IHRoZSBBRyBmcmVlIHNwYWNlIGJ0cmVlIHJvb3QgKi8KKwlw dXNoX2N1cigpOworCXNldF9jdXIoJnR5cHRhYltUWVBfQUdGXSwgWEZTX0FH X0RBRERSKG1wLCBhZ25vLCBYRlNfQUdGX0RBRERSKG1wKSksCisJCQlYRlNf RlNTX1RPX0JCKG1wLCAxKSwgREJfUklOR19JR04sIE5VTEwpOworCWFnZiA9 IGlvY3VyX3RvcC0+ZGF0YTsKKwlpZiAoaW9jdXJfdG9wLT5kYXRhID09IE5V TEwpIHsKKwkJcHJpbnRfd2FybmluZygiY2Fubm90IHJlYWQgYWdmIGJsb2Nr IGZvciBhZyAldSIsIGFnbm8pOworCQlpZiAoc3RvcF9vbl9yZWFkX2Vycm9y KQorCQkJcmV0dXJuIDA7CisJfSBlbHNlIHsKKwkJaWYgKCF3cml0ZV9idWYo aW9jdXJfdG9wKSkKKwkJCXJldHVybiAwOworCX0KKworCS8qIGNvcHkgdGhl IEFHIGlub2RlIGJ0cmVlIHJvb3QgKi8KKwlwdXNoX2N1cigpOworCXNldF9j dXIoJnR5cHRhYltUWVBfQUdJXSwgWEZTX0FHX0RBRERSKG1wLCBhZ25vLCBY RlNfQUdJX0RBRERSKG1wKSksCisJCQlYRlNfRlNTX1RPX0JCKG1wLCAxKSwg REJfUklOR19JR04sIE5VTEwpOworCWFnaSA9IGlvY3VyX3RvcC0+ZGF0YTsK KwlpZiAoaW9jdXJfdG9wLT5kYXRhID09IE5VTEwpIHsKKwkJcHJpbnRfd2Fy bmluZygiY2Fubm90IHJlYWQgYWdpIGJsb2NrIGZvciBhZyAldSIsIGFnbm8p OworCQlpZiAoc3RvcF9vbl9yZWFkX2Vycm9yKQorCQkJcmV0dXJuIDA7CisJ fSBlbHNlIHsKKwkJaWYgKCF3cml0ZV9idWYoaW9jdXJfdG9wKSkKKwkJCXJl dHVybiAwOworCX0KKworCS8qIGNvcHkgdGhlIEFHIGZyZWUgbGlzdCBoZWFk ZXIgKi8KKwlwdXNoX2N1cigpOworCXNldF9jdXIoJnR5cHRhYltUWVBfQUdG TF0sIFhGU19BR19EQUREUihtcCwgYWdubywgWEZTX0FHRkxfREFERFIobXAp KSwKKwkJCVhGU19GU1NfVE9fQkIobXAsIDEpLCBEQl9SSU5HX0lHTiwgTlVM TCk7CisJaWYgKGlvY3VyX3RvcC0+ZGF0YSA9PSBOVUxMKSB7CisJCXByaW50 X3dhcm5pbmcoImNhbm5vdCByZWFkIGFnZmwgYmxvY2sgZm9yIGFnICV1Iiwg YWdubyk7CisJCWlmIChzdG9wX29uX3JlYWRfZXJyb3IpCisJCQlyZXR1cm4g MDsKKwl9IGVsc2UgeworCQlpZiAoIXdyaXRlX2J1Zihpb2N1cl90b3ApKQor CQkJcmV0dXJuIDA7CisJfQorCXBvcF9jdXIoKTsKKworCS8qIGNvcHkgQUcg ZnJlZSBzcGFjZSBidHJlZXMgKi8KKwlpZiAoYWdmKSB7CisJCWlmIChzaG93 X3Byb2dyZXNzKQorCQkJcHJpbnRfcHJvZ3Jlc3MoIkNvcHlpbmcgZnJlZSBz cGFjZSB0cmVlcyBvZiBBRyAldSIsCisJCQkJCWFnbm8pOworCQlpZiAoIWNv cHlfZnJlZV9ibm9fYnRyZWUoYWdubywgYWdmKSkKKwkJCXJldHVybiAwOwor CQlpZiAoIWNvcHlfZnJlZV9jbnRfYnRyZWUoYWdubywgYWdmKSkKKwkJCXJl dHVybiAwOworCX0KKworCS8qIGNvcHkgaW5vZGUgYnRyZWVzIGFuZCB0aGUg aW5vZGVzIGFuZCB0aGVpciBhc3NvY2lhdGVkIG1ldGFkYXRhICovCisJaWYg KGFnaSkgeworCQlpZiAoIWNvcHlfaW5vZGVzKGFnbm8sIGFnaSkpCisJCQly ZXR1cm4gMDsKKwl9CisKKwlwb3BfY3VyKCk7CisJcG9wX2N1cigpOworCXBv cF9jdXIoKTsKKworCXJldHVybiAxOworfQorCitzdGF0aWMgaW50Citjb3B5 X2lubygKKwl4ZnNfaW5vX3QJCWlubywKKwl0eXBubV90CQkJaXR5cGUpCit7 CisJeGZzX2FnbnVtYmVyX3QJCWFnbm87CisJeGZzX2FnYmxvY2tfdAkJYWdi bm87CisJeGZzX2FnaW5vX3QJCWFnaW5vOworCXhmc19kaW5vZGVfdAkJKmRp cDsKKwl4ZnNfZGlub2RlX2NvcmVfdAl0ZGljOworCWludAkJCW9mZnNldDsK KworCWlmIChpbm8gPT0gMCkKKwkJcmV0dXJuIDE7CisKKwlhZ25vID0gWEZT X0lOT19UT19BR05PKG1wLCBpbm8pOworCWFnaW5vID0gWEZTX0lOT19UT19B R0lOTyhtcCwgaW5vKTsKKwlhZ2JubyA9IFhGU19BR0lOT19UT19BR0JOTyht cCwgYWdpbm8pOworCW9mZnNldCA9IFhGU19BR0lOT19UT19PRkZTRVQobXAs IGFnaW5vKTsKKworCWlmIChhZ25vID49IG1wLT5tX3NiLnNiX2FnY291bnQg fHwgYWdibm8gPj0gbXAtPm1fc2Iuc2JfYWdibG9ja3MgfHwKKwkJCW9mZnNl dCA+PSBtcC0+bV9zYi5zYl9pbm9wYmxvY2spIHsKKwkJaWYgKHNob3dfd2Fy bmluZ3MpCisJCQlwcmludF93YXJuaW5nKCJpbnZhbGlkICVzIGlub2RlIG51 bWJlciAoJWxsZCkiLAorCQkJCQl0eXB0YWJbaXR5cGVdLm5hbWUsIChsb25n IGxvbmcpaW5vKTsKKwkJcmV0dXJuIDE7CisJfQorCisJcHVzaF9jdXIoKTsK KwlzZXRfY3VyKCZ0eXB0YWJbVFlQX0lOT0RFXSwgWEZTX0FHQl9UT19EQURE UihtcCwgYWdubywgYWdibm8pLAorCQkJYmxrYmIsIERCX1JJTkdfSUdOLCBO VUxMKTsKKwlpZiAoaW9jdXJfdG9wLT5kYXRhID09IE5VTEwpIHsKKwkJcHJp bnRfd2FybmluZygiY2Fubm90IHJlYWQgJXMgaW5vZGUgJWxsZCIsCisJCQkJ dHlwdGFiW2l0eXBlXS5uYW1lLCAobG9uZyBsb25nKWlubyk7CisJCXJldHVy biAhc3RvcF9vbl9yZWFkX2Vycm9yOworCX0KKwlvZmZfY3VyKG9mZnNldCA8 PCBtcC0+bV9zYi5zYl9pbm9kZWxvZywgbXAtPm1fc2Iuc2JfaW5vZGVzaXpl KTsKKworCWRpcCA9IGlvY3VyX3RvcC0+ZGF0YTsKKwlsaWJ4ZnNfeGxhdGVf ZGlub2RlX2NvcmUoKHhmc19jYWRkcl90KSZkaXAtPmRpX2NvcmUsICZ0ZGlj LCAxKTsKKwltZW1jcHkoJmRpcC0+ZGlfY29yZSwgJnRkaWMsIHNpemVvZih4 ZnNfZGlub2RlX2NvcmVfdCkpOworCisJY3VyX2lubyA9IGlubzsKKwlyZXR1 cm4gcHJvY2Vzc19pbm9kZV9kYXRhKGRpcCwgaXR5cGUpOworfQorCisKK3N0 YXRpYyBpbnQKK2NvcHlfc2JfaW5vZGVzKHZvaWQpCit7CisJaWYgKCFjb3B5 X2lubyhtcC0+bV9zYi5zYl9yYm1pbm8sIFRZUF9SVEJJVE1BUCkpCisJCXJl dHVybiAwOworCisJaWYgKCFjb3B5X2lubyhtcC0+bV9zYi5zYl9yc3VtaW5v LCBUWVBfUlRTVU1NQVJZKSkKKwkJcmV0dXJuIDA7CisKKwlpZiAoIWNvcHlf aW5vKG1wLT5tX3NiLnNiX3VxdW90aW5vLCBUWVBfRFFCTEspKQorCQlyZXR1 cm4gMDsKKworCXJldHVybiBjb3B5X2lubyhtcC0+bV9zYi5zYl9ncXVvdGlu bywgVFlQX0RRQkxLKTsKK30KKworc3RhdGljIGludAorY29weV9sb2codm9p ZCkKK3sKKwlpZiAoc2hvd19wcm9ncmVzcykKKwkJcHJpbnRfcHJvZ3Jlc3Mo IkNvcHlpbmcgbG9nIik7CisKKwlwdXNoX2N1cigpOworCXNldF9jdXIoJnR5 cHRhYltUWVBfTE9HXSwgWEZTX0ZTQl9UT19EQUREUihtcCwgbXAtPm1fc2Iu c2JfbG9nc3RhcnQpLAorCQkJbXAtPm1fc2Iuc2JfbG9nYmxvY2tzICogYmxr YmIsIERCX1JJTkdfSUdOLCBOVUxMKTsKKwlpZiAoaW9jdXJfdG9wLT5kYXRh ID09IE5VTEwpIHsKKwkJcHJpbnRfd2FybmluZygiY2Fubm90IHJlYWQgbG9n Iik7CisJCXJldHVybiAhc3RvcF9vbl9yZWFkX2Vycm9yOworCX0KKwlyZXR1 cm4gd3JpdGVfYnVmKGlvY3VyX3RvcCk7Cit9CisKK3N0YXRpYyBpbnQKK21l dGFkdW1wX2YoCisJaW50IAkJYXJnYywKKwljaGFyIAkJKiphcmd2KQorewor CXhmc19hZ251bWJlcl90CWFnbm87CisJaW50CQljOworCWludAkJc3RhcnRf aW9jdXJfc3A7CisKKwlleGl0Y29kZSA9IDE7CisJc2hvd19wcm9ncmVzcyA9 IDA7CisJc2hvd193YXJuaW5ncyA9IDA7CisJc3RvcF9vbl9yZWFkX2Vycm9y ID0gMDsKKworCWlmIChtcC0+bV9zYi5zYl9tYWdpY251bSAhPSBYRlNfU0Jf TUFHSUMpIHsKKwkJcHJpbnRfd2FybmluZygiYmFkIHN1cGVyYmxvY2sgbWFn aWMgbnVtYmVyICV4LCBnaXZpbmcgdXAiLAorCQkJCW1wLT5tX3NiLnNiX21h Z2ljbnVtKTsKKwkJcmV0dXJuIDA7CisJfQorCisJd2hpbGUgKChjID0gZ2V0 b3B0KGFyZ2MsIGFyZ3YsICJlZ293IikpICE9IEVPRikgeworCQlzd2l0Y2gg KGMpIHsKKwkJCWNhc2UgJ2UnOgorCQkJCXN0b3Bfb25fcmVhZF9lcnJvciA9 IDE7CisJCQkJYnJlYWs7CisJCQljYXNlICdnJzoKKwkJCQlzaG93X3Byb2dy ZXNzID0gMTsKKwkJCQlicmVhazsKKwkJCWNhc2UgJ28nOgorCQkJCWRvbnRf b2JmdXNjYXRlID0gMTsKKwkJCQlicmVhazsKKwkJCWNhc2UgJ3cnOgorCQkJ CXNob3dfd2FybmluZ3MgPSAxOworCQkJCWJyZWFrOworCQkJZGVmYXVsdDoK KwkJCQlwcmludF93YXJuaW5nKCJiYWQgb3B0aW9uIGZvciBtZXRhZHVtcCBj b21tYW5kIik7CisJCQkJcmV0dXJuIDA7CisJCX0KKwl9CisKKwlpZiAob3B0 aW5kICE9IGFyZ2MgLSAxKSB7CisJCXByaW50X3dhcm5pbmcoInRvbyBmZXcg b3B0aW9ucyBmb3IgbWV0YWR1bXAgKG5vIGZpbGVuYW1lIGdpdmVuKSIpOwor CQlyZXR1cm4gMDsKKwl9CisKKwltZXRhYmxvY2sgPSAoeGZzX21ldGFibG9j a190ICopY2FsbG9jKEJCU0laRSArIDEsIEJCU0laRSk7CisJaWYgKG1ldGFi bG9jayA9PSBOVUxMKSB7CisJCXByaW50X3dhcm5pbmcoIm1lbW9yeSBhbGxv Y2F0aW9uIGZhaWx1cmUiKTsKKwkJcmV0dXJuIDA7CisJfQorCW1ldGFibG9j ay0+bWJfYmxvY2tsb2cgPSBCQlNISUZUOworCW1ldGFibG9jay0+bWJfbWFn aWMgPSBjcHVfdG9fYmUzMihYRlNfTURfTUFHSUMpOworCisJaWYgKCFjcmVh dGVfbmFtZXRhYmxlKCkpIHsKKwkJcHJpbnRfd2FybmluZygibWVtb3J5IGFs bG9jYXRpb24gZmFpbHVyZSIpOworCQlmcmVlKG1ldGFibG9jayk7CisJCXJl dHVybiAwOworCX0KKworCWJsb2NrX2luZGV4ID0gKF9fYmU2NCAqKSgoY2hh ciAqKW1ldGFibG9jayArIHNpemVvZih4ZnNfbWV0YWJsb2NrX3QpKTsKKwli bG9ja19idWZmZXIgPSAoY2hhciAqKW1ldGFibG9jayArIEJCU0laRTsKKwlu dW1faW5kaWNpZXMgPSAoQkJTSVpFIC0gc2l6ZW9mKHhmc19tZXRhYmxvY2tf dCkpIC8gc2l6ZW9mKF9fYmU2NCk7CisJY3VyX2luZGV4ID0gMDsKKwlzdGFy dF9pb2N1cl9zcCA9IGlvY3VyX3NwOworCisJaWYgKHN0cmNtcChhcmd2W29w dGluZF0sICItIikgPT0gMCkgeworCQlpZiAoaXNhdHR5KGZpbGVubyhzdGRv dXQpKSkgeworCQkJcHJpbnRfd2FybmluZygiY2Fubm90IHdyaXRlIHRvIGEg dGVybWluYWwiKTsKKwkJCWZyZWUobmFtZXRhYmxlKTsKKwkJCWZyZWUobWV0 YWJsb2NrKTsKKwkJCXJldHVybiAwOworCQl9CisJCW91dGYgPSBzdGRvdXQ7 CisJfSBlbHNlIHsKKwkJb3V0ZiA9IGZvcGVuKGFyZ3Zbb3B0aW5kXSwgIndi Iik7CisJCWlmIChvdXRmID09IE5VTEwpIHsKKwkJCXByaW50X3dhcm5pbmco ImNhbm5vdCBjcmVhdGUgZHVtcCBmaWxlIik7CisJCQlmcmVlKG5hbWV0YWJs ZSk7CisJCQlmcmVlKG1ldGFibG9jayk7CisJCQlyZXR1cm4gMDsKKwkJfQor CX0KKworCWV4aXRjb2RlID0gMDsKKworCWZvciAoYWdubyA9IDA7IGFnbm8g PCBtcC0+bV9zYi5zYl9hZ2NvdW50OyBhZ25vKyspIHsKKwkJaWYgKCFzY2Fu X2FnKGFnbm8pKSB7CisJCQlleGl0Y29kZSA9IDE7CisJCQlicmVhazsKKwkJ fQorCX0KKworCS8qIGNvcHkgcmVhbHRpbWUgYW5kIHF1b3RhIGlub2RlIGNv bnRlbnRzICovCisJaWYgKCFleGl0Y29kZSkKKwkJZXhpdGNvZGUgPSAhY29w eV9zYl9pbm9kZXMoKTsKKworCS8qIGNvcHkgbG9nIGlmIGl0J3MgaW50ZXJu YWwgKi8KKwlpZiAoKG1wLT5tX3NiLnNiX2xvZ3N0YXJ0ICE9IDApICYmICFl eGl0Y29kZSkKKwkJZXhpdGNvZGUgPSAhY29weV9sb2coKTsKKworCS8qIHdy aXRlIHRoZSByZW1haW5pbmcgaW5kZXggKi8KKwlpZiAoIWV4aXRjb2RlKQor CQlleGl0Y29kZSA9ICF3cml0ZV9pbmRleCgpOworCisJaWYgKHByb2dyZXNz X3NpbmNlX3dhcm5pbmcpCisJCWZwdXRjKCdcbicsIChvdXRmID09IHN0ZG91 dCkgPyBzdGRlcnIgOiBzdGRvdXQpOworCisJaWYgKG91dGYgIT0gc3Rkb3V0 KQorCQlmY2xvc2Uob3V0Zik7CisKKwkvKiBjbGVhbnVwIGlvY3VyIHN0YWNr ICovCisJd2hpbGUgKGlvY3VyX3NwID4gc3RhcnRfaW9jdXJfc3ApCisJCXBv cF9jdXIoKTsKKworCWZyZWUobmFtZXRhYmxlKTsKKwlmcmVlKG1ldGFibG9j ayk7CisKKwlyZXR1cm4gMDsKK30KCj09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PQp4ZnNwcm9ncy9kYi9tZXRhZHVtcC5oCj09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PQoKLS0tIGEveGZzcHJvZ3MvZGIvbWV0YWR1bXAu aAkyMDA2LTA2LTE3IDAwOjU4OjI0LjAwMDAwMDAwMCArMTAwMAorKysgYi94 ZnNwcm9ncy9kYi9tZXRhZHVtcC5oCTIwMDctMDUtMTggMTI6MDc6NTcuMzE0 MDIzNzc4ICsxMDAwCkBAIC0wLDAgKzEsMTkgQEAKKy8qCisgKiBDb3B5cmln aHQgKGMpIDIwMDcgU2lsaWNvbiBHcmFwaGljcywgSW5jLgorICogQWxsIFJp Z2h0cyBSZXNlcnZlZC4KKyAqCisgKiBUaGlzIHByb2dyYW0gaXMgZnJlZSBz b2Z0d2FyZTsgeW91IGNhbiByZWRpc3RyaWJ1dGUgaXQgYW5kL29yCisgKiBt b2RpZnkgaXQgdW5kZXIgdGhlIHRlcm1zIG9mIHRoZSBHTlUgR2VuZXJhbCBQ dWJsaWMgTGljZW5zZSBhcworICogcHVibGlzaGVkIGJ5IHRoZSBGcmVlIFNv ZnR3YXJlIEZvdW5kYXRpb24uCisgKgorICogVGhpcyBwcm9ncmFtIGlzIGRp c3RyaWJ1dGVkIGluIHRoZSBob3BlIHRoYXQgaXQgd291bGQgYmUgdXNlZnVs LAorICogYnV0IFdJVEhPVVQgQU5ZIFdBUlJBTlRZOyB3aXRob3V0IGV2ZW4g dGhlIGltcGxpZWQgd2FycmFudHkgb2YKKyAqIE1FUkNIQU5UQUJJTElUWSBv ciBGSVRORVNTIEZPUiBBIFBBUlRJQ1VMQVIgUFVSUE9TRS4gIFNlZSB0aGUK KyAqIEdOVSBHZW5lcmFsIFB1YmxpYyBMaWNlbnNlIGZvciBtb3JlIGRldGFp bHMuCisgKgorICogWW91IHNob3VsZCBoYXZlIHJlY2VpdmVkIGEgY29weSBv ZiB0aGUgR05VIEdlbmVyYWwgUHVibGljIExpY2Vuc2UKKyAqIGFsb25nIHdp dGggdGhpcyBwcm9ncmFtOyBpZiBub3QsIHdyaXRlIHRoZSBGcmVlIFNvZnR3 YXJlIEZvdW5kYXRpb24sCisgKiBJbmMuLCAgNTEgRnJhbmtsaW4gU3QsIEZp ZnRoIEZsb29yLCBCb3N0b24sIE1BICAwMjExMC0xMzAxICBVU0EKKyAqLwor CitleHRlcm4gdm9pZAltZXRhZHVtcF9pbml0KHZvaWQpOwoKPT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09Cnhmc3Byb2dzL2RiL3hmc19tZXRhZHVt cC5zaAo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KCi0tLSBhL3hm c3Byb2dzL2RiL3hmc19tZXRhZHVtcC5zaAkyMDA2LTA2LTE3IDAwOjU4OjI0 LjAwMDAwMDAwMCArMTAwMAorKysgYi94ZnNwcm9ncy9kYi94ZnNfbWV0YWR1 bXAuc2gJMjAwNy0wNS0yOCAxNDo1OToxMi44MTc4NTA5MjEgKzEwMDAKQEAg LTAsMCArMSwzOCBAQAorIyEvYmluL3NoIC1mCisjCisjIENvcHlyaWdodCAo YykgMjAwNyBTaWxpY29uIEdyYXBoaWNzLCBJbmMuICBBbGwgUmlnaHRzIFJl c2VydmVkLgorIworCitPUFRTPSIgIgorREJPUFRTPSIgIgorVVNBR0U9IlVz YWdlOiB4ZnNfbWV0YWR1bXAgWy1lZm9nd1ZdIFstbCBsb2dkZXZdIHNvdXJj ZSB0YXJnZXQiCisKK3doaWxlIGdldG9wdHMgImVmZ2w6b3dWIiBjCitkbwor CWNhc2UgJGMgaW4KKwllKQlPUFRTPSRPUFRTIi1lICI7OworCWcpCU9QVFM9 JE9QVFMiLWcgIjs7CisJbykJT1BUUz0kT1BUUyItbyAiOzsKKwl3KQlPUFRT PSRPUFRTIi13ICI7OworCWYpCURCT1BUUz0kREJPUFRTIiAtZiI7OworCWwp CURCT1BUUz0kREJPUFRTIiAtbCAiJE9QVEFSRyIgIjs7CisJVikJeGZzX2Ri IC1wIHhmc19tZXRhZHVtcCAtVgorCQlzdGF0dXM9JD8KKwkJZXhpdCAkc3Rh dHVzCisJCTs7CisJXD8pCWVjaG8gJFVTQUdFIDE+JjIKKwkJZXhpdCAyCisJ CTs7CisJZXNhYworZG9uZQorc2V0IC0tIGV4dHJhICRACitzaGlmdCAkT1BU SU5ECitjYXNlICQjIGluCisJMikJeGZzX2RiJERCT1BUUyAtaSAtcCB4ZnNf bWV0YWR1bXAgLWMgIm1ldGFkdW1wJE9QVFMgJDIiICQxCisJCXN0YXR1cz0k PworCQk7OworCSopCWVjaG8gJFVTQUdFIDE+JjIKKwkJZXhpdCAyCisJCTs7 Citlc2FjCitleGl0ICRzdGF0dXMKCj09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PQp4ZnNwcm9ncy9pbmNsdWRlL3hmc19tZXRhZHVtcC5oCj09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PQoKLS0tIGEveGZzcHJvZ3MvaW5j bHVkZS94ZnNfbWV0YWR1bXAuaAkyMDA2LTA2LTE3IDAwOjU4OjI0LjAwMDAw MDAwMCArMTAwMAorKysgYi94ZnNwcm9ncy9pbmNsdWRlL3hmc19tZXRhZHVt cC5oCTIwMDctMDUtMTggMTI6MzE6MDAuODQwNjM1NzgzICsxMDAwCkBAIC0w LDAgKzEsMzIgQEAKKy8qCisgKiBDb3B5cmlnaHQgKGMpIDIwMDcgU2lsaWNv biBHcmFwaGljcywgSW5jLgorICogQWxsIFJpZ2h0cyBSZXNlcnZlZC4KKyAq CisgKiBUaGlzIHByb2dyYW0gaXMgZnJlZSBzb2Z0d2FyZTsgeW91IGNhbiBy ZWRpc3RyaWJ1dGUgaXQgYW5kL29yCisgKiBtb2RpZnkgaXQgdW5kZXIgdGhl IHRlcm1zIG9mIHRoZSBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSBhcwor ICogcHVibGlzaGVkIGJ5IHRoZSBGcmVlIFNvZnR3YXJlIEZvdW5kYXRpb24u CisgKgorICogVGhpcyBwcm9ncmFtIGlzIGRpc3RyaWJ1dGVkIGluIHRoZSBo b3BlIHRoYXQgaXQgd291bGQgYmUgdXNlZnVsLAorICogYnV0IFdJVEhPVVQg QU5ZIFdBUlJBTlRZOyB3aXRob3V0IGV2ZW4gdGhlIGltcGxpZWQgd2FycmFu dHkgb2YKKyAqIE1FUkNIQU5UQUJJTElUWSBvciBGSVRORVNTIEZPUiBBIFBB UlRJQ1VMQVIgUFVSUE9TRS4gIFNlZSB0aGUKKyAqIEdOVSBHZW5lcmFsIFB1 YmxpYyBMaWNlbnNlIGZvciBtb3JlIGRldGFpbHMuCisgKgorICogWW91IHNo b3VsZCBoYXZlIHJlY2VpdmVkIGEgY29weSBvZiB0aGUgR05VIEdlbmVyYWwg UHVibGljIExpY2Vuc2UKKyAqIGFsb25nIHdpdGggdGhpcyBwcm9ncmFtOyBp ZiBub3QsIHdyaXRlIHRoZSBGcmVlIFNvZnR3YXJlIEZvdW5kYXRpb24sCisg KiBJbmMuLCAgNTEgRnJhbmtsaW4gU3QsIEZpZnRoIEZsb29yLCBCb3N0b24s IE1BICAwMjExMC0xMzAxICBVU0EKKyAqLworCisjaWZuZGVmIF9YRlNfTUVU QURVTVBfSF8KKyNkZWZpbmUgX1hGU19NRVRBRFVNUF9IXworCisjZGVmaW5l CVhGU19NRF9NQUdJQwkJMHg1ODQ2NTM0ZAkvKiAnWEZTTScgKi8KKwordHlw ZWRlZiBzdHJ1Y3QgeGZzX21ldGFibG9jayB7CisJX19iZTMyCQltYl9tYWdp YzsKKwlfX2JlMTYJCW1iX2NvdW50OworCV9fdWludDhfdAltYl9ibG9ja2xv ZzsKKwlfX3VpbnQ4X3QJbWJfcmVzZXJ2ZWQ7CisJLyogZm9sbG93ZWQgYnkg YW4gYXJyYXkgb2YgeGZzX2RhZGRyX3QgKi8KK30geGZzX21ldGFibG9ja190 OworCisjZW5kaWYgLyogX1hGU19NRVRBRFVNUF9IXyAqLwo= ------------k9x39jGsayoNr3EUsJAZe6-- From owner-xfs@oss.sgi.com Sun May 27 23:31:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 27 May 2007 23:31:36 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4S6VTWt011804 for ; Sun, 27 May 2007 23:31:31 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA05736; Mon, 28 May 2007 16:31:23 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1161) id 771CE58CA5A4; Mon, 28 May 2007 16:31:23 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com Cc: xfs@oss.sgi.com Subject: TAKE 964999 - lazy superblock counters for XFS Message-Id: <20070528063123.771CE58CA5A4@chook.melbourne.sgi.com> Date: Mon, 28 May 2007 16:31:23 +1000 (EST) From: bnaujok@sgi.com (Barry Naujok) X-archive-position: 11530 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@sgi.com Precedence: bulk X-list: xfs Update version for lazy superblock counter support Date: Mon May 28 16:30:24 AEST 2007 Workarea: chook.melbourne.sgi.com:/home/bnaujok/isms/xfs-cmds Inspected by: dgc@sgi.com The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/xfs-cmds/master-melb Modid: master-melb:xfs-cmds:28712a xfsprogs/VERSION - 1.171 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/VERSION.diff?r1=text&tr1=1.171&r2=text&tr2=1.170&f=h xfsprogs/doc/CHANGES - 1.239 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/doc/CHANGES.diff?r1=text&tr1=1.239&r2=text&tr2=1.238&f=h - Update version for lazy superblock counter support From owner-xfs@oss.sgi.com Mon May 28 00:09:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 28 May 2007 00:09:21 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4S79EWt020854 for ; Mon, 28 May 2007 00:09:16 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA06550; Mon, 28 May 2007 17:09:10 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1161) id 1394858CA5A4; Mon, 28 May 2007 17:09:09 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com Cc: xfs@oss.sgi.com Subject: TAKE 956281 - Debian patches Message-Id: <20070528070910.1394858CA5A4@chook.melbourne.sgi.com> Date: Mon, 28 May 2007 17:09:09 +1000 (EST) From: bnaujok@sgi.com (Barry Naujok) X-archive-position: 11531 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@sgi.com Precedence: bulk X-list: xfs Update debian changelogs Date: Mon May 28 17:08:40 AEST 2007 Workarea: chook.melbourne.sgi.com:/home/bnaujok/isms/xfs-cmds Inspected by: nathans@aconex.com The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/xfs-cmds/master-melb Modid: master-melb:xfs-cmds:28713a xfsdump/debian/changelog - 1.65 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsdump/debian/changelog.diff?r1=text&tr1=1.65&r2=text&tr2=1.64&f=h - Update debian changelogs dmapi/include/builddefs.in - 1.27 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/dmapi/include/builddefs.in.diff?r1=text&tr1=1.27&r2=text&tr2=1.26&f=h - Fix autoconf build issue dmapi/debian/changelog - 1.27 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/dmapi/debian/changelog.diff?r1=text&tr1=1.27&r2=text&tr2=1.26&f=h - Update debian changelogs From owner-xfs@oss.sgi.com Mon May 28 04:18:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 28 May 2007 04:18:27 -0700 (PDT) Received: from dh.localdomain (inn.nightwish.hu [217.20.130.190]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4SBHbWt017389 for ; Mon, 28 May 2007 04:18:21 -0700 Received: from dap by dh.localdomain with local (Exim 4.63) (envelope-from ) id 1HsdEC-0000KW-K8; Mon, 28 May 2007 13:17:32 +0200 From: Pallai Roland Organization: magex To: David Chinner Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Date: Mon, 28 May 2007 13:17:31 +0200 User-Agent: KMail/1.9.6 Cc: Linux-Raid , xfs@oss.sgi.com References: <200705241318.30711.dap@mail.index.hu> <200705280350.18384.dap@mail.index.hu> <20070528021718.GZ85884050@sgi.com> In-Reply-To: <20070528021718.GZ85884050@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705281317.32384.dap@mail.index.hu> X-archive-position: 11532 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dap@mail.index.hu Precedence: bulk X-list: xfs On Monday 28 May 2007 04:17:18 David Chinner wrote: > On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote: > > On Monday 28 May 2007 02:30:11 David Chinner wrote: > > > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > > > .and I've spammed such messages. This "internal error" isn't a good > > > > reason to shut down the file system? > > > > > > Actaully, that error does shut the filesystem down in most cases. When > > > you see that output, the function is returning -EFSCORRUPTED. You've > > > got a corrupted freespace btree. > > > > > > The reason why you get spammed is that this is happening during > > > background writeback, and there is no one to return the -EFSCORRUPTED > > > error to. The background writeback path doesn't specifically detect > > > shut down filesystems or trigger shutdowns on errors because that > > > happens in different layers so you just end up with failed data writes. > > > These errors will occur on the next foreground data or metadata > > > allocation and that will shut the filesystem down at that point. > > > > > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe > > > in this case we should be shutting down the filesystem. That would > > > certainly cut down on the spamming and would not appear to change > > > anything other behaviour.... > > > > If I remember correctly, my file system wasn't shutted down at all, it > > was "writeable" for whole night, the yafc slowly "written" files to it. > > Maybe all write operations had failed, but yafc doesn't warn. > > So you never created new files or directories, unlinked files or > directories, did synchronous writes, etc? Just had slowly growing files? I just overwritten badly downloaded files. > > Spamming is just annoying when we need to find out what went wrong (My > > kernel.log is 300Mb), but for data security it's important to react to > > EFSCORRUPTED error in any case, I think so. Please consider this. > > The filesystem has responded correctly to the corruption in terms of > data security (i.e. failed the data write and warned noisily about > it), but it probably hasn't done everything it should.... > > Hmmmm. A quick look at the linux code makes me thikn that background > writeback on linux has never been able to cause a shutdown in this > case. However, the same error on Irix will definitely cause a > shutdown, though.... I hope Linux will follow Irix, that's a consistent standpoint. David, have you a plan to implement your "reporting raid5 block layer" idea? No one else has caring about this silent data loss on temporary (cable, power) failed raid5 arrays as I see, I really hope you do at least! -- d From owner-xfs@oss.sgi.com Mon May 28 05:54:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 28 May 2007 05:54:04 -0700 (PDT) Received: from dh.localdomain (inn.nightwish.hu [217.20.130.190]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4SCrxWt011034 for ; Mon, 28 May 2007 05:54:01 -0700 Received: from dap by dh.localdomain with local (Exim 4.63) (envelope-from ) id 1HsejT-0000aj-V1; Mon, 28 May 2007 14:53:56 +0200 From: Pallai Roland Organization: magex To: David Chinner Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Date: Mon, 28 May 2007 14:53:55 +0200 User-Agent: KMail/1.9.6 Cc: Linux-Raid , xfs@oss.sgi.com References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> In-Reply-To: <20070525000547.GH85884050@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705281453.55618.dap@mail.index.hu> X-archive-position: 11533 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dap@mail.index.hu Precedence: bulk X-list: xfs On Friday 25 May 2007 02:05:47 David Chinner wrote: > "-o ro,norecovery" will allow you to mount the filesystem and get any > uncorrupted data off it. > > You still may get shutdowns if you trip across corrupted metadata in > the filesystem, though. This filesystem is completely dead. hq:~# mount -o ro,norecovery /dev/loop1 /mnt/r5 May 28 13:41:50 hq kernel: Mounting filesystem "loop1" in no-recovery mode. Filesystem will be inconsistent. May 28 13:41:50 hq kernel: XFS: failed to read root inode hq:~# xfs_db /dev/loop1 xfs_db: cannot read root inode (22) xfs_db: cannot read realtime bitmap inode (22) Segmentation fault hq:~# strace xfs_db /dev/loop1 _llseek(4, 0, [0], SEEK_SET) = 0 read(4, "XFSB\0\0\20\0\0\0\0\0\6\374\253\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 pread(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 480141901312) = 512 pread(4, "\30G$L\203\33OE \256=\207@\340\264O\"\324\2074DY\323\6"..., 8192, 131072) = 8192 write(2, "xfs_db: cannot read root inode ("..., 36xfs_db: cannot read root inode (22) ) = 36 pread(4, "\30G$L\203\33OE \256=\207@\340\264O\"\324\2074DY\323\6"..., 8192, 131072) = 8192 write(2, "xfs_db: cannot read realtime bit"..., 47xfs_db: cannot read realtime bitmap inode (22) ) = 47 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV +++ Browsing with hexdump -C, seems like a part of a PDF file is at 128Kb, on the place of the root inode. :( -- d From owner-xfs@oss.sgi.com Mon May 28 08:30:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 28 May 2007 08:31:04 -0700 (PDT) Received: from dh.localdomain (inn.nightwish.hu [217.20.130.190]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4SFUwWt024944 for ; Mon, 28 May 2007 08:30:59 -0700 Received: from dap by dh.localdomain with local (Exim 4.63) (envelope-from ) id 1HshBN-0001Au-Jl; Mon, 28 May 2007 17:30:53 +0200 From: Pallai Roland Organization: magex To: David Chinner Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Date: Mon, 28 May 2007 17:30:52 +0200 User-Agent: KMail/1.9.6 Cc: Linux-Raid , xfs@oss.sgi.com References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <200705281453.55618.dap@mail.index.hu> In-Reply-To: <200705281453.55618.dap@mail.index.hu> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705281730.53343.dap@mail.index.hu> X-archive-position: 11534 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dap@mail.index.hu Precedence: bulk X-list: xfs On Monday 28 May 2007 14:53:55 Pallai Roland wrote: > On Friday 25 May 2007 02:05:47 David Chinner wrote: > > "-o ro,norecovery" will allow you to mount the filesystem and get any > > uncorrupted data off it. > > > > You still may get shutdowns if you trip across corrupted metadata in > > the filesystem, though. > > This filesystem is completely dead. > [...] I tried to make a md patch to stop writes if a raid5 array got 2+ failed drives, but I found it's already done, oops. :) handle_stripe5() ignores writes in this case quietly, I tried and works. So how I lost my file system? My first guess about partially successed writes wasn't right: there wasn't real write to the disks after the second disk has been kicked, so the scenario is same to a simple power loss from this point of view. Am I thinking right? There's an another layer I used on this box between md and xfs: loop-aes. I used it since years and rock stable, but now it's my first suspect, cause I found a bug in it today: I assembled my array from n-1 disks, and I failed a second disk for a test and I found /dev/loop1 still provides *random* data where /dev/md1 serves nothing, it's definitely a loop-aes bug: /dev/loop1: [0700]:180907 (/dev/md1) encryption=AES128 multi-key-v3 hq:~# dd if=/dev/md1 bs=1k count=128 skip=128 >/dev/null dd: reading `/dev/md1': Input/output error 0+0 records in 0+0 records out hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum 128+0 records in 128+0 records out 131072 bytes (131 kB) copied, 0.027775 seconds, 4.7 MB/s e2548a924a0e835bb45fb50058acba98 - (!!!) hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum 128+0 records in 128+0 records out 131072 bytes (131 kB) copied, 0.030311 seconds, 4.3 MB/s c6a23412fb75eb5a7eb1d6a7813eb86b - (!!!) It's not an explanation to my screwed up file system, but for me it's enough to drop loop-aes. Eh. -- d From owner-xfs@oss.sgi.com Mon May 28 15:45:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 28 May 2007 15:45:37 -0700 (PDT) Received: from mail.ggsys.net (mail.ggsys.net [69.26.161.131]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4SMjVWt024369 for ; Mon, 28 May 2007 15:45:33 -0700 Received: (qmail 26429 invoked from network); 28 May 2007 22:45:30 -0000 Received: from cpe-70-112-65-134.austin.res.rr.com (HELO ?192.168.4.12?) (70.112.65.134) by mail.ggsys.net with SMTP; 28 May 2007 22:45:30 -0000 Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem From: Alberto Alonso To: David Chinner Cc: Pallai Roland , Linux-Raid , xfs@oss.sgi.com In-Reply-To: <20070525083650.GO85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <1180056948.6183.10.camel@daptopfc.localdomain> <20070525045500.GF86004887@sgi.com> <1180071831.21028.125.camel@w100> <20070525083650.GO85884050@sgi.com> Content-Type: text/plain Organization: Global Gate Systems LLC. Date: Mon, 28 May 2007 17:45:27 -0500 Message-Id: <1180392327.21028.140.camel@w100> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 Content-Transfer-Encoding: 7bit X-archive-position: 11535 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: alberto@ggsys.net Precedence: bulk X-list: xfs On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: > On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > I think his point was that going into a read only mode causes a > > less catastrophic situation (ie. a web server can still serve > > pages). > > Sure - but once you've detected one corruption or had metadata > I/O errors, can you trust the rest of the filesystem? > > > I think that is a valid point, rather than shutting down > > the file system completely, an automatic switch to where the least > > disruption of service can occur is always desired. > > I consider the possibility of serving out bad data (i.e after > a remount to readonly) to be the worst possible disruption of > service that can happen ;) I guess it does depend on the nature of the failure. A write failure on block 2000 does not imply corruption of the other 2TB of data. I wish I knew more on the internals of file systems, unfortunately since I don't, I was just commenting on feature that would be nice, but maybe there is no way to implement them. I figured that a dynamic table with bad blocks could be kept, if an attempt to access those blocks is generated (read or write) an I/O error is returned, if the block is not on the list, the access is processed. This would help a server with large file systems continue operations for most users. > > I personally have found the XFS file system to be great for > > my needs (except issues with NFS interaction, where the bug report > > never got answered), but that doesn't mean it can not be improved. > > Got a pointer? I can't seem to find it. I'm pretty sure I used bugzilla to report it. I did find the kernel dump file though, so here it is: Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: vp/0xd1e69c80, invp/0xc989e380 Oct 3 15:34:07 localhost kernel: ------------[ cut here ]------------ Oct 3 15:34:07 localhost kernel: kernel BUG at fs/xfs/support/debug.c:106! Oct 3 15:34:07 localhost kernel: invalid operand: 0000 [#1] Oct 3 15:34:07 localhost kernel: PREEMPT SMP Oct 3 15:34:07 localhost kernel: Modules linked in: af_packet iptable_filter ip_tables nfsd exportfs lockd sunrpc ipv6xfs capability commoncap ext3 jbd mbc ache aic7xxx i2c_dev tsdev floppy mousedev parport_pc parport psmouse evdev pcspkrhw_random shpchp pciehp pci_hotplug intel_agp intel_mch_agp agpgart uhci_h cd usbcore piix ide_core e1000 cfi_cmdset_0001 cfi_util mtdpart mtdcore jedec_probe gen_probe chipreg dm_mod w83781d i2c_sensor i2c_i801 i2c_core raid5 xor genrtc sd_mod aic79xx scsi_mod raid1 md unix font vesafb cfbcopyarea cfbimgblt cfbfillrect Oct 3 15:34:07 localhost kernel: CPU: 0 Oct 3 15:34:07 localhost kernel: EIP: 0060:[__crc_pm_idle +3334982/5290900] Not tainted Oct 3 15:34:07 localhost kernel: EFLAGS: 00010246 (2.6.8-2-686-smp) Oct 3 15:34:07 localhost kernel: EIP is at cmn_err+0xc5/0xe0 [xfs] Oct 3 15:34:07 localhost kernel: eax: 00000000 ebx: f602c000 ecx: c02dcfbc edx: c02dcfbc Oct 3 15:34:07 localhost kernel: esi: f8c40e28 edi: f8c56a3e ebp: 00000293 esp: f602da08 Oct 3 15:34:07 localhost kernel: ds: 007b es: 007b ss: 0068 Oct 3 15:34:07 localhost kernel: Process nfsd (pid: 2740, threadinfo=f602c000 task=f71a7210) Oct 3 15:34:07 localhost kernel: Stack: f8c40e28 f8c40def f8c56a00 00000000 f602c000 074aa1aa f8c41700 ea2f0a40 Oct 3 15:34:07 localhost kernel: f8c0a745 00000000 f8c41700 d1e69c80 c989e380 f7d4cc00 c2934754 074aa1aa Oct 3 15:34:07 localhost kernel: 00000000 f6555624 074aa1aa f7d4cc00 c017d6bd f6555620 00000000 00000000 Oct 3 15:34:07 localhost kernel: Call Trace: Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3123398/5290900] xfs_iget_core+0x565/0x6b0 [xfs] Oct 3 15:34:07 localhost kernel: [iget_locked+189/256] iget_locked +0xbd/0x100 Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3124083/5290900] xfs_iget+0x162/0x1a0 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3252484/5290900] xfs_vget+0x63/0x100 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3331204/5290900] vfs_vget+0x43/0x50 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3329570/5290900] linvfs_get_dentry+0x51/0x90 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+1536451/5290900] find_exported_dentry+0x42/0x830 [exportfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3234969/5290900] xfs_trans_tail_ail+0x38/0x80 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3174595/5290900] xlog_write+0x102/0x580 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3234969/5290900] xfs_trans_tail_ail+0x38/0x80 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3170617/5290900] xlog_assign_tail_lsn+0x18/0x90 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3234969/5290900] xfs_trans_tail_ail+0x38/0x80 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3174595/5290900] xlog_write+0x102/0x580 [xfs] Oct 3 15:34:07 localhost kernel: [alloc_skb+71/240] alloc_skb +0x47/0xf0 Oct 3 15:34:07 localhost kernel: [sock_alloc_send_pskb+197/464] sock_alloc_send_pskb+0xc5/0x1d0 Oct 3 15:34:07 localhost kernel: [sock_alloc_send_skb+45/64] sock_alloc_send_skb+0x2d/0x40 Oct 3 15:34:07 localhost kernel: [ip_append_data+1810/2016] ip_append_data+0x712/0x7e0 Oct 3 15:34:07 localhost kernel: [recalc_task_prio+168/416] recalc_task_prio+0xa8/0x1a0 Oct 3 15:34:07 localhost kernel: [__ip_route_output_key+47/288] __ip_route_output_key+0x2f/0x120 Oct 3 15:34:07 localhost kernel: [udp_sendmsg+831/1888] udp_sendmsg +0x33f/0x760 Oct 3 15:34:07 localhost kernel: [ip_generic_getfrag+0/192] ip_generic_getfrag+0x0/0xc0 Oct 3 15:34:07 localhost kernel: [qdisc_restart+23/560] qdisc_restart +0x17/0x230 Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+1539451/5290900] export_decode_fh+0x5a/0x7a [exportfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4695505/5290900] nfsd_acceptable+0x0/0x140 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4696349/5290900] fh_verify+0x20c/0x5a0 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4695505/5290900] nfsd_acceptable+0x0/0x140 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4702954/5290900] nfsd_open+0x39/0x1a0 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4704974/5290900] nfsd_write+0x5d/0x360 [nfsd] Oct 3 15:34:07 localhost kernel: [skb_copy_and_csum_bits+102/784] skb_copy_and_csum_bits+0x66/0x310 Oct 3 15:34:07 localhost kernel: [resched_task+83/144] resched_task +0x53/0x90 Oct 3 15:34:07 localhost kernel: [skb_copy_and_csum_bits+556/784] skb_copy_and_csum_bits+0x22c/0x310 Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+2136279/5290900] skb_read_and_csum_bits+0x46/0x90 [sunrpc] Oct 3 15:34:07 localhost kernel: [kfree_skbmem+36/48] kfree_skbmem +0x24/0x30 Oct 3 15:34:07 localhost kernel: [__kfree_skb+173/336] __kfree_skb +0xad/0x150 Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+2184090/5290900] xdr_partial_copy_from_skb+0x169/0x180 [sunrpc] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+2180355/5290900] svcauth_unix_accept+0x272/0x2c0 [sunrpc] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4735417/5290900] nfsd3_proc_write+0xb8/0x120 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4688328/5290900] nfsd_dispatch+0xd7/0x1e0 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4688113/5290900] nfsd_dispatch+0x0/0x1e0 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+2162754/5290900] svc_process+0x4b1/0x619 [sunrpc] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4687545/5290900] nfsd +0x248/0x480 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4686961/5290900] nfsd +0x0/0x480 [nfsd] Oct 3 15:34:07 localhost kernel: [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10 Oct 3 15:34:07 localhost kernel: Code: 0f 0b 6a 00 0f 0e c4 f8 83 c4 10 5b 5e 5f 5d c3 e8 c6 03 66 Oct 3 15:34:07 localhost kernel: <6>note: nfsd[2740] exited with preempt_count 1 Oct 3 15:51:23 localhost kernel: klogd 1.4.1#17, log source = /proc/kmsg started. Oct 3 15:51:23 localhost kernel: Inspecting /boot/System.map-2.6.8-2-686-smp Oct 3 15:51:24 localhost kernel: Loaded 27755 symbols from /boot/System.map-2.6.8-2-686-smp. Oct 3 15:51:24 localhost kernel: Symbols match kernel version 2.6.8. Oct 3 15:51:24 localhost kernel: No module symbols loaded - kernel modules not enabled. Oct 3 15:51:24 localhost kernel: fef0000 (usable) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000bfef0000 - 00000000bfefc000 (ACPI data) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000bfefc000 - 00000000bff00000 (ACPI NVS) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000bff00000 - 00000000bff80000 (usable) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000bff80000 - 00000000c0000000 (reserved) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved) Oct 3 15:51:24 localhost kernel: 2175MB HIGHMEM available. Oct 3 15:51:24 localhost kernel: 896MB LOWMEM available. Oct 3 15:51:24 localhost kernel: found SMP MP-table at 000f6810 Oct 3 15:51:24 localhost kernel: On node 0 totalpages: 786304 Oct 3 15:51:24 localhost kernel: DMA zone: 4096 pages, LIFO batch:1 Oct 3 15:51:24 localhost kernel: Normal zone: 225280 pages, LIFO batch:16 Oct 3 15:51:24 localhost kernel: HighMem zone: 556928 pages, LIFO batch:16 Oct 3 15:51:24 localhost kernel: DMI present. Thanks, Alberto From owner-xfs@oss.sgi.com Mon May 28 16:06:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 28 May 2007 16:06:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4SN6DWt000883 for ; Mon, 28 May 2007 16:06:15 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA24579; Tue, 29 May 2007 09:06:12 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4SN6AAf102758236; Tue, 29 May 2007 09:06:11 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4SN68TF107812473; Tue, 29 May 2007 09:06:08 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 29 May 2007 09:06:08 +1000 From: David Chinner To: Pallai Roland Cc: David Chinner , Linux-Raid , xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Message-ID: <20070528230608.GH85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <200705280350.18384.dap@mail.index.hu> <20070528021718.GZ85884050@sgi.com> <200705281317.32384.dap@mail.index.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705281317.32384.dap@mail.index.hu> User-Agent: Mutt/1.4.2.1i X-archive-position: 11536 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, May 28, 2007 at 01:17:31PM +0200, Pallai Roland wrote: > On Monday 28 May 2007 04:17:18 David Chinner wrote: > > Hmmmm. A quick look at the linux code makes me thikn that background > > writeback on linux has never been able to cause a shutdown in this case. > > However, the same error on Irix will definitely cause a shutdown, > > though.... > I hope Linux will follow Irix, that's a consistent standpoint. I raised a bug for this yesterday when writing that reply. It won't get forgotten now.... > David, have you a plan to implement your "reporting raid5 block layer" > idea? No one else has caring about this silent data loss on temporary > (cable, power) failed raid5 arrays as I see, I really hope you do at least! Yeah, I'd love to get something like this happening, but given it's about half way down my list of "stuff to do when I have some spare time" I'd say it will be about 2015 before I get to it..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 28 16:36:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 28 May 2007 16:36:29 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4SNaOWt017600 for ; Mon, 28 May 2007 16:36:26 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA25060; Tue, 29 May 2007 09:36:22 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4SNaKAf108013671; Tue, 29 May 2007 09:36:21 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4SNaH1r108050695; Tue, 29 May 2007 09:36:17 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 29 May 2007 09:36:17 +1000 From: David Chinner To: Pallai Roland Cc: David Chinner , Linux-Raid , xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Message-ID: <20070528233617.GI85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <200705281453.55618.dap@mail.index.hu> <200705281730.53343.dap@mail.index.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705281730.53343.dap@mail.index.hu> User-Agent: Mutt/1.4.2.1i X-archive-position: 11537 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, May 28, 2007 at 05:30:52PM +0200, Pallai Roland wrote: > > On Monday 28 May 2007 14:53:55 Pallai Roland wrote: > > On Friday 25 May 2007 02:05:47 David Chinner wrote: > > > "-o ro,norecovery" will allow you to mount the filesystem and get any > > > uncorrupted data off it. > > > > > > You still may get shutdowns if you trip across corrupted metadata in > > > the filesystem, though. > > > > This filesystem is completely dead. > > [...] > > I tried to make a md patch to stop writes if a raid5 array got 2+ failed > drives, but I found it's already done, oops. :) handle_stripe5() ignores > writes in this case quietly, I tried and works. Hmmm - it clears the uptodate bit on the bio, which is supposed to make the bio return EIO. That looks to be doing the right thing... > There's an another layer I used on this box between md and xfs: loop-aes. I Oh, that's a kind of important thing to forget to mention.... > used it since years and rock stable, but now it's my first suspect, cause I > found a bug in it today: > I assembled my array from n-1 disks, and I failed a second disk for a test > and I found /dev/loop1 still provides *random* data where /dev/md1 serves > nothing, it's definitely a loop-aes bug: ..... > It's not an explanation to my screwed up file system, but for me it's enough > to drop loop-aes. Eh. If you can get random data back instead of an error from the block device, then I'm not surprised your filesystem is toast. If it's one sector in a larger block that is corrupted, then the only thing that will protect you from this sort of corruption causing problems is metadata checksums (yet another thin on my list of stuff to do). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 28 20:28:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 28 May 2007 20:28:20 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4T3SEWt026582 for ; Mon, 28 May 2007 20:28:16 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA29585; Tue, 29 May 2007 13:28:10 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4T3S7Af108039312; Tue, 29 May 2007 13:28:08 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4T3S3Xi107879094; Tue, 29 May 2007 13:28:03 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Tue, 29 May 2007 13:28:03 +1000 From: David Chinner To: Alberto Alonso Cc: David Chinner , Pallai Roland , Linux-Raid , xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Message-ID: <20070529032803.GM85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <1180056948.6183.10.camel@daptopfc.localdomain> <20070525045500.GF86004887@sgi.com> <1180071831.21028.125.camel@w100> <20070525083650.GO85884050@sgi.com> <1180392327.21028.140.camel@w100> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1180392327.21028.140.camel@w100> User-Agent: Mutt/1.4.2.1i X-archive-position: 11538 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote: > On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: > > On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > > I think his point was that going into a read only mode causes a > > > less catastrophic situation (ie. a web server can still serve > > > pages). > > > > Sure - but once you've detected one corruption or had metadata > > I/O errors, can you trust the rest of the filesystem? > > > > > I think that is a valid point, rather than shutting down > > > the file system completely, an automatic switch to where the least > > > disruption of service can occur is always desired. > > > > I consider the possibility of serving out bad data (i.e after > > a remount to readonly) to be the worst possible disruption of > > service that can happen ;) > > I guess it does depend on the nature of the failure. A write failure > on block 2000 does not imply corruption of the other 2TB of data. The rest might not be corrupted, but if block 2000 is a index of some sort (i.e. metadata), you could reference any of that 2TB incorrectly and get the wrong data, write to the wrong spot on disk, etc. > > > I personally have found the XFS file system to be great for > > > my needs (except issues with NFS interaction, where the bug report > > > never got answered), but that doesn't mean it can not be improved. > > > > Got a pointer? > > I can't seem to find it. I'm pretty sure I used bugzilla to report > it. I did find the kernel dump file though, so here it is: > > Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: > vp/0xd1e69c80, invp/0xc989e380 Oh, I haven't seen any of those problems for quite some time. > = /proc/kmsg started. > Oct 3 15:51:23 localhost kernel: > Inspecting /boot/System.map-2.6.8-2-686-smp Oh, well, yes, kernels that old did have that problem. It got fixed some time around 2.6.12 or 2.6.13 IIRC.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon May 28 20:37:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 28 May 2007 20:37:45 -0700 (PDT) Received: from mail.ggsys.net (mail.ggsys.net [69.26.161.131]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4T3bfWt028113 for ; Mon, 28 May 2007 20:37:42 -0700 Received: (qmail 13974 invoked from network); 29 May 2007 03:37:40 -0000 Received: from cpe-70-112-65-134.austin.res.rr.com (HELO ?192.168.4.12?) (70.112.65.134) by mail.ggsys.net with SMTP; 29 May 2007 03:37:40 -0000 Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem From: Alberto Alonso To: David Chinner Cc: Pallai Roland , Linux-Raid , xfs@oss.sgi.com In-Reply-To: <20070529032803.GM85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <1180056948.6183.10.camel@daptopfc.localdomain> <20070525045500.GF86004887@sgi.com> <1180071831.21028.125.camel@w100> <20070525083650.GO85884050@sgi.com> <1180392327.21028.140.camel@w100> <20070529032803.GM85884050@sgi.com> Content-Type: text/plain Organization: Global Gate Systems LLC. Date: Mon, 28 May 2007 22:37:37 -0500 Message-Id: <1180409857.21028.150.camel@w100> Mime-Version: 1.0 X-Mailer: Evolution 2.0.4 Content-Transfer-Encoding: 7bit X-archive-position: 11539 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: alberto@ggsys.net Precedence: bulk X-list: xfs On Tue, 2007-05-29 at 13:28 +1000, David Chinner wrote: > On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote: > > On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: > > > I consider the possibility of serving out bad data (i.e after > > > a remount to readonly) to be the worst possible disruption of > > > service that can happen ;) > > > > I guess it does depend on the nature of the failure. A write failure > > on block 2000 does not imply corruption of the other 2TB of data. > > The rest might not be corrupted, but if block 2000 is a index of > some sort (i.e. metadata), you could reference any of that 2TB > incorrectly and get the wrong data, write to the wrong spot on disk, > etc. Forgive my ignorance, but if block 2000 is an index, to access the data that it references you would go through block 2000, which would return an error without continuing to access any data pointed to by it. Isn't that how things work? > > > > > I personally have found the XFS file system to be great for > > > > my needs (except issues with NFS interaction, where the bug report > > > > never got answered), but that doesn't mean it can not be improved. > > > > > > Got a pointer? > > > > I can't seem to find it. I'm pretty sure I used bugzilla to report > > it. I did find the kernel dump file though, so here it is: > > > > Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: > > vp/0xd1e69c80, invp/0xc989e380 > > Oh, I haven't seen any of those problems for quite some time. > > > = /proc/kmsg started. > > Oct 3 15:51:23 localhost kernel: > > Inspecting /boot/System.map-2.6.8-2-686-smp > > Oh, well, yes, kernels that old did have that problem. It got fixed > some time around 2.6.12 or 2.6.13 IIRC.... Time for a kernel upgrade then :-) Thanks for all your enlightenment, I think I am learning quite a few things. Alberto From owner-xfs@oss.sgi.com Tue May 29 11:05:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 29 May 2007 11:05:52 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4TI5jWt021267 for ; Tue, 29 May 2007 11:05:48 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id EAA7AB0000F1; Tue, 29 May 2007 14:05:44 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id CF75350001A7; Tue, 29 May 2007 14:05:44 -0400 (EDT) Date: Tue, 29 May 2007 14:05:44 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: xfs@oss.sgi.com cc: apiszcz@solarrain.com Subject: xfs_fsr 2.2.38 bug/kernel 2.6.21.3 cannot defrag volume Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11540 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs # uname -a Linux boxname 2.6.21.3 #2 SMP Sun May 27 11:34:21 EDT 2007 i686 GNU/Linux # xfs_fsr -V xfs_fsr version 2.2.38 # xfs_db -V xfs_db version 2.8.18 # xfs_db -c frag -f /dev/md1 actual 449, ideal 403, fragmentation factor 10.24% # xfs_fsr /dev/md1 /d2 start inode=0 # xfs_db -c frag -f /dev/md1 actual 449, ideal 403, fragmentation factor 10.24% # From owner-xfs@oss.sgi.com Tue May 29 17:41:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 29 May 2007 17:41:55 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4U0fnWt028343 for ; Tue, 29 May 2007 17:41:51 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA00904; Wed, 30 May 2007 10:41:44 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4U0fhAf108753869; Wed, 30 May 2007 10:41:43 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4U0feNH108959436; Wed, 30 May 2007 10:41:40 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 30 May 2007 10:41:40 +1000 From: David Chinner To: Justin Piszcz Cc: xfs@oss.sgi.com, apiszcz@solarrain.com Subject: Re: xfs_fsr 2.2.38 bug/kernel 2.6.21.3 cannot defrag volume Message-ID: <20070530004140.GX85884050@sgi.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-archive-position: 11541 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, May 29, 2007 at 02:05:44PM -0400, Justin Piszcz wrote: > # uname -a > Linux boxname 2.6.21.3 #2 SMP Sun May 27 11:34:21 EDT 2007 i686 GNU/Linux > # xfs_fsr -V > xfs_fsr version 2.2.38 > # xfs_db -V > xfs_db version 2.8.18 > # xfs_db -c frag -f /dev/md1 > actual 449, ideal 403, fragmentation factor 10.24% > # xfs_fsr /dev/md1 > /d2 start inode=0 > # xfs_db -c frag -f /dev/md1 > actual 449, ideal 403, fragmentation factor 10.24% > # Try xfs_db -c "frag -v" -f /dev/md1 to see which inodes are fragmented, then run xfs_db -c "inode " -c bmap -f /dev/md1 to see whether it is a sparse file or not.... Remember - xfs_fsr does best effort defrag - if it can't make progress, it does nothing, and it can't defrag directories... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue May 29 19:25:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 29 May 2007 19:25:33 -0700 (PDT) Received: from smtp102.sbc.mail.mud.yahoo.com (smtp102.sbc.mail.mud.yahoo.com [68.142.198.201]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4U2PQWt021126 for ; Tue, 29 May 2007 19:25:29 -0700 Received: (qmail 98937 invoked from network); 30 May 2007 02:25:26 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp102.sbc.mail.mud.yahoo.com with SMTP; 30 May 2007 02:25:25 -0000 X-YMail-OSG: xc9Qi3wVM1lNvbzlTqpV8fTrujPhIEQhjD10ibxh3Pf2Ixtw9J3h1.IgmxHZ8nWghvvG0wnZVQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 48B38182612A; Tue, 29 May 2007 19:25:24 -0700 (PDT) Date: Tue, 29 May 2007 19:25:24 -0700 From: Chris Wedgwood To: David Chinner Cc: Justin Piszcz , xfs@oss.sgi.com, apiszcz@solarrain.com Subject: Re: xfs_fsr 2.2.38 bug/kernel 2.6.21.3 cannot defrag volume Message-ID: <20070530022524.GA20275@tuatara.stupidest.org> References: <20070530004140.GX85884050@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070530004140.GX85884050@sgi.com> X-archive-position: 11542 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Wed, May 30, 2007 at 10:41:40AM +1000, David Chinner wrote: > Remember - xfs_fsr does best effort defrag - if it can't make > progress, it does nothing, and it can't defrag directories... Also, it's not going to do much in many cases unless you have free space... if you don't have enough *suitable* free space then it won't be able to do much. From owner-xfs@oss.sgi.com Tue May 29 19:41:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 29 May 2007 19:41:17 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4U2fCWt026397 for ; Tue, 29 May 2007 19:41:14 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 2F51718CF884C; Tue, 29 May 2007 21:41:12 -0500 (CDT) Message-ID: <465CE447.1080608@sandeen.net> Date: Tue, 29 May 2007 21:41:11 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Justin Piszcz CC: xfs@oss.sgi.com, apiszcz@solarrain.com Subject: Re: xfs_fsr 2.2.38 bug/kernel 2.6.21.3 cannot defrag volume References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11543 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Justin Piszcz wrote: > # uname -a > Linux boxname 2.6.21.3 #2 SMP Sun May 27 11:34:21 EDT 2007 i686 GNU/Linux > # xfs_fsr -V > xfs_fsr version 2.2.38 > # xfs_db -V > xfs_db version 2.8.18 > # xfs_db -c frag -f /dev/md1 > actual 449, ideal 403, fragmentation factor 10.24% > # xfs_fsr /dev/md1 > /d2 start inode=0 > # xfs_db -c frag -f /dev/md1 > actual 449, ideal 403, fragmentation factor 10.24% > # So you have on average 1.114 extents per file. Relax. Have a homebrew. :) -Eric From owner-xfs@oss.sgi.com Tue May 29 22:24:23 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 29 May 2007 22:24:32 -0700 (PDT) Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4U5OJWt003248 for ; Tue, 29 May 2007 22:24:22 -0700 Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.161]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l4U5OHVv016951 for ; Wed, 30 May 2007 14:24:17 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l4U5OH304147 for xfs@oss.sgi.com; Wed, 30 May 2007 14:24:17 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv4.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id l4U5OHg04216 for ; Wed, 30 May 2007 14:24:17 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070530.142417.26603304 for ; Wed, 30 May 2007 14:24:17 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Wed May 30 14:24:17 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 891A1AE4B3; Wed, 30 May 2007 14:24:12 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l4U5OGv7003040; Wed, 30 May 2007 14:24:16 +0900 Message-Id: <200705300524.AA05467@TNESG9305.tnes.nec.co.jp> From: Utako Kusaka Date: Wed, 30 May 2007 14:24:17 +0900 To: xfs@oss.sgi.com Subject: [PATCH] Fix xfs_quota man page. MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11544 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, This patch includes following fix in xfs_quota man page. 1. Add description of print command in USER COMMANDS. 2. Move description of path command to ADMINISTRATOR COMMANDS. 3. Fix errors in option of project command and the use of the xfs_quota project command. 4. Remove 's' (abbreviation of seconds) which is not supported in timer command. 5. Precede `'' (quotation mark) with escape sequence (\&) in description of timer command. Because a part of it is not displayed in the output. -- Utako Kusaka Signed-off-by: Utako Kusaka Signed-off-by: Kouta Ooizumi --- --- xfsprogs-2.8.20/man/man8/xfs_quota.8.orig 2007-04-18 11:43:12.000000000 +0900 +++ xfsprogs-2.8.20/man/man8/xfs_quota.8 2007-05-25 16:09:35.000000000 +0900 @@ -119,11 +119,8 @@ Then after rectifying the quota situatio filesystem it belongs on. .SH USER COMMANDS .TP -\f3path\f1 [ \f2N\f1 ] -Lists all paths with devices/project identifiers or set the current -path to the \f2N\f1th list entry (the current path is used by many -of the commands described here, it identifies the filesystem toward -which a command is directed). +\f3print\f1 +Lists all paths with devices/project identifiers. The path list can come from several places \- the command line, the mount table, and the \f2/etc/projects\f1 file. .TP @@ -227,6 +224,14 @@ filesystems available space; or simply t amount of space used, or number of inodes, within the tree). .SH ADMINISTRATOR COMMANDS .TP +\f3path\f1 [ \f2N\f1 ] +Lists all paths with devices/project identifiers or set the current +path to the \f2N\f1th list entry (the current path is used by many +of the commands described here, it identifies the filesystem toward +which a command is directed). +The path list can come from several places \- the command line, +the mount table, and the \f2/etc/projects\f1 file. +.TP \f3report\f1 [ \f2\-gpu\f1 ] [ \f2\-bir\f1 ] [ \f2\-ahnNt\f1 ] Report filesystem quota information. This reports all quota usage for a filesystem, for the specified @@ -257,9 +262,9 @@ Allows the quota enforcement timeout (i. to pass before the soft limits are enforced as the hard limits) to be modified. The current timeout setting can be displayed using the \f3state\f1 command. -The value argument is a number of seconds, but units of 'seconds', -'minutes', 'hours', 'days', and 'weeks' are also understood -(as are their abbreviations, 's', 'm', 'h', 'd', and 'w'). +The value argument is a number of seconds, but units of 'minutes', 'hours', +\&'days', and 'weeks' are also understood +(as are their abbreviations, 'm', 'h', 'd', and 'w'). .TP \f3warn\f1 [ \f2\-gpu\f1 ] [ \f2\-bir\f1 ] value \-d|id|name Allows the quota warnings limit (i.e. the number of times a warning @@ -310,7 +315,7 @@ an entire filesystem and report usage in This command can be used even when filesystem quota are not enabled, as it is a full-filesystem scan (it may also take a long time...). .TP -\f3project\f1 [ \f2\-cds\f1 id|name ] +\f3project\f1 [ \f2\-cCs\f1 id|name ] Without arguments, this command lists known project names and identifiers (based on entries in the .I /etc/projects @@ -408,7 +413,7 @@ log file directories to only using 1 gig .nf .sp .8v .in +5 -# mount \-o prjquota /dev/xvm/var /var +# mount \-o prjquota /dev/xvm/var /home # echo 42:/var/log >> /etc/projects # echo logfiles:42 >> /etc/projid # xfs_quota \-x \-c 'project \-s logfiles' /home From owner-xfs@oss.sgi.com Wed May 30 01:45:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 01:45:22 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4U8jGWt031303 for ; Wed, 30 May 2007 01:45:18 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 1AC4EB0000F1; Wed, 30 May 2007 04:45:16 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 1681150001A7; Wed, 30 May 2007 04:45:16 -0400 (EDT) Date: Wed, 30 May 2007 04:45:16 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: David Chinner cc: xfs@oss.sgi.com, apiszcz@solarrain.com Subject: Re: xfs_fsr 2.2.38 bug/kernel 2.6.21.3 cannot defrag volume In-Reply-To: <20070530004140.GX85884050@sgi.com> Message-ID: References: <20070530004140.GX85884050@sgi.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11545 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs On Wed, 30 May 2007, David Chinner wrote: > On Tue, May 29, 2007 at 02:05:44PM -0400, Justin Piszcz wrote: >> # uname -a >> Linux boxname 2.6.21.3 #2 SMP Sun May 27 11:34:21 EDT 2007 i686 GNU/Linux >> # xfs_fsr -V >> xfs_fsr version 2.2.38 >> # xfs_db -V >> xfs_db version 2.8.18 >> # xfs_db -c frag -f /dev/md1 >> actual 449, ideal 403, fragmentation factor 10.24% >> # xfs_fsr /dev/md1 >> /d2 start inode=0 >> # xfs_db -c frag -f /dev/md1 >> actual 449, ideal 403, fragmentation factor 10.24% >> # > > Try xfs_db -c "frag -v" -f /dev/md1 to see which inodes are > fragmented, then run xfs_db -c "inode " -c bmap -f /dev/md1 > to see whether it is a sparse file or not.... > > Remember - xfs_fsr does best effort defrag - if it can't make > progress, it does nothing, and it can't defrag directories... > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > # df -h Filesystem Size Used Avail Use% Mounted on /dev/md1 746G 101G 645G 14% /d2 # xfs_db -c "frag -v" -f /dev/md1 inode 512 actual 0 ideal 0 inode 513 actual 0 ideal 0 inode 514 actual 0 ideal 0 inode 515 actual 3 ideal 1 inode 518 actual 0 ideal 0 inode 146353073 actual 0 ideal 0 inode 146353075 actual 3 ideal 1 inode 146353076 actual 1 ideal 1 .. inode 4160750217 actual 1 ideal 1 inode 4160750218 actual 1 ideal 1 inode 4160750219 actual 1 ideal 1 # xfs_db -c "frag -v" -f /dev/md1 | wc 422 Thoughts? # xfs_db -c "inode 146353073" -c bmap -f /dev/md1 # (no output) # xfs_db -c "inode 4160750219" -c bmap -f /dev/md1 data offset 0 startblock 260410208 (31/363360) count 740 flag 0 Justin. From owner-xfs@oss.sgi.com Wed May 30 01:45:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 01:45:43 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4U8jcWt031422 for ; Wed, 30 May 2007 01:45:39 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id D0C9AB0000F1; Wed, 30 May 2007 04:45:38 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id CEBD450001A7; Wed, 30 May 2007 04:45:38 -0400 (EDT) Date: Wed, 30 May 2007 04:45:38 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Chris Wedgwood cc: David Chinner , xfs@oss.sgi.com, apiszcz@solarrain.com Subject: Re: xfs_fsr 2.2.38 bug/kernel 2.6.21.3 cannot defrag volume In-Reply-To: <20070530022524.GA20275@tuatara.stupidest.org> Message-ID: References: <20070530004140.GX85884050@sgi.com> <20070530022524.GA20275@tuatara.stupidest.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11546 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs On Tue, 29 May 2007, Chris Wedgwood wrote: > On Wed, May 30, 2007 at 10:41:40AM +1000, David Chinner wrote: > >> Remember - xfs_fsr does best effort defrag - if it can't make >> progress, it does nothing, and it can't defrag directories... > > Also, it's not going to do much in many cases unless you have free > space... if you don't have enough *suitable* free space then it won't > be able to do much. > Currently have 101GB used with 645GB free. Justin. From owner-xfs@oss.sgi.com Wed May 30 02:26:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 02:27:04 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4U9QuWt009244 for ; Wed, 30 May 2007 02:26:58 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id TAA12923; Wed, 30 May 2007 19:26:49 +1000 Date: Wed, 30 May 2007 19:27:14 +1000 From: Timothy Shimmin To: torvalds@linux-foundation.org cc: akpm@osdl.org, xfs@oss.sgi.com Subject: [GIT PULL] xfs bug fix for 2.6.22 Message-ID: <9F557877728BD541CEF0E542@timothy-shimmins-power-mac-g5.local> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11547 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Linus, A bug fix for a regression which went in rc-1 for the NULL files problem. Please pull from the for-linus branch: git pull git://oss.sgi.com:8090/xfs/xfs-2.6 for-linus This will update the following files: fs/xfs/linux-2.6/xfs_aops.c | 26 +++++++++++++++++--------- 1 files changed, 17 insertions(+), 9 deletions(-) through these commits: commit df3c7244264f1d12562413aa32d56be802486516 Author: David Chinner Date: Thu May 24 15:27:03 2007 +1000 [XFS] Write at EOF may not update filesize correctly. The recent fix for preventing NULL files from being left around does not update the file size corectly in all cases. The missing case is a write extending the file that does not need to allocate a block. In that case we used a read mapping of the extent which forced the use of the read I/O completion handler instead of the write I/O completion handle. Hence the file size was not updated on I/O completion. SGI-PV: 965068 SGI-Modid: xfs-linux-melb:xfs-kern:28657a Signed-off-by: David Chinner Signed-off-by: Nathan Scott Signed-off-by: Tim Shimmin From owner-xfs@oss.sgi.com Wed May 30 07:55:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 07:55:38 -0700 (PDT) Received: from mail.suse.cz (styx.suse.cz [82.119.242.94]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4UEtSWt018588 for ; Wed, 30 May 2007 07:55:30 -0700 Received: from discovery.suse.cz (discovery.suse.cz [10.20.1.116]) by mail.suse.cz (Postfix) with ESMTP id 2C3A96280D3; Wed, 30 May 2007 16:30:44 +0200 (CEST) Received: by discovery.suse.cz (Postfix, from userid 10020) id E6B8C29425; Wed, 30 May 2007 16:30:43 +0200 (CEST) Message-Id: <20070530143043.611931865@suse.cz> References: <20070530125954.706423971@suse.cz> User-Agent: quilt/0.46-36 Date: Wed, 30 May 2007 14:59:56 +0200 From: Michal Marek To: xfs@oss.sgi.com Cc: linux-kernel@vger.kernel.org Subject: [patch 2/3] Fix XFS_IOC_*_TO_HANDLE and XFS_IOC_{OPEN,READLINK}_BY_HANDLE in compat mode Content-Disposition: inline; filename=xfs-compat-ioctl-fshandle.patch X-archive-position: 11550 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mmarek@suse.cz Precedence: bulk X-list: xfs 32bit struct xfs_fsop_handlereq has different size and offsets (due to pointers). TODO: case XFS_IOC_{FSSETDM,ATTRLIST,ATTRMULTI}_BY_HANDLE still not handled. Signed-off-by: Michal Marek --- fs/xfs/linux-2.6/xfs_ioctl32.c | 63 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 58 insertions(+), 5 deletions(-) --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl32.c +++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl32.c @@ -139,6 +139,44 @@ xfs_ioctl32_bulkstat( } #endif +typedef struct xfs_fsop_handlereq32 { + __u32 fd; /* fd for FD_TO_HANDLE */ + compat_uptr_t path; /* user pathname */ + __u32 oflags; /* open flags */ + compat_uptr_t ihandle; /* user supplied handle */ + __u32 ihandlen; /* user supplied length */ + compat_uptr_t ohandle; /* user buffer for handle */ + compat_uptr_t ohandlen; /* user buffer length */ +} xfs_fsop_handlereq32_t; +#define XFS_IOC_PATH_TO_FSHANDLE_32 _IOWR('X', 104, struct xfs_fsop_handlereq32) +#define XFS_IOC_PATH_TO_HANDLE_32 _IOWR('X', 105, struct xfs_fsop_handlereq32) +#define XFS_IOC_FD_TO_HANDLE_32 _IOWR('X', 106, struct xfs_fsop_handlereq32) +#define XFS_IOC_OPEN_BY_HANDLE_32 _IOWR('X', 107, struct xfs_fsop_handlereq32) +#define XFS_IOC_READLINK_BY_HANDLE_32 _IOWR('X', 108, struct xfs_fsop_handlereq32) + +STATIC unsigned long xfs_ioctl32_fshandle(unsigned long arg) +{ + xfs_fsop_handlereq32_t __user *p32 = (void __user *)arg; + xfs_fsop_handlereq_t __user *p = compat_alloc_user_space(sizeof(*p)); + u32 addr; + + if (copy_in_user(&p->fd, &p32->fd, sizeof(__u32)) || + get_user(addr, &p32->path) || + put_user(compat_ptr(addr), &p->path) || + copy_in_user(&p->oflags, &p32->oflags, sizeof(__u32)) || + get_user(addr, &p32->ihandle) || + put_user(compat_ptr(addr), &p->ihandle) || + copy_in_user(&p->ihandlen, &p32->ihandlen, sizeof(__u32)) || + get_user(addr, &p32->ohandle) || + put_user(compat_ptr(addr), &p->ohandle) || + get_user(addr, &p32->ohandlen) || + put_user(compat_ptr(addr), &p->ohandlen)) + return -EFAULT; + + return (unsigned long)p; +} + + STATIC long xfs_compat_ioctl( int mode, @@ -164,12 +202,7 @@ xfs_compat_ioctl( case XFS_IOC_GETBMAPA: case XFS_IOC_GETBMAPX: /* not handled - case XFS_IOC_FD_TO_HANDLE: - case XFS_IOC_PATH_TO_HANDLE: - case XFS_IOC_PATH_TO_FSHANDLE: - case XFS_IOC_OPEN_BY_HANDLE: case XFS_IOC_FSSETDM_BY_HANDLE: - case XFS_IOC_READLINK_BY_HANDLE: case XFS_IOC_ATTRLIST_BY_HANDLE: case XFS_IOC_ATTRMULTI_BY_HANDLE: */ @@ -226,6 +259,26 @@ xfs_compat_ioctl( arg = xfs_ioctl32_bulkstat(arg); break; #endif + case XFS_IOC_FD_TO_HANDLE_32: + arg = xfs_ioctl32_fshandle(arg); + cmd = XFS_IOC_FD_TO_HANDLE; + break; + case XFS_IOC_PATH_TO_HANDLE_32: + arg = xfs_ioctl32_fshandle(arg); + cmd = XFS_IOC_PATH_TO_HANDLE; + break; + case XFS_IOC_PATH_TO_FSHANDLE_32: + arg = xfs_ioctl32_fshandle(arg); + cmd = XFS_IOC_PATH_TO_FSHANDLE; + break; + case XFS_IOC_OPEN_BY_HANDLE_32: + arg = xfs_ioctl32_fshandle(arg); + cmd = XFS_IOC_OPEN_BY_HANDLE; + break; + case XFS_IOC_READLINK_BY_HANDLE_32: + arg = xfs_ioctl32_fshandle(arg); + cmd = XFS_IOC_READLINK_BY_HANDLE; + break; default: return -ENOIOCTLCMD; } -- From owner-xfs@oss.sgi.com Wed May 30 07:55:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 07:55:37 -0700 (PDT) Received: from mail.suse.cz (styx.suse.cz [82.119.242.94]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4UEtTWt018599 for ; Wed, 30 May 2007 07:55:30 -0700 Received: from discovery.suse.cz (discovery.suse.cz [10.20.1.116]) by mail.suse.cz (Postfix) with ESMTP id 54ACA6280C3; Wed, 30 May 2007 16:30:43 +0200 (CEST) Received: by discovery.suse.cz (Postfix, from userid 10020) id 01F6229425; Wed, 30 May 2007 16:30:42 +0200 (CEST) Message-Id: <20070530125954.706423971@suse.cz> User-Agent: quilt/0.46-36 Date: Wed, 30 May 2007 14:59:54 +0200 From: Michal Marek To: xfs@oss.sgi.com Cc: linux-kernel@vger.kernel.org Subject: [patch 0/3] Fix for XFS compat ioctls X-archive-position: 11548 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mmarek@suse.cz Precedence: bulk X-list: xfs Hi, it looks like the XFS compat ioctl interface (fs/xfs/linux-2.6/xfs_ioctl32.c) is quite incomplete. Attached patches fix some ioctls to make at least xfsdump work. Tested on x86_64 with an i386 xfsdump binary, I'll test on ppc64 later. -- have a nice day, Michal Marek From owner-xfs@oss.sgi.com Wed May 30 07:55:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 07:55:38 -0700 (PDT) Received: from mail.suse.cz (styx.suse.cz [82.119.242.94]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4UEtTWt018616 for ; Wed, 30 May 2007 07:55:30 -0700 Received: from discovery.suse.cz (discovery.suse.cz [10.20.1.116]) by mail.suse.cz (Postfix) with ESMTP id BF2F36280DC; Wed, 30 May 2007 16:30:44 +0200 (CEST) Received: by discovery.suse.cz (Postfix, from userid 10020) id 4C95029425; Wed, 30 May 2007 16:30:44 +0200 (CEST) Message-Id: <20070530143044.060544510@suse.cz> References: <20070530125954.706423971@suse.cz> User-Agent: quilt/0.46-36 Date: Wed, 30 May 2007 14:59:57 +0200 From: Michal Marek To: xfs@oss.sgi.com Cc: linux-kernel@vger.kernel.org Subject: [patch 3/3] Fix XFS_IOC_FSBULKSTAT{,_SINGLE} and XFS_IOC_FSINUMBERS in compat mode Content-Disposition: inline; filename=xfs-compat-ioctl-bulkstat.patch X-archive-position: 11549 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mmarek@suse.cz Precedence: bulk X-list: xfs * 32bit struct xfs_fsop_bulkreq has different size and layout of members, no matter the alignment. Move the code out of the #else branch (why was it there in the first place?). Define _32 variants of the ioctl constants. * 32bit struct xfs_bstat is different on 32bit (because of time_t and on i386 becaus of different padding). Create a new function xfs_ioctl32_bulkstat_wrap(), which allocates extra ->ubuffer and converts the elements to the 32bit format after the original ioctl returns. Same for i386 struct xfs_inogrp. Signed-off-by: Michal Marek --- fs/xfs/linux-2.6/xfs_ioctl32.c | 262 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 238 insertions(+), 24 deletions(-) --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl32.c +++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl32.c @@ -109,35 +109,249 @@ STATIC unsigned long xfs_ioctl32_geom_v1 return (unsigned long)p; } -#else +typedef struct xfs_inogrp32 { + __u64 xi_startino; /* starting inode number */ + __s32 xi_alloccount; /* # bits set in allocmask */ + __u64 xi_allocmask; /* mask of allocated inodes */ +} __attribute__((packed)) xfs_inogrp32_t; + +STATIC int xfs_inogrp_store_compat( + xfs_inogrp32_t __user *p32, + xfs_inogrp_t __user *p) +{ +#define copy(memb) copy_in_user(&p32->memb, &p->memb, sizeof(p32->memb)) + if (copy(xi_startino) || + copy(xi_alloccount) || + copy(xi_allocmask)) + return -EFAULT; + return 0; +#undef copy +} + +#endif + +/* XFS_IOC_FSBULKSTAT and friends */ + +typedef struct xfs_bstime32 { + __s32 tv_sec; /* seconds */ + __s32 tv_nsec; /* and nanoseconds */ +} xfs_bstime32_t; + +static int xfs_bstime_store_compat( + xfs_bstime32_t __user *p32, + xfs_bstime_t __user *p) +{ + time_t sec; + __s32 sec32; + + if (get_user(sec, &p->tv_sec)) + return -EFAULT; + sec32 = sec; + if (put_user(sec32, &p32->tv_sec) || + copy_in_user(&p32->tv_nsec, &p->tv_nsec, sizeof(__s32))) + return -EFAULT; + return 0; +} + +typedef struct xfs_bstat32 { + __u64 bs_ino; /* inode number */ + __u16 bs_mode; /* type and mode */ + __u16 bs_nlink; /* number of links */ + __u32 bs_uid; /* user id */ + __u32 bs_gid; /* group id */ + __u32 bs_rdev; /* device value */ + __s32 bs_blksize; /* block size */ + __s64 bs_size; /* file size */ + xfs_bstime32_t bs_atime; /* access time */ + xfs_bstime32_t bs_mtime; /* modify time */ + xfs_bstime32_t bs_ctime; /* inode change time */ + int64_t bs_blocks; /* number of blocks */ + __u32 bs_xflags; /* extended flags */ + __s32 bs_extsize; /* extent size */ + __s32 bs_extents; /* number of extents */ + __u32 bs_gen; /* generation count */ + __u16 bs_projid; /* project id */ + unsigned char bs_pad[14]; /* pad space, unused */ + __u32 bs_dmevmask; /* DMIG event mask */ + __u16 bs_dmstate; /* DMIG state info */ + __u16 bs_aextents; /* attribute number of extents */ +} +#ifdef BROKEN_X86_ALIGNMENT + __attribute__((packed)) +#endif + xfs_bstat32_t; + +static int xfs_bstat_store_compat( + xfs_bstat32_t __user *p32, + xfs_bstat_t __user *p) +{ +#define copy(memb) copy_in_user(&p32->memb, &p->memb, sizeof(p32->memb)) + if (copy(bs_ino) || + copy(bs_mode) || + copy(bs_nlink) || + copy(bs_uid) || + copy(bs_gid) || + copy(bs_rdev) || + copy(bs_blksize) || + copy(bs_size) || + xfs_bstime_store_compat(&p32->bs_atime, &p->bs_atime) || + xfs_bstime_store_compat(&p32->bs_mtime, &p->bs_mtime) || + xfs_bstime_store_compat(&p32->bs_ctime, &p->bs_ctime) || + copy(bs_blocks) || + copy(bs_xflags) || + copy(bs_extsize) || + copy(bs_extents) || + copy(bs_gen) || + copy(bs_projid) || + copy(bs_pad[14]) || + copy(bs_dmevmask) || + copy(bs_dmstate) || + copy(bs_aextents)) + return -EFAULT; + return 0; +#undef copy +} typedef struct xfs_fsop_bulkreq32 { compat_uptr_t lastip; /* last inode # pointer */ __s32 icount; /* count of entries in buffer */ compat_uptr_t ubuffer; /* user buffer for inode desc. */ - __s32 ocount; /* output count pointer */ + compat_uptr_t ocount; /* output count pointer */ } xfs_fsop_bulkreq32_t; - -STATIC unsigned long -xfs_ioctl32_bulkstat( - unsigned long arg) +#define XFS_IOC_FSBULKSTAT_32 _IOWR('X', 101, struct xfs_fsop_bulkreq32) +#define XFS_IOC_FSBULKSTAT_SINGLE_32 _IOWR('X', 102, struct xfs_fsop_bulkreq32) +#define XFS_IOC_FSINUMBERS_32 _IOWR('X', 103, struct xfs_fsop_bulkreq32) + +#define MAX_BSTAT_LEN \ + ((__s32)((64*1024 - sizeof(xfs_fsop_bulkreq_t)) / sizeof(xfs_bstat_t))) +#define MAX_INOGRP_LEN \ + ((__s32)((64*1024 - sizeof(xfs_fsop_bulkreq_t)) / sizeof(xfs_inogrp_t))) + +STATIC int +xfs_ioctl32_bulkstat_wrap( + bhv_vnode_t *vp, + struct inode *inode, + struct file *file, + int mode, + unsigned cmd, + unsigned long arg) { - xfs_fsop_bulkreq32_t __user *p32 = (void __user *)arg; - xfs_fsop_bulkreq_t __user *p = compat_alloc_user_space(sizeof(*p)); - u32 addr; - - if (get_user(addr, &p32->lastip) || - put_user(compat_ptr(addr), &p->lastip) || - copy_in_user(&p->icount, &p32->icount, sizeof(s32)) || - get_user(addr, &p32->ubuffer) || - put_user(compat_ptr(addr), &p->ubuffer) || - get_user(addr, &p32->ocount) || - put_user(compat_ptr(addr), &p->ocount)) + xfs_fsop_bulkreq32_t __user *p32 = (void __user *)arg; + xfs_fsop_bulkreq_t tmp; + u32 addr; + void *buf32; + int err; + + if (get_user(addr, &p32->lastip)) + return 0; + tmp.lastip = compat_ptr(addr); + if (get_user(tmp.icount, &p32->icount) || + get_user(addr, &p32->ubuffer)) return -EFAULT; + buf32 = compat_ptr(addr); + if (get_user(addr, &p32->ocount)) + return -EFAULT; + tmp.ocount = compat_ptr(addr); - return (unsigned long)p; -} + if (tmp.icount <= 0) + return -EINVAL; + + if (cmd == XFS_IOC_FSBULKSTAT_32) + cmd = XFS_IOC_FSBULKSTAT; + if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) + cmd = XFS_IOC_FSBULKSTAT_SINGLE; + if (cmd == XFS_IOC_FSINUMBERS_32) + cmd = XFS_IOC_FSINUMBERS; + + if (cmd == XFS_IOC_FSBULKSTAT || cmd == XFS_IOC_FSBULKSTAT_SINGLE) { + xfs_fsop_bulkreq_t __user *p; + xfs_bstat_t __user *bs; + xfs_bstat32_t __user *bs32 = buf32; + __s32 total_ocount = 0; + p = compat_alloc_user_space(sizeof(*p) + + sizeof(xfs_bstat_t) * min(MAX_BSTAT_LEN, tmp.icount)); + bs = (xfs_bstat_t __user *)(p + 1); + tmp.ubuffer = bs; + if (copy_to_user(p, &tmp, sizeof(*p))) + return -EFAULT; + while (tmp.icount) { + __s32 icount = min(MAX_BSTAT_LEN, tmp.icount); + __s32 ocount; + int i; + + if (put_user(icount, &p->icount)) + return -EFAULT; + err = bhv_vop_ioctl(vp, inode, file, mode, cmd, + (void __user *)p); + if (err) + return err; + if (get_user(ocount, p->ocount)) + return -EFAULT; + for (i = 0; i < ocount; i++) { + if (xfs_bstat_store_compat(bs32 + + total_ocount + i, bs + i)) + return -EFAULT; + } + total_ocount += ocount; + tmp.icount -= icount; + } + if (put_user(total_ocount, p->ocount)) + return -EFAULT; + return 0; + } + if (cmd == XFS_IOC_FSINUMBERS) { +#ifdef BROKEN_X86_ALIGNMENT + xfs_fsop_bulkreq_t __user *p; + xfs_inogrp_t __user *ig; + xfs_inogrp32_t __user *ig32 = buf32; + __s32 total_ocount = 0; + p = compat_alloc_user_space(sizeof(*p) + + sizeof(xfs_inogrp_t) * min(MAX_INOGRP_LEN, tmp.icount)); + ig = (xfs_inogrp_t __user *)(p + 1); + tmp.ubuffer = ig; + if (copy_to_user(p, &tmp, sizeof(*p))) + return -EFAULT; + + while (tmp.icount) { + __s32 icount = min(MAX_INOGRP_LEN, tmp.icount); + __s32 ocount; + int i; + + if (put_user(icount, &p->icount)) + return -EFAULT; + err = bhv_vop_ioctl(vp, inode, file, mode, cmd, + (void __user *)p); + if (err) + return err; + if (get_user(ocount, p->ocount)) + return -EFAULT; + for (i = 0; i < ocount; i++) { + if (xfs_inogrp_store_compat(ig32 + + total_ocount + i, ig + i)) + return -EFAULT; + } + tmp.icount -= icount; + total_ocount += ocount; + } + if (put_user(total_ocount, p->ocount)) + return -EFAULT; +#else + xfs_fsop_bulkreq_t __user *p; + p = compat_alloc_user_space(sizeof(*p)); + tmp.ubuffer = buf32; + if (copy_to_user(p, &tmp, sizeof(*p))) + return -EFAULT; + + err = bhv_vop_ioctl(vp, inode, file, mode, cmd, + (void __user *)p); + if (err) + return err; #endif + return 0; + } + return -ENOSYS; +} + typedef struct xfs_fsop_handlereq32 { __u32 fd; /* fd for FD_TO_HANDLE */ @@ -253,12 +467,12 @@ xfs_compat_ioctl( case XFS_IOC_SWAPEXT: break; - case XFS_IOC_FSBULKSTAT_SINGLE: - case XFS_IOC_FSBULKSTAT: - case XFS_IOC_FSINUMBERS: - arg = xfs_ioctl32_bulkstat(arg); - break; #endif + case XFS_IOC_FSBULKSTAT_32: + case XFS_IOC_FSBULKSTAT_SINGLE_32: + case XFS_IOC_FSINUMBERS_32: + return xfs_ioctl32_bulkstat_wrap(vp, inode, file, mode, cmd, + arg); case XFS_IOC_FD_TO_HANDLE_32: arg = xfs_ioctl32_fshandle(arg); cmd = XFS_IOC_FD_TO_HANDLE; -- From owner-xfs@oss.sgi.com Wed May 30 07:55:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 07:55:39 -0700 (PDT) Received: from mail.suse.cz (styx.suse.cz [82.119.242.94]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4UEtTWt018619 for ; Wed, 30 May 2007 07:55:31 -0700 Received: from discovery.suse.cz (discovery.suse.cz [10.20.1.116]) by mail.suse.cz (Postfix) with ESMTP id CEF286280C7; Wed, 30 May 2007 16:30:43 +0200 (CEST) Received: by discovery.suse.cz (Postfix, from userid 10020) id 7929529425; Wed, 30 May 2007 16:30:43 +0200 (CEST) Message-Id: <20070530143043.216024061@suse.cz> References: <20070530125954.706423971@suse.cz> User-Agent: quilt/0.46-36 Date: Wed, 30 May 2007 14:59:55 +0200 From: Michal Marek To: xfs@oss.sgi.com Cc: linux-kernel@vger.kernel.org Subject: [patch 1/3] Fix XFS_IOC_FSGEOMETRY_V1 in compat mode Content-Disposition: inline; filename=xfs-compat-ioctl-fsgeometry.patch X-archive-position: 11551 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mmarek@suse.cz Precedence: bulk X-list: xfs i386 struct xfs_fsop_geom_v1 has no padding after the last member, so the size is different. Signed-off-by: Michal Marek --- fs/xfs/linux-2.6/xfs_ioctl32.c | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl32.c +++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl32.c @@ -75,6 +75,40 @@ xfs_ioctl32_flock( return (unsigned long)p; } +typedef struct xfs_fsop_geom_v132 { + __u32 blocksize; /* filesystem (data) block size */ + __u32 rtextsize; /* realtime extent size */ + __u32 agblocks; /* fsblocks in an AG */ + __u32 agcount; /* number of allocation groups */ + __u32 logblocks; /* fsblocks in the log */ + __u32 sectsize; /* (data) sector size, bytes */ + __u32 inodesize; /* inode size in bytes */ + __u32 imaxpct; /* max allowed inode space(%) */ + __u64 datablocks; /* fsblocks in data subvolume */ + __u64 rtblocks; /* fsblocks in realtime subvol */ + __u64 rtextents; /* rt extents in realtime subvol*/ + __u64 logstart; /* starting fsblock of the log */ + unsigned char uuid[16]; /* unique id of the filesystem */ + __u32 sunit; /* stripe unit, fsblocks */ + __u32 swidth; /* stripe width, fsblocks */ + __s32 version; /* structure version */ + __u32 flags; /* superblock version flags */ + __u32 logsectsize; /* log sector size, bytes */ + __u32 rtsectsize; /* realtime sector size, bytes */ + __u32 dirblocksize; /* directory block size, bytes */ +} __attribute__((packed)) xfs_fsop_geom_v132_t; +#define XFS_IOC_FSGEOMETRY_V1_32 _IOR ('X', 100, struct xfs_fsop_geom_v132) + +STATIC unsigned long xfs_ioctl32_geom_v1(unsigned long arg) +{ + xfs_fsop_geom_v132_t __user *p32 = (void __user *)arg; + xfs_fsop_geom_v1_t __user *p = compat_alloc_user_space(sizeof(*p)); + + if (copy_in_user(p, p32, sizeof(*p32))) + return -EFAULT; + return (unsigned long)p; +} + #else typedef struct xfs_fsop_bulkreq32 { @@ -118,7 +152,6 @@ xfs_compat_ioctl( switch (cmd) { case XFS_IOC_DIOINFO: - case XFS_IOC_FSGEOMETRY_V1: case XFS_IOC_FSGEOMETRY: case XFS_IOC_GETVERSION: case XFS_IOC_GETXFLAGS: @@ -166,6 +199,10 @@ xfs_compat_ioctl( arg = xfs_ioctl32_flock(arg); cmd = _NATIVE_IOC(cmd, struct xfs_flock64); break; + case XFS_IOC_FSGEOMETRY_V1_32: + arg = xfs_ioctl32_geom_v1(arg); + cmd = XFS_IOC_FSGEOMETRY_V1; + break; #else /* These are handled fine if no alignment issues */ case XFS_IOC_ALLOCSP: @@ -176,6 +213,7 @@ xfs_compat_ioctl( case XFS_IOC_FREESP64: case XFS_IOC_RESVSP64: case XFS_IOC_UNRESVSP64: + case XFS_IOC_FSGEOMETRY_V1: break; /* xfs_bstat_t still has wrong u32 vs u64 alignment */ -- From owner-xfs@oss.sgi.com Wed May 30 09:11:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 09:11:54 -0700 (PDT) Received: from mail.g-house.de (ns2.g-housing.de [81.169.133.75]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4UGBmWt011524 for ; Wed, 30 May 2007 09:11:49 -0700 Received: from [82.41.246.210] (helo=[10.0.0.30]) by mail.g-house.de with esmtpsa (TLS-1.0:DHE_RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1HtQm1-0004z1-Ul; Wed, 30 May 2007 18:11:46 +0200 Date: Wed, 30 May 2007 17:11:42 +0100 (BST) From: Christian Kujau X-X-Sender: evil@sheep.housecafe.de To: Pallai Roland cc: xfs@oss.sgi.com Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem In-Reply-To: <200705281730.53343.dap@mail.index.hu> Message-ID: References: <200705241318.30711.dap@mail.index.hu> <20070525000547.GH85884050@sgi.com> <200705281453.55618.dap@mail.index.hu> <200705281730.53343.dap@mail.index.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=us-ascii X-archive-position: 11552 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lists@nerdbynature.de Precedence: bulk X-list: xfs On Mon, 28 May 2007, Pallai Roland wrote: > /dev/loop1: [0700]:180907 (/dev/md1) encryption=AES128 multi-key-v3 > hq:~# dd if=/dev/md1 bs=1k count=128 skip=128 >/dev/null > dd: reading `/dev/md1': Input/output error > 0+0 records in > 0+0 records out > hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum > 128+0 records in > 128+0 records out > 131072 bytes (131 kB) copied, 0.027775 seconds, 4.7 MB/s > e2548a924a0e835bb45fb50058acba98 - (!!!) > hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum > 128+0 records in > 128+0 records out > 131072 bytes (131 kB) copied, 0.030311 seconds, 4.3 MB/s > c6a23412fb75eb5a7eb1d6a7813eb86b - (!!!) Hm, looks nasty. Maybe you should send this to linux-crypto@nl.linux.org, where many loop-aes folks are hanging around? C. -- BOFH excuse #197: I'm sorry a pentium won't do, you need an SGI to connect with us. From owner-xfs@oss.sgi.com Wed May 30 09:49:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 09:49:40 -0700 (PDT) Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4UGncWt027407 for ; Wed, 30 May 2007 09:49:38 -0700 Received: (from root@localhost) by oss.sgi.com (8.12.10/8.12.10/Submit) id l4UGnckA027406; Wed, 30 May 2007 09:49:38 -0700 Date: Wed, 30 May 2007 09:49:38 -0700 Message-Id: <200705301649.l4UGnckA027406@oss.sgi.com> From: "Michael Nishimoto" To: Subject: Reducing memory requirements for high extent xfs files X-archive-position: 11553 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: miken@stanfordalumni.org Precedence: bulk X-list: xfs Hello, Has anyone done any work or had thoughts on changes required to reduce the total memory footprint of high extent xfs files? Obviously, it is important to reduce fragmentation as files are generated and to regularly defrag files, but both of these alternatives are not complete solutions. To reduce memory consumption, xfs could bring in extents from disk as needed (or just before needed) and could free up mappings when certain extent ranges have not been recently accessed. A solution should become more aggressive about reclaiming extent mapping memory as free memory becomes limited. Michael ____________________________________________________________________ From owner-xfs@oss.sgi.com Wed May 30 10:32:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 10:32:17 -0700 (PDT) Received: from smtp104.sbc.mail.re2.yahoo.com (smtp104.sbc.mail.re2.yahoo.com [68.142.229.101]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4UHWBWt010414 for ; Wed, 30 May 2007 10:32:13 -0700 Received: (qmail 69932 invoked from network); 30 May 2007 17:05:32 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp104.sbc.mail.re2.yahoo.com with SMTP; 30 May 2007 17:05:32 -0000 X-YMail-OSG: jkGYUiQVM1lMn5GuDgSvymRfwhOO_t3PbokgcLZNVXTcBoMgIThJ.KXeatLFP8vz3flUIGfczw-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 4981A182612A; Wed, 30 May 2007 10:05:30 -0700 (PDT) Date: Wed, 30 May 2007 10:05:30 -0700 From: Chris Wedgwood To: Michal Marek Cc: xfs@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [patch 1/3] Fix XFS_IOC_FSGEOMETRY_V1 in compat mode Message-ID: <20070530170530.GA4197@tuatara.stupidest.org> References: <20070530125954.706423971@suse.cz> <20070530143043.216024061@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070530143043.216024061@suse.cz> X-archive-position: 11554 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Wed, May 30, 2007 at 02:59:55PM +0200, Michal Marek wrote: > +typedef struct xfs_fsop_geom_v132 { wouldn't xfs_fsop_geom_v1_32 ^ > + __u32 blocksize; /* filesystem (data) block size */ [...] > + __u32 dirblocksize; /* directory block size, bytes */ > +} __attribute__((packed)) xfs_fsop_geom_v132_t; and xfs_fsop_geom_v1_32_t ^ read better there? From owner-xfs@oss.sgi.com Wed May 30 15:02:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 15:02:41 -0700 (PDT) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.186]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4UM2aWt008237 for ; Wed, 30 May 2007 15:02:37 -0700 Received: from [87.180.187.44] (helo=noname) by mrelayeu.kundenserver.de (node=mrelayeu8) with ESMTP (Nemesis), id 0ML31I-1HtW2L1qqe-00045G; Wed, 30 May 2007 23:48:58 +0200 From: Arnd Bergmann To: Chris Wedgwood Subject: Re: [patch 1/3] Fix XFS_IOC_FSGEOMETRY_V1 in compat mode Date: Wed, 30 May 2007 23:48:53 +0200 User-Agent: KMail/1.9.6 Cc: Michal Marek , xfs@oss.sgi.com, linux-kernel@vger.kernel.org References: <20070530125954.706423971@suse.cz> <20070530143043.216024061@suse.cz> <20070530170530.GA4197@tuatara.stupidest.org> In-Reply-To: <20070530170530.GA4197@tuatara.stupidest.org> X-Face: >j"dOR3XO=^3iw?0`(E1wZ/&le9!.ok[JrI=S~VlsF~}"P\+jx.GT@=?utf-8?q?=0A=09-oaEG?=,9Ba>v;3>:kcw#yO5?B:l{(Ln.2)=?utf-8?q?=27=7Dfw07+4-=26=5E=7CScOpE=3F=5D=5EXdv=5B/zWkA7=60=25M!DxZ=0A=09?= =?utf-8?q?8MJ=2EU5?="hi+2yT(k`PF~Zt;tfT,i,JXf=x@eLP{7B:"GyA\=UnN) =?utf-8?q?=26=26qdaA=3A=7D-Y*=7D=3A3YvzV9=0A=09=7E=273a=7E7I=7CWQ=5D?=<50*%U-6Ewmxfzdn/CK_E/ouMU(r?FAQG/ev^JyuX.%(By`" =?utf-8?q?L=5F=0A=09H=3Dbj?=)"y7*XOqz|SS"mrZ$`Q_syCd MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200705302348.54259.arnd@arndb.de> X-Provags-ID: V01U2FsdGVkX18LiNsCkpbVE1/CP2zcJ0EUExEA8iG02Fm1wzs de6uNuFZOz21y391AD1MXyinJTz8tvflD2FRzkFa5oP8yru6OZ ZOaqp/3cyeHll2P6uUjoQ== Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l4UM2cWt008241 X-archive-position: 11555 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: arnd@arndb.de Precedence: bulk X-list: xfs On Wednesday 30 May 2007, Chris Wedgwood wrote: > > On Wed, May 30, 2007 at 02:59:55PM +0200, Michal Marek wrote: > > > +typedef struct xfs_fsop_geom_v132 { > > wouldn't xfs_fsop_geom_v1_32 >                          ^ > > > +     __u32           blocksize;      /* filesystem (data) block size */ > > [...] > > > +     __u32           dirblocksize;   /* directory block size, bytes  */ > > +} __attribute__((packed)) xfs_fsop_geom_v132_t; > > and xfs_fsop_geom_v1_32_t >                     ^ > > read better there? Actually, the current convention would be struct compat_xfs_fsop_geom_v1 and compat_xfs_fsop_geom_v1_t. Arnd <>< From owner-xfs@oss.sgi.com Wed May 30 15:55:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 15:55:30 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4UMtOWt020004 for ; Wed, 30 May 2007 15:55:26 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA03526; Thu, 31 May 2007 08:55:19 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4UMtIAf109895803; Thu, 31 May 2007 08:55:19 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4UMtG6n107911823; Thu, 31 May 2007 08:55:16 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 31 May 2007 08:55:16 +1000 From: David Chinner To: Michael Nishimoto Cc: xfs@oss.sgi.com Subject: Re: Reducing memory requirements for high extent xfs files Message-ID: <20070530225516.GB85884050@sgi.com> References: <200705301649.l4UGnckA027406@oss.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705301649.l4UGnckA027406@oss.sgi.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11556 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 30, 2007 at 09:49:38AM -0700, Michael Nishimoto wrote: > Hello, > > Has anyone done any work or had thoughts on changes required > to reduce the total memory footprint of high extent xfs files? We changed the way we do memory allocation to avoid needing large contiguous chunks of memory a bit over a year ago; that solved the main OOM problem we were getting reported with highly fragmented files. > Obviously, it is important to reduce fragmentation as files > are generated and to regularly defrag files, but both of these > alternatives are not complete solutions. > > To reduce memory consumption, xfs could bring in extents > from disk as needed (or just before needed) and could free > up mappings when certain extent ranges have not been recently > accessed. A solution should become more aggressive about > reclaiming extent mapping memory as free memory becomes limited. Yes, it could, but that's a pretty major overhaul of the extent interface which currently assumes everywhere that the entire extent tree is in core. Can you describe the problem you are seeing that leads you to ask this question? What's the problem you need to solve? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 30 19:30:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 19:30:51 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4V2UhWt007569 for ; Wed, 30 May 2007 19:30:46 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA08845; Thu, 31 May 2007 12:30:37 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4V2UXAf102687516; Thu, 31 May 2007 12:30:35 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4V2UVb1106115525; Thu, 31 May 2007 12:30:31 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 31 May 2007 12:30:31 +1000 From: David Chinner To: Michal Marek Cc: xfs@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [patch 1/3] Fix XFS_IOC_FSGEOMETRY_V1 in compat mode Message-ID: <20070531023031.GH85884050@sgi.com> References: <20070530125954.706423971@suse.cz> <20070530143043.216024061@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070530143043.216024061@suse.cz> User-Agent: Mutt/1.4.2.1i X-archive-position: 11557 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 30, 2007 at 02:59:55PM +0200, Michal Marek wrote: > i386 struct xfs_fsop_geom_v1 has no padding after the last member, so > the size is different. That's a pain - it's kind of clunky having to redefine the entire structure just pack it differently. Oh well, not much that we can do about it... > > Signed-off-by: Michal Marek > --- > fs/xfs/linux-2.6/xfs_ioctl32.c | 40 +++++++++++++++++++++++++++++++++++++++- > 1 file changed, 39 insertions(+), 1 deletion(-) > > --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl32.c > +++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl32.c > @@ -75,6 +75,40 @@ xfs_ioctl32_flock( > return (unsigned long)p; > } > > +typedef struct xfs_fsop_geom_v132 { xfs_fsop_geom_v1_32 > + __u32 blocksize; /* filesystem (data) block size */ > + __u32 rtextsize; /* realtime extent size */ > + __u32 agblocks; /* fsblocks in an AG */ > + __u32 agcount; /* number of allocation groups */ > + __u32 logblocks; /* fsblocks in the log */ > + __u32 sectsize; /* (data) sector size, bytes */ > + __u32 inodesize; /* inode size in bytes */ > + __u32 imaxpct; /* max allowed inode space(%) */ > + __u64 datablocks; /* fsblocks in data subvolume */ > + __u64 rtblocks; /* fsblocks in realtime subvol */ > + __u64 rtextents; /* rt extents in realtime subvol*/ > + __u64 logstart; /* starting fsblock of the log */ > + unsigned char uuid[16]; /* unique id of the filesystem */ > + __u32 sunit; /* stripe unit, fsblocks */ > + __u32 swidth; /* stripe width, fsblocks */ > + __s32 version; /* structure version */ > + __u32 flags; /* superblock version flags */ > + __u32 logsectsize; /* log sector size, bytes */ > + __u32 rtsectsize; /* realtime sector size, bytes */ > + __u32 dirblocksize; /* directory block size, bytes */ > +} __attribute__((packed)) xfs_fsop_geom_v132_t; xfs_fsop_geom_v1_32_t > +#define XFS_IOC_FSGEOMETRY_V1_32 _IOR ('X', 100, struct xfs_fsop_geom_v132) > + > +STATIC unsigned long xfs_ioctl32_geom_v1(unsigned long arg) > +{ > + xfs_fsop_geom_v132_t __user *p32 = (void __user *)arg; > + xfs_fsop_geom_v1_t __user *p = compat_alloc_user_space(sizeof(*p)); > + > + if (copy_in_user(p, p32, sizeof(*p32))) > + return -EFAULT; > + return (unsigned long)p; > +} > + > #else > > typedef struct xfs_fsop_bulkreq32 { > @@ -118,7 +152,6 @@ xfs_compat_ioctl( > > switch (cmd) { > case XFS_IOC_DIOINFO: > - case XFS_IOC_FSGEOMETRY_V1: > case XFS_IOC_FSGEOMETRY: > case XFS_IOC_GETVERSION: > case XFS_IOC_GETXFLAGS: > @@ -166,6 +199,10 @@ xfs_compat_ioctl( > arg = xfs_ioctl32_flock(arg); > cmd = _NATIVE_IOC(cmd, struct xfs_flock64); > break; /* xfs_fsop_geom_v1 changes size */ > + case XFS_IOC_FSGEOMETRY_V1_32: > + arg = xfs_ioctl32_geom_v1(arg); > + cmd = XFS_IOC_FSGEOMETRY_V1; > + break; cmd = _NATIVE_IOC(cmd, struct xfs_fsop_geom_v1); Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 30 19:36:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 19:36:53 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4V2amWt011319 for ; Wed, 30 May 2007 19:36:50 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA08952; Thu, 31 May 2007 12:36:43 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4V2agAf109359294; Thu, 31 May 2007 12:36:42 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4V2aeuJ109658032; Thu, 31 May 2007 12:36:40 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 31 May 2007 12:36:40 +1000 From: David Chinner To: Michal Marek Cc: xfs@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [patch 2/3] Fix XFS_IOC_*_TO_HANDLE and XFS_IOC_{OPEN,READLINK}_BY_HANDLE in compat mode Message-ID: <20070531023640.GI85884050@sgi.com> References: <20070530125954.706423971@suse.cz> <20070530143043.611931865@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070530143043.611931865@suse.cz> User-Agent: Mutt/1.4.2.1i X-archive-position: 11558 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 30, 2007 at 02:59:56PM +0200, Michal Marek wrote: > 32bit struct xfs_fsop_handlereq has different size and offsets (due to > pointers). TODO: case XFS_IOC_{FSSETDM,ATTRLIST,ATTRMULTI}_BY_HANDLE > still not handled. > > Signed-off-by: Michal Marek > --- > fs/xfs/linux-2.6/xfs_ioctl32.c | 63 +++++++++++++++++++++++++++++++++++++---- > 1 file changed, 58 insertions(+), 5 deletions(-) > > --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl32.c > +++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl32.c > @@ -139,6 +139,44 @@ xfs_ioctl32_bulkstat( > } > #endif > > +typedef struct xfs_fsop_handlereq32 { xfs_fsop_handlereq_32 > + __u32 fd; /* fd for FD_TO_HANDLE */ > + compat_uptr_t path; /* user pathname */ > + __u32 oflags; /* open flags */ > + compat_uptr_t ihandle; /* user supplied handle */ > + __u32 ihandlen; /* user supplied length */ > + compat_uptr_t ohandle; /* user buffer for handle */ > + compat_uptr_t ohandlen; /* user buffer length */ > +} xfs_fsop_handlereq32_t; xfs_fsop_handlereq_32_t Add a empty line here... > +#define XFS_IOC_PATH_TO_FSHANDLE_32 _IOWR('X', 104, struct xfs_fsop_handlereq32) > +#define XFS_IOC_PATH_TO_HANDLE_32 _IOWR('X', 105, struct xfs_fsop_handlereq32) > +#define XFS_IOC_FD_TO_HANDLE_32 _IOWR('X', 106, struct xfs_fsop_handlereq32) > +#define XFS_IOC_OPEN_BY_HANDLE_32 _IOWR('X', 107, struct xfs_fsop_handlereq32) > +#define XFS_IOC_READLINK_BY_HANDLE_32 _IOWR('X', 108, struct xfs_fsop_handlereq32) Looks kinda whitespacey here - it's mixing spaces and tabs.... > +STATIC unsigned long xfs_ioctl32_fshandle(unsigned long arg) > +{ > + xfs_fsop_handlereq32_t __user *p32 = (void __user *)arg; > + xfs_fsop_handlereq_t __user *p = compat_alloc_user_space(sizeof(*p)); > + u32 addr; > + > + if (copy_in_user(&p->fd, &p32->fd, sizeof(__u32)) || > + get_user(addr, &p32->path) || > + put_user(compat_ptr(addr), &p->path) || > + copy_in_user(&p->oflags, &p32->oflags, sizeof(__u32)) || > + get_user(addr, &p32->ihandle) || > + put_user(compat_ptr(addr), &p->ihandle) || > + copy_in_user(&p->ihandlen, &p32->ihandlen, sizeof(__u32)) || > + get_user(addr, &p32->ohandle) || > + put_user(compat_ptr(addr), &p->ohandle) || > + get_user(addr, &p32->ohandlen) || > + put_user(compat_ptr(addr), &p->ohandlen)) > + return -EFAULT; > + > + return (unsigned long)p; > +} > + > + > STATIC long > xfs_compat_ioctl( > int mode, > @@ -164,12 +202,7 @@ xfs_compat_ioctl( > case XFS_IOC_GETBMAPA: > case XFS_IOC_GETBMAPX: > /* not handled > - case XFS_IOC_FD_TO_HANDLE: > - case XFS_IOC_PATH_TO_HANDLE: > - case XFS_IOC_PATH_TO_FSHANDLE: > - case XFS_IOC_OPEN_BY_HANDLE: > case XFS_IOC_FSSETDM_BY_HANDLE: > - case XFS_IOC_READLINK_BY_HANDLE: > case XFS_IOC_ATTRLIST_BY_HANDLE: > case XFS_IOC_ATTRMULTI_BY_HANDLE: > */ > @@ -226,6 +259,26 @@ xfs_compat_ioctl( > arg = xfs_ioctl32_bulkstat(arg); > break; > #endif > + case XFS_IOC_FD_TO_HANDLE_32: > + arg = xfs_ioctl32_fshandle(arg); > + cmd = XFS_IOC_FD_TO_HANDLE; > + break; > + case XFS_IOC_PATH_TO_HANDLE_32: > + arg = xfs_ioctl32_fshandle(arg); > + cmd = XFS_IOC_PATH_TO_HANDLE; > + break; > + case XFS_IOC_PATH_TO_FSHANDLE_32: > + arg = xfs_ioctl32_fshandle(arg); > + cmd = XFS_IOC_PATH_TO_FSHANDLE; > + break; > + case XFS_IOC_OPEN_BY_HANDLE_32: > + arg = xfs_ioctl32_fshandle(arg); > + cmd = XFS_IOC_OPEN_BY_HANDLE; > + break; > + case XFS_IOC_READLINK_BY_HANDLE_32: > + arg = xfs_ioctl32_fshandle(arg); > + cmd = XFS_IOC_READLINK_BY_HANDLE; > + break; + case XFS_IOC_FD_TO_HANDLE_32: + case XFS_IOC_PATH_TO_HANDLE_32: + case XFS_IOC_PATH_TO_FSHANDLE_32: + case XFS_IOC_OPEN_BY_HANDLE_32: + case XFS_IOC_READLINK_BY_HANDLE_32: + arg = xfs_ioctl32_fshandle(arg); + cmd = _NATIVE_IOC(cmd, struct xfs_fsop_handlereq); + break; Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed May 30 21:22:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 21:22:50 -0700 (PDT) Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4V4MjWt012312 for ; Wed, 30 May 2007 21:22:47 -0700 Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.161]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l4V4MjCS022757 for ; Thu, 31 May 2007 13:22:45 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l4V4MjP24299 for xfs@oss.sgi.com; Thu, 31 May 2007 13:22:45 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv5.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id l4V4Mi206567 for ; Thu, 31 May 2007 13:22:44 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070531.132243.92102468 for ; Thu, 31 May 2007 13:22:44 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Thu May 31 13:22:43 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 96503AE4B3; Thu, 31 May 2007 13:22:40 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l4V4Mh97006891; Thu, 31 May 2007 13:22:43 +0900 Message-Id: <200705310422.AA05481@TNESG9305.tnes.nec.co.jp> Date: Thu, 31 May 2007 13:22:43 +0900 To: xfs@oss.sgi.com Subject: [PATCH] Fix xfs_quota command handling. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11559 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, This is my last patch for xfs_quota, maybe. When path argument is not specified, xfs_quota executes commands repeatedly to the number of mounted XFS file systems. As a result, I get the same command report many times. This patch implements the similar command loop operation to xfs_db. # ./xfs_quota -c "print" -c "help" Filesystem Pathname /home/utako/mpnt /dev/sda6 (uquota) /home/utako/mpnt/mpnt2 /dev/loop0 (uquota) /home/utako/mpnt/pjq /dev/sda6 (project 42, logfiles) /home/utako/mpnt/pjq2 /dev/sda6 (project 43, logfiles2) Filesystem Pathname /home/utako/mpnt /dev/sda6 (uquota) /home/utako/mpnt/mpnt2 /dev/loop0 (uquota) /home/utako/mpnt/pjq /dev/sda6 (project 42, logfiles) /home/utako/mpnt/pjq2 /dev/sda6 (project 43, logfiles2) df [-bir] [-hn] [-f file] -- show free and used counts for blocks and inodes help [command] -- help for one or all commands print -- list known mount points and projects quit -- exit the program quota [-bir] [-gpu] [-hnv] [-f file] [id|name]... -- show usage and limits Use 'help commandname' for extended help. df [-bir] [-hn] [-f file] -- show free and used counts for blocks and inodes help [command] -- help for one or all commands print -- list known mount points and projects quit -- exit the program quota [-bir] [-gpu] [-hnv] [-f file] [id|name]... -- show usage and limits Use 'help commandname' for extended help. -- Utako Kusaka Signed-off-by: Utako Kusaka --- --- xfsprogs-2.8.20/quota/init.orig 2007-05-16 11:38:39.000000000 +0900 +++ xfsprogs-2.8.20/quota/init.c 2007-05-30 11:16:56.000000000 +0900 @@ -25,6 +25,8 @@ char *progname; int exitcode; int expert; +static char **cmdline; /* table of user commands */ +static int ncmdline; /* number of entries in command table */ static char **projopts; /* table of project names (cmdline) */ static int nprojopts; /* number of entries in name table. */ @@ -74,24 +76,6 @@ init_commands(void) state_init(); } -static int -init_args_command( - int index) -{ - if (index >= fs_count) - return 0; - - do { - fs_path = &fs_table[index++]; - } while ((fs_path->fs_flags & FS_PROJECT_PATH) && index < fs_count); - - if (fs_path->fs_flags & FS_PROJECT_PATH) - return 0; - if (index > fs_count) - return 0; - return index; -} - static void init( int argc, @@ -107,7 +91,12 @@ init( while ((c = getopt(argc, argv, "c:d:D:P:p:t:xV")) != EOF) { switch (c) { case 'c': /* commands */ - add_user_command(optarg); + cmdline = realloc(cmdline, (ncmdline+1)*sizeof(char*)); + if (!cmdline) { + perror("realloc"); + exit(1); + } + cmdline[ncmdline++] = optarg; break; case 'd': add_project_opt(optarg); @@ -150,7 +139,6 @@ init( free(projopts); init_commands(); - add_args_command(init_args_command); } int @@ -158,7 +146,32 @@ main( int argc, char **argv) { + int c, i, done = 0; + char *input; + char **v; + init(argc, argv); - command_loop(); + + for (i = 0; !done && i < ncmdline; i++) { + v = breakline(cmdline[i], &c); + if (c) + done = command(c, v); + free(v); + } + if (cmdline) { + free(cmdline); + return exitcode; + } + + while (!done) { + if ((input = fetchline()) == NULL) + break; + v = breakline(input, &c); + if (c) + done = command(c, v); + doneline(input, v); + } + + return exitcode; } From owner-xfs@oss.sgi.com Wed May 30 23:37:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 30 May 2007 23:37:48 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4V6bgWt002729 for ; Wed, 30 May 2007 23:37:43 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA15381; Thu, 31 May 2007 16:37:37 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4V6baAf109557916; Thu, 31 May 2007 16:37:36 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4V6bYkg109962094; Thu, 31 May 2007 16:37:34 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 31 May 2007 16:37:34 +1000 From: David Chinner To: Michal Marek Cc: xfs@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [patch 3/3] Fix XFS_IOC_FSBULKSTAT{,_SINGLE} and XFS_IOC_FSINUMBERS in compat mode Message-ID: <20070531063734.GJ85884050@sgi.com> References: <20070530125954.706423971@suse.cz> <20070530143044.060544510@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070530143044.060544510@suse.cz> User-Agent: Mutt/1.4.2.1i X-archive-position: 11560 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, May 30, 2007 at 02:59:57PM +0200, Michal Marek wrote: > * 32bit struct xfs_fsop_bulkreq has different size and layout of > members, no matter the alignment. Move the code out of the #else > branch (why was it there in the first place?). Define _32 variants of > the ioctl constants. > * 32bit struct xfs_bstat is different on 32bit (because of time_t and on > i386 becaus of different padding). Create a new function > xfs_ioctl32_bulkstat_wrap(), which allocates extra ->ubuffer and > converts the elements to the 32bit format after the original ioctl > returns. Same for i386 struct xfs_inogrp. > > Signed-off-by: Michal Marek > --- > fs/xfs/linux-2.6/xfs_ioctl32.c | 262 +++++++++++++++++++++++++++++++++++++---- > 1 file changed, 238 insertions(+), 24 deletions(-) > > --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl32.c > +++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl32.c > @@ -109,35 +109,249 @@ STATIC unsigned long xfs_ioctl32_geom_v1 > return (unsigned long)p; > } > > -#else > +typedef struct xfs_inogrp32 { > + __u64 xi_startino; /* starting inode number */ > + __s32 xi_alloccount; /* # bits set in allocmask */ > + __u64 xi_allocmask; /* mask of allocated inodes */ > +} __attribute__((packed)) xfs_inogrp32_t; xfs_inogrp_32 xfs_inogrp_32_t > +STATIC int xfs_inogrp_store_compat( > + xfs_inogrp32_t __user *p32, > + xfs_inogrp_t __user *p) > +{ > +#define copy(memb) copy_in_user(&p32->memb, &p->memb, sizeof(p32->memb)) > + if (copy(xi_startino) || > + copy(xi_alloccount) || > + copy(xi_allocmask)) No need for the #define here.... > + return -EFAULT; > + return 0; > +#undef copy > +} > + > +#endif > + > +/* XFS_IOC_FSBULKSTAT and friends */ > + > +typedef struct xfs_bstime32 { > + __s32 tv_sec; /* seconds */ > + __s32 tv_nsec; /* and nanoseconds */ > +} xfs_bstime32_t; *_32 > +static int xfs_bstime_store_compat( > + xfs_bstime32_t __user *p32, > + xfs_bstime_t __user *p) > +{ > + time_t sec; > + __s32 sec32; > + > + if (get_user(sec, &p->tv_sec)) > + return -EFAULT; > + sec32 = sec; > + if (put_user(sec32, &p32->tv_sec) || > + copy_in_user(&p32->tv_nsec, &p->tv_nsec, sizeof(__s32))) > + return -EFAULT; > + return 0; > +} > + > +typedef struct xfs_bstat32 { > + __u64 bs_ino; /* inode number */ > + __u16 bs_mode; /* type and mode */ > + __u16 bs_nlink; /* number of links */ > + __u32 bs_uid; /* user id */ > + __u32 bs_gid; /* group id */ > + __u32 bs_rdev; /* device value */ > + __s32 bs_blksize; /* block size */ > + __s64 bs_size; /* file size */ > + xfs_bstime32_t bs_atime; /* access time */ > + xfs_bstime32_t bs_mtime; /* modify time */ > + xfs_bstime32_t bs_ctime; /* inode change time */ > + int64_t bs_blocks; /* number of blocks */ > + __u32 bs_xflags; /* extended flags */ > + __s32 bs_extsize; /* extent size */ > + __s32 bs_extents; /* number of extents */ > + __u32 bs_gen; /* generation count */ > + __u16 bs_projid; /* project id */ > + unsigned char bs_pad[14]; /* pad space, unused */ > + __u32 bs_dmevmask; /* DMIG event mask */ > + __u16 bs_dmstate; /* DMIG state info */ > + __u16 bs_aextents; /* attribute number of extents */ > +} > +#ifdef BROKEN_X86_ALIGNMENT > + __attribute__((packed)) > +#endif > + xfs_bstat32_t; #ifdef BROKEN_X86_ALIGNMENT #define _PACKED __attribute__((packed)) #else #define _PACKED #endif typedef struct xfs_bstat_32 { ...... } _PACKED xfs_bstat32_t > + > +static int xfs_bstat_store_compat( > + xfs_bstat32_t __user *p32, > + xfs_bstat_t __user *p) > +{ > +#define copy(memb) copy_in_user(&p32->memb, &p->memb, sizeof(p32->memb)) Hmmm - now I see why you used this. These copies are used everywhere in this file, maybe it would be best to define a copy_from_32() and a copy_to_32() macros and use them everywhere in the file? > + if (copy(bs_ino) || > + copy(bs_mode) || > + copy(bs_nlink) || > + copy(bs_uid) || > + copy(bs_gid) || > + copy(bs_rdev) || > + copy(bs_blksize) || > + copy(bs_size) || > + xfs_bstime_store_compat(&p32->bs_atime, &p->bs_atime) || > + xfs_bstime_store_compat(&p32->bs_mtime, &p->bs_mtime) || > + xfs_bstime_store_compat(&p32->bs_ctime, &p->bs_ctime) || > + copy(bs_blocks) || > + copy(bs_xflags) || > + copy(bs_extsize) || > + copy(bs_extents) || > + copy(bs_gen) || > + copy(bs_projid) || > + copy(bs_pad[14]) || > + copy(bs_dmevmask) || > + copy(bs_dmstate) || > + copy(bs_aextents)) > + return -EFAULT; > + return 0; > +#undef copy > +} > > typedef struct xfs_fsop_bulkreq32 { > compat_uptr_t lastip; /* last inode # pointer */ > __s32 icount; /* count of entries in buffer */ > compat_uptr_t ubuffer; /* user buffer for inode desc. */ > - __s32 ocount; /* output count pointer */ > + compat_uptr_t ocount; /* output count pointer */ > } xfs_fsop_bulkreq32_t; > - > -STATIC unsigned long > -xfs_ioctl32_bulkstat( > - unsigned long arg) > +#define XFS_IOC_FSBULKSTAT_32 _IOWR('X', 101, struct xfs_fsop_bulkreq32) > +#define XFS_IOC_FSBULKSTAT_SINGLE_32 _IOWR('X', 102, struct xfs_fsop_bulkreq32) > +#define XFS_IOC_FSINUMBERS_32 _IOWR('X', 103, struct xfs_fsop_bulkreq32) > + > +#define MAX_BSTAT_LEN \ > + ((__s32)((64*1024 - sizeof(xfs_fsop_bulkreq_t)) / sizeof(xfs_bstat_t))) > +#define MAX_INOGRP_LEN \ > + ((__s32)((64*1024 - sizeof(xfs_fsop_bulkreq_t)) / sizeof(xfs_inogrp_t))) Oooo magic numbers. Why were these chosen? > + > +STATIC int > +xfs_ioctl32_bulkstat_wrap( > + bhv_vnode_t *vp, > + struct inode *inode, > + struct file *file, > + int mode, > + unsigned cmd, > + unsigned long arg) > { > - xfs_fsop_bulkreq32_t __user *p32 = (void __user *)arg; > - xfs_fsop_bulkreq_t __user *p = compat_alloc_user_space(sizeof(*p)); > - u32 addr; > - > - if (get_user(addr, &p32->lastip) || > - put_user(compat_ptr(addr), &p->lastip) || > - copy_in_user(&p->icount, &p32->icount, sizeof(s32)) || > - get_user(addr, &p32->ubuffer) || > - put_user(compat_ptr(addr), &p->ubuffer) || > - get_user(addr, &p32->ocount) || > - put_user(compat_ptr(addr), &p->ocount)) > + xfs_fsop_bulkreq32_t __user *p32 = (void __user *)arg; > + xfs_fsop_bulkreq_t tmp; > + u32 addr; > + void *buf32; > + int err; > + > + if (get_user(addr, &p32->lastip)) > + return 0; return -EFAULT? > + tmp.lastip = compat_ptr(addr); > + if (get_user(tmp.icount, &p32->icount) || > + get_user(addr, &p32->ubuffer)) > return -EFAULT; > + buf32 = compat_ptr(addr); > + if (get_user(addr, &p32->ocount)) > + return -EFAULT; > + tmp.ocount = compat_ptr(addr); > > - return (unsigned long)p; > -} > + if (tmp.icount <= 0) > + return -EINVAL; > + > + if (cmd == XFS_IOC_FSBULKSTAT_32) > + cmd = XFS_IOC_FSBULKSTAT; > + if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) > + cmd = XFS_IOC_FSBULKSTAT_SINGLE; > + if (cmd == XFS_IOC_FSINUMBERS_32) > + cmd = XFS_IOC_FSINUMBERS; cmd = _NATIVE_IOC(cmd, struct xfs_fsop_bulkreq); switch (cmd) { case XFS_IOC_FSBULKSTAT: case XFS_IOC_FSBULKSTAT_SINGLE: > + > + if (cmd == XFS_IOC_FSBULKSTAT || cmd == XFS_IOC_FSBULKSTAT_SINGLE) { Oh, now it gets messy :( So, we do a whole lot of repacking of the bulkstat structures once we've got the data out of the bulkstat call. I think this is really the wrong way of doing this - the bulkstat functions themselves take a "formatter" argument that is used to pack the buffer in a given format. I think that we need to be supplying the bulkstat code with different formatters in this case, not repacking the buffer into a different format at a later time. The formatter used by default is xfs_bulkstat_one() which falls down to xfs_bulkstat_one_dinode() or xfs_bulkstat_one_iget() depending on whether we are doing icache coherent or blockdev cache coherent lookups. It is these functions that need to be told what format they are packing, I think, and xfs_bulkstat_single() needs to be taught about them.... > + if (cmd == XFS_IOC_FSINUMBERS) { And I'm wondering if we should be doing the same thing here (i.e. customer formatters), because this is equally ugly... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 31 00:07:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 00:07:03 -0700 (PDT) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.186]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4V76xWt012221 for ; Thu, 31 May 2007 00:07:00 -0700 Received: from [87.180.149.97] (helo=noname) by mrelayeu.kundenserver.de (node=mrelayeu7) with ESMTP (Nemesis), id 0ML2xA-1HtekH2mkS-0000hc; Thu, 31 May 2007 09:06:57 +0200 From: Arnd Bergmann To: Michal Marek Subject: Re: [patch 3/3] Fix XFS_IOC_FSBULKSTAT{,_SINGLE} and XFS_IOC_FSINUMBERS in compat mode Date: Thu, 31 May 2007 09:06:49 +0200 User-Agent: KMail/1.9.6 Cc: xfs@oss.sgi.com, linux-kernel@vger.kernel.org References: <20070530125954.706423971@suse.cz> <20070530143044.060544510@suse.cz> In-Reply-To: <20070530143044.060544510@suse.cz> X-Face: >j"dOR3XO=^3iw?0`(E1wZ/&le9!.ok[JrI=S~VlsF~}"P\+jx.GT@=?utf-8?q?=0A=09-oaEG?=,9Ba>v;3>:kcw#yO5?B:l{(Ln.2)=?utf-8?q?=27=7Dfw07+4-=26=5E=7CScOpE=3F=5D=5EXdv=5B/zWkA7=60=25M!DxZ=0A=09?= =?utf-8?q?8MJ=2EU5?="hi+2yT(k`PF~Zt;tfT,i,JXf=x@eLP{7B:"GyA\=UnN) =?utf-8?q?=26=26qdaA=3A=7D-Y*=7D=3A3YvzV9=0A=09=7E=273a=7E7I=7CWQ=5D?=<50*%U-6Ewmxfzdn/CK_E/ouMU(r?FAQG/ev^JyuX.%(By`" =?utf-8?q?L=5F=0A=09H=3Dbj?=)"y7*XOqz|SS"mrZ$`Q_syCd MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705310906.50434.arnd@arndb.de> X-Provags-ID: V01U2FsdGVkX1+9P4TcH7DVdMb+R8mU/rjkJkPWTSZloTTFace X2Vi89dI2fHJeJHQUNHybyc72UzwEYtNL+mDdSw1u2FX0G/vyj jKvuKyuX6Y3AclwrpAFAw== X-archive-position: 11561 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: arnd@arndb.de Precedence: bulk X-list: xfs On Wednesday 30 May 2007, Michal Marek wrote: > --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl32.c > +++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl32.c > @@ -109,35 +109,249 @@ STATIC unsigned long xfs_ioctl32_geom_v1 > return (unsigned long)p; > } > > -#else > +typedef struct xfs_inogrp32 { > + __u64 xi_startino; /* starting inode number */ > + __s32 xi_alloccount; /* # bits set in allocmask */ > + __u64 xi_allocmask; /* mask of allocated inodes */ > +} __attribute__((packed)) xfs_inogrp32_t; __attribute__((packed)) isn't entirely correct here. You don't really want to have the whole structure to have byte alignment, you only want to reduce the alignment o fthe 64 bit members to 32 bit. It would be more appropriate to define a separate type #if defined(__x86_64__) || defined(__ia64__) typedef unsigned long long __compat_u64 __attribute__((aligned(4))); #else typedef unsigned long long __compat_u64; #endif and use that in the data structures. > +STATIC int xfs_inogrp_store_compat( > + xfs_inogrp32_t __user *p32, > + xfs_inogrp_t __user *p) > +{ > +#define copy(memb) copy_in_user(&p32->memb, &p->memb, sizeof(p32->memb)) > + if (copy(xi_startino) || > + copy(xi_alloccount) || > + copy(xi_allocmask)) > + return -EFAULT; > + return 0; > +#undef copy > +} Your copy() operation looks really dangerous, it will break as soon as someone tries to use it on a member that is actually variable length, like a pointer. A better way would be #define move_user(p32, p64, memb) ({ \ typeof(p32->memb) data; \ get_user(data, &p64->memb) || \ put_user(data, &p32->memb); \ }) Actually, even better would be not to use the compat_alloc_userspace trick at all, but to just interpret the 32 bit data structure directly in the implementation instead of converting it to the 64 bit structure, whereever that's possible. Arnd <>< From owner-xfs@oss.sgi.com Thu May 31 00:22:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 00:22:43 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4V7MbWt021217 for ; Thu, 31 May 2007 00:22:38 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA16661; Thu, 31 May 2007 17:22:29 +1000 Date: Thu, 31 May 2007 17:22:58 +1000 From: Timothy Shimmin To: David Chinner , Michal Marek cc: xfs@oss.sgi.com Subject: Re: [patch 1/3] Fix XFS_IOC_FSGEOMETRY_V1 in compat mode Message-ID: <649C7FF68B1450E03D544BD9@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070531023031.GH85884050@sgi.com> References: <20070530125954.706423971@suse.cz> <20070530143043.216024061@suse.cz> <20070531023031.GH85884050@sgi.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11562 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs --On 31 May 2007 12:30:31 PM +1000 David Chinner wrote: > On Wed, May 30, 2007 at 02:59:55PM +0200, Michal Marek wrote: >> i386 struct xfs_fsop_geom_v1 has no padding after the last member, so >> the size is different. > > That's a pain - it's kind of clunky having to redefine the entire > structure just pack it differently. Oh well, not much that > we can do about it... > Could we get rid of it? Do we really need to support XFS_IOC_FSGEOMETRY_V1 anymore? IRIX has 4 versions, and Linux has the latter 2 versions (though xfs_fs_geometry() still has a bit of code for 'em all). And the current version would have come in June 2002 with v2 logs on Linux. Who would want to use XFS_IOC_FSGEOMETRY_V1? Okay it turns out a whole bunch of our xfs-cmds :-) (Such as xfsdump as Michal mentioned) On Sep/2002, Nathan changed a bunch of them to use v1. xfsprogs-2.3.0 (03 September 2002) - Several changes to geometry ioctl callers which will make the tools useable on older kernel versions too. So he did this so that new tools would work on the older kernels which didn't support the new geom version. So I guess we are stuck with v1 now. Oh well, just a thought :) --Tim From owner-xfs@oss.sgi.com Thu May 31 01:11:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 01:11:04 -0700 (PDT) Received: from mail.suse.cz (styx.suse.cz [82.119.242.94]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4V8B0Wt016814 for ; Thu, 31 May 2007 01:11:01 -0700 Received: from [10.20.1.116] (discovery.suse.cz [10.20.1.116]) by mail.suse.cz (Postfix) with ESMTP id 1366762813B; Thu, 31 May 2007 10:11:00 +0200 (CEST) Message-ID: <465E8313.20309@suse.cz> Date: Thu, 31 May 2007 10:10:59 +0200 From: Michal Marek User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.2) Gecko/20061107 SUSE/1.1.1-0.2 SeaMonkey/1.1.1 MIME-Version: 1.0 To: Arnd Bergmann Cc: Chris Wedgwood , xfs@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [patch 1/3] Fix XFS_IOC_FSGEOMETRY_V1 in compat mode References: <20070530125954.706423971@suse.cz> <20070530143043.216024061@suse.cz> <20070530170530.GA4197@tuatara.stupidest.org> <200705302348.54259.arnd@arndb.de> In-Reply-To: <200705302348.54259.arnd@arndb.de> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11563 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mmarek@suse.cz Precedence: bulk X-list: xfs Arnd Bergmann wrote: > On Wednesday 30 May 2007, Chris Wedgwood wrote: >> On Wed, May 30, 2007 at 02:59:55PM +0200, Michal Marek wrote: >> >>> +typedef struct xfs_fsop_geom_v132 { >> wouldn't xfs_fsop_geom_v1_32 ... >> and xfs_fsop_geom_v1_32_t >> ^ >> >> read better there? > > Actually, the current convention would be struct compat_xfs_fsop_geom_v1 > and compat_xfs_fsop_geom_v1_t. I see. I'll change it to struct compat_xfs_*. Michal From owner-xfs@oss.sgi.com Thu May 31 01:52:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 01:52:19 -0700 (PDT) Received: from mail.suse.cz (styx.suse.cz [82.119.242.94]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4V8qEWt011729 for ; Thu, 31 May 2007 01:52:17 -0700 Received: from [10.20.1.116] (discovery.suse.cz [10.20.1.116]) by mail.suse.cz (Postfix) with ESMTP id C7BA0628143; Thu, 31 May 2007 10:52:14 +0200 (CEST) Message-ID: <465E8CBE.8020709@suse.cz> Date: Thu, 31 May 2007 10:52:14 +0200 From: Michal Marek User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.2) Gecko/20061107 SUSE/1.1.1-0.2 SeaMonkey/1.1.1 MIME-Version: 1.0 To: David Chinner Cc: xfs@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [patch 3/3] Fix XFS_IOC_FSBULKSTAT{,_SINGLE} and XFS_IOC_FSINUMBERS in compat mode References: <20070530125954.706423971@suse.cz> <20070530143044.060544510@suse.cz> <20070531063734.GJ85884050@sgi.com> In-Reply-To: <20070531063734.GJ85884050@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11564 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mmarek@suse.cz Precedence: bulk X-list: xfs David Chinner wrote: > On Wed, May 30, 2007 at 02:59:57PM +0200, Michal Marek wrote: >> +typedef struct xfs_bstat32 { >> + __u64 bs_ino; /* inode number */ >> + __u16 bs_mode; /* type and mode */ >> + __u16 bs_nlink; /* number of links */ >> + __u32 bs_uid; /* user id */ >> + __u32 bs_gid; /* group id */ >> + __u32 bs_rdev; /* device value */ >> + __s32 bs_blksize; /* block size */ >> + __s64 bs_size; /* file size */ >> + xfs_bstime32_t bs_atime; /* access time */ >> + xfs_bstime32_t bs_mtime; /* modify time */ >> + xfs_bstime32_t bs_ctime; /* inode change time */ >> + int64_t bs_blocks; /* number of blocks */ >> + __u32 bs_xflags; /* extended flags */ >> + __s32 bs_extsize; /* extent size */ >> + __s32 bs_extents; /* number of extents */ >> + __u32 bs_gen; /* generation count */ >> + __u16 bs_projid; /* project id */ >> + unsigned char bs_pad[14]; /* pad space, unused */ >> + __u32 bs_dmevmask; /* DMIG event mask */ >> + __u16 bs_dmstate; /* DMIG state info */ >> + __u16 bs_aextents; /* attribute number of extents */ >> +} >> +#ifdef BROKEN_X86_ALIGNMENT >> + __attribute__((packed)) >> +#endif >> + xfs_bstat32_t; > > #ifdef BROKEN_X86_ALIGNMENT > #define _PACKED __attribute__((packed)) > #else > #define _PACKED > #endif > > typedef struct xfs_bstat_32 { > ...... > } _PACKED xfs_bstat32_t Yes, that would look cleaner. Perhaps something like PACKED_IF_NEEDED so that it's clear that it's not allways defined to __attribute__((packed))? >> +static int xfs_bstat_store_compat( >> + xfs_bstat32_t __user *p32, >> + xfs_bstat_t __user *p) >> +{ >> +#define copy(memb) copy_in_user(&p32->memb, &p->memb, sizeof(p32->memb)) > > Hmmm - now I see why you used this. > > These copies are used everywhere in this file, maybe it would be best > to define a copy_from_32() and a copy_to_32() macros and use them > everywhere in the file? OK, I'll use a "proper" (not playing with some local variables) global macro like Arnd suggested in this thread. >> +#define MAX_BSTAT_LEN \ >> + ((__s32)((64*1024 - sizeof(xfs_fsop_bulkreq_t)) / sizeof(xfs_bstat_t))) >> +#define MAX_INOGRP_LEN \ >> + ((__s32)((64*1024 - sizeof(xfs_fsop_bulkreq_t)) / sizeof(xfs_inogrp_t))) > > Oooo magic numbers. Why were these chosen? I wanted to limit the argument passed to compat_alloc_user_space somehow; 64K is probably not the best idea, but some upper bound is needed, isn't it? OK, if we can get around without any conversions at all (see below), then these constants can go away. >> + >> +STATIC int >> +xfs_ioctl32_bulkstat_wrap( >> + bhv_vnode_t *vp, >> + struct inode *inode, >> + struct file *file, >> + int mode, >> + unsigned cmd, >> + unsigned long arg) >> { >> - xfs_fsop_bulkreq32_t __user *p32 = (void __user *)arg; >> - xfs_fsop_bulkreq_t __user *p = compat_alloc_user_space(sizeof(*p)); >> - u32 addr; >> - >> - if (get_user(addr, &p32->lastip) || >> - put_user(compat_ptr(addr), &p->lastip) || >> - copy_in_user(&p->icount, &p32->icount, sizeof(s32)) || >> - get_user(addr, &p32->ubuffer) || >> - put_user(compat_ptr(addr), &p->ubuffer) || >> - get_user(addr, &p32->ocount) || >> - put_user(compat_ptr(addr), &p->ocount)) >> + xfs_fsop_bulkreq32_t __user *p32 = (void __user *)arg; >> + xfs_fsop_bulkreq_t tmp; >> + u32 addr; >> + void *buf32; >> + int err; >> + >> + if (get_user(addr, &p32->lastip)) >> + return 0; > > return -EFAULT? Oops, silly mistake. Thanks! >> + if (cmd == XFS_IOC_FSBULKSTAT_32) >> + cmd = XFS_IOC_FSBULKSTAT; >> + if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) >> + cmd = XFS_IOC_FSBULKSTAT_SINGLE; >> + if (cmd == XFS_IOC_FSINUMBERS_32) >> + cmd = XFS_IOC_FSINUMBERS; > > cmd = _NATIVE_IOC(cmd, struct xfs_fsop_bulkreq); > switch (cmd) { > case XFS_IOC_FSBULKSTAT: > case XFS_IOC_FSBULKSTAT_SINGLE: Yes, I'll use _NATIVE_IOC here (also in other places as you pointed out). >> + >> + if (cmd == XFS_IOC_FSBULKSTAT || cmd == XFS_IOC_FSBULKSTAT_SINGLE) { > > Oh, now it gets messy :( True :-/ > So, we do a whole lot of repacking of the bulkstat structures > once we've got the data out of the bulkstat call. > > I think this is really the wrong way of doing this - the bulkstat > functions themselves take a "formatter" argument that is used to pack > the buffer in a given format. I think that we need to be supplying > the bulkstat code with different formatters in this case, not > repacking the buffer into a different format at a later time. > > The formatter used by default is xfs_bulkstat_one() which > falls down to xfs_bulkstat_one_dinode() or xfs_bulkstat_one_iget() > depending on whether we are doing icache coherent or blockdev > cache coherent lookups. It is these functions that need to be > told what format they are packing, I think, and xfs_bulkstat_single() > needs to be taught about them.... Well, I didn't want to touch other files but xfs_ioctl32.c so that the patch has a higher chance of being accepted ;-) But of course if you think that patching the implementation to be aware of the compat ioctls is acceptable, then I can do it. >> + if (cmd == XFS_IOC_FSINUMBERS) { > > And I'm wondering if we should be doing the same thing here > (i.e. customer formatters), because this is equally ugly... I'll try to clean up the XFS_IOC_FSBULKSTAT part first... Thanks to all for their comments and suggestions! I'll send updated patches later. Michal From owner-xfs@oss.sgi.com Thu May 31 04:02:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 04:02:56 -0700 (PDT) Received: from hpsmtp-eml20.kpnxchange.com (hpsmtp-eml20.kpnxchange.com [213.75.38.85]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4VB2qWt002281 for ; Thu, 31 May 2007 04:02:53 -0700 Received: from hpsmtp-eml05.kpnxchange.com ([213.75.38.105]) by hpsmtp-eml20.kpnxchange.com with Microsoft SMTPSVC(6.0.3790.1830); Thu, 31 May 2007 12:50:47 +0200 Received: from mail.deserver.nl ([86.94.146.172]) by hpsmtp-eml05.kpnxchange.com with Microsoft SMTPSVC(6.0.3790.3959); Thu, 31 May 2007 12:50:47 +0200 Received: from localhost (localhost [127.0.0.1]) by mail.deserver.nl (Postfix) with ESMTP id 60760236CB for ; Thu, 31 May 2007 12:50:47 +0200 (CEST) Received: from [192.168.0.14] (unknown [192.168.0.14]) by mail.deserver.nl (Postfix) with ESMTP id 01E03236BC for ; Thu, 31 May 2007 12:50:42 +0200 (CEST) Message-ID: <465EA882.3030403@deserver.nl> Date: Thu, 31 May 2007 12:50:42 +0200 From: Jaap Struyk User-Agent: Thunderbird 2.0.0.0 (X11/20070326) MIME-Version: 1.0 To: xfs@oss.sgi.com Subject: ways to restore data from crashed disk X-Enigmail-Version: 0.95.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 31 May 2007 10:50:47.0457 (UTC) FILETIME=[8BBF9910:01C7A371] X-archive-position: 11565 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: japie@deserver.nl Precedence: bulk X-list: xfs Hello, Recently my server drive has crashed and I'am trying to get some data of it. The only problematic part is my xfs partition, the drive has some "holes" in it, unfortunatly on the place where the superblock would be. I did: dd_rhelp /dev/hdb4 hdb4.img xfs_repair -f hdb4.img Phase 1 - find and verify superblock... superblock read failed, offset 9397895168, size 2048, ag 1, rval 0 Is there an alternative available on xfs or a way to recreate it? -- Groetjes Japie From owner-xfs@oss.sgi.com Thu May 31 06:04:00 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 06:04:03 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4VD3tWt017134 for ; Thu, 31 May 2007 06:03:57 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA23976; Thu, 31 May 2007 23:03:44 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4VD3gAf106843201; Thu, 31 May 2007 23:03:43 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4VD3bPh110045308; Thu, 31 May 2007 23:03:37 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 31 May 2007 23:03:37 +1000 From: David Chinner To: Michal Marek Cc: David Chinner , xfs@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [patch 3/3] Fix XFS_IOC_FSBULKSTAT{,_SINGLE} and XFS_IOC_FSINUMBERS in compat mode Message-ID: <20070531130337.GM85884050@sgi.com> References: <20070530125954.706423971@suse.cz> <20070530143044.060544510@suse.cz> <20070531063734.GJ85884050@sgi.com> <465E8CBE.8020709@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <465E8CBE.8020709@suse.cz> User-Agent: Mutt/1.4.2.1i X-archive-position: 11566 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 31, 2007 at 10:52:14AM +0200, Michal Marek wrote: > David Chinner wrote: > > On Wed, May 30, 2007 at 02:59:57PM +0200, Michal Marek wrote: > >> +typedef struct xfs_bstat32 { ....... > >> +} > >> +#ifdef BROKEN_X86_ALIGNMENT > >> + __attribute__((packed)) > >> +#endif > >> + xfs_bstat32_t; > > > > #ifdef BROKEN_X86_ALIGNMENT > > #define _PACKED __attribute__((packed)) > > #else > > #define _PACKED > > #endif > > > > typedef struct xfs_bstat_32 { > > ...... > > } _PACKED xfs_bstat32_t > > Yes, that would look cleaner. Perhaps something like PACKED_IF_NEEDED so > that it's clear that it's not allways defined to __attribute__((packed))? Not really that fussed ;) > >> + > >> + if (cmd == XFS_IOC_FSBULKSTAT || cmd == XFS_IOC_FSBULKSTAT_SINGLE) { > > > > Oh, now it gets messy :( > > True :-/ > > > > So, we do a whole lot of repacking of the bulkstat structures > > once we've got the data out of the bulkstat call. > > > > I think this is really the wrong way of doing this - the bulkstat > > functions themselves take a "formatter" argument that is used to pack > > the buffer in a given format. I think that we need to be supplying > > the bulkstat code with different formatters in this case, not > > repacking the buffer into a different format at a later time. > > > > The formatter used by default is xfs_bulkstat_one() which > > falls down to xfs_bulkstat_one_dinode() or xfs_bulkstat_one_iget() > > depending on whether we are doing icache coherent or blockdev > > cache coherent lookups. It is these functions that need to be > > told what format they are packing, I think, and xfs_bulkstat_single() > > needs to be taught about them.... > > Well, I didn't want to touch other files but xfs_ioctl32.c so that the > patch has a higher chance of being accepted ;-) But of course if you > think that patching the implementation to be aware of the compat ioctls > is acceptable, then I can do it. I think that given we already have multiple bulkstat formatters to support different buffer formats, we'd be silly not to use that interface directly for the new buffer formats needed. You can probably dup the code from xfs_ioctl.c to issue the bulkstat calls and then modify both to take specified formatters. You could even define the compat formatter(s) in xfs_ioctl32.c so the compat code doesn't need to be put elsewhere.... > >> + if (cmd == XFS_IOC_FSINUMBERS) { > > > > And I'm wondering if we should be doing the same thing here > > (i.e. customer formatters), because this is equally ugly... > > I'll try to clean up the XFS_IOC_FSBULKSTAT part first... Yup, fair enough. Thanks! Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 31 06:26:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 06:26:30 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4VDQKWt028920 for ; Thu, 31 May 2007 06:26:22 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA24301; Thu, 31 May 2007 23:26:20 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l4VDQGAf109439850; Thu, 31 May 2007 23:26:17 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l4VDQFRl110075623; Thu, 31 May 2007 23:26:15 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Thu, 31 May 2007 23:26:15 +1000 From: David Chinner To: Timothy Shimmin Cc: David Chinner , Michal Marek , xfs@oss.sgi.com Subject: Re: [patch 1/3] Fix XFS_IOC_FSGEOMETRY_V1 in compat mode Message-ID: <20070531132615.GO85884050@sgi.com> References: <20070530125954.706423971@suse.cz> <20070530143043.216024061@suse.cz> <20070531023031.GH85884050@sgi.com> <649C7FF68B1450E03D544BD9@timothy-shimmins-power-mac-g5.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <649C7FF68B1450E03D544BD9@timothy-shimmins-power-mac-g5.local> User-Agent: Mutt/1.4.2.1i X-archive-position: 11567 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, May 31, 2007 at 05:22:58PM +1000, Timothy Shimmin wrote: > > > --On 31 May 2007 12:30:31 PM +1000 David Chinner wrote: > > >On Wed, May 30, 2007 at 02:59:55PM +0200, Michal Marek wrote: > >>i386 struct xfs_fsop_geom_v1 has no padding after the last member, so > >>the size is different. > > > >That's a pain - it's kind of clunky having to redefine the entire > >structure just pack it differently. Oh well, not much that > >we can do about it... > > > > Could we get rid of it? > Do we really need to support XFS_IOC_FSGEOMETRY_V1 anymore? > IRIX has 4 versions, and Linux has the latter 2 versions > (though xfs_fs_geometry() still has a bit of code for 'em all). > And the current version would have come in June 2002 with v2 logs > on Linux. > Who would want to use XFS_IOC_FSGEOMETRY_V1? > > Okay it turns out a whole bunch of our xfs-cmds :-) > (Such as xfsdump as Michal mentioned) > On Sep/2002, Nathan changed a bunch of them to use v1. > xfsprogs-2.3.0 (03 September 2002) > - Several changes to geometry ioctl callers which will make > the tools useable on older kernel versions too. > So he did this so that new tools would work on the older kernels which > didn't support the new geom version. > So I guess we are stuck with v1 now. Not necessarily - we could change the tools to use v4, and if that didn't exist, then try v1. That way we don't need to support v1 in linux, and the tools still run on old kernels..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu May 31 06:37:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 06:37:38 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4VDbWWt032643 for ; Thu, 31 May 2007 06:37:33 -0700 Received: from liberator.sandeen.net (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 8B77818032D18; Thu, 31 May 2007 08:37:31 -0500 (CDT) Message-ID: <465ECF9B.2000500@sandeen.net> Date: Thu, 31 May 2007 08:37:31 -0500 From: Eric Sandeen User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Jaap Struyk CC: xfs@oss.sgi.com Subject: Re: ways to restore data from crashed disk References: <465EA882.3030403@deserver.nl> In-Reply-To: <465EA882.3030403@deserver.nl> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11568 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Jaap Struyk wrote: > Hello, > > Recently my server drive has crashed and I'am trying to get some data of it. > The only problematic part is my xfs partition, the drive has some > "holes" in it, unfortunatly on the place where the superblock would be. > I did: > dd_rhelp /dev/hdb4 hdb4.img > xfs_repair -f hdb4.img > Phase 1 - find and verify superblock... > superblock read failed, offset 9397895168, size 2048, ag 1, rval 0 For starters, is the file really that big? (9397895168 bytes?) -eric > Is there an alternative available on xfs or a way to recreate it? From owner-xfs@oss.sgi.com Thu May 31 11:09:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 11:09:42 -0700 (PDT) Received: from silver.tritoncore.com (silver.tritoncore.com [209.59.142.74]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l4VI9ZWt020501 for ; Thu, 31 May 2007 11:09:36 -0700 Received: from powersle by silver.tritoncore.com with local (Exim 4.63) (envelope-from ) id 1Htp5S-0007yu-Co for xfs@oss.sgi.com; Thu, 31 May 2007 14:09:27 -0400 To: xfs@oss.sgi.com Subject: Vacancy For Online Payroll Representative From: "Kenneth Fabrics Ltd." Reply-To: kennethfabricsltdd@mail2recruiter.com MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 8bit Message-Id: Date: Thu, 31 May 2007 14:09:26 -0400 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - silver.tritoncore.com X-AntiAbuse: Original Domain - oss.sgi.com X-AntiAbuse: Originator/Caller UID/GID - [32761 32002] / [47 12] X-AntiAbuse: Sender Address Domain - silver.tritoncore.com X-Source: /usr/bin/php X-Source-Args: /usr/bin/php mailer.php X-Source-Dir: powersledge.com:/public_html/images X-archive-position: 11569 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: employment.dept@kennethfabricsltd.com Precedence: bulk X-list: xfs From The Desk Of The: Recruitment Manager Mr Kenneth Holley Kenneth Fabrics Limited. Dear , KENNETH FABRICS LTD.Is committed to global citizenship by operating in a responsible and sustainable manner around the globe. As part of our Multi Level Marketing scheme, we need capable hands to act as representative/book keeper in the United Kingdom and Canada on the company’s behalf. Kenneth Fabrics Ltd.. Is a new Store under KENNETH FARICS in India? We are into supplies of Raw Materials. We are ranked No.1 among India private enterprises with annual production capacity exceeding 1 million units sold everywhere in India and exported to all over the world including UK, Mexico, Southeast Asia countries and European countries. We have won a good reputation for high-quality products, prompt delivery and close cooperation among our customers. We needs a representative in the United states, United Kingdom, Canada, Mexico, Southeast Asia countries and European countries, to act as our Online Staff through which our customers can pay outstanding bills owed by them to us in your Region via Bank Wire Transfer. JOB DESCRIPTION: 1. Receive payment from Clients by wire transfer and Cheques 2. Deduct 10% which will be your commission on each payment processed. 3. Forward the balance after deducting of 10% commission to offices which shall be provided by you as soon as the fund becomes available. HOW MUCH WILL YOU EARN: 10% from each operation! For instance: you receive £ 5000 or $5000 via wire transfer Or Cheques on our behalf. You will cash the money and keep £ 500 or $500 (10% from £ 5000-$5000) for yourself! At the beginning your commission will equal 10%. After creditable performance, your commission may be reviewed for increment. We are looking only for the Honest and Open – Hearted Individual who satisfies our requirements and glad to offer this job position to you. If our proposals interest you, Do get back to us with your under listed detailed information; Names:.................. Address:................ City:................... Zip Code:............... State:.................. Country;................ Home Phone:............. Cell Phone:............. Gender:................. Age..................... Thanks for Reading Our Job Offer Kenneth Fabrics Limited 301-A, World Trade Tower Barakhamba Lane New Delhi -110001 India From owner-xfs@oss.sgi.com Thu May 31 20:32:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 20:32:46 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l513WdWt014920 for ; Thu, 31 May 2007 20:32:41 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA15440; Fri, 1 Jun 2007 13:32:33 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1116) id 85DEB591CE8E; Fri, 1 Jun 2007 13:32:33 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com, xfs@oss.sgi.com Subject: PARTIAL TAKE 965537 - xfsprogs AC_TYPE_U32 broken by defining HAVE___U32_T instead of HAVE___U32 Message-Id: <20070601033233.85DEB591CE8E@chook.melbourne.sgi.com> Date: Fri, 1 Jun 2007 13:32:33 +1000 (EST) From: tes@sgi.com (Tim Shimmin) X-archive-position: 11570 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs s/HAVE___U32_T/HAVE___U32/g Thanks to Eric Sandeen for finding this. Date: Fri Jun 1 13:31:17 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/tes/xfs-cmds Inspected by: sandeen@sandeen.net The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/xfs-cmds/master-melb Modid: master-melb:xfs-cmds:28754a xfsprogs/doc/CHANGES - 1.240 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/doc/CHANGES.diff?r1=text&tr1=1.240&r2=text&tr2=1.239&f=h xfsprogs/aclocal.m4 - 1.26 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/aclocal.m4.diff?r1=text&tr1=1.26&r2=text&tr2=1.25&f=h xfsprogs/m4/package_types.m4 - 1.4 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/m4/package_types.m4.diff?r1=text&tr1=1.4&r2=text&tr2=1.3&f=h - s/HAVE___U32_T/HAVE___U32/g Thanks to Eric Sandeen for finding this. From owner-xfs@oss.sgi.com Thu May 31 21:39:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 31 May 2007 21:39:38 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l514dVWt004088 for ; Thu, 31 May 2007 21:39:33 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA17134; Fri, 1 Jun 2007 14:39:17 +1000 Date: Fri, 01 Jun 2007 14:39:48 +1000 From: Timothy Shimmin To: David Chinner cc: Michal Marek , xfs@oss.sgi.com Subject: Re: [patch 1/3] Fix XFS_IOC_FSGEOMETRY_V1 in compat mode Message-ID: <0C1BF59AD81186689933280E@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070531132615.GO85884050@sgi.com> References: <20070530125954.706423971@suse.cz> <20070530143043.216024061@suse.cz> <20070531023031.GH85884050@sgi.com> <649C7FF68B1450E03D544BD9@timothy-shimmins-power-mac-g5.local> <20070531132615.GO85884050@sgi.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11571 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs --On 31 May 2007 11:26:15 PM +1000 David Chinner wrote: >> Who would want to use XFS_IOC_FSGEOMETRY_V1? >> >> Okay it turns out a whole bunch of our xfs-cmds :-) >> (Such as xfsdump as Michal mentioned) >> On Sep/2002, Nathan changed a bunch of them to use v1. >> xfsprogs-2.3.0 (03 September 2002) >> - Several changes to geometry ioctl callers which will make >> the tools useable on older kernel versions too. >> So he did this so that new tools would work on the older kernels which >> didn't support the new geom version. >> So I guess we are stuck with v1 now. > > Not necessarily - we could change the tools to use v4, and if that > didn't exist, then try v1. That way we don't need to support v1 in > linux, and the tools still run on old kernels..... > The problem with that is the old tools won't run on new kernels. If you get a new kernel and use an old xfsdump then you are out of luck. Not sure if we want to require people to bump up to new userspace for this. --Tim