From owner-xfs@oss.sgi.com Sun Apr 1 15:45:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 01 Apr 2007 15:45:16 -0700 (PDT) X-Spam-oss-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l31MjA6p010169 for ; Sun, 1 Apr 2007 15:45:11 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA12073; Mon, 2 Apr 2007 08:45:01 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l31Mj0Af40434916; Mon, 2 Apr 2007 08:45:00 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l31MiwTX46334670; Mon, 2 Apr 2007 08:44:58 +1000 (AEST) Date: Mon, 2 Apr 2007 08:44:58 +1000 From: David Chinner To: Martin Steigerwald Cc: linux-xfs@oss.sgi.com Subject: Re: write barrier and USB devices Message-ID: <20070401224458.GR32597093@melbourne.sgi.com> References: <200703301523.58027.Martin@lichtvoll.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200703301523.58027.Martin@lichtvoll.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11015 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, Mar 30, 2007 at 03:23:57PM +0200, Martin Steigerwald wrote: > > Hello! > > Does the usb mass storage driver support write barriers? You should ask the usb folks that question.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 2 11:24:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 11:24:12 -0700 (PDT) X-Spam-oss-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l32IO2kH008050 for ; Mon, 2 Apr 2007 11:24:04 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA23891; Mon, 2 Apr 2007 16:30:51 +1000 Date: Mon, 02 Apr 2007 16:31:52 +1100 From: Timothy Shimmin To: xfs-dev@sgi.com, xfs@oss.sgi.com Subject: review: remove unused ilen var from xfs_vnodeops.c Message-ID: X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11016 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs simple cleanup patch =========================================================================== Index: fs/xfs/xfs_vnodeops.c =========================================================================== --- a/fs/xfs/xfs_vnodeops.c 2007-04-02 15:56:30.000000000 +1000 +++ b/fs/xfs/xfs_vnodeops.c 2007-04-02 15:19:42.926081759 +1000 @@ -4289,7 +4289,6 @@ xfs_free_file_space( int error; xfs_fsblock_t firstfsb; xfs_bmap_free_t free_list; - xfs_off_t ilen; xfs_bmbt_irec_t imap; xfs_off_t ioffset; xfs_extlen_t mod=0; @@ -4338,10 +4337,7 @@ xfs_free_file_space( } rounding = max_t(uint, 1 << mp->m_sb.sb_blocklog, NBPP); - ilen = len + (offset & (rounding - 1)); ioffset = offset & ~(rounding - 1); - if (ilen & (rounding - 1)) - ilen = (ilen + rounding) & ~(rounding - 1); if (VN_CACHED(vp) != 0) { xfs_inval_cached_trace(&ip->i_iocore, ioffset, -1, ============================================================================= I think ilen was removed a while back when we changed a call interface... revision 1.534 date: 2002/07/08 22:09:30; author: lord; state: Exp; lines: +1 -2 modid: 2.4.x-xfs:slinx:122666a changes xfs_inval_cached_pages interface =========================================================================== Index: fs/xfs/xfs_vnodeops.c =========================================================================== --- a/fs/xfs/xfs_vnodeops.c 2007-04-02 15:38:34.000000000 +1000 +++ b/fs/xfs/xfs_vnodeops.c 2007-04-02 15:38:34.000000000 +1000 @@ -5459,8 +5459,7 @@ xfs_free_file_space( ioffset = offset & ~(rounding - 1); if (ilen & (rounding - 1)) ilen = (ilen + rounding) & ~(rounding - 1); - xfs_inval_cached_pages(XFS_ITOV(ip), &(ip->i_iocore), - ioffset, ilen, NULL, 0); + xfs_inval_cached_pages(XFS_ITOV(ip), &(ip->i_iocore), ioffset, 0, 0); /* * Need to zero the stuff we're not freeing, on disk. * If its specrt (realtime & can't use unwritten extents) then --Tim From owner-xfs@oss.sgi.com Mon Apr 2 11:35:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 11:35:14 -0700 (PDT) X-Spam-oss-Status: No, score=0.0 required=5.0 tests=BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail.g-house.de (ns2.g-housing.de [81.169.133.75]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l32IXskH011758 for ; Mon, 2 Apr 2007 11:35:11 -0700 Received: from [77.99.119.196] (helo=77-99-119-196.cable.ubr04.linl.blueyonder.co.uk) by mail.g-house.de with esmtpsa (TLS-1.0:DHE_RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1HYR6a-0002wK-26 for xfs@oss.sgi.com; Mon, 02 Apr 2007 20:18:12 +0200 Date: Mon, 2 Apr 2007 19:18:01 +0100 (BST) From: Christian Kujau X-X-Sender: evil@sheep.housecafe.de To: xfs@oss.sgi.com Subject: possible recursive locking detected Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11017 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lists@nerdbynature.de Precedence: bulk X-list: xfs Hi, when I enabled a few more debug-options in the kernel (vanilla 2.6.21-rc5), I came across: [ INFO: possible recursive locking detected ] 2.6.21-rc5 #2 --------------------------------------------- rm/32198 is trying to acquire lock: xfs_ilock+0x71/0xa0 but task is already holding lock: xfs_ilock+0x71/0xa0 other info that might help us debug this: 3 locks held by rm/32198: do_unlinkat+0x96/0x160 vfs_unlink+0x75/0xe0 xfs_ilock+0x71/0xa0 stack backtrace: __lock_acquire+0xa99/0x1010 lock_acquire+0x57/0x70 xfs_ilock+0x71/0xa0 down_write+0x38/0x50 xfs_ilock+0x71/0xa0 xfs_ilock+0x71/0xa0 xfs_lock_dir_and_entry+0xf6/0x100 xfs_remove+0x197/0x4e0 d_instantiate+0x19/0x40 d_rehash+0x20/0x50 vfs_unlink+0x75/0xe0 xfs_vn_unlink+0x23/0x60 __mutex_lock_slowpath+0x13f/0x280 mark_held_locks+0x6b/0x90 __mutex_lock_slowpath+0x13f/0x280 __mutex_lock_slowpath+0x13f/0x280 trace_hardirqs_on+0xb9/0x160 vfs_unlink+0x75/0xe0 __mutex_lock_slowpath+0x132/0x280 vfs_unlink+0x75/0xe0 permission+0x91/0xf0 vfs_unlink+0x89/0xe0 do_unlinkat+0xd2/0x160 sysenter_past_esp+0x8d/0x99 trace_hardirqs_on+0xb9/0x160 sysenter_past_esp+0x5d/0x99 ======================= Is this something I have to worry about? Please see http://nerdbynature.de/bits/2.6.21-rc5/ for a few more details. Thanks, Christian. -- BOFH excuse #372: Forced to support NT servers; sysadmins quit. From owner-xfs@oss.sgi.com Mon Apr 2 13:19:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 13:19:27 -0700 (PDT) X-Spam-oss-Status: No, score=-1.3 required=5.0 tests=AWL,BAYES_20 autolearn=no version=3.2.0-pre1-r499012 Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l32KJOfB015559 for ; Mon, 2 Apr 2007 13:19:25 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HYSjB-0006XR-RE; Mon, 02 Apr 2007 21:02:09 +0100 Date: Mon, 2 Apr 2007 21:02:09 +0100 From: Christoph Hellwig To: Timothy Shimmin Cc: xfs-dev@sgi.com, xfs@oss.sgi.com Subject: Re: review: remove unused ilen var from xfs_vnodeops.c Message-ID: <20070402200209.GA25101@infradead.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11018 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Mon, Apr 02, 2007 at 04:31:52PM +1100, Timothy Shimmin wrote: > simple cleanup patch looks good. From owner-xfs@oss.sgi.com Mon Apr 2 14:44:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 14:44:50 -0700 (PDT) X-Spam-oss-Status: No, score=3.0 required=5.0 tests=BAYES_95,HTML_MESSAGE autolearn=no version=3.2.0-pre1-r499012 Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.243]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l32LijfB006582 for ; Mon, 2 Apr 2007 14:44:46 -0700 Received: by an-out-0708.google.com with SMTP id c5so1347883anc for ; Mon, 02 Apr 2007 14:44:44 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; b=B0dKOXK/Z19QmQgxQu99WBEx++uR/fqbhpRomvr1bdl0Go0keJ5m4MEuIy4+8IbZbuj/tKmtmSIhGlVq68O7zFhNJj/YYQTVX3Ww3e5egV2T9HRYdwH4k3r0Stdep02hmpeRp05IoOK5iLST4BmJFwXeo8jk8ESupxftDNe7WiU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; b=Pj9/4vi7bihLMXtvW2pCj7KVnv8jPvl2JSdnm0yBUnbZ1A0Se9Jqx7YsGqs6OIy0rbDBhPLrQ6AqHeOYCivaz7VLdgbi1g4kl+Jh0oRS0bBWJURBb74AgSkaLQs6teRmu/MAbrndaOH0MWOPIZHjLgkSq6n/YX0IjgltU5V8zO8= Received: by 10.100.144.11 with SMTP id r11mr3858808and.1175548737613; Mon, 02 Apr 2007 14:18:57 -0700 (PDT) Received: by 10.100.200.11 with HTTP; Mon, 2 Apr 2007 14:18:57 -0700 (PDT) Message-ID: <817da7960704021418g6e1d4662y3250be14bc01ab69@mail.gmail.com> Date: Mon, 2 Apr 2007 17:18:57 -0400 From: "Charles Weber" To: "Roger Heflin" Subject: Re: xfs partial dismount issue Cc: linux-xfs@oss.sgi.com, sandeen@sandeen.net In-Reply-To: <45EC868A.4060607@atipa.com> MIME-Version: 1.0 References: <45EC3DEA.3000105@sandeen.net> <45EC868A.4060607@atipa.com> Content-Type: text/plain Content-Disposition: inline Content-Transfer-Encoding: 7bit Content-length: 2822 X-archive-position: 11019 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: chaweber@gmail.com Precedence: bulk X-list: xfs Well actually I did a test with ext3 and got the same result. It partially dismounted after a day or so of use and seemed identical to my previous xfs filesystem failures . My guess now is that this has always occurred when all 6 raid controllers (2 per card) were in use. I could go quite some time with 5 of the 6 controllers used. I consolidated everything to 2 cards, removed one card and put in fiber channel card for my new storage array So far no problems. If so then it seems something is funny about the cciss driver. thanks, Chuck On 3/5/07, Roger Heflin wrote: > > Charles Weber wrote: > > Eric Sandeen sandeen.net> writes: > > > >> Chuck Weber wrote: > >>> Hi everyone, I have a long running problem perhaps you can help with. > I > >>> will include as much detail as I can. I can set up a spare server-disk > >>> set for testing if you have any bright ideas. > >>> > >>> We use XFS for samba and nfs on x86_64 Fedora Proliant DL585/385 > >>> servers. Our busiest server has disk partitions go away. > >> What do you mean by this, exactly? The partitions themselves go away, > >> or are you talking about the problem described below where processes > >> start hanging? > >> > > Here is an example partition (1 of 6 or more xfs storage only). > > /share/store3 with samba shares on /share/store3/lls, lds, lxs and so > on. > > I will get a call saying my groups share (lxs) is no longer accessable. > I ssh > > into server and can ls /share/store3 but ls will hang when I ls > > /share/store3/lxs. Shortly there after ls will hang for the root or any > > directory on the partition. Other partitions will be fine and other > samba shares > > will be fine until the queued up process load bogs the server down. > > > > Charles, > > I have seen what may be a similar issue on SLES9SP2, we had 1 xfs > partition, and under certain conditions it would stop responding, all > non-xfs partitions were ok, and everything was fine after a reboot. > > Under sysrq-t it appeared to me that 2 separate processes were calling > fsync and were causing each other to deadlock (and locking all others > out of changing the xfs partition). I was not able to determine exactly > what the underlying bug was, but all of the hung processes > were waiting on locks in at least several widely different parts of the > xfs and kernel code, and adjusting the application to not fsync has > apparently resulted in the deadlock not occuring. In this case > there were multiple (2-4) different instances of the application calling > fsync apparently sometimes at close to the same time. With the > given application the failure was almost a certainly on one machine > (of 100) running the application overnight. > > Roger > [[HTML alternate version deleted]] From owner-xfs@oss.sgi.com Mon Apr 2 16:46:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 16:46:22 -0700 (PDT) X-Spam-oss-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00 autolearn=no version=3.2.0-pre1-r499012 Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l32NkIfB010080 for ; Mon, 2 Apr 2007 16:46:20 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l32NkHUX007234; Mon, 2 Apr 2007 19:46:17 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l32NkHod021971; Mon, 2 Apr 2007 19:46:17 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l32NkGlR000765; Mon, 2 Apr 2007 19:46:16 -0400 Message-ID: <46119596.1020900@sandeen.net> Date: Mon, 02 Apr 2007 18:45:26 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Timothy Shimmin CC: xfs-dev@sgi.com, xfs@oss.sgi.com Subject: Re: review: remove unused ilen var from xfs_vnodeops.c References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11020 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Timothy Shimmin wrote: > simple cleanup patch > Hey that's my job! ;-) Looks good -Eric From owner-xfs@oss.sgi.com Mon Apr 2 17:54:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 17:54:15 -0700 (PDT) X-Spam-oss-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00 autolearn=no version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l330s9fB000643 for ; Mon, 2 Apr 2007 17:54:11 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA23394 for ; Tue, 3 Apr 2007 10:54:07 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1116) id C15D458FF80A; Tue, 3 Apr 2007 10:54:07 +1000 (EST) To: xfs@oss.sgi.com Subject: TAKE cleanup - remove ilen refs from vnodeops.c Message-Id: <20070403005407.C15D458FF80A@chook.melbourne.sgi.com> Date: Tue, 3 Apr 2007 10:54:07 +1000 (EST) From: tes@sgi.com (Tim Shimmin) X-archive-position: 11021 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Remove unused ilen variable and references. Date: Tue Apr 3 10:53:20 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/tes/2.6.x-xfs Inspected by: lachlan@sgi.com,sandeen@sandeen.net The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28344a fs/xfs/xfs_vnodeops.c - 1.694 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.694&r2=text&tr2=1.693&f=h - Remove unused ilen variable and references. From owner-xfs@oss.sgi.com Mon Apr 2 23:09:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 02 Apr 2007 23:09:40 -0700 (PDT) X-Spam-oss-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3369YfB022611 for ; Mon, 2 Apr 2007 23:09:36 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA01072; Tue, 3 Apr 2007 16:09:33 +1000 Date: Tue, 03 Apr 2007 16:10:38 +1100 From: Timothy Shimmin To: xfs@oss.sgi.com, xfs-dev@sgi.com Subject: review: export xfs_buftarg_list for use by xfsidbg Message-ID: X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11023 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Content-Length: 6296 Lines: 185 Hi, This patch addresses the problem of having xfs_buftarg_list global for use by kdb and xfsidbg, where otherwise it would be static. If we are using xfsidbg, then we export the xfs_get_buftarg_list function to it. Previous to Dave's (dgc) static changes, we were only globalizing it if we were in DEBUG - when really it is a question of xfsidbg. Using CONFIG_KDB_MODULES as this is what we use in the Makefiles for determining if xfsidbg is used. --Tim linux-2.4/xfs_buf.c | 8 ++++++++ linux-2.4/xfs_buf.h | 3 +++ linux-2.4/xfs_ksyms.c | 5 ++--- linux-2.6/xfs_buf.c | 10 +++++++++- linux-2.6/xfs_buf.h | 3 +++ linux-2.6/xfs_ksyms.c | 5 ++--- xfsidbg.c | 9 +++------ 7 files changed, 30 insertions(+), 13 deletions(-) =========================================================================== Index: fs/xfs/linux-2.4/xfs_buf.c =========================================================================== --- a/fs/xfs/linux-2.4/xfs_buf.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.4/xfs_buf.c 2007-04-03 15:45:23.930213823 +1000 @@ -2335,3 +2335,11 @@ xfs_buf_terminate(void) kmem_zone_destroy(xfs_buf_zone); kmem_shake_deregister(xfs_buf_shake); } + +#ifdef CONFIG_KDB_MODULES +struct list_head * +xfs_get_buftarg_list(void) +{ + return &xfs_buftarg_list; +} +#endif =========================================================================== Index: fs/xfs/linux-2.4/xfs_buf.h =========================================================================== --- a/fs/xfs/linux-2.4/xfs_buf.h 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.4/xfs_buf.h 2007-04-03 15:46:59.997634360 +1000 @@ -500,6 +500,9 @@ extern void xfs_free_buftarg(xfs_buftarg extern void xfs_wait_buftarg(xfs_buftarg_t *); extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int); extern int xfs_flush_buftarg(xfs_buftarg_t *, int); +#ifdef CONFIG_KDB_MODULES +extern struct list_head *xfs_get_buftarg_list(void); +#endif #define xfs_getsize_buftarg(buftarg) block_size((buftarg)->bt_kdev) #define xfs_readonly_buftarg(buftarg) is_read_only((buftarg)->bt_kdev) =========================================================================== Index: fs/xfs/linux-2.4/xfs_ksyms.c =========================================================================== --- a/fs/xfs/linux-2.4/xfs_ksyms.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.4/xfs_ksyms.c 2007-04-03 15:46:29.489629322 +1000 @@ -124,9 +124,8 @@ EXPORT_SYMBOL(xfs_params); EXPORT_SYMBOL(xfs_bmbt_disk_get_all); #endif -#if defined(CONFIG_XFS_DEBUG) -extern struct list_head xfs_buftarg_list; -EXPORT_SYMBOL(xfs_buftarg_list); +#if defined(CONFIG_KDB_MODULES) +EXPORT_SYMBOL(xfs_get_buftarg_list); #endif /* =========================================================================== Index: fs/xfs/linux-2.6/xfs_buf.c =========================================================================== --- a/fs/xfs/linux-2.6/xfs_buf.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.6/xfs_buf.c 2007-04-03 15:11:15.778300965 +1000 @@ -1426,7 +1426,7 @@ xfs_free_bufhash( /* * buftarg list for delwrite queue processing */ -LIST_HEAD(xfs_buftarg_list); +static LIST_HEAD(xfs_buftarg_list); static DEFINE_SPINLOCK(xfs_buftarg_lock); STATIC void @@ -1867,3 +1867,11 @@ xfs_buf_terminate(void) ktrace_free(xfs_buf_trace_buf); #endif } + +#ifdef CONFIG_KDB_MODULES +struct list_head * +xfs_get_buftarg_list(void) +{ + return &xfs_buftarg_list; +} +#endif =========================================================================== Index: fs/xfs/linux-2.6/xfs_buf.h =========================================================================== --- a/fs/xfs/linux-2.6/xfs_buf.h 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.6/xfs_buf.h 2007-04-03 15:22:51.547106965 +1000 @@ -411,6 +411,9 @@ extern void xfs_free_buftarg(xfs_buftarg extern void xfs_wait_buftarg(xfs_buftarg_t *); extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int); extern int xfs_flush_buftarg(xfs_buftarg_t *, int); +#ifdef CONFIG_KDB_MODULES +extern struct list_head *xfs_get_buftarg_list(void); +#endif #define xfs_getsize_buftarg(buftarg) block_size((buftarg)->bt_bdev) #define xfs_readonly_buftarg(buftarg) bdev_read_only((buftarg)->bt_bdev) =========================================================================== Index: fs/xfs/linux-2.6/xfs_ksyms.c =========================================================================== --- a/fs/xfs/linux-2.6/xfs_ksyms.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/linux-2.6/xfs_ksyms.c 2007-04-03 14:52:27.011000730 +1000 @@ -125,9 +125,8 @@ EXPORT_SYMBOL(xfs_params); EXPORT_SYMBOL(xfs_bmbt_disk_get_all); #endif -#if defined(CONFIG_XFS_DEBUG) -extern struct list_head xfs_buftarg_list; -EXPORT_SYMBOL(xfs_buftarg_list); +#if defined(CONFIG_KDB_MODULES) +EXPORT_SYMBOL(xfs_get_buftarg_list); #endif /* =========================================================================== Index: fs/xfs/xfsidbg.c =========================================================================== --- a/fs/xfs/xfsidbg.c 2007-04-03 15:47:15.000000000 +1000 +++ b/fs/xfs/xfsidbg.c 2007-04-03 15:24:02.201877199 +1000 @@ -62,6 +62,7 @@ #include "xfs_quota.h" #include "quota/xfs_qm.h" #include "xfs_iomap.h" +#include "xfs_buf.h" MODULE_AUTHOR("Silicon Graphics, Inc."); MODULE_DESCRIPTION("Additional kdb commands for debugging XFS"); @@ -2350,8 +2351,7 @@ kdbm_bp(int argc, const char **argv) static int kdbm_bpdelay(int argc, const char **argv) { -#ifdef DEBUG - extern struct list_head xfs_buftarg_list; + struct list_head *xfs_buftarg_list = xfs_get_buftarg_list(); struct list_head *curr, *next; xfs_buftarg_t *tp, *n; xfs_buf_t bp; @@ -2372,7 +2372,7 @@ kdbm_bpdelay(int argc, const char **argv } - list_for_each_entry_safe(tp, n, &xfs_buftarg_list, bt_list) { + list_for_each_entry_safe(tp, n, xfs_buftarg_list, bt_list) { list_for_each_safe(curr, next, &tp->bt_delwrite_queue) { addr = (unsigned long)list_entry(curr, xfs_buf_t, b_list); if ((diag = kdb_getarea(bp, addr))) @@ -2388,9 +2388,6 @@ kdbm_bpdelay(int argc, const char **argv } } } -#else - kdb_printf("bt_delwrite_queue inaccessible (non-debug)\n"); -#endif return 0; } From owner-xfs@oss.sgi.com Tue Apr 3 05:01:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 03 Apr 2007 05:01:56 -0700 (PDT) X-Spam-oss-Status: No, score=3.5 required=5.0 tests=BAYES_99 autolearn=no version=3.2.0-pre1-r499012 Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.245]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l33C1lfB004228 for ; Tue, 3 Apr 2007 05:01:48 -0700 Received: by an-out-0708.google.com with SMTP id c5so1529305anc for ; Tue, 03 Apr 2007 05:01:47 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=txHZ6dlfmDTAXzC1JJfxIC4zr23HysF+vBV63kPFJQojyrwYBxC+zarceB7LmuBqtg+GU2SQX9tFi5Miwz6XGDX79fdFvkAzmfEmatq5Ti3j0iqD9L3BDryfSKlBN2uqPNXX6oq/B6rX0trdWSr3ZltvhGKbXgbqyy2rdj5FEPg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=kzkQhrK0HCVnZ5TB83FgwDJEDJKg4Znm2oneaxe2FYT0dFJreyO4NtExZAfNS9pfZYvukd4EUpRjlmra1+0X7VaWc4xuFKNhB16tPPw3gefMiRbv7k7Rksrt3A70pAPDnCkyPdoHXcDXBe+Rls0h+3ic02qiNEV+oj2afLR68+I= Received: by 10.100.91.6 with SMTP id o6mr4212834anb.1175598219092; Tue, 03 Apr 2007 04:03:39 -0700 (PDT) Received: by 10.100.138.16 with HTTP; Tue, 3 Apr 2007 04:03:39 -0700 (PDT) Message-ID: <12fac1030704030403t3ffc3599w5a0191476eb8b865@mail.gmail.com> Date: Tue, 3 Apr 2007 13:03:39 +0200 From: Sencer To: xfs@oss.sgi.com Subject: md/dm devices, barrier support, commodity hardware and data integrity MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11024 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: alisencer@gmail.com Precedence: bulk X-list: xfs Hello, After reading up on XFS, there are a couple of issues that still seem kind of cloudy to me. I am merely a user of filesystems, so forgive me if some issues seem obvious. If you could confirm/clarify/answer the following issues, it would be very helpful to me. Situation 1) We are currently using XFS on a commodity x86 server with SATA drives (with NCQ) on Debian Etch (Kernel: 2.6.18-3-k7). We are also using Software-Raid1 (mdadm). All partitions except /boot are XFS. If I understand the FAQ and recent ml-discussions right, then 1a) without software-raid, we would enjoy write barrier support, however given that we are using md-devices this is not the case (kern.log confirms this by explicitly stating barrier support is disabled for mdX ...). Did this (barrier support with XFS on md) change in later kernels or is it likely to change in the near (or far) future? (I think I read mentions of md, and some kind of barrier-awareness on the ml, but didn't quite understand what effectively follows from it from a users POV). 1b) Given the current circumstances above, we should disable write cache as suggested in the faq (there are actually UPS's but they've failed before) to reduce the possibility of loosing data. Correct? We did need to do some hard-resets, and had power failure, though as of yet we never had problems with lost data on any xfs partitions, and I'd like to make sure it stays that way. 1c) We have backup strategies in place, so I can live with having a few partly damaged files and restoring them from backups. However I am not sure how we would make sure that we can find out about all such damaged files or if any such files exist ( referring to http://oss.sgi.com/projects/xfs/faq.html#nulls ). Are there tools for finding potential candidates for corruption? I am assuming that there would be a way to find out which files were the most recently touched with the help of the journal. Or do I just use shell-magic and find files by mtime and check if there are Nulls at the end of those file modified within the last minute or two before the crash? Situation 2) I hear many people saying that using XFS on machines that have no UPS (as in Notebooks [battery removed], Desktops etc.) is something that is not recommended. But after reading up on the issues, the recommendation should really go for every FS that only does meta-data journaling, as alluded to in the FAQ. 2a) And with the recent changes (barrier support and sync on truncated+modified+closed files) I assume there is really no reason to choose another meta-data journaling FS over XFS for such machines in terms of likelihood of damaged files after hard-resets and power failures - would you agree? 2b) When dm-crypt+luks is being used, there is no barrier support available (for XFS) even if the underlying hdd supports it, correct? Should this be expected to change, or is it more likely to stay that way? (due to limited dev. resources and priorities? or due to principal issues with it?) Thanks in advance Sencer From owner-xfs@oss.sgi.com Tue Apr 3 12:40:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 03 Apr 2007 12:41:00 -0700 (PDT) X-Spam-oss-Status: No, score=0.7 required=5.0 tests=AWL,BAYES_60, J_CHICKENPOX_45 autolearn=no version=3.2.0-pre1-r499012 Received: from barcelona.int.jammed.com (jammed.com [216.99.218.161]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l33JevfB027379 for ; Tue, 3 Apr 2007 12:40:58 -0700 Received: from barcelona.int.jammed.com (barcelona.int.jammed.com [172.16.64.15]) by barcelona.int.jammed.com (Postfix) with ESMTP id D5AA2BD6E for ; Tue, 3 Apr 2007 12:11:46 -0700 (PDT) Date: Tue, 3 Apr 2007 12:11:46 -0700 (PDT) From: "James W. Abendschan" X-X-Sender: jwa@barcelona.int.jammed.com To: xfs@oss.sgi.com Subject: xfs_repair segfault Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 11025 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jwa@jammed.com Precedence: bulk X-list: xfs Hi there -- I have a 6.9TB XFS volume that is acting up after a power failure (I understand XFS + no UPS + PC hardware == badness. Not my decision.) The machine is a dual proc x86 (intel xeon 5130) w/ 8GB RAM running a custom 2.6.18 kernel on top of Ubuntu 6.06. Since xfs_check can't repair volumes of this size without scads of memory, I've been using xfs_repair to correct power-related problems before. Unfortunately, for some reason xfs_repair is segfaulting: # ulimit -c unlimited # xfs_repair -v /dev/md1 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... zero_log: head block 8 tail block 8 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 30 - agno = 31 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - clear lost+found (if it exists) ... - clearing existing "lost+found" inode Segmentation fault (core dumped) gdb doesn't show anything useful (I don't know how to interpret the I/O error) : # gdb /sbin/xfs_repair core GNU gdb 6.4-debian Copyright 2005 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i486-linux-gnu"...(no debugging symbols found) Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1". (no debugging symbols found) Core was generated by `xfs_repair -v /dev/md1'. Program terminated with signal 11, Segmentation fault. warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libuuid.so.1 Reading symbols from /lib/tls/i686/cmov/libc.so.6...(no debugging symbols found) Loaded symbols for /lib/tls/i686/cmov/libc.so.6 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 #0 0x08052f42 in ?? () (gdb) bt #0 0x08052f42 in ?? () #1 0x000088e9 in ?? () #2 0x00000800 in ?? () #3 0x00000080 in ?? () #4 0x00000000 in ?? () What's the next step? Thanks, James From owner-xfs@oss.sgi.com Tue Apr 3 17:42:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 03 Apr 2007 17:42:42 -0700 (PDT) X-Spam-oss-Status: No, score=0.3 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_45 autolearn=no version=3.2.0-pre1-r499012 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l340gbfB023972 for ; Tue, 3 Apr 2007 17:42:39 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA01820; Wed, 4 Apr 2007 10:42:30 +1000 Message-Id: <200704040042.KAA01820@larry.melbourne.sgi.com> From: "Barry Naujok" To: "'James W. Abendschan'" , Subject: RE: xfs_repair segfault Date: Wed, 4 Apr 2007 10:45:47 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 In-Reply-To: Thread-Index: Acd2KC2mjxULtKq3T3C2GrfXBg0FrAAKdumQ X-archive-position: 11026 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs Hi James, Would it be possible for you apply the patch I posted to xfs@oss in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html to the latest xfsprogs source, make and install it and run: # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 And make the image available for me to download and analyse? Regards, Barry. > -----Original Message----- > From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] > On Behalf Of James W. Abendschan > Sent: Wednesday, 4 April 2007 5:12 AM > To: xfs@oss.sgi.com > Subject: xfs_repair segfault > > Hi there -- I have a 6.9TB XFS volume that is acting up > after a power failure (I understand XFS + no UPS + PC > hardware == badness. Not my decision.) > > The machine is a dual proc x86 (intel xeon 5130) w/ 8GB RAM > running a custom 2.6.18 kernel on top of Ubuntu 6.06. > > Since xfs_check can't repair volumes of this size without > scads of memory, I've been using xfs_repair to correct > power-related problems before. > > Unfortunately, for some reason xfs_repair is segfaulting: > > # ulimit -c unlimited > # xfs_repair -v /dev/md1 > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - zero log... > zero_log: head block 8 tail block 8 > - scan filesystem freespace and inode maps... > - found root inode chunk > Phase 3 - for each AG... > - scan and clear agi unlinked lists... > - process known inodes and perform inode discovery... > - agno = 0 > - agno = 1 > - agno = 2 > - agno = 3 > - agno = 4 > - agno = 5 > - agno = 6 > - agno = 7 > - agno = 8 > - agno = 9 > - agno = 10 > - agno = 11 > - agno = 12 > - agno = 13 > - agno = 14 > - agno = 15 > - agno = 16 > - agno = 17 > - agno = 18 > - agno = 19 > - agno = 20 > - agno = 21 > - agno = 22 > - agno = 23 > - agno = 24 > - agno = 25 > - agno = 26 > - agno = 27 > - agno = 28 > - agno = 29 > - agno = 30 > - agno = 31 > - process newly discovered inodes... > Phase 4 - check for duplicate blocks... > - setting up duplicate extent list... > - clear lost+found (if it exists) ... > - clearing existing "lost+found" inode > Segmentation fault (core dumped) > > > gdb doesn't show anything useful (I don't know how to interpret > the I/O error) : > > > # gdb /sbin/xfs_repair core > GNU gdb 6.4-debian > Copyright 2005 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public > License, and you are > welcome to change it and/or distribute copies of it under > certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show > warranty" for details. > This GDB was configured as "i486-linux-gnu"...(no debugging > symbols found) > Using host libthread_db library > "/lib/tls/i686/cmov/libthread_db.so.1". > > (no debugging symbols found) > Core was generated by `xfs_repair -v /dev/md1'. > Program terminated with signal 11, Segmentation fault. > > warning: Can't read pathname for load map: Input/output error. > Reading symbols from /lib/libuuid.so.1...(no debugging > symbols found)...done. > Loaded symbols for /lib/libuuid.so.1 > Reading symbols from /lib/tls/i686/cmov/libc.so.6...(no > debugging symbols found) > Loaded symbols for /lib/tls/i686/cmov/libc.so.6 > Reading symbols from /lib/ld-linux.so.2...(no debugging > symbols found)...done. > Loaded symbols for /lib/ld-linux.so.2 > > #0 0x08052f42 in ?? () > (gdb) bt > #0 0x08052f42 in ?? () > #1 0x000088e9 in ?? () > #2 0x00000800 in ?? () > #3 0x00000080 in ?? () > #4 0x00000000 in ?? () > > > What's the next step? > > Thanks, > James > > > From owner-xfs@oss.sgi.com Wed Apr 4 06:24:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:24:27 -0700 (PDT) X-Spam-oss-Status: No, score=0.8 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DOKfB022642 for ; Wed, 4 Apr 2007 06:24:21 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 373B8BF30 for ; Wed, 4 Apr 2007 15:05:39 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 11168-04 for ; Wed, 4 Apr 2007 15:05:35 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 3AA7CBF84; Wed, 4 Apr 2007 15:05:35 +0200 (CEST) Date: Wed, 4 Apr 2007 15:05:35 +0200 From: Thomas Kaehn To: xfs@oss.sgi.com Subject: Strange delete performance using XFS Message-ID: <20070404130535.GE18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.9i X-archive-position: 11033 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 2042 Lines: 62 Hi, I've got a strange problem on one machine using XFS. Deleting large directories (containing about 100000 files, 20k each) using "rm -rf" lasts nearly as long as creating the the files using a bash loop. The machine is running Debian Sarge with a vanilla 2.6.20.3 kernel. CPU: Dual Xeon(TM) CPU 3.20GHz RAM: 4 GB RAID10: 4x 320 GB disks connected to 3ware 9550SXU-8LP (Firmware Version = FE9X 3.08.00.004) The XFS was first created using default options and later on with "-d su=64k,sw=2 -l su=64k" which improved overall performance but not delete performance. Has anyone realized similar effects? On a different server (Dell 6850) the directory can be deleted within seconds. What could be the reason for the huge difference in delete performance? Please see below for "time" output. | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done | | real 6m6.814s | user 0m30.290s | sys 2m42.562s | # time rm -rf y | | real 5m18.034s | user 0m0.036s | sys 0m8.169s In contrast to this the result on the Dell machine looks more reasonable: | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done | | real 9m26.658s | user 0m24.134s | sys 3m3.623s | # time rm -rf x | | real 0m10.254s | user 0m0.124s | sys 0m10.105s Ciao, Thomas PS: Using JFS and ext3 it is also possible to delete the above directory in a couple of seconds. Only XFS seems problematic in this regard on this system. -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Wed Apr 4 06:29:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:29:55 -0700 (PDT) X-Spam-oss-Status: No, score=-0.9 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DTlfB024034 for ; Wed, 4 Apr 2007 06:29:49 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id B8834F063B21; Wed, 4 Apr 2007 09:29:46 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id B314A7176724; Wed, 4 Apr 2007 09:29:46 -0400 (EDT) Date: Wed, 4 Apr 2007 09:29:46 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404130535.GE18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-2087494152-1175693386=:7309" X-archive-position: 11034 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2784 Lines: 86 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-2087494152-1175693386=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi, > > I've got a strange problem on one machine using XFS. Deleting large > directories (containing about 100000 files, 20k each) using "rm -rf" > lasts nearly as long as creating the the files using a bash loop. > > The machine is running Debian Sarge with a vanilla 2.6.20.3 kernel. > CPU: Dual Xeon(TM) CPU 3.20GHz > RAM: 4 GB > RAID10: 4x 320 GB disks connected to 3ware 9550SXU-8LP > (Firmware Version =3D FE9X 3.08.00.004) > > The XFS was first created using default options and later on with > "-d su=3D64k,sw=3D2 -l su=3D64k" which improved overall performance > but not delete performance. > > Has anyone realized similar effects? On a different server (Dell 6850) > the directory can be deleted within seconds. What could be the reason > for the huge difference in delete performance? > > Please see below for "time" output. > > | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k co= unt=3D20 >/dev/null 2>&1; done > | > | real 6m6.814s > | user 0m30.290s > | sys 2m42.562s > | # time rm -rf y > | > | real 5m18.034s > | user 0m0.036s > | sys 0m8.169s > > In contrast to this the result on the Dell machine looks more > reasonable: > > | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k co= unt=3D20 >/dev/null 2>&1; done > | > | real 9m26.658s > | user 0m24.134s > | sys 3m3.623s > | # time rm -rf x > | > | real 0m10.254s > | user 0m0.124s > | sys 0m10.105s > > Ciao, > Thomas > > PS: Using JFS and ext3 it is also possible to delete the above directory > in a couple of seconds. Only XFS seems problematic in this regard on > this system. > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > Deletes on XFS is one area that is a little slower than other filesystems.= =20 You can increase the log size during the creation of the filesystem and=20 also increase logbufs to 8 and that might help.= ---1463747160-2087494152-1175693386=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 06:47:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:47:38 -0700 (PDT) X-Spam-oss-Status: No, score=0.6 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DlTfB027541 for ; Wed, 4 Apr 2007 06:47:31 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 74F7CC159; Wed, 4 Apr 2007 15:47:29 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 26351-05; Wed, 4 Apr 2007 15:47:24 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id E983AC126; Wed, 4 Apr 2007 15:47:24 +0200 (CEST) Date: Wed, 4 Apr 2007 15:47:24 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070404134724.GF18320@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11035 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1850 Lines: 49 Hi Justin, On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: > >Please see below for "time" output. > > > >| # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 > >>/dev/null 2>&1; done > >| > >| real 6m6.814s > >| user 0m30.290s > >| sys 2m42.562s > >| # time rm -rf y > >| > >| real 5m18.034s > >| user 0m0.036s > >| sys 0m8.169s > Deletes on XFS is one area that is a little slower than other filesystems. > You can increase the log size during the creation of the filesystem and > also increase logbufs to 8 and that might help. Thanks for your suggestions. I also tried to increase the log size and logbufs mount option. This optimizes create and delete times to the above values (with default options both are around 9-10 minutes). The strange thing is that on a similar Dell machines using XFS, too, deletes take only ten seconds which would match user and system time. More than five minutes for deleting 100000 files where ext3 needs 3 seconds on the same machine is actually more than a little bit slower - to my mind there must be something wrong. JFS needs around 18 seconds. However I am not sure if the problem is hardware or software related. I've also tried to use the newest 3ware firmware - but this did not lead to an improvement. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Wed Apr 4 06:51:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:51:20 -0700 (PDT) X-Spam-oss-Status: No, score=-0.9 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DpAfB028519 for ; Wed, 4 Apr 2007 06:51:12 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 5A323F063B21; Wed, 4 Apr 2007 09:51:10 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 553C57176724; Wed, 4 Apr 2007 09:51:10 -0400 (EDT) Date: Wed, 4 Apr 2007 09:51:10 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404134724.GF18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-732358276-1175694670=:7309" X-archive-position: 11036 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2419 Lines: 72 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-732358276-1175694670=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>> Please see below for "time" output. >>> >>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k = count=3D20 >>>> /dev/null 2>&1; done >>> | >>> | real 6m6.814s >>> | user 0m30.290s >>> | sys 2m42.562s >>> | # time rm -rf y >>> | >>> | real 5m18.034s >>> | user 0m0.036s >>> | sys 0m8.169s > >> Deletes on XFS is one area that is a little slower than other filesystem= s. >> You can increase the log size during the creation of the filesystem and >> also increase logbufs to 8 and that might help. > > Thanks for your suggestions. > > I also tried to increase the log size and logbufs mount option. This > optimizes create and delete times to the above values (with default optio= ns > both are around 9-10 minutes). > > The strange thing is that on a similar Dell machines using XFS, too, > deletes take only ten seconds which would match user and system time. > > More than five minutes for deleting 100000 files where ext3 needs > 3 seconds on the same machine is actually more than a little bit slower > - to my mind there must be something wrong. JFS needs around 18 seconds. > > However I am not sure if the problem is hardware or software related. > I've also tried to use the newest 3ware firmware - but this did not lead > to an improvement. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > I am running some benchmarks with SW raid and will prevent my findings=20 shortly.= ---1463747160-732358276-1175694670=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 06:57:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:57:26 -0700 (PDT) X-Spam-oss-Status: No, score=-0.9 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DvKfB030429 for ; Wed, 4 Apr 2007 06:57:21 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id CFFF2F063B21; Wed, 4 Apr 2007 09:57:16 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id CAB0E7176724; Wed, 4 Apr 2007 09:57:16 -0400 (EDT) Date: Wed, 4 Apr 2007 09:57:16 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404134724.GF18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-295697546-1175695036=:7309" X-archive-position: 11037 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2757 Lines: 90 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-295697546-1175695036=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>> Please see below for "time" output. >>> >>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k = count=3D20 >>>> /dev/null 2>&1; done >>> | >>> | real 6m6.814s >>> | user 0m30.290s >>> | sys 2m42.562s >>> | # time rm -rf y >>> | >>> | real 5m18.034s >>> | user 0m0.036s >>> | sys 0m8.169s > >> Deletes on XFS is one area that is a little slower than other filesystem= s. >> You can increase the log size during the creation of the filesystem and >> also increase logbufs to 8 and that might help. > > Thanks for your suggestions. > > I also tried to increase the log size and logbufs mount option. This > optimizes create and delete times to the above values (with default optio= ns > both are around 9-10 minutes). > > The strange thing is that on a similar Dell machines using XFS, too, > deletes take only ten seconds which would match user and system time. > > More than five minutes for deleting 100000 files where ext3 needs > 3 seconds on the same machine is actually more than a little bit slower > - to my mind there must be something wrong. JFS needs around 18 seconds. > > However I am not sure if the problem is hardware or software related. > I've also tried to use the newest 3ware firmware - but this did not lead > to an improvement. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > The benchmark: $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k count= =3D20=20 >/dev/null 2>&1; done 1. Six 400GB SATA drives using SW RAID5: real 6m24.411s user 0m43.097s sys 2m17.350s 2. Four Raptor 150 ADFD drives using SW RAID5: real 3m16.962s user 0m42.899s sys 2m15.420s 3. Two Raptor 74GB *GD drives using SW RAID1: real 3m19.241s user 0m41.731s sys 2m15.873s ---1463747160-295697546-1175695036=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 06:57:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 06:58:03 -0700 (PDT) X-Spam-oss-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34DvufB030663 for ; Wed, 4 Apr 2007 06:57:57 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 4A17FF063B21; Wed, 4 Apr 2007 09:57:56 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 269017176724; Wed, 4 Apr 2007 09:57:56 -0400 (EDT) Date: Wed, 4 Apr 2007 09:57:56 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1613578982-1175695076=:7309" X-archive-position: 11038 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2627 Lines: 79 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1613578982-1175695076=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Justin Piszcz wrote: > > > On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >> Hi Justin, >>=20 >> On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>>> Please see below for "time" output. >>>>=20 >>>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k= count=3D20 >>>>> /dev/null 2>&1; done >>>> | >>>> | real 6m6.814s >>>> | user 0m30.290s >>>> | sys 2m42.562s >>>> | # time rm -rf y >>>> | >>>> | real 5m18.034s >>>> | user 0m0.036s >>>> | sys 0m8.169s >>=20 >>> Deletes on XFS is one area that is a little slower than other filesyste= ms. >>> You can increase the log size during the creation of the filesystem and >>> also increase logbufs to 8 and that might help. >>=20 >> Thanks for your suggestions. >>=20 >> I also tried to increase the log size and logbufs mount option. This >> optimizes create and delete times to the above values (with default opti= ons >> both are around 9-10 minutes). >>=20 >> The strange thing is that on a similar Dell machines using XFS, too, >> deletes take only ten seconds which would match user and system time. >>=20 >> More than five minutes for deleting 100000 files where ext3 needs >> 3 seconds on the same machine is actually more than a little bit slower >> - to my mind there must be something wrong. JFS needs around 18 seconds. >>=20 >> However I am not sure if the problem is hardware or software related. >> I've also tried to use the newest 3ware firmware - but this did not lead >> to an improvement. >>=20 >> Ciao, >> Thomas >> --=20 >> Thomas K=E4hn WESTEND GmbH | Internet-Business-Provi= der >> Technik CISCO Systems Partner - Authorized Reseller >> Im S=FCsterfeld 6 Tel 0241/701333-= 18 >> tk@westend.com D-52072 Aachen Fax 0241/911879 >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 >> Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael= Kolb >>=20 >>=20 > > I am running some benchmarks with SW raid and will prevent my findings=20 > shortly. Removal tests coming shortly, benchmarking is always interesting.= ---1463747160-1613578982-1175695076=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 07:12:51 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:12:57 -0700 (PDT) X-Spam-oss-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34ECmfB002491 for ; Wed, 4 Apr 2007 07:12:50 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 6A9EEF063B21; Wed, 4 Apr 2007 10:12:48 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 655AF7176724; Wed, 4 Apr 2007 10:12:48 -0400 (EDT) Date: Wed, 4 Apr 2007 10:12:48 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1228164866-1175695968=:7309" X-archive-position: 11039 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 3274 Lines: 117 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1228164866-1175695968=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Justin Piszcz wrote: > > > On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >> Hi Justin, >>=20 >> On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>>> Please see below for "time" output. >>>>=20 >>>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k= count=3D20 >>>>> /dev/null 2>&1; done >>>> | >>>> | real 6m6.814s >>>> | user 0m30.290s >>>> | sys 2m42.562s >>>> | # time rm -rf y >>>> | >>>> | real 5m18.034s >>>> | user 0m0.036s >>>> | sys 0m8.169s >>=20 >>> Deletes on XFS is one area that is a little slower than other filesyste= ms. >>> You can increase the log size during the creation of the filesystem and >>> also increase logbufs to 8 and that might help. >>=20 >> Thanks for your suggestions. >>=20 >> I also tried to increase the log size and logbufs mount option. This >> optimizes create and delete times to the above values (with default opti= ons >> both are around 9-10 minutes). >>=20 >> The strange thing is that on a similar Dell machines using XFS, too, >> deletes take only ten seconds which would match user and system time. >>=20 >> More than five minutes for deleting 100000 files where ext3 needs >> 3 seconds on the same machine is actually more than a little bit slower >> - to my mind there must be something wrong. JFS needs around 18 seconds. >>=20 >> However I am not sure if the problem is hardware or software related. >> I've also tried to use the newest 3ware firmware - but this did not lead >> to an improvement. >>=20 >> Ciao, >> Thomas >> --=20 >> Thomas K=E4hn WESTEND GmbH | Internet-Business-Provi= der >> Technik CISCO Systems Partner - Authorized Reseller >> Im S=FCsterfeld 6 Tel 0241/701333-= 18 >> tk@westend.com D-52072 Aachen Fax 0241/911879 >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 >> Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael= Kolb >>=20 >>=20 > > The benchmark: > $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k coun= t=3D20=20 >> /dev/null 2>&1; done > > 1. Six 400GB SATA drives using SW RAID5: > real 6m24.411s > user 0m43.097s > sys 2m17.350s > > 2. Four Raptor 150 ADFD drives using SW RAID5: > real 3m16.962s > user 0m42.899s > sys 2m15.420s > > 3. Two Raptor 74GB *GD drives using SW RAID1: > real 3m19.241s > user 0m41.731s > sys 2m15.873s > > The removals: The benchmark: $ time rm -rf test 1. Six 400GB SATA drives using SW RAID5: real 0m33.996s user 0m0.057s sys 0m8.101s 2. Four Raptor 150 ADFD drives using SW RAID5: real 0m43.967s user 0m0.071s sys 0m8.340s 3. Two Raptor 74GB *GD drives using SW RAID1: real 0m32.965s user 0m0.049s sys 0m6.307s ---1463747160-1228164866-1175695968=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 07:13:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:13:43 -0700 (PDT) X-Spam-oss-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34EDafB002895 for ; Wed, 4 Apr 2007 07:13:39 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 55C01F080A24; Wed, 4 Apr 2007 10:13:36 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 50D137176724; Wed, 4 Apr 2007 10:13:36 -0400 (EDT) Date: Wed, 4 Apr 2007 10:13:36 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-2094664383-1175696016=:7309" X-archive-position: 11040 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 3303 Lines: 109 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-2094664383-1175696016=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Justin Piszcz wrote: > > > On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >> Hi Justin, >>=20 >> On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>>> Please see below for "time" output. >>>>=20 >>>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k= count=3D20 >>>>> /dev/null 2>&1; done >>>> | >>>> | real 6m6.814s >>>> | user 0m30.290s >>>> | sys 2m42.562s >>>> | # time rm -rf y >>>> | >>>> | real 5m18.034s >>>> | user 0m0.036s >>>> | sys 0m8.169s >>=20 >>> Deletes on XFS is one area that is a little slower than other filesyste= ms. >>> You can increase the log size during the creation of the filesystem and >>> also increase logbufs to 8 and that might help. >>=20 >> Thanks for your suggestions. >>=20 >> I also tried to increase the log size and logbufs mount option. This >> optimizes create and delete times to the above values (with default opti= ons >> both are around 9-10 minutes). >>=20 >> The strange thing is that on a similar Dell machines using XFS, too, >> deletes take only ten seconds which would match user and system time. >>=20 >> More than five minutes for deleting 100000 files where ext3 needs >> 3 seconds on the same machine is actually more than a little bit slower >> - to my mind there must be something wrong. JFS needs around 18 seconds. >>=20 >> However I am not sure if the problem is hardware or software related. >> I've also tried to use the newest 3ware firmware - but this did not lead >> to an improvement. >>=20 >> Ciao, >> Thomas >> --=20 >> Thomas K=E4hn WESTEND GmbH | Internet-Business-Provi= der >> Technik CISCO Systems Partner - Authorized Reseller >> Im S=FCsterfeld 6 Tel 0241/701333-= 18 >> tk@westend.com D-52072 Aachen Fax 0241/911879 >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 >> Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael= Kolb >>=20 >>=20 > > The benchmark: > $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k coun= t=3D20=20 >> /dev/null 2>&1; done > > 1. Six 400GB SATA drives using SW RAID5: > real 6m24.411s > user 0m43.097s > sys 2m17.350s > > 2. Four Raptor 150 ADFD drives using SW RAID5: > real 3m16.962s > user 0m42.899s > sys 2m15.420s > > 3. Two Raptor 74GB *GD drives using SW RAID1: > real 3m19.241s > user 0m41.731s > sys 2m15.873s > > I used the DEFAULT create options for XFS as I find it highly optimizes=20 itself (at least with SW raid) with the exception of the ROOT FS, I had=20 that optimized awhile ago and I kept it: /dev/md2 / xfs=20 logbufs=3D8,logbsize=3D262144,biosize=3D16,noatime,nodiratime,nobarrier 0 = 1 For my regular RAID5s though I use defaults,noatime. Justin.= ---1463747160-2094664383-1175696016=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 07:22:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:22:18 -0700 (PDT) X-Spam-oss-Status: No, score=0.5 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34EMBfB005858 for ; Wed, 4 Apr 2007 07:22:12 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id E39C7C16B; Wed, 4 Apr 2007 16:22:10 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 05404-04-13; Wed, 4 Apr 2007 16:22:07 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 6593AC12D; Wed, 4 Apr 2007 16:21:42 +0200 (CEST) Date: Wed, 4 Apr 2007 16:21:42 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070404142142.GG18320@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11041 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1210 Lines: 37 Hi Justin, On Wed, Apr 04, 2007 at 10:12:48AM -0400, Justin Piszcz wrote: > On Wed, 4 Apr 2007, Justin Piszcz wrote: > >On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >$ time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 > >>/dev/null 2>&1; done > > > >1. Six 400GB SATA drives using SW RAID5: > >real 6m24.411s > >user 0m43.097s > >sys 2m17.350s > > > > The removals: > The benchmark: > $ time rm -rf test > > 1. Six 400GB SATA drives using SW RAID5: > real 0m33.996s > user 0m0.057s > sys 0m8.101s thanks for your bechmark. To my mind this clearly shows that my setup is wrong at some point. I'll try again with your mount options. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Wed Apr 4 07:24:46 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:24:50 -0700 (PDT) X-Spam-oss-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_43 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34EOhfB006635 for ; Wed, 4 Apr 2007 07:24:45 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id A39DAF063B21; Wed, 4 Apr 2007 10:24:42 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 9FC3C717672A; Wed, 4 Apr 2007 10:24:42 -0400 (EDT) Date: Wed, 4 Apr 2007 10:24:42 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404142142.GG18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> <20070404142142.GG18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-702127020-1175696682=:7309" X-archive-position: 11042 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 1917 Lines: 63 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-702127020-1175696682=:7309 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 10:12:48AM -0400, Justin Piszcz wrote: >> On Wed, 4 Apr 2007, Justin Piszcz wrote: >>> On Wed, 4 Apr 2007, Thomas Kaehn wrote: >>> $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k co= unt=3D20 >>>> /dev/null 2>&1; done >>> >>> 1. Six 400GB SATA drives using SW RAID5: >>> real 6m24.411s >>> user 0m43.097s >>> sys 2m17.350s >>> >> >> The removals: >> The benchmark: >> $ time rm -rf test >> >> 1. Six 400GB SATA drives using SW RAID5: >> real 0m33.996s >> user 0m0.057s >> sys 0m8.101s > > thanks for your bechmark. To my mind this clearly shows that my > setup is wrong at some point. I'll try again with your mount options. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > My guess is mkfs.xfs cannot optimzie for your array like it can with a SW= =20 RAID device because it cannot see what is undereath it. Have you tried=20 making a SW RAID? I also use optimized parameters for my SW RAID1/5 as=20 well FYI. Justin.= ---1463747160-702127020-1175696682=:7309-- From owner-xfs@oss.sgi.com Wed Apr 4 07:35:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 07:35:52 -0700 (PDT) X-Spam-oss-Status: No, score=0.5 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_43 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34EZkfB008940 for ; Wed, 4 Apr 2007 07:35:47 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 387EDC177; Wed, 4 Apr 2007 16:35:45 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 10353-03-2; Wed, 4 Apr 2007 16:35:37 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id B287BC117; Wed, 4 Apr 2007 16:35:33 +0200 (CEST) Date: Wed, 4 Apr 2007 16:35:33 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070404143533.GF12481@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> <20070404142142.GG18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11043 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1933 Lines: 53 Hi Justin, On Wed, Apr 04, 2007 at 10:24:42AM -0400, Justin Piszcz wrote: > >On Wed, Apr 04, 2007 at 10:12:48AM -0400, Justin Piszcz wrote: > >>On Wed, 4 Apr 2007, Justin Piszcz wrote: > >>>On Wed, 4 Apr 2007, Thomas Kaehn wrote: > >>>$ time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 > >>>>/dev/null 2>&1; done > >>> > My guess is mkfs.xfs cannot optimzie for your array like it can with a SW > RAID device because it cannot see what is undereath it. Have you tried > making a SW RAID? I also use optimized parameters for my SW RAID1/5 as > well FYI. I guess this might be the problem. I've already tried to alter the stripe unit to match the RAID stripe size: "-d su=64k,sw=2 -l su=64k". Maybe the 3ware controller can't deal with the kind of read and write patterns needed by XFS. But in this case other people should have realized similar problems. On a different system with a 3ware 9500S-4LP using 4 disks as RAID5 setup I get a better (but not really good) result for delete performance (I've taken only 50000 files in this case as the system's CPU is much slower): | # time for i in `seq 1 50000`; do dd if=/dev/zero of=$i | bs=1k count=20 >/dev/null 2>&1; done | | real 18m21.643s | user 0m55.727s | sys 3m12.140s | backup:/srv/x# cd .. | backup:/srv# rm -rf x | | # time rm -rf x | | real 5m7.845s | user 0m0.160s | sys 0m11.369s Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Wed Apr 4 08:45:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 08:45:33 -0700 (PDT) X-Spam-oss-Status: No, score=-0.4 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from smtp108.sbc.mail.mud.yahoo.com (smtp108.sbc.mail.mud.yahoo.com [68.142.198.207]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l34FjRfB026359 for ; Wed, 4 Apr 2007 08:45:28 -0700 Received: (qmail 95791 invoked from network); 4 Apr 2007 15:45:26 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp108.sbc.mail.mud.yahoo.com with SMTP; 4 Apr 2007 15:45:25 -0000 X-YMail-OSG: VVG0PLEVM1mC181ic3eeqjwe.pA93jXFNrc6sdqfX4N8N.o06NwfUPyuHOiCpMxcsJnKi4Utiw-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 74C8D1826127; Wed, 4 Apr 2007 08:45:23 -0700 (PDT) Date: Wed, 4 Apr 2007 08:45:23 -0700 From: Chris Wedgwood To: Thomas Kaehn Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070404154523.GA20096@tuatara.stupidest.org> References: <20070404130535.GE18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070404130535.GE18320@mail3b.westend.com> X-archive-position: 11044 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs Content-Length: 1390 Lines: 46 On Wed, Apr 04, 2007 at 03:05:35PM +0200, Thomas Kaehn wrote: > I've got a strange problem on one machine using XFS. Deleting large > directories (containing about 100000 files, 20k each) using "rm -rf" > lasts nearly as long as creating the the files using a bash loop. quite possible > RAM: 4 GB > RAID10: 4x 320 GB disks connected to 3ware 9550SXU-8LP > (Firmware Version = FE9X 3.08.00.004) > The XFS was first created using default options and later on with > "-d su=64k,sw=2 -l su=64k" which improved overall performance > but not delete performance. have you tried w/o using the hw raid? > Has anyone realized similar effects? On a different server (Dell > 6850) the directory can be deleted within seconds. What could be the > reason for the huge difference in delete performance? a lot of log updates; does the other server have a battery-backed write-cache like many cards to these days? > | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done > | > | real 6m6.814s > | user 0m30.290s > | sys 2m42.562s that's about the same as my quick single-spindle cheap-desktop test here > | # time rm -rf y > | > | real 5m18.034s > | user 0m0.036s > | sys 0m8.169s v2 logs? what logbufs & logbsize is used? testing with my cheap crappy desktop workstation thing with a single disk I get "1m25.004s" for the delete From owner-xfs@oss.sgi.com Wed Apr 4 11:36:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 11:36:54 -0700 (PDT) X-Spam-oss-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34IadfB001401 for ; Wed, 4 Apr 2007 11:36:40 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 8B561F063B21; Wed, 4 Apr 2007 14:36:37 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 8A4A4717672A; Wed, 4 Apr 2007 14:36:37 -0400 (EDT) Date: Wed, 4 Apr 2007 14:36:37 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404134724.GF18320@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1108556179-1175711797=:16731" X-archive-position: 11045 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2388 Lines: 70 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1108556179-1175711797=:16731 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 09:29:46AM -0400, Justin Piszcz wrote: >>> Please see below for "time" output. >>> >>> | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k = count=3D20 >>>> /dev/null 2>&1; done >>> | >>> | real 6m6.814s >>> | user 0m30.290s >>> | sys 2m42.562s >>> | # time rm -rf y >>> | >>> | real 5m18.034s >>> | user 0m0.036s >>> | sys 0m8.169s > >> Deletes on XFS is one area that is a little slower than other filesystem= s. >> You can increase the log size during the creation of the filesystem and >> also increase logbufs to 8 and that might help. > > Thanks for your suggestions. > > I also tried to increase the log size and logbufs mount option. This > optimizes create and delete times to the above values (with default optio= ns > both are around 9-10 minutes). > > The strange thing is that on a similar Dell machines using XFS, too, > deletes take only ten seconds which would match user and system time. > > More than five minutes for deleting 100000 files where ext3 needs > 3 seconds on the same machine is actually more than a little bit slower > - to my mind there must be something wrong. JFS needs around 18 seconds. > > However I am not sure if the problem is hardware or software related. > I've also tried to use the newest 3ware firmware - but this did not lead > to an improvement. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > For the ext3, try time bash -c 'rm -rf test; sync' ---1463747160-1108556179-1175711797=:16731-- From owner-xfs@oss.sgi.com Wed Apr 4 13:45:08 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 13:45:12 -0700 (PDT) X-Spam-oss-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_43 autolearn=no version=3.2.0-pre1-r499012 Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34Kj4fB005320 for ; Wed, 4 Apr 2007 13:45:08 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id D56A6F063B21; Wed, 4 Apr 2007 16:45:03 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id D21F9717672A; Wed, 4 Apr 2007 16:45:03 -0400 (EDT) Date: Wed, 4 Apr 2007 16:45:03 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070404143533.GF12481@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> <20070404142142.GG18320@mail3b.westend.com> <20070404143533.GF12481@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1667215911-1175719503=:20373" X-archive-position: 11046 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Content-Length: 2514 Lines: 73 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1667215911-1175719503=:20373 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 4 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Wed, Apr 04, 2007 at 10:24:42AM -0400, Justin Piszcz wrote: >>> On Wed, Apr 04, 2007 at 10:12:48AM -0400, Justin Piszcz wrote: >>>> On Wed, 4 Apr 2007, Justin Piszcz wrote: >>>>> On Wed, 4 Apr 2007, Thomas Kaehn wrote: >>>>> $ time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k = count=3D20 >>>>>> /dev/null 2>&1; done >>>>> >> My guess is mkfs.xfs cannot optimzie for your array like it can with a SW >> RAID device because it cannot see what is undereath it. Have you tried >> making a SW RAID? I also use optimized parameters for my SW RAID1/5 as >> well FYI. > > I guess this might be the problem. I've already tried to alter > the stripe unit to match the RAID stripe size: "-d su=3D64k,sw=3D2 -l su= =3D64k". > > Maybe the 3ware controller can't deal with the kind of read and write > patterns needed by XFS. But in this case other people should have > realized similar problems. > > On a different system with a 3ware 9500S-4LP using 4 disks as RAID5 > setup I get a better (but not really good) result for delete > performance (I've taken only 50000 files in this case as the system's > CPU is much slower): > > | # time for i in `seq 1 50000`; do dd if=3D/dev/zero of=3D$i > | bs=3D1k count=3D20 >/dev/null 2>&1; done > | > | real 18m21.643s > | user 0m55.727s > | sys 3m12.140s > | backup:/srv/x# cd .. > | backup:/srv# rm -rf x > | > | # time rm -rf x > | > | real 5m7.845s > | user 0m0.160s > | sys 0m11.369s > > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > What do you get with ext3 when using time bash -c 'rm -f file; sync'= ---1463747160-1667215911-1175719503=:20373-- From owner-xfs@oss.sgi.com Wed Apr 4 13:58:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 04 Apr 2007 13:58:30 -0700 (PDT) X-Spam-oss-Status: No, score=0.6 required=5.0 tests=BAYES_50,J_CHICKENPOX_43, J_CHICKENPOX_44,J_CHICKENPOX_45,J_CHICKENPOX_46,J_CHICKENPOX_47, J_CHICKENPOX_48 autolearn=no version=3.2.0-pre1-r499012 Received: from mail.linbit.com (nudl.linbit.com [212.69.162.21]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l34KwPfB008103 for ; Wed, 4 Apr 2007 13:58:27 -0700 Received: by mail.linbit.com (LINBIT Mail Daemon, from userid 1030) id 7AE342E10D59; Wed, 4 Apr 2007 22:36:06 +0200 (CEST) Date: Wed, 4 Apr 2007 22:36:01 +0200 X-OfflineIMAP-1301118847-6c61727340696d61702e6c696e626974-494e424f582e4f7574626f78: 1175718968-0570832815641-v4.0.11 From: Lars Ellenberg To: xfs@oss.sgi.com Subject: xfs_repair leaves empty but undeletable dirs in lost+found Message-ID: <20070404203601.GA11771@barkeeper1.linbit> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.11 X-archive-position: 11047 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lars.ellenberg@linbit.com Precedence: bulk X-list: xfs Content-Length: 9501 Lines: 343 this is a plain debian sarge system, i386 (actually k7 athlon), nothing fancy. kernel is a self-build kernel.org 2.6.16 kernel, repackaged for sarge, so it is no longer obvious which exact sublevel. I can dig that up however, if it seems relevant. this is a backup volume with typically many (40 to 70, maybe?) hardlinks. Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vg00-backup 1.2G 19M 1.2G 2% /mnt/backup Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-backup 1.2T 809G 365G 69% /mnt/backup somehow file system corruption crept in. may be related to some power loss, may be related to some strange 3w-xxxx: scsi0: AEN: WARNING: ATA UDMA upgrade: Port #3 messages (also for port 5), which are actually DOWNgrades... (unfortunately 2.6.16 has that double definition bug still in some header file, I'll submit the oneline patch to Adrian together whith some other stuff I have pending). maybe cosmic rays. memtest86+ is fine, btw. anyhow. after several runs of xfs_repair version 2.6.20 (and then cleaning up lost+found, where possible), some "empty" but not so empty directories stay behind. [root:/mnt/backup/lost+found]# rmdir 4059295137 rmdir: `4059295137': Directory not empty [root:/mnt/backup/lost+found]# find 4059295137 -ls 4059295137 8 drwxrwxr-x 2 1049 1049 4096 Apr 3 12:39 4059295137 that is it! [right now I have still 118 of these] additional runs of xfs_repair don't change the situation. on run takes about 8+ hours, though. NOTE that I used the default sarge xfsprogs version 2.6.20, not the upstream 2.8.20. yet. I'll start an xfs_repair run with 2.8.20 right after this post, though... from a post on xfs mailing list Message-Id: <200608150145.LAA07105@larry.melbourne.sgi.com> From: Barry Naujok To: 'Paul Slootman' , xfs@oss.sgi.com Subject: RE: cache_purge: shake on cache 0x5880a0 left 8 nodes!? Date: Tue, 15 Aug 2006 11:49:13 +1000 I found a suggestion to investigate like this: xfs_db version 2.6.20 xfs_db /dev/vg00/backup [sorry, I skipped the blockget -n and therefor the ncheck, if absolutely necessary, I can do that still, but I suspect it will take hours, too? I tried it, but after a few minutes with 95% CPU and about 2GB RAM used, I killed it...] xfs_db> inode 4059295137 xfs_db> p core.magic = 0x494e core.mode = 040775 core.version = 1 core.format = 2 (extents) core.nlinkv1 = 2 core.uid = 1049 core.gid = 1049 core.flushiter = 24 core.atime.sec = Thu Feb 1 20:30:54 2007 core.atime.nsec = 533550800 core.mtime.sec = Tue Apr 3 12:39:00 2007 core.mtime.nsec = 472123752 core.ctime.sec = Tue Apr 3 12:39:00 2007 core.ctime.nsec = 472123752 core.size = 4096 core.nblocks = 2 core.extsize = 0 core.nextents = 2 core.naextents = 0 core.forkoff = 0 core.aformat = 2 (extents) core.dmevmask = 0 core.dmstate = 0 core.newrtbm = 0 core.prealloc = 0 core.realtime = 0 core.immutable = 0 core.append = 0 core.sync = 0 core.noatime = 0 core.nodump = 0 core.gen = 11 next_unlinked = null u.bmx[0-1] = [startoff,startblock,blockcount,extentflag] 0:[0,253808988,1,0] 1:[8388608,253808989,1,0] xfs_db> ncheck lost+found must run blockget -n first xfs_db> dblock 0 xfs_db> p dhdr.magic = 0x58443244 dhdr.bestfree[0].offset = 0x20 dhdr.bestfree[0].length = 0xfd0 dhdr.bestfree[1].offset = 0 dhdr.bestfree[1].length = 0 dhdr.bestfree[2].offset = 0 dhdr.bestfree[2].length = 0 du[0].inumber = 4059295137 du[0].namelen = 1 du[0].name = "." du[0].tag = 0x10 du[1].freetag = 0xffff du[1].length = 0xfd0 du[1].tag = 0x20 du[2].inumber = 656 du[2].namelen = 2 du[2].name = ".." du[2].tag = 0xff0 xfs_db> dblock 8388608 xfs_db> p lhdr.info.forw = 0 lhdr.info.back = 0 lhdr.info.magic = 0xd2f1 lhdr.count = 99 lhdr.stale = 97 lbests[0] = 0:0xfd0 lents[0].hashval = 0x2e lents[0].address = 0x2 lents[1].hashval = 0x172e lents[1].address = 0x1fe lents[2].hashval = 0x859d16c lents[2].address = 0 lents[3].hashval = 0xdc6133e lents[3].address = 0 lents[4].hashval = 0xeffc248 lents[4].address = 0 lents[5].hashval = 0xfed728e lents[5].address = 0 lents[6].hashval = 0x124f4f36 lents[6].address = 0 lents[7].hashval = 0x13625491 lents[7].address = 0 lents[8].hashval = 0x1372549d lents[8].address = 0 lents[9].hashval = 0x19ef9ac0 lents[9].address = 0 lents[10].hashval = 0x1b1d6dce lents[10].address = 0 lents[11].hashval = 0x1db2dd93 lents[11].address = 0 lents[12].hashval = 0x262a70f3 lents[12].address = 0 lents[13].hashval = 0x29460811 lents[13].address = 0 lents[14].hashval = 0x2956081d lents[14].address = 0 lents[15].hashval = 0x2d8b48ab lents[15].address = 0 lents[16].hashval = 0x31e3c314 lents[16].address = 0 lents[17].hashval = 0x3669d1ab lents[17].address = 0 lents[18].hashval = 0x3679d1a7 lents[18].address = 0 lents[19].hashval = 0x36b8264b lents[19].address = 0 lents[20].hashval = 0x3996aced lents[20].address = 0 lents[21].hashval = 0x3a35ab00 lents[21].address = 0 lents[22].hashval = 0x3c670bbd lents[22].address = 0 lents[23].hashval = 0x3f150fd1 lents[23].address = 0 lents[24].hashval = 0x4308c7cd lents[24].address = 0 lents[25].hashval = 0x463489b9 lents[25].address = 0 lents[26].hashval = 0x475261d5 lents[26].address = 0 lents[27].hashval = 0x53469be7 lents[27].address = 0 lents[28].hashval = 0x53569beb lents[28].address = 0 lents[29].hashval = 0x589604ba lents[29].address = 0 lents[30].hashval = 0x594bf53d lents[30].address = 0 lents[31].hashval = 0x5981bb01 lents[31].address = 0 lents[32].hashval = 0x5a46bb9c lents[32].address = 0 lents[33].hashval = 0x5c0b668e lents[33].address = 0 lents[34].hashval = 0x5deca0a1 lents[34].address = 0 lents[35].hashval = 0x5f872eba lents[35].address = 0 lents[36].hashval = 0x5f8a3faa lents[36].address = 0 lents[37].hashval = 0x5f9a3fa6 lents[37].address = 0 lents[38].hashval = 0x691b5828 lents[38].address = 0 lents[39].hashval = 0x69d27a70 lents[39].address = 0 lents[40].hashval = 0x73c859c2 lents[40].address = 0 lents[41].hashval = 0x73d859ce lents[41].address = 0 lents[42].hashval = 0x75eb5177 lents[42].address = 0 lents[43].hashval = 0x75fb517b lents[43].address = 0 lents[44].hashval = 0x891e2e5b lents[44].address = 0 lents[45].hashval = 0x8965893f lents[45].address = 0 lents[46].hashval = 0x8cb8b142 lents[46].address = 0 lents[47].hashval = 0x8cce44a0 lents[47].address = 0 lents[48].hashval = 0x8de83f12 lents[48].address = 0 lents[49].hashval = 0x8f22b70e lents[49].address = 0 lents[50].hashval = 0x8f4ab64c lents[50].address = 0 lents[51].hashval = 0x90acb7a0 lents[51].address = 0 lents[52].hashval = 0x9302cac0 lents[52].address = 0 lents[53].hashval = 0x9312cacc lents[53].address = 0 lents[54].hashval = 0x950c4daa lents[54].address = 0 lents[55].hashval = 0x960164df lents[55].address = 0 lents[56].hashval = 0x97df8a76 lents[56].address = 0 lents[57].hashval = 0x9b1ca984 lents[57].address = 0 lents[58].hashval = 0x9be7549e lents[58].address = 0 lents[59].hashval = 0x9bf75492 lents[59].address = 0 lents[60].hashval = 0x9db319d9 lents[60].address = 0 lents[61].hashval = 0xa4926d71 lents[61].address = 0 lents[62].hashval = 0xa5677d7e lents[62].address = 0 lents[63].hashval = 0xa5777d72 lents[63].address = 0 lents[64].hashval = 0xaa01d033 lents[64].address = 0 lents[65].hashval = 0xaa11d03f lents[65].address = 0 lents[66].hashval = 0xaa8a84f0 lents[66].address = 0 lents[67].hashval = 0xaa9a84fc lents[67].address = 0 lents[68].hashval = 0xae810ff0 lents[68].address = 0 lents[69].hashval = 0xaedc9085 lents[69].address = 0 lents[70].hashval = 0xb55cb496 lents[70].address = 0 lents[71].hashval = 0xb8866b93 lents[71].address = 0 lents[72].hashval = 0xb98b5af9 lents[72].address = 0 lents[73].hashval = 0xb9d6e5c3 lents[73].address = 0 lents[74].hashval = 0xbb1c6ddf lents[74].address = 0 lents[75].hashval = 0xbc34bda9 lents[75].address = 0 lents[76].hashval = 0xbd20f48a lents[76].address = 0 lents[77].hashval = 0xbe8bae2a lents[77].address = 0 lents[78].hashval = 0xc86644bc lents[78].address = 0 lents[79].hashval = 0xcdb62e47 lents[79].address = 0 lents[80].hashval = 0xcdcd8923 lents[80].address = 0 lents[81].hashval = 0xd43269bc lents[81].address = 0 lents[82].hashval = 0xde074636 lents[82].address = 0 lents[83].hashval = 0xde17463a lents[83].address = 0 lents[84].hashval = 0xe1321219 lents[84].address = 0 lents[85].hashval = 0xe1465aa0 lents[85].address = 0 lents[86].hashval = 0xe65770a0 lents[86].address = 0 lents[87].hashval = 0xe7e6d19e lents[87].address = 0 lents[88].hashval = 0xeb8d1d7a lents[88].address = 0 lents[89].hashval = 0xf02bb092 lents[89].address = 0 lents[90].hashval = 0xf03bb09e lents[90].address = 0 lents[91].hashval = 0xf494b02c lents[91].address = 0 lents[92].hashval = 0xf58811b2 lents[92].address = 0 lents[93].hashval = 0xf59811be lents[93].address = 0 lents[94].hashval = 0xf988f496 lents[94].address = 0 lents[95].hashval = 0xfb4d59cd lents[95].address = 0 lents[96].hashval = 0xfb5d59c1 lents[96].address = 0 lents[97].hashval = 0xfca78996 lents[97].address = 0 lents[98].hashval = 0xfe80e968 lents[98].address = 0 ltail.bestcount = 1 xfs_db> hope that helps someone figure out what is wrong. if I can provide further info, anything, just tell me. cheers, -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Thu Apr 5 00:28:08 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 00:28:11 -0700 (PDT) X-Spam-oss-Status: No, score=0.4 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l357S6fB028589 for ; Thu, 5 Apr 2007 00:28:07 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 994E3C308; Thu, 5 Apr 2007 09:28:05 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 25356-05; Thu, 5 Apr 2007 09:28:03 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 93425C2BC; Thu, 5 Apr 2007 09:28:03 +0200 (CEST) Date: Thu, 5 Apr 2007 09:28:03 +0200 From: Thomas Kaehn To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405072803.GB2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070404154523.GA20096@tuatara.stupidest.org> User-Agent: Mutt/1.5.9i X-archive-position: 11049 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 2790 Lines: 71 Hi Chris, On Wed, Apr 04, 2007 at 08:45:23AM -0700, Chris Wedgwood wrote: > On Wed, Apr 04, 2007 at 03:05:35PM +0200, Thomas Kaehn wrote: > > The XFS was first created using default options and later on with > > "-d su=64k,sw=2 -l su=64k" which improved overall performance > > but not delete performance. > > have you tried w/o using the hw raid? I'am going to test with a single disk in the same machine. > > Has anyone realized similar effects? On a different server (Dell > > 6850) the directory can be deleted within seconds. What could be the > > reason for the huge difference in delete performance? > > a lot of log updates; does the other server have a battery-backed > write-cache like many cards to these days? The Dell system has got a battery-backed write-cache. The 3ware system has no battery unit. However it's supposed to provide write cache, too. At least I've enabled it in the RAID's configuration. The controller has got more than 100MB memory. > > | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done > > | > > | real 6m6.814s > > | user 0m30.290s > > | sys 2m42.562s > > that's about the same as my quick single-spindle cheap-desktop test > here > > > | # time rm -rf y > > | > > | real 5m18.034s > > | user 0m0.036s > > | sys 0m8.169s > > v2 logs? what logbufs & logbsize is used? > > testing with my cheap crappy desktop workstation thing with a > single disk I get "1m25.004s" for the delete Your delete time sounds more sensible to me. The file system was created first with default options - later on I tried to match the RAID's stripe size, increased the log size and mounted with logbufs=8: log stripe unit specified, using v2 logs meta-data=/dev/sda1 isize=256 agcount=8, agsize=125008 blks = sectsz=512 data = bsize=4096 blocks=1000032, imaxpct=25 = sunit=16 swidth=32 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=512 sunit=16 blks realtime =none extsz=65536 blocks=0, rtextents=0 Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 00:37:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 00:37:27 -0700 (PDT) X-Spam-oss-Status: No, score=0.3 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l357bOfB030945 for ; Thu, 5 Apr 2007 00:37:25 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id A4D40C2EE; Thu, 5 Apr 2007 09:37:23 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 28182-02; Thu, 5 Apr 2007 09:37:21 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id EEA00C2E0; Thu, 5 Apr 2007 09:37:21 +0200 (CEST) Date: Thu, 5 Apr 2007 09:37:21 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405073721.GC2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11050 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 751 Lines: 23 Hi Justin, On Wed, Apr 04, 2007 at 02:36:37PM -0400, Justin Piszcz wrote: > For the ext3, try time bash -c 'rm -rf test; sync' # time bash -c 'rm -rf y; sync' real 0m1.592s user 0m0.032s sys 0m1.408s Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 01:17:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 01:17:57 -0700 (PDT) X-Spam-oss-Status: No, score=0.3 required=5.0 tests=AWL,BAYES_50 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l358HrfB010719 for ; Thu, 5 Apr 2007 01:17:55 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 66206C17E; Thu, 5 Apr 2007 10:17:53 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 07235-04-3; Thu, 5 Apr 2007 10:17:51 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 795E4C117; Thu, 5 Apr 2007 10:17:51 +0200 (CEST) Date: Thu, 5 Apr 2007 10:17:51 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405081751.GD2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404134724.GF18320@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11051 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1654 Lines: 46 Hi Justin, On Wed, Apr 04, 2007 at 10:13:36AM -0400, Justin Piszcz wrote: > I used the DEFAULT create options for XFS as I find it highly optimizes > itself (at least with SW raid) with the exception of the ROOT FS, I had > that optimized awhile ago and I kept it: > > /dev/md2 / xfs > logbufs=8,logbsize=262144,biosize=16,noatime,nodiratime,nobarrier 0 1 > > > For my regular RAID5s though I use defaults,noatime. I've disabled barriers, too, and performance increased dramatically. However I am not aware of the consequences of disabling write barriers. The FAQ generally recommends using write barriers except when having a battery-backed cache (this 3ware has not). | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done | | real 3m52.182s | user 0m30.482s | sys 3m16.152s | | # time \rm -rf y | | real 0m16.327s | user 0m0.052s | sys 0m8.305s So I am unsure if disabling is an option for me. I could imagine that write barriers are not properly supported by 3ware or have to be fine tuned at the kernel or SCSI level. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 01:30:32 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 01:30:35 -0700 (PDT) X-Spam-oss-Status: No, score=0.6 required=5.0 tests=AWL,BAYES_60,HTML_MESSAGE, RDNS_NONE autolearn=no version=3.2.0-pre1-r499012 Received: from ilsmtp.nds.com ([192.118.32.12]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l358UQfD013932 for ; Thu, 5 Apr 2007 01:30:31 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 MIME-Version: 1.0 Subject: XFS Resiliency to the disk errors. Date: Thu, 5 Apr 2007 11:08:07 +0300 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: XFS Resiliency to the disk errors. Thread-Index: Acd3WYs9KYsrse46SLGHxEpphOjGKg== From: "Zak, Semion" To: Content-Type: text/plain Content-Disposition: inline Content-Transfer-Encoding: 7bit X-archive-position: 11052 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: SZak@nds.com Precedence: bulk X-list: xfs Content-Length: 481 Lines: 21 Hi, We are studying possibility to use XFS with cheap (not too reliable) discs, so we have some questions: What in XFS is done to survive the disk errors (bad sectors)? I know about superblock duplication in every AG. What else? What is XFS behavior in case of the disk errors (panic/no mount/partial data access)? What could be done to restore? If zero bad sector/dump to other device/format/restore will help? Thanks, Semion. [[HTML alternate version deleted]] From owner-xfs@oss.sgi.com Thu Apr 5 02:03:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 02:03:40 -0700 (PDT) X-Spam-oss-Status: No, score=0.4 required=5.0 tests=AWL,BAYES_50, J_CHICKENPOX_12,J_CHICKENPOX_43 autolearn=no version=3.2.0-pre1-r499012 Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3593afB021908 for ; Thu, 5 Apr 2007 02:03:38 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 0DF9FC14C; Thu, 5 Apr 2007 11:03:36 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 20480-03; Thu, 5 Apr 2007 11:03:33 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id C3122BED3; Thu, 5 Apr 2007 11:03:33 +0200 (CEST) Date: Thu, 5 Apr 2007 11:03:33 +0200 From: Thomas Kaehn To: Chris Wedgwood Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405090333.GE2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070405072803.GB2759@mail3b.westend.com> User-Agent: Mutt/1.5.9i X-archive-position: 11053 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Content-Length: 1421 Lines: 40 Hi Chris, On Thu, Apr 05, 2007 at 09:28:03AM +0200, Thomas Kaehn wrote: > On Wed, Apr 04, 2007 at 08:45:23AM -0700, Chris Wedgwood wrote: > > On Wed, Apr 04, 2007 at 03:05:35PM +0200, Thomas Kaehn wrote: > > > The XFS was first created using default options and later on with > > > "-d su=64k,sw=2 -l su=64k" which improved overall performance > > > but not delete performance. > > > > have you tried w/o using the hw raid? > > I'am going to test with a single disk in the same machine. this is what I got with a single disk (defaults for mkfs.xfs, logbufs=8 for mount) in the same machine: | # time for i in `seq 1 100000`; do dd if=/dev/zero of=$i bs=1k count=20 >/dev/null 2>&1; done | | real 11m22.487s | user 0m30.278s | sys 2m33.762s | # time \rm -rf y | | real 8m20.963s | user 0m0.056s | sys 0m7.968s So there is no improvement for a single disk. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 03:21:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 03:21:49 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35ALifB006633 for ; Thu, 5 Apr 2007 03:21:45 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id BB6C8F063B21; Thu, 5 Apr 2007 06:21:43 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id B3139717672A; Thu, 5 Apr 2007 06:21:43 -0400 (EDT) Date: Thu, 5 Apr 2007 06:21:43 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: Chris Wedgwood , xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070405090333.GE2759@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405090333.GE2759@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1411685764-1175768503=:17700" X-archive-position: 11054 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1411685764-1175768503=:17700 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 5 Apr 2007, Thomas Kaehn wrote: > Hi Chris, > > On Thu, Apr 05, 2007 at 09:28:03AM +0200, Thomas Kaehn wrote: >> On Wed, Apr 04, 2007 at 08:45:23AM -0700, Chris Wedgwood wrote: >>> On Wed, Apr 04, 2007 at 03:05:35PM +0200, Thomas Kaehn wrote: >>>> The XFS was first created using default options and later on with >>>> "-d su=3D64k,sw=3D2 -l su=3D64k" which improved overall performance >>>> but not delete performance. >>> >>> have you tried w/o using the hw raid? >> >> I'am going to test with a single disk in the same machine. > > this is what I got with a single disk (defaults for mkfs.xfs, logbufs=3D8= for > mount) in the same machine: > > | # time for i in `seq 1 100000`; do dd if=3D/dev/zero of=3D$i bs=3D1k co= unt=3D20 >/dev/null 2>&1; done > | > | real 11m22.487s > | user 0m30.278s > | sys 2m33.762s > | # time \rm -rf y > | > | real 8m20.963s > | user 0m0.056s > | sys 0m7.968s > > So there is no improvement for a single disk. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > What kind of disks are you using? Maybe just slow disks??= ---1463747160-1411685764-1175768503=:17700-- From owner-xfs@oss.sgi.com Thu Apr 5 03:50:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 03:50:17 -0700 (PDT) Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35AoCfB013090 for ; Thu, 5 Apr 2007 03:50:13 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 92453C03B; Thu, 5 Apr 2007 12:50:08 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 17632-05; Thu, 5 Apr 2007 12:50:06 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id 14D6FBFBF; Thu, 5 Apr 2007 12:50:06 +0200 (CEST) Date: Thu, 5 Apr 2007 12:50:06 +0200 From: Thomas Kaehn To: Justin Piszcz Cc: Chris Wedgwood , xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405105005.GF2759@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405090333.GE2759@mail3b.westend.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.9i X-archive-position: 11055 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Hi Justin, On Thu, Apr 05, 2007 at 06:21:43AM -0400, Justin Piszcz wrote: > >So there is no improvement for a single disk. > > > What kind of disks are you using? Maybe just slow disks?? I am using the following disks: http://www.westerndigital.com/en/products/products.asp?DriveID=233 When disabling write barriers delete times are OK. I think that the 3ware RAID controller could have a problem with it. I'll try to contact 3ware in order to come to know if this feature is supported or not. Additionally I am going to try out some advices presented in the 3ware knowledge base. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 5 04:11:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 04:11:46 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35BBgfB018902 for ; Thu, 5 Apr 2007 04:11:43 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id DFC69F063B21; Thu, 5 Apr 2007 07:11:41 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id D8A5B717672A; Thu, 5 Apr 2007 07:11:41 -0400 (EDT) Date: Thu, 5 Apr 2007 07:11:41 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Thomas Kaehn cc: Chris Wedgwood , xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS In-Reply-To: <20070405105005.GF2759@mail3b.westend.com> Message-ID: References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405090333.GE2759@mail3b.westend.com> <20070405105005.GF2759@mail3b.westend.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463747160-1637025469-1175771501=:17700" X-archive-position: 11056 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463747160-1637025469-1175771501=:17700 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 5 Apr 2007, Thomas Kaehn wrote: > Hi Justin, > > On Thu, Apr 05, 2007 at 06:21:43AM -0400, Justin Piszcz wrote: >>> So there is no improvement for a single disk. >>> >> What kind of disks are you using? Maybe just slow disks?? > > I am using the following disks: > > http://www.westerndigital.com/en/products/products.asp?DriveID=3D233 > > When disabling write barriers delete times are OK. I think that > the 3ware RAID controller could have a problem with it. I'll try to > contact 3ware in order to come to know if this feature is supported or > not. > > Additionally I am going to try out some advices presented in the > 3ware knowledge base. > > Ciao, > Thomas > --=20 > Thomas K=E4hn WESTEND GmbH | Internet-Business-Provid= er > Technik CISCO Systems Partner - Authorized Reseller > Im S=FCsterfeld 6 Tel 0241/701333-18 > tk@westend.com D-52072 Aachen Fax 0241/911879 > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 > Gesch=E4ftsf=FChrer: Thomas Neugebauer, Thomas Heller, Michael = Kolb > > Ah, ok-- Keep us updated/let us know if you get any new findings/etc. Something else you can try as well is turning off NCQ, that gave me a=20 10-35% performance boost depending on the benchmark. Justin.= ---1463747160-1637025469-1175771501=:17700-- From owner-xfs@oss.sgi.com Thu Apr 5 04:44:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 04:44:19 -0700 (PDT) Received: from e6.ny.us.ibm.com (e6.ny.us.ibm.com [32.97.182.146]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35BiDfB031345 for ; Thu, 5 Apr 2007 04:44:15 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l35Biv3S002799 for ; Thu, 5 Apr 2007 07:44:57 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l35BiC7l280220 for ; Thu, 5 Apr 2007 07:44:12 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l35BiCDL001445 for ; Thu, 5 Apr 2007 07:44:12 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l35BiBr1001354; Thu, 5 Apr 2007 07:44:11 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 5A61929ECD4; Thu, 5 Apr 2007 17:14:17 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l35BiG93015298; Thu, 5 Apr 2007 17:14:16 +0530 Date: Thu, 5 Apr 2007 17:14:16 +0530 From: "Amit K. Arora" To: Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070405114416.GB19982@amitarora.in.ibm.com> References: <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070405112619.GA19982@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070405112619.GA19982@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11057 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Thu, Apr 05, 2007 at 04:56:19PM +0530, Amit K. Arora wrote: Correction below: > asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode) > { > return sys_fallocate(fd, offset, len, mode); return sys_fallocate(fd, mode, offset, len); > } -- Regards, Amit Arora From owner-xfs@oss.sgi.com Thu Apr 5 04:45:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 04:45:23 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35BjHfB031746 for ; Thu, 5 Apr 2007 04:45:19 -0700 Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l35BQIr0010240 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Thu, 5 Apr 2007 07:26:19 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e35.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l35BQG2R000962 for ; Thu, 5 Apr 2007 07:26:16 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l35BQGB7176036 for ; Thu, 5 Apr 2007 05:26:16 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l35BQFCw023992 for ; Thu, 5 Apr 2007 05:26:16 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av02.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l35BQFnW023961; Thu, 5 Apr 2007 05:26:15 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id AB73B29ECD4; Thu, 5 Apr 2007 16:56:20 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l35BQJem007983; Thu, 5 Apr 2007 16:56:19 +0530 Date: Thu, 5 Apr 2007 16:56:19 +0530 From: "Amit K. Arora" To: Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070405112619.GA19982@amitarora.in.ibm.com> References: <20070117094658.GA17390@amitarora.in.ibm.com> <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070330071417.GI355@devserv.devel.redhat.com> User-Agent: Mutt/1.4.1i X-archive-position: 11058 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > Wouldn't > int fallocate(loff_t offset, loff_t len, int fd, int mode) > work on both s390 and ppc/arm? glibc will certainly wrap it and > reorder the arguments as needed, so there is no need to keep fd first. This should work on all the platforms. The only concern I can think of here is the convention being followed till now, where all the entities on which the action has to be performed by the kernel (say fd, file/device name, pid etc.) is the first argument of the system call. If we can live with the small exception here, fine. Or else, we may have to implement the int fd, int mode, loff_t offset, loff_t len as the layout of arguments here. I think only s390 will have a problem with this, and we can think of a workaround for it (may be similar to what ARM did to implement sync_file_range() system call) : asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode) { return sys_fallocate(fd, offset, len, mode); } To me both the approaches look slightly unconventional. But, we need to compromise somewhere to make things work on all the platforms. Any thoughts on which one of the above should we finalize on ? Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Thu Apr 5 08:29:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 08:29:30 -0700 (PDT) Received: from smtp101.sbc.mail.mud.yahoo.com (smtp101.sbc.mail.mud.yahoo.com [68.142.198.200]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l35FTKfB002141 for ; Thu, 5 Apr 2007 08:29:21 -0700 Received: (qmail 75615 invoked from network); 5 Apr 2007 15:29:19 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp101.sbc.mail.mud.yahoo.com with SMTP; 5 Apr 2007 15:29:18 -0000 X-YMail-OSG: 8JPF9LEVM1lVqLlXkX11bjo82.dG9Z1QiU9IWNNuDIvd.SD3BTlvPb5emYgszcu26.YWreJrHQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 43A481826127; Thu, 5 Apr 2007 08:29:17 -0700 (PDT) Date: Thu, 5 Apr 2007 08:29:17 -0700 From: Chris Wedgwood To: Thomas Kaehn Cc: xfs@oss.sgi.com Subject: Re: Strange delete performance using XFS Message-ID: <20070405152917.GB23893@tuatara.stupidest.org> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070405072803.GB2759@mail3b.westend.com> X-archive-position: 11060 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs Content-Length: 770 Lines: 18 On Thu, Apr 05, 2007 at 09:28:03AM +0200, Thomas Kaehn wrote: > The Dell system has got a battery-backed write-cache. The 3ware > system has no battery unit. However it's supposed to provide write > cache, too. That sounds like the main reason for the difference. The Dell's raid system can safely buffer outstanding writes and flsuh them, the 3ware can't so it stalls waiting fot the disks to catch up. You could run blktrace and watch what's going on in both cases to verify this. The numbers do seem a little low for a raid array all the same, I'd be tempted to just use the 3ware as a JBOD and use sw, but I'm arguably biased, I've had so many reliability and performance problems with hw raid over the years I will almost always use sw raid given the choice. From owner-xfs@oss.sgi.com Thu Apr 5 09:06:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 09:06:40 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35G6WfB013207 for ; Thu, 5 Apr 2007 09:06:33 -0700 Received: from [192.168.1.103] (c-76-17-197-128.hsd1.mn.comcast.net [76.17.197.128]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 478131807E361; Thu, 5 Apr 2007 11:06:31 -0500 (CDT) Message-ID: <46151E86.2080704@sandeen.net> Date: Thu, 05 Apr 2007 11:06:30 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (Macintosh/20070221) MIME-Version: 1.0 To: "Zak, Semion" CC: xfs@oss.sgi.com Subject: Re: XFS Resiliency to the disk errors. References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11061 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Content-Length: 1285 Lines: 39 Zak, Semion wrote: > Hi, > > We are studying possibility to use XFS with cheap (not too reliable) > discs, so we have some questions: > > What in XFS is done to survive the disk errors (bad sectors)? > I know about superblock duplication in every AG. What else? > > What is XFS behavior in case of the disk errors (panic/no mount/partial > data access)? generally metadata IO errors or bad magic found in metadata will shut down the filesystem gracefully if it can. IO errors on data will just be IO errors. > What could be done to restore? xfsdump/xfsrestore I suppose > If zero bad sector/dump to other device/format/restore will help? Well, you can't make data out of nothing. you could dd off the junk drive, zeroing out unreadable sectors, point xfs_repair at it and hope for the best. Which, depending on the problem, could wind up not being very good. If you want to know how to recover from disaster, it sounds like perhaps your data is important enough that you should not plan for failure, but rather find a way to avoid it? Seems to me the only way I'd want to put drives which are expected to fail regularly into a product is if the recovery method of "replace the disk and re-image the appliance" was acceptable, but that's just me. :) -Eric From owner-xfs@oss.sgi.com Thu Apr 5 09:23:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 09:23:28 -0700 (PDT) Received: from mail.linbit.com (nudl.linbit.com [212.69.162.21]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35GNJfB017294 for ; Thu, 5 Apr 2007 09:23:21 -0700 Received: by mail.linbit.com (LINBIT Mail Daemon, from userid 1030) id 8E70C2E114F9; Thu, 5 Apr 2007 18:23:17 +0200 (CEST) Date: Thu, 5 Apr 2007 18:22:35 +0200 X-OfflineIMAP-x597882765-6c61727340696d61702e6c696e626974-494e424f582e4f7574626f78: 1175790198-0130267256784-v4.0.11 From: Lars Ellenberg To: xfs@oss.sgi.com Subject: Re: xfs_repair leaves empty but undeletable dirs in lost+found Message-ID: <20070405162235.GA816@barkeeper1.linbit> Mail-Followup-To: Lars Ellenberg , xfs@oss.sgi.com References: <20070404203601.GA11771@barkeeper1.linbit> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070404203601.GA11771@barkeeper1.linbit> User-Agent: Mutt/1.5.11 X-archive-position: 11062 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lars.ellenberg@linbit.com Precedence: bulk X-list: xfs Content-Length: 1344 Lines: 39 On Wed, Apr 04, 2007 at 10:36:01PM +0200, Lars Ellenberg wrote: > NOTE that I used the default sarge xfsprogs version 2.6.20, not > the upstream 2.8.20. yet. I'll start an xfs_repair run with > 2.8.20 right after this post, though... done. now, this used seriously more memory, and cpu, and the box went thrashing. after some experimenting, xfs_repair -o bhash=512 got it going without using excessive amounts of swap, so it finally finished after about 12 hours (2.6.20 needed 8:30, repeatable). it did not change the situation, however. I know I could clean these using xfs_db and an additional run of xfs_repair, but I'm going to keep these around for some more time, in case you want me to have a look at some internals still. file system itself has gone life again, I hope it does not hurt having those strange directories around. maybe it is even "just" a problem on the kernel side, not being able to convert so the expected "form" of directory? sorry, I'm not too deep in the xfs internals, so I need some input from the developers here... Thanks, -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Thu Apr 5 10:27:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 05 Apr 2007 10:27:29 -0700 (PDT) Received: from rgminet02.oracle.com (rgminet02.oracle.com [148.87.113.119]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l35HRKfB000743 for ; Thu, 5 Apr 2007 10:27:22 -0700 Received: from rgminet01.oracle.com (rgminet01.oracle.com [148.87.113.118]) by rgminet02.oracle.com (Switch-3.2.4/Switch-3.1.7) with ESMTP id l35FoFgj014786 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 5 Apr 2007 09:50:16 -0600 Received: from rgmgw3.us.oracle.com (rgmgw3.us.oracle.com [138.1.186.112]) by rgminet01.oracle.com (Switch-3.2.4/Switch-3.1.6) with ESMTP id l35FnVjP026456; Thu, 5 Apr 2007 09:49:31 -0600 Received: from acsmt350.oracle.com (acsmt350.oracle.com [141.146.40.150]) by rgmgw3.us.oracle.com (Switch-3.2.4/Switch-3.1.7) with ESMTP id l35FQpgT010519; Thu, 5 Apr 2007 09:49:29 -0600 Received: from pool-71-245-96-31.nycmny.fios.verizon.net by rcsmt252.oracle.com with ESMTP id 2592044011175788085; Thu, 05 Apr 2007 09:48:05 -0600 Date: Thu, 5 Apr 2007 08:50:16 -0700 From: Randy Dunlap To: "Amit K. Arora" Cc: Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-Id: <20070405085016.a513526b.randy.dunlap@oracle.com> In-Reply-To: <20070405112619.GA19982@amitarora.in.ibm.com> References: <20070117094658.GA17390@amitarora.in.ibm.com> <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070405112619.GA19982@amitarora.in.ibm.com> Organization: Oracle Linux Eng. X-Mailer: Sylpheed 2.3.1 (GTK+ 2.8.10; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Whitelist: TRUE X-Whitelist: TRUE X-Whitelist: TRUE X-Brightmail-Tracker: AAAAAQAAAAI= X-Brightmail-Tracker: AAAAAQAAAAI= X-archive-position: 11063 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: randy.dunlap@oracle.com Precedence: bulk X-list: xfs Content-Length: 1611 Lines: 44 On Thu, 5 Apr 2007 16:56:19 +0530 Amit K. Arora wrote: > On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > > Wouldn't > > int fallocate(loff_t offset, loff_t len, int fd, int mode) > > work on both s390 and ppc/arm? glibc will certainly wrap it and > > reorder the arguments as needed, so there is no need to keep fd first. > > This should work on all the platforms. The only concern I can think of > here is the convention being followed till now, where all the entities on > which the action has to be performed by the kernel (say fd, file/device > name, pid etc.) is the first argument of the system call. If we can live > with the small exception here, fine. > > Or else, we may have to implement the > > int fd, int mode, loff_t offset, loff_t len > > as the layout of arguments here. I think only s390 will have a problem > with this, and we can think of a workaround for it (may be similar to > what ARM did to implement sync_file_range() system call) : > > asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode) > { > return sys_fallocate(fd, offset, len, mode); > } > > > To me both the approaches look slightly unconventional. But, we need to > compromise somewhere to make things work on all the platforms. > > Any thoughts on which one of the above should we finalize on ? > > Thanks! If s390 can work around the calling order that easily, I certainly prefer the more conventional ordering of: > int fd, int mode, loff_t offset, loff_t len --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** From owner-xfs@oss.sgi.com Fri Apr 6 03:19:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 06 Apr 2007 03:20:00 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l36AJsfB023236 for ; Fri, 6 Apr 2007 03:19:55 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 54FE07BA30E; Fri, 6 Apr 2007 03:58:22 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id D2ECB407F; Fri, 6 Apr 2007 03:58:20 -0600 (MDT) Date: Fri, 6 Apr 2007 03:58:20 -0600 From: Andreas Dilger To: "Amit K. Arora" Cc: Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070406095820.GF5967@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , Jakub Jelinek , Andrew Morton , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com References: <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070405112619.GA19982@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070405112619.GA19982@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11066 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs Content-Length: 1005 Lines: 28 On Apr 05, 2007 16:56 +0530, Amit K. Arora wrote: > This should work on all the platforms. The only concern I can think of > here is the convention being followed till now, where all the entities on > which the action has to be performed by the kernel (say fd, file/device > name, pid etc.) is the first argument of the system call. If we can live > with the small exception here, fine. Yes, it is much cleaner to have fd first, like every other such syscall. > Or else, we may have to implement the > > int fd, int mode, loff_t offset, loff_t len > > as the layout of arguments here. I think only s390 will have a problem > with this, and we can think of a workaround for it (may be similar to > what ARM did to implement sync_file_range() system call) : > > asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode) > { > return sys_fallocate(fd, offset, len, mode); > } Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Fri Apr 6 11:57:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 06 Apr 2007 11:57:38 -0700 (PDT) Received: from ty.sabi.co.UK (82-69-39-138.dsl.in-addr.zen.co.uk [82.69.39.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l36IvRfB017453 for ; Fri, 6 Apr 2007 11:57:29 -0700 Received: from from [127.0.0.1] (helo=base.ty.sabi.co.UK) by ty.sabi.co.UK with esmtp(Exim 4.62 #1) id 1HZtVR-0004j5-QX for ; Fri, 06 Apr 2007 19:49:54 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17942.38470.367699.402354@base.ty.sabi.co.UK> Date: Fri, 6 Apr 2007 19:49:42 +0100 X-Face: SMJE]JPYVBO-9UR%/8d'mG.F!@.,l@c[f'[%S8'BZIcbQc3/">GrXDwb#;fTRGNmHr^JFb SAptvwWc,0+z+~p~"Gdr4H$(|N(yF(wwCM2bW0~U?HPEE^fkPGx^u[*[yV.gyB!hDOli}EF[\cW*S H&spRGFL}{`bj1TaD^l/"[ msn( /TH#THs{Hpj>)]f> Subject: Re: XFS Resiliency to the disk errors. In-Reply-To: References: X-Mailer: VM 7.17 under 21.4 (patch 20) XEmacs Lucid From: pg_xfs@xfs.for.sabi.co.UK (Peter Grandi) X-Disclaimer: This message contains only personal opinions X-archive-position: 11067 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: pg_xfs@xfs.for.sabi.co.UK Precedence: bulk X-list: xfs Content-Length: 1251 Lines: 33 >>> On Thu, 5 Apr 2007 11:08:07 +0300, "Zak, Semion" >>> said: SZak> Hi, We are studying possibility to use XFS with cheap (not SZak> too reliable) discs, so we have some questions: Astute move :-). I hope that you are also thinking of using 16-wide RAID5 too :-). SZak> What in XFS is done to survive the disk errors (bad SZak> sectors)? [ ... ] My impression is that the XFS design is really meant for highly scalable performance on enterprise level hardware, where the block device layer abstracts aways all drive error issues, including having UPSes. Sure you can use it otherwise, but it has a very different optimal usage envelope from 'ext3' or ReiserFS/Reiser4 (which have been designed with stronger resiliency and recoverability features, as they are more oriented to desktop and cheap server usage). Anyhow, a highly reliable block device layer can surely be built out of cheap disks, if one does it right, and people like EMC2 do it regularly with their midrange products. I may be interesting for your to have a look at the disk reliability statistics in some recent papers by some Google and CMU researchers, discussed here: http://swik.net/User:dolander/All+Things+Distributed/On+the+Reliability+of+Hard+Disks/ From owner-xfs@oss.sgi.com Sat Apr 7 06:28:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 07 Apr 2007 06:28:12 -0700 (PDT) Received: from ty.sabi.co.UK (82-69-39-138.dsl.in-addr.zen.co.uk [82.69.39.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l37DS3fB021195 for ; Sat, 7 Apr 2007 06:28:06 -0700 Received: from from [127.0.0.1] (helo=base.ty.sabi.co.UK) by ty.sabi.co.UK with esmtp(Exim 4.62 #1) id 1HZthd-0005LS-2Q for ; Fri, 06 Apr 2007 20:02:29 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17942.39236.623189.817503@base.ty.sabi.co.UK> Date: Fri, 6 Apr 2007 20:02:28 +0100 X-Face: SMJE]JPYVBO-9UR%/8d'mG.F!@.,l@c[f'[%S8'BZIcbQc3/">GrXDwb#;fTRGNmHr^JFb SAptvwWc,0+z+~p~"Gdr4H$(|N(yF(wwCM2bW0~U?HPEE^fkPGx^u[*[yV.gyB!hDOli}EF[\cW*S H&spRGFL}{`bj1TaD^l/"[ msn( /TH#THs{Hpj>)]f> Subject: Re: Strange delete performance using XFS In-Reply-To: <20070405152917.GB23893@tuatara.stupidest.org> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405152917.GB23893@tuatara.stupidest.org> X-Mailer: VM 7.17 under 21.4 (patch 20) XEmacs Lucid From: pg_xfs@xfs.for.sabi.co.UK (Peter Grandi) X-Disclaimer: This message contains only personal opinions X-archive-position: 11069 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: pg_xfs@xfs.for.sabi.co.UK Precedence: bulk X-list: xfs Content-Length: 1071 Lines: 28 [ ... slowness deleting a lot of inodes ... ] >> The Dell system has got a battery-backed write-cache. The >> 3ware system has no battery unit. However it's supposed to >> provide write cache, too. Whatever, but you cannot have both metadata consistency and high speed without fully reliable hw... > [ ... ] The Dell's raid system can safely buffer outstanding > writes and flsuh them, the 3ware can't so it stalls waiting > fot the disks to catch up. [ ... ] I'd be tempted to just use > the 3ware as a JBOD and use sw, but I'm arguably biased, I've > had so many reliability and performance problems with hw raid > over the years Uhm, I had a friend that worked for a middling storage system vendor and he was telling me horror stories about bugs and misdesigns in their quite popular RAID products. 3ware seem to me one of the more reliable RAID brands, their cautious approach may be why they are slower above. > I will almost always use sw raid given the choice. Does not buy you a lot over well designed RAID host adapter. it is also a lot less convenient. From owner-xfs@oss.sgi.com Sat Apr 7 13:47:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 07 Apr 2007 13:47:45 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l37KlefB023317 for ; Sat, 7 Apr 2007 13:47:42 -0700 Received: from localhost (dslb-084-056-094-087.pools.arcor-ip.net [84.56.94.87]) by mail.lichtvoll.de (Postfix) with ESMTP id 58DE55ADBB for ; Sat, 7 Apr 2007 22:47:39 +0200 (CEST) From: Martin Steigerwald To: linux-xfs@oss.sgi.com Subject: Re: XFS Resiliency to the disk errors. Date: Sat, 7 Apr 2007 22:47:37 +0200 User-Agent: KMail/1.9.6 References: (sfid-20070405_112347_743716_6E82B98E) (sfid-20070405_112347_743716_6E82B98E) (sfid-20070405_112347_743716_6E82B98E) In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200704072247.38143.Martin@lichtvoll.de> X-archive-position: 11070 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Donnerstag 05 April 2007 schrieb Zak, Semion: > Hi, > > We are studying possibility to use XFS with cheap (not too reliable) > discs, so we have some questions: Hi Semion! I recommend at least monitoring the health status of the drives using smartmontools - with regular short and long selft test - or a similar mechanism. So you *may* at least be warned *before* a disk fails. Otherwise I would go for a redundant RAID array at least so that at least one drive in a bunch of drives can fail without data loss. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Mon Apr 9 08:36:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 09 Apr 2007 08:36:35 -0700 (PDT) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l39FaQfB032204 for ; Mon, 9 Apr 2007 08:36:28 -0700 Received: by ozlabs.org (Postfix, from userid 1003) id D13CBDDFDB; Tue, 10 Apr 2007 01:36:11 +1000 (EST) MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Message-ID: <17946.14646.808334.441833@cargo.ozlabs.ibm.com> Date: Mon, 9 Apr 2007 23:01:42 +1000 From: Paul Mackerras To: =?utf-8?B?SsO2cm4=?= Engel Cc: Heiko Carstens , Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call In-Reply-To: <20070330104449.GA9371@lazybastard.org> References: <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071929.GC8365@osiris.boeblingen.de.ibm.com> <17932.54606.323431.491736@cargo.ozlabs.ibm.com> <20070330104449.GA9371@lazybastard.org> X-Mailer: VM 7.19 under Emacs 21.4.1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l39FaSfB032208 X-archive-position: 11071 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: paulus@samba.org Precedence: bulk X-list: xfs Jörn Engel writes: > Wouldn't that work be confined to fallocate()? If I understand Heiko > correctly, the alternative would slow s390 down for every syscall, > including more performance-critical ones. The alternative that Jakub suggested wouldn't slow s390 down. Paul. From owner-xfs@oss.sgi.com Mon Apr 9 09:39:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 09 Apr 2007 09:39:08 -0700 (PDT) Received: from longford.lazybastard.org (lazybastard.de [212.112.238.170]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l39Gd0fB011359 for ; Mon, 9 Apr 2007 09:39:00 -0700 Received: from joern by longford.lazybastard.org with local (Exim 4.50) id 1HawpE-0006QX-5Y; Mon, 09 Apr 2007 18:34:40 +0200 Date: Mon, 9 Apr 2007 18:34:37 +0200 From: =?utf-8?B?SsO2cm4=?= Engel To: Paul Mackerras Cc: Heiko Carstens , Andrew Morton , "Amit K. Arora" , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070409163436.GA24012@lazybastard.org> References: <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071929.GC8365@osiris.boeblingen.de.ibm.com> <17932.54606.323431.491736@cargo.ozlabs.ibm.com> <20070330104449.GA9371@lazybastard.org> <17946.14646.808334.441833@cargo.ozlabs.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <17946.14646.808334.441833@cargo.ozlabs.ibm.com> User-Agent: Mutt/1.5.9i X-archive-position: 11072 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: joern@lazybastard.org Precedence: bulk X-list: xfs On Mon, 9 April 2007 23:01:42 +1000, Paul Mackerras wrote: > Jörn Engel writes: > > > Wouldn't that work be confined to fallocate()? If I understand Heiko > > correctly, the alternative would slow s390 down for every syscall, > > including more performance-critical ones. > > The alternative that Jakub suggested wouldn't slow s390 down. True. And it appears to be one of the least offensive options we have. Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra From owner-xfs@oss.sgi.com Mon Apr 9 18:51:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 09 Apr 2007 18:51:18 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3A1pBfB026573 for ; Mon, 9 Apr 2007 18:51:13 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA27963; Tue, 10 Apr 2007 11:51:05 +1000 Message-Id: <200704100151.LAA27963@larry.melbourne.sgi.com> From: "Barry Naujok" To: "'Lars Ellenberg'" , Subject: RE: xfs_repair leaves empty but undeletable dirs in lost+found Date: Tue, 10 Apr 2007 11:55:14 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 In-Reply-To: <20070405162235.GA816@barkeeper1.linbit> Thread-Index: Acd3nt9nMNud96faT2y1vCJJ8ylFNADdDMXw X-archive-position: 11073 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs Hi Lars, Would it be possible for you apply the patch I posted to xfs@oss in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html to the latest xfsprogs source, make and install it and run: # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 And make the image available for me to download and analyse? Regards, Barry. > -----Original Message----- > From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] > On Behalf Of Lars Ellenberg > Sent: Friday, 6 April 2007 2:23 AM > To: xfs@oss.sgi.com > Subject: Re: xfs_repair leaves empty but undeletable dirs in > lost+found > > On Wed, Apr 04, 2007 at 10:36:01PM +0200, Lars Ellenberg wrote: > > NOTE that I used the default sarge xfsprogs version 2.6.20, not > > the upstream 2.8.20. yet. I'll start an xfs_repair run with > > 2.8.20 right after this post, though... > > done. > now, this used seriously more memory, and cpu, > and the box went thrashing. > after some experimenting, > > xfs_repair -o bhash=512 > > got it going without using excessive amounts of swap, > so it finally finished after about 12 hours > (2.6.20 needed 8:30, repeatable). > > it did not change the situation, however. > > I know I could clean these using xfs_db and an additional run of > xfs_repair, but I'm going to keep these around for some more time, in > case you want me to have a look at some internals still. > > file system itself has gone life again, I hope it does not hurt having > those strange directories around. > > maybe it is even "just" a problem on the kernel side, > not being able to convert so the expected "form" of directory? > sorry, I'm not too deep in the xfs internals, so I need some > input from > the developers here... > > Thanks, > > -- > : Lars Ellenberg Tel +43-1-8178292-0 : > : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : > : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : > __ > please use the "List-Reply" function of your email client. > > From owner-xfs@oss.sgi.com Mon Apr 9 23:49:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 09 Apr 2007 23:49:17 -0700 (PDT) Received: from ilsmtp.nds.com ([192.118.32.12]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3A6n8fB016419 for ; Mon, 9 Apr 2007 23:49:14 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: RE: XFS Resiliency to the disk errors. Date: Tue, 10 Apr 2007 09:49:06 +0300 Message-ID: In-Reply-To: <46151E86.2080704@sandeen.net> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: XFS Resiliency to the disk errors. Thread-Index: Acd3nHBtIHJQ+P54QPiwtt50fU4c0ADnmBnA From: "Zak, Semion" To: "Eric Sandeen" Cc: Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l3A6nEfB016437 X-archive-position: 11074 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: SZak@nds.com Precedence: bulk X-list: xfs Thank you very much. I have other question, about data lose on crash/power cut. Is it possible to make it not more then in other file systems, if open the important file with O_SYNC flag, or use fsync and sync functions? Thanks, Semion. -----Original Message----- From: Eric Sandeen [mailto:sandeen@sandeen.net] Sent: Thursday, April 05, 2007 7:07 PM To: Zak, Semion Cc: xfs@oss.sgi.com Subject: Re: XFS Resiliency to the disk errors. Zak, Semion wrote: > Hi, > > We are studying possibility to use XFS with cheap (not too reliable) > discs, so we have some questions: > > What in XFS is done to survive the disk errors (bad sectors)? > I know about superblock duplication in every AG. What else? > > What is XFS behavior in case of the disk errors (panic/no > mount/partial data access)? generally metadata IO errors or bad magic found in metadata will shut down the filesystem gracefully if it can. IO errors on data will just be IO errors. > What could be done to restore? xfsdump/xfsrestore I suppose > If zero bad sector/dump to other device/format/restore will help? Well, you can't make data out of nothing. you could dd off the junk drive, zeroing out unreadable sectors, point xfs_repair at it and hope for the best. Which, depending on the problem, could wind up not being very good. If you want to know how to recover from disaster, it sounds like perhaps your data is important enough that you should not plan for failure, but rather find a way to avoid it? Seems to me the only way I'd want to put drives which are expected to fail regularly into a product is if the recovery method of "replace the disk and re-image the appliance" was acceptable, but that's just me. :) -Eric *********************************************************************************** This email message and any attachments thereto are intended only for use by the addressee(s) named above, and may contain legally privileged and/or confidential information. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the postmaster@nds.com and destroy the original message. *********************************************************************************** From owner-xfs@oss.sgi.com Tue Apr 10 02:26:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 10 Apr 2007 02:26:49 -0700 (PDT) Received: from mail.linbit.com (nudl.linbit.com [212.69.162.21]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3A9QifB014917 for ; Tue, 10 Apr 2007 02:26:45 -0700 Received: by mail.linbit.com (LINBIT Mail Daemon, from userid 1030) id A9D432DF653D; Tue, 10 Apr 2007 11:26:42 +0200 (CEST) Date: Tue, 10 Apr 2007 11:24:43 +0200 X-OfflineIMAP-382579781-6c61727340696d61702e6c696e626974-494e424f582e4f7574626f78: 1176197203-0163058626383-v4.0.11 From: Lars Ellenberg To: Barry Naujok Cc: xfs@oss.sgi.com Subject: Re: xfs_repair leaves empty but undeletable dirs in lost+found Message-ID: <20070410092443.GA8496@barkeeper1.linbit> Mail-Followup-To: Lars Ellenberg , Barry Naujok , xfs@oss.sgi.com References: <20070405162235.GA816@barkeeper1.linbit> <200704100151.LAA27963@larry.melbourne.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200704100151.LAA27963@larry.melbourne.sgi.com> User-Agent: Mutt/1.5.11 X-archive-position: 11075 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lars.ellenberg@linbit.com Precedence: bulk X-list: xfs On Tue, Apr 10, 2007 at 11:55:14AM +1000, Barry Naujok wrote: > Hi Lars, > > Would it be possible for you apply the patch I posted to xfs@oss > in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html > to the latest xfsprogs source, make and install it and run: > > # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 > > And make the image available for me to download and analyse? uhm. probably. I'll talk with the guy who owns the data :) out of curiosity: what exactly would you do with it? I mean, would that be sufficient to restore the "badness", with the files all filled with zero, and you'd be able to reproduce locally? -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Tue Apr 10 07:40:57 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 10 Apr 2007 07:41:05 -0700 (PDT) Received: from mexforward.lss.emc.com (mexforward.lss.emc.com [128.222.32.20]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3AEetfB020875 for ; Tue, 10 Apr 2007 07:40:56 -0700 Received: from mailhub.lss.emc.com (nagas.lss.emc.com [10.254.144.11]) by mexforward.lss.emc.com (Switch-3.2.5/Switch-3.1.7) with ESMTP id l3ADNodH011613; Tue, 10 Apr 2007 09:23:50 -0400 (EDT) Received: from [168.159.36.217] ([168.159.36.217]) by mailhub.lss.emc.com (Switch-3.2.5/Switch-3.1.7) with ESMTP id l3ADNm4A028359; Tue, 10 Apr 2007 09:23:48 -0400 (EDT) Message-ID: <461B8FE3.3010600@emc.com> Date: Tue, 10 Apr 2007 09:23:47 -0400 From: Ric Wheeler Reply-To: ric@emc.com User-Agent: Thunderbird 1.5.0.8 (X11/20061025) MIME-Version: 1.0 To: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, reiserfs-list@namesys.com, ext2-devel@lists.sourceforge.net, linux-ide@vger.kernel.org, ocfs2-devel@oss.oracle.com, linux-scsi@vger.kernel.org Subject: Linux 2007 File System & IO Workshop notes & talks Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-PMX-Version: 4.7.1.128075, Antispam-Engine: 2.5.1.298604, Antispam-Data: 2007.4.10.54233 X-PerlMx-Spam: Gauge=, SPAM=1%, Reasons='EMC_FROM_0+ -3, RDNS_NXDOMAIN 0, RDNS_SUSP 0, RDNS_SUSP_GENERIC 0, __CP_URI_IN_BODY 0, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __USER_AGENT 0' X-archive-position: 11077 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: ric@emc.com Precedence: bulk X-list: xfs Content-Length: 471 Lines: 20 We have some of the material reviewed and posted now from the IO & FS workshop. USENIX has posted the talks at: http://www.usenix.org/events/lsf07/tech/tech.html A write up of the workshop went out at LWN and invoked a healthy discussion: http://lwn.net/Articles/226351/ At that LWN article, there is a link to the Linux FS wiki with good notes: http://linuxfs.pbwiki.com/LSF07-Workshop-Notes Another summary will go out in the next USENIX ;login edition. ric From owner-xfs@oss.sgi.com Tue Apr 10 14:18:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 10 Apr 2007 14:18:09 -0700 (PDT) Received: from mail.lichtvoll.de (mondschein.lichtvoll.de [194.150.191.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3ALI4fB031461 for ; Tue, 10 Apr 2007 14:18:06 -0700 Received: from localhost (dslb-084-056-073-232.pools.arcor-ip.net [84.56.73.232]) by mail.lichtvoll.de (Postfix) with ESMTP id 478185AD2D; Tue, 10 Apr 2007 22:45:02 +0200 (CEST) From: Martin Steigerwald To: Lars Ellenberg , Barry Naujok , xfs@oss.sgi.com Subject: Re: xfs_repair leaves empty but undeletable dirs in lost+found Date: Tue, 10 Apr 2007 22:45:00 +0200 User-Agent: KMail/1.9.6 References: <20070405162235.GA816@barkeeper1.linbit> <200704100151.LAA27963@larry.melbourne.sgi.com> <20070410092443.GA8496@barkeeper1.linbit> (sfid-20070410_134231_652510_049D094B) In-Reply-To: <20070410092443.GA8496@barkeeper1.linbit> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200704102245.00734.Martin@lichtvoll.de> X-archive-position: 11078 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: Martin@lichtvoll.de Precedence: bulk X-list: xfs Am Dienstag 10 April 2007 schrieb Lars Ellenberg: > > Would it be possible for you apply the patch I posted to xfs@oss > > in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html > > to the latest xfsprogs source, make and install it and run: > > > > # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 > > > > And make the image available for me to download and analyse? > > uhm. probably. I'll talk with the guy who owns the data :) > > out of curiosity: what exactly would you do with it? > I mean, would that be sufficient to restore the "badness", > with the files all filled with zero, > and you'd be able to reproduce locally? Hi Lars! As far as I understand a meta data dump does not contain the actual data in the files. That would be sufficient als xfs_repair is for repairing metadata corruption. For analysing the reason why a file is undeleteable its actual contents should be quite irrelevant. Only thing that could possibly matter is the amount and location, not the contents of blocks a file occupies. But that doesn't seem to matter here either. It would contain meta data information on the directory and file names as well as timestamps, owner and rights - if you are concerned about the privacy of your customer you may want to try to reproduce the problem with different meta data information. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 From owner-xfs@oss.sgi.com Tue Apr 10 17:11:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 10 Apr 2007 17:11:26 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3B0BMfB001815 for ; Tue, 10 Apr 2007 17:11:23 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA28739; Wed, 11 Apr 2007 10:11:16 +1000 Message-Id: <200704110011.KAA28739@larry.melbourne.sgi.com> From: "Barry Naujok" To: "'Lars Ellenberg'" Cc: Subject: RE: xfs_repair leaves empty but undeletable dirs in lost+found Date: Wed, 11 Apr 2007 10:16:57 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 In-Reply-To: <20070410092443.GA8496@barkeeper1.linbit> Thread-Index: Acd7Ul9QlkRwFlL5SyqfAUwRC1EM9QAe85tQ X-archive-position: 11079 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs Hi Lars, It copies the inodes and directory contents and other metadata. I restore it here and run xfs_repair over it to see how it failed. xfs_repair only operates on metadata and does not check data. Currently, it does not obfuscate file names if there is a privacy/confidentiality concern but that is a feature I intend on adding later. Regards, Barry. > -----Original Message----- > From: Lars Ellenberg [mailto:lars.ellenberg@linbit.com] > Sent: Tuesday, 10 April 2007 7:25 PM > To: Barry Naujok > Cc: xfs@oss.sgi.com > Subject: Re: xfs_repair leaves empty but undeletable dirs in > lost+found > > On Tue, Apr 10, 2007 at 11:55:14AM +1000, Barry Naujok wrote: > > Hi Lars, > > > > Would it be possible for you apply the patch I posted to xfs@oss > > in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html > > to the latest xfsprogs source, make and install it and run: > > > > # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 > > > > And make the image available for me to download and analyse? > > uhm. probably. I'll talk with the guy who owns the data :) > > out of curiosity: what exactly would you do with it? > I mean, would that be sufficient to restore the "badness", > with the files all filled with zero, > and you'd be able to reproduce locally? > > -- > : Lars Ellenberg Tel +43-1-8178292-0 : > : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : > : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : > __ > please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Wed Apr 11 00:21:22 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 11 Apr 2007 00:21:25 -0700 (PDT) Received: from mail.linbit.com (nudl.linbit.com [212.69.162.21]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3B7LKfB021100 for ; Wed, 11 Apr 2007 00:21:22 -0700 Received: by mail.linbit.com (LINBIT Mail Daemon, from userid 1030) id 197612E2716C; Wed, 11 Apr 2007 09:21:19 +0200 (CEST) Date: Tue, 10 Apr 2007 23:05:02 +0200 X-OfflineIMAP-x1476802397-6c61727340696d61702e6c696e626974-494e424f582e4f7574626f78: 1176276079-0187333350209-v4.0.11 From: Lars Ellenberg To: xfs@oss.sgi.com Subject: Re: xfs_repair leaves empty but undeletable dirs in lost+found Message-ID: <20070410210502.GA19842@barkeeper1.linbit> Mail-Followup-To: Lars Ellenberg , xfs@oss.sgi.com References: <20070405162235.GA816@barkeeper1.linbit> <200704100151.LAA27963@larry.melbourne.sgi.com> <20070410092443.GA8496@barkeeper1.linbit> <200704102245.00734.Martin@lichtvoll.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200704102245.00734.Martin@lichtvoll.de> User-Agent: Mutt/1.5.11 X-archive-position: 11080 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lars.ellenberg@linbit.com Precedence: bulk X-list: xfs On Tue, Apr 10, 2007 at 10:45:00PM +0200, Martin Steigerwald wrote: > Am Dienstag 10 April 2007 schrieb Lars Ellenberg: > > > > Would it be possible for you apply the patch I posted to xfs@oss > > > in Feb http://oss.sgi.com/archives/xfs/2007-02/msg00072.html > > > to the latest xfsprogs source, make and install it and run: > > > > > > # xfs_metadump /dev/md1 - | bzip2 > /tmp/bad_xfs.bz2 > > > > > > And make the image available for me to download and analyse? > > > > uhm. probably. I'll talk with the guy who owns the data :) > > > > out of curiosity: what exactly would you do with it? > > I mean, would that be sufficient to restore the "badness", > > with the files all filled with zero, > > and you'd be able to reproduce locally? > > Hi Lars! > > As far as I understand a meta data dump does not contain the actual data > in the files. That would be sufficient als xfs_repair is for repairing > metadata corruption. For analysing the reason why a file is undeleteable > its actual contents should be quite irrelevant. Only thing that could > possibly matter is the amount and location, not the contents of blocks a > file occupies. But that doesn't seem to matter here either. > > It would contain meta data information on the directory and file names as > well as timestamps, owner and rights - if you are concerned about the > privacy of your customer you may want to try to reproduce the problem > with different meta data information. I'm very well aware of these things. what I meant to ask was: (how) could I do some "partial" metadump? because, * yes, I am concerned about the privacy (not too much, though, but still as this is not my data, I have to ask) * its probably going to be huge anyways * its going to take some time to produce, and this is a life system * it would probably really help for investigating further if I were able to reduce the amount of meta data involved (remember, one xfs_repair run took me 12 hours) and, most importantly: * reducing the amount of meta data is probably the first step Barry would do once he has my full dump, because thats the only way to go about debugging this. I'd like to help to reduce the work needed to debug this. so yes, I really would like to try to reproduce with different meta data information, but I'd need a hint what to look for in my existing bad data to be able to reproduce similar bad data. @Barry: I'd probably be able to get a full dump tomorrow. or even better, a partial dump, if you tell me what you'd need. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. From owner-xfs@oss.sgi.com Wed Apr 11 00:37:02 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 11 Apr 2007 00:37:04 -0700 (PDT) Received: from mail.gatrixx.com (mail.gatrixx.com [217.111.11.44]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3B7axfB024064 for ; Wed, 11 Apr 2007 00:37:01 -0700 Received: (qmail 8091 invoked by uid 1008); 11 Apr 2007 09:36:53 +0200 Received: from unknown (HELO majestix.gallier.de) (ojoa@gatrixx.com@89.54.93.154) by 0 with AES256-SHA encrypted SMTP; 11 Apr 2007 09:36:53 +0200 Received: from [192.168.10.3] (olli@gutemine.gallier.de [192.168.10.3]) by majestix.gallier.de (8.13.8/8.13.8/Debian-2) with ESMTP id l3B7aqqk007017; Wed, 11 Apr 2007 09:36:52 +0200 Message-ID: <461C9014.6040109@j-o-a.de> Date: Wed, 11 Apr 2007 09:36:52 +0200 From: Oliver Joa User-Agent: Icedove 1.5.0.10 (X11/20070329) MIME-Version: 1.0 To: xfs-oss CC: linux-kernel@vger.kernel.org Subject: Re: Corrupt XFS -Filesystems on new Hardware and Kernel References: <46094344.4090007@j-o-a.de> <20070328113141.GQ32597093@melbourne.sgi.com> In-Reply-To: <20070328113141.GQ32597093@melbourne.sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11081 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: oliver@j-o-a.de Precedence: bulk X-list: xfs Hi, David Chinner wrote: > On Tue, Mar 27, 2007 at 06:16:04PM +0200, Oliver Joa wrote: >> Hi, >> >> since some weeks i try to get my new hardware running: >> >> Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz >> Intel DP965LT Mainboard >> Seagate SATA-Harddisk in AHCI-Mode >> >> After some hours of running or after some heavy file-i/o >> (find / | cpio -padm /test) I always get a corrupted >> XFS-filesystem. I solved the problem: I made a memtest and found a lot of memory-errors, then i bought a other brand of memory and everything working fine. The first memory i used was brandnew. I bought it together with the board and processor. It was from Kingston. Now i have one from Crucial, which seems to work fine. Thanks to everyone for the help Olli From owner-xfs@oss.sgi.com Wed Apr 11 03:02:32 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 11 Apr 2007 03:02:38 -0700 (PDT) Received: from mail3b1.westend.com (mail3b.westend.com [212.117.79.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3BA2VfB011436 for ; Wed, 11 Apr 2007 03:02:32 -0700 Received: from localhost (localhost [127.0.0.1]) by mail3b1.westend.com (Postfix) with ESMTP id 67D6AC16A; Wed, 11 Apr 2007 11:36:25 +0200 (CEST) Received: from mail3b1.westend.com ([127.0.0.1]) by localhost (mail3b [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 12154-01; Wed, 11 Apr 2007 11:36:22 +0200 (CEST) Received: by mail3b1.westend.com (Postfix, from userid 1001) id EEC94C007; Wed, 11 Apr 2007 11:36:22 +0200 (CEST) Date: Wed, 11 Apr 2007 11:36:22 +0200 From: Thomas Kaehn To: Peter Grandi Cc: Linux XFS Subject: Re: Strange delete performance using XFS Message-ID: <20070411093622.GB28503@mail3b.westend.com> References: <20070404130535.GE18320@mail3b.westend.com> <20070404154523.GA20096@tuatara.stupidest.org> <20070405072803.GB2759@mail3b.westend.com> <20070405152917.GB23893@tuatara.stupidest.org> <17942.39236.623189.817503@base.ty.sabi.co.UK> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <17942.39236.623189.817503@base.ty.sabi.co.UK> User-Agent: Mutt/1.5.9i X-archive-position: 11082 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tk@westend.com Precedence: bulk X-list: xfs Hi Peter, On Fri, Apr 06, 2007 at 08:02:28PM +0100, Peter Grandi wrote: > > [ ... ] The Dell's raid system can safely buffer outstanding > > writes and flsuh them, the 3ware can't so it stalls waiting > > fot the disks to catch up. [ ... ] I'd be tempted to just use > > the 3ware as a JBOD and use sw, but I'm arguably biased, I've > > had so many reliability and performance problems with hw raid > > over the years > > Uhm, I had a friend that worked for a middling storage system > vendor and he was telling me horror stories about bugs and > misdesigns in their quite popular RAID products. > > 3ware seem to me one of the more reliable RAID brands, their > cautious approach may be why they are slower above. your are probably right. 3ware didn't answer yet. However I've found an option in the controller to set the "storsave" policy. In the default profile FUA (force unit access) commands are only acknowledged directly in case a BBU is present. Otherwise the controller waits until the data is written to disk. When selecting the "performance" profile FUA commands are ignored and delete time lowers to a couple of seconds. So the behaviour of the controller should be considered a feature. But I am still astonished how slow deletes were in the first place. This might be a bug or incompatibility anyhow. Thanks to all others for your suggestions. I'll inform you in case 3ware has news for me. Ciao, Thomas -- Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk@westend.com D-52072 Aachen Fax 0241/911879 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb From owner-xfs@oss.sgi.com Thu Apr 12 04:05:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 04:06:02 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3CB5ufB007456 for ; Thu, 12 Apr 2007 04:05:58 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id D15AA7BA307; Thu, 12 Apr 2007 05:05:51 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 1E3F0407F; Thu, 12 Apr 2007 05:05:50 -0600 (MDT) Date: Thu, 12 Apr 2007 05:05:50 -0600 From: Andreas Dilger To: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com Cc: hch@infradead.org Subject: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070412110550.GM5967@schatzie.adilger.int> Mail-Followup-To: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11087 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs Content-Length: 4714 Lines: 117 I'm interested in getting input for implementing an ioctl to efficiently map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion times. We already have customers with single files in the 10TB range and we additionally need to get the mapping over the network so it needs to be efficient in terms of how data is passed, and how easily it can be extracted from the filesystem. I had come up with a plan independently and was also steered toward XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original plan, though I think the XFS structs used there are a bit bloated. There was also recent discussion about SEEK_HOLE and SEEK_DATA as implemented by Sun, but even if we could skip the holes we still might need to do millions of FIBMAPs to see how large files are allocated on disk. Conversely, having filesystems implement an efficient FIBMAP ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE and SEEK_DATA instead of doing looping over ->bmap() inside the kernel as I saw one patch. struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff000000000000 #define FIEMAP_LEN_HOLE 0x01000000000000 #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). The ->fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. One feature that XFS_IOC_GETBMAPX has that may be desirable is the ability to return unwritten extent information. In order to do this XFS required expanding the per-extent struct from 32 to 48 bytes per extent, but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) and keep 8 bytes or so for input/output flags per extent (would need to be masked before use). Caller works something like: char buf[4096]; struct fibmap *fm = (struct fibmap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm->fm_extent.fe_start = 0; /* start of file */ fm->fm_extent.fe_len = -1; /* end of file */ fm->fm_extent_count = count; /* max extents in fm_extents[] array */ fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ fd = open(path, O_RDONLY); printf("logical\t\tphysical\t\tbytes\n"); /* The last entry will have less extents than the maximum */ while (fm->fm_extent_count == count) { rc = ioctl(fd, FIEMAP, fm); if (rc) break; /* kernel filled in fm_extents[] array, set fm_extent_count * to be actual number of extents returned, leaves fm_start * alone (unlike XFS_IOC_GETBMAP). */ for (i = 0; i < fm->fm_extent_count; i++) { __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; __u64 fm_next = fm->fm_start + len; int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", fm->fm_start, fm_next - 1, hole ? 0 : fm->fm_extents[i].fe_start, hole ? 0 : fm->fm_extents[i].fe_start + fm->fm_extents[i].fe_len - 1, len, hole ? "(hole) " : "", unwr ? "(unwritten) " : ""); /* get ready for printing next extent, or next ioctl */ fm->fm_start = fm_next; } } I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP. I'm quite open to suggestions at this point, both in terms of how usable the fibmap data structures are by the caller, and if we need to add anything to make them more flexible for the future. In terms of implementing this in the kernel, there was originally code for this during the development of the ext3 extent patches and it was done via a callback in the extent tree iterator so it is very efficient. I believe it implements all that is needed to allow this interface to be mapped onto XFS_IOC_BMAP internally (or vice versa). Even for block-mapped filesystems, they can at least improve over the ->bmap() case by skipping holes in files that cover [dt]indirect blocks (saving thousands of calls). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu Apr 12 05:08:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 05:08:17 -0700 (PDT) Received: from ppsw-9.csi.cam.ac.uk (ppsw-9.csi.cam.ac.uk [131.111.8.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3CC83fB024925 for ; Thu, 12 Apr 2007 05:08:04 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:52806) by ppsw-9.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HbxOC-00043y-U8 (Exim 4.63) (return-path ); Thu, 12 Apr 2007 12:22:56 +0100 In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Thu, 12 Apr 2007 12:22:55 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11088 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs Content-Length: 7053 Lines: 191 Hi Andreas, On 12 Apr 2007, at 12:05, Andreas Dilger wrote: > I'm interested in getting input for implementing an ioctl to > efficiently > map file extents & holes (FIEMAP) instead of looping over FIBMAP a > billion > times. We already have customers with single files in the 10TB > range and > we additionally need to get the mapping over the network so it > needs to > be efficient in terms of how data is passed, and how easily it can be > extracted from the filesystem. > > I had come up with a plan independently and was also steered toward > XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original > plan, though I think the XFS structs used there are a bit bloated. > > There was also recent discussion about SEEK_HOLE and SEEK_DATA as > implemented by Sun, but even if we could skip the holes we still might > need to do millions of FIBMAPs to see how large files are allocated > on disk. Conversely, having filesystems implement an efficient FIBMAP > ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE > and SEEK_DATA instead of doing looping over ->bmap() inside the kernel > as I saw one patch. > > > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired > mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > > #define FIEMAP_LEN_MASK 0xff000000000000 > #define FIEMAP_LEN_HOLE 0x01000000000000 > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 Sound good but I would add: #define FIEMAP_LEN_NO_DIRECT_ACCESS This would say that the offset on disk can move at any time or that the data is compressed or encrypted on disk thus the data is not useful for direct disk access. On NTFS small files can be inside the inode and there direct access is not possible because the metadata on disk is protected with fixups which need to be removed when the inode is read into memory. If you access the data directly on disk, you would see corrupt data on reads and cause corruption on writes... Similarly both for compressed and encrypted files doing direct access to the on-disk data is totally nonsensical as you would see random junk on read and cause fatal data corruption on writes. Also why are you not using 0xff00000000000000, i.e. two more zeroes at the end? Seems unnecessary to drop an extra 8 bits of significance from the byte size... May not matter today but it almost certainly will do in the future (just remember what people said about the 640k limit in MSDOS when it first came out!)... Finally please make sure that the file system can return in one way or another errors for example when it fails to determine the extents because the system ran out of memory, there was an i/o error, whatever... It may even be useful to be able to say "here is an extent of size X bytes but we do not know where it is on disk because there was an error determining this particular extent's on-disk location for some reason or other"... > All offsets are in bytes to allow cases where filesystems are not > going Excellent! > block-aligned/sized allocations (e.g. tail packing). The > fm_extents array > returned contains the packed list of allocation extents for the file, > including entries for holes (which have fe_start == 0, and a flag). Why the fe_start == 0? Surely just the flag is sufficient... On NTFS it is perfectly valid to have fe_start == 0 and to have that not be sparse (normally the $Boot system file is stored in the first 8 sectors of the volume)... Best regards, Anton > The ->fm_extents[] array includes all of the holes in addition to > allocated extents because this avoids the need to return both the > logical > and physical address for every extent and does not make processing any > harder. > > One feature that XFS_IOC_GETBMAPX has that may be desirable is the > ability to return unwritten extent information. In order to do > this XFS > required expanding the per-extent struct from 32 to 48 bytes per > extent, > but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what > hardship) > and keep 8 bytes or so for input/output flags per extent (would > need to > be masked before use). > > > Caller works something like: > > char buf[4096]; > struct fibmap *fm = (struct fibmap *)buf; > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > fm->fm_extent.fe_start = 0; /* start of file */ > fm->fm_extent.fe_len = -1; /* end of file */ > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > fd = open(path, O_RDONLY); > printf("logical\t\tphysical\t\tbytes\n"); > > /* The last entry will have less extents than the maximum */ > while (fm->fm_extent_count == count) { > rc = ioctl(fd, FIEMAP, fm); > if (rc) > break; > > /* kernel filled in fm_extents[] array, set fm_extent_count > * to be actual number of extents returned, leaves fm_start > * alone (unlike XFS_IOC_GETBMAP). */ > > for (i = 0; i < fm->fm_extent_count; i++) { > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > __u64 fm_next = fm->fm_start + len; > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > fm->fm_start, fm_next - 1, > hole ? 0 : fm->fm_extents[i].fe_start, > hole ? 0 : fm->fm_extents[i].fe_start + > fm->fm_extents[i].fe_len - 1, > len, hole ? "(hole) " : "", > unwr ? "(unwritten) " : ""); > > /* get ready for printing next extent, or next ioctl */ > fm->fm_start = fm_next; > } > } > > I'm not wedded to an ioctl interface, but it seems consistent with > FIBMAP. > I'm quite open to suggestions at this point, both in terms of how > usable > the fibmap data structures are by the caller, and if we need to add > anything > to make them more flexible for the future. > > In terms of implementing this in the kernel, there was originally > code for > this during the development of the ext3 extent patches and it was > done via > a callback in the extent tree iterator so it is very efficient. I > believe > it implements all that is needed to allow this interface to be mapped > onto XFS_IOC_BMAP internally (or vice versa). Even for block-mapped > filesystems, they can at least improve over the ->bmap() case by > skipping > holes in files that cover [dt]indirect blocks (saving thousands of > calls). > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Thu Apr 12 18:43:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 18:43:35 -0700 (PDT) Received: from alnrmhc14.comcast.net (alnrmhc14.comcast.net [206.18.177.54]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D1hDfB028328 for ; Thu, 12 Apr 2007 18:43:14 -0700 Received: from [192.168.1.10] (c-67-171-1-120.hsd1.wa.comcast.net[67.171.1.120]) by comcast.net (alnrmhc14) with SMTP id <20070413013301b1400npnf0e>; Fri, 13 Apr 2007 01:33:01 +0000 Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation From: Nicholas Miell To: Andreas Dilger Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> Content-Type: text/plain Date: Thu, 12 Apr 2007 18:33:00 -0700 Message-Id: <1176427980.3125.9.camel@entropy> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 (2.8.3-2.0.njm.1) Content-Transfer-Encoding: 7bit X-archive-position: 11089 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nmiell@comcast.net Precedence: bulk X-list: xfs On Thu, 2007-04-12 at 05:05 -0600, Andreas Dilger wrote: > I'm interested in getting input for implementing an ioctl to efficiently > map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion > times. We already have customers with single files in the 10TB range and > we additionally need to get the mapping over the network so it needs to > be efficient in terms of how data is passed, and how easily it can be > extracted from the filesystem. > > I had come up with a plan independently and was also steered toward > XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original > plan, though I think the XFS structs used there are a bit bloated. > > There was also recent discussion about SEEK_HOLE and SEEK_DATA as > implemented by Sun, but even if we could skip the holes we still might > need to do millions of FIBMAPs to see how large files are allocated > on disk. Conversely, having filesystems implement an efficient FIBMAP > ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE > and SEEK_DATA instead of doing looping over ->bmap() inside the kernel > as I saw one patch. > I certainly hope not. SEEK_HOLE/SEEK_DATA is a poor interface and doesn't deserve to spread. OTOH, this is nicely done. > > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > > #define FIEMAP_LEN_MASK 0xff000000000000 > #define FIEMAP_LEN_HOLE 0x01000000000000 > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > -- Nicholas Miell From owner-xfs@oss.sgi.com Thu Apr 12 19:26:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 19:26:49 -0700 (PDT) Received: from tyo200.gate.nec.co.jp (TYO200.gate.nec.co.jp [210.143.35.50]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D2QZfB006786 for ; Thu, 12 Apr 2007 19:26:36 -0700 Received: from tyo202.gate.nec.co.jp ([10.7.69.202]) by tyo200.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3D0c044010281 for ; Fri, 13 Apr 2007 09:38:01 +0900 (JST) Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.161]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3D0bgSc024600 for ; Fri, 13 Apr 2007 09:37:42 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l3D0bgH18718 for xfs@oss.sgi.com; Fri, 13 Apr 2007 09:37:42 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv4.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id l3D0bgg19679 for ; Fri, 13 Apr 2007 09:37:42 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070413.093821.29702412 for ; Fri, 13 Apr 2007 09:38:21 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Fri Apr 13 09:38:20 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 2E96FAE4B3; Fri, 13 Apr 2007 09:37:34 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l3D0bfRL020548; Fri, 13 Apr 2007 09:37:41 +0900 Message-Id: <200704130037.AA05196@TNESG9305.tnes.nec.co.jp> Date: Fri, 13 Apr 2007 09:37:35 +0900 To: xfs@oss.sgi.com Subject: [PATCH] remove the unnecessary word in the log message. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11090 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, This is the trivial fix to remove the unnecessary word in the log message. "required" is set in both xlog_recover() and xfs_dev_is_read_only(). Example: fsfile is the filesystem which needs log recovery. # losetup -r /dev/loop1 fsfile # mount -t xfs /dev/loop1 mpnt mount: block device /dev/loop1 is write-protected, mounting read-only mount: cannot mount block device /dev/loop1 read-only /var/log/messages: Jan 23 15:05:22 g9517 kernel: XFS: recovery required required on read-only device. Signed-off-by: Utako Kusaka --- --- linux-2.6.20-orig/fs/xfs/xfs_log_recover.c 2007-02-05 03:44:54.000000000 +0900 +++ linux-2.6.20/fs/xfs/xfs_log_recover.c 2007-04-11 13:23:04.000000000 +0900 @@ -3937,8 +3937,7 @@ xlog_recover( * under the vfs layer, so we can get away with it unless * the device itself is read-only, in which case we fail. */ - if ((error = xfs_dev_is_read_only(log->l_mp, - "recovery required"))) { + if ((error = xfs_dev_is_read_only(log->l_mp, "recovery"))) { return error; } From owner-xfs@oss.sgi.com Thu Apr 12 21:02:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 21:02:07 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D420fB028758 for ; Thu, 12 Apr 2007 21:02:01 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 0A4317BA305; Thu, 12 Apr 2007 22:01:59 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id E0C57407F; Thu, 12 Apr 2007 22:01:56 -0600 (MDT) Date: Thu, 12 Apr 2007 22:01:56 -0600 From: Andreas Dilger To: Anton Altaparmakov Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org, linux-ext4@vger.kernel.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070413040156.GU5967@schatzie.adilger.int> Mail-Followup-To: Anton Altaparmakov , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11091 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote: > On 12 Apr 2007, at 12:05, Andreas Dilger wrote: > >I'm interested in getting input for implementing an ioctl to > >efficiently map file extents & holes (FIEMAP) instead of looping > >over FIBMAP a billion times. We already have customers with single > >files in the 10TB range and we additionally need to get the mapping > >over the network so it needs to be efficient in terms of how data > >is passed, and how easily it can be extracted from the filesystem. > > > >struct fibmap_extent { > > __u64 fe_start; /* starting offset in bytes */ > > __u64 fe_len; /* length in bytes */ > >} > > > >struct fibmap { > > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > > __u32 fm_extent_count; /* number of extents in array */ > > __u32 fm_flags; /* flags for input request */ > > XFS_IOC_GETBMAP) */ > > __u64 unused; > > struct fibmap_extent fm_extents[0]; > >} > > > >#define FIEMAP_LEN_MASK 0xff000000000000 > >#define FIEMAP_LEN_HOLE 0x01000000000000 > >#define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > Sound good but I would add: > > #define FIEMAP_LEN_NO_DIRECT_ACCESS > > This would say that the offset on disk can move at any time or that > the data is compressed or encrypted on disk thus the data is not > useful for direct disk access. This makes sense. Even for Reiserfs the same is true with packed tails, and I believe if FIBMAP is called on a tail it will migrate the tail into a block because this is might be a sign that the file is a kernel that LILO wants to boot. I'd rather not have any such feature in FIEMAP, and just return the on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me. My main reason for FIEMAP is being able to investigate allocation patterns of files. By no means is my flag list exhaustive, just the ones that I thought would be needed to implement this for ext4 and Lustre. > Also why are you not using 0xff00000000000000, i.e. two more zeroes > at the end? Seems unnecessary to drop an extra 8 bits of > significance from the byte size... It was actually just a typo (this was the first time I'd written the structs and flags down, it is just at the discussion stage). I'd meant for it to be 2^56 bytes for the file size as I wrote later in the email. That said, I think that 2^48 bytes is probably sufficient for most uses, so that we get 16 bits for flags. As it is this email already discusses 5 flags, and that would give little room for expansion in the future. Remember, this is the mapping for a single file (which can't practially be beyond 2^64 bytes as yet) so it wouldn't be hard for the filesystem to return a few separate extents which are actually contiguous (assuming that there will actually be files in filesystems with > 2^48 bytes of contiguous space). Since the API is that it will return the extent that contains the requested "start" byte, the kernel will be able to detect this case also, since it won't be able to specify a length for the extent that contains the start byte. At most we'd have to call the ioctl() 65536 times for a completely contiguous 2^64 byte file if the buffer was only large enough for a single extent. In reality, I expect any file to have some discontinuities and the buffer to be large enough for a thousand or more entries so the corner case is not very bad. > Finally please make sure that the file system can return in one way > or another errors for example when it fails to determine the extents > because the system ran out of memory, there was an i/o error, > whatever... It may even be useful to be able to say "here is an > extent of size X bytes but we do not know where it is on disk because > there was an error determining this particular extent's on-disk > location for some reason or other"... Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and FIEMAP_LEN_ERROR. Consider FIEMAP on a file that was migrated to tape and currently has no blocks allocated in the filesystem. We want to return some indication that there is actual file data and not just a hole, but at the same time we don't want this to actually return the file from tape just to generate block mappings for it. This concept is also present in XFS_IOC_GETBMAPX - BMV_IF_NO_DMAPI_READ, but this needs to be specified on input to prevent the file being mapped and I'd rather the opposite (not getting file from tape) be the default, by principle of least surprise. > >block-aligned/sized allocations (e.g. tail packing). The > >fm_extents array > >returned contains the packed list of allocation extents for the file, > >including entries for holes (which have fe_start == 0, and a flag). > > Why the fe_start == 0? Surely just the flag is sufficient... On > NTFS it is perfectly valid to have fe_start == 0 and to have that not > be sparse (normally the $Boot system file is stored in the first 8 > sectors of the volume)... I thought fe_start = 0 was pretty standard for a hole. It should be something and I'd rather 0 than anything else. The _HOLE flag is enough as you say though. PS - I'd thought about adding you to the CC list for this, because I know you've had opinions on FIBMAP in the past, but I didn't have your email handy and it was late, and I know you saw the NTFS kmap patch on fsdevel so I figured you would see this too... Thanks for your input. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Thu Apr 12 21:16:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 12 Apr 2007 21:16:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3D4FufB031559 for ; Thu, 12 Apr 2007 21:16:00 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA16550; Fri, 13 Apr 2007 14:15:54 +1000 Message-Id: <200704130415.OAA16550@larry.melbourne.sgi.com> From: "Barry Naujok" To: , "'xfs-dev'" Subject: [PATCH] xfs_repair - move realtime extent processing to a separate function Date: Fri, 13 Apr 2007 14:22:10 +1000 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_025A_01C77DD7.1FB6E8F0" X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 Thread-Index: Acd9g03WEX2xICCNSPiJxMhRydgOEw== X-archive-position: 11092 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs This is a multi-part message in MIME format. ------=_NextPart_000_025A_01C77DD7.1FB6E8F0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit While changing the process_bmbt_reclist_int() function, I observed a realtime check inside the block map get/set state loop which is quite CPU intensive. Upon further investigation, this loop is not used at all for realtime extents and that the two types of extents are pretty much processed exclusively. So, I simplified the functionality by moving the realtime extent processing into it's own function and fixing a bug at the same time when it comes to realtime inodes with attributes (it was comparing attr extents to the realtime volume bmap instead of the normal bmap). ------=_NextPart_000_025A_01C77DD7.1FB6E8F0 Content-Type: application/octet-stream; name="separate_rt_extent_processing.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="separate_rt_extent_processing.patch" Index: repair/xfsprogs/repair/dinode.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- repair.orig/xfsprogs/repair/dinode.c 2007-04-13 13:07:16.000000000 +1000 +++ repair/xfsprogs/repair/dinode.c 2007-04-13 14:15:33.920960345 +1000 @@ -537,6 +537,121 @@ return (val); \ } while (0) =20 +static int +process_rt_rec( + xfs_mount_t *mp, + xfs_bmbt_rec_32_t *rp, + xfs_ino_t ino, + xfs_drfsbno_t *tot, + int check_dups) +{ + xfs_dfsbno_t b; + xfs_drtbno_t ext; + xfs_dfilblks_t c; /* count */ + xfs_dfsbno_t s; /* start */ + xfs_dfiloff_t o; /* offset */ + int state; + int flag; /* extent flag */ + int pwe; /* partially-written extent */ + + convert_extent(rp, &o, &s, &c, &flag); + + /* + * check numeric validity of the extent + */ + if (s >=3D mp->m_sb.sb_rblocks) { + do_warn(_("inode %llu - bad rt extent start block number " + "%llu, offset %llu\n"), ino, s, o); + return 1; + } + if (s + c - 1 >=3D mp->m_sb.sb_rblocks) { + do_warn(_("inode %llu - bad rt extent last block number %llu, " + "offset %llu\n"), ino, s + c - 1, o); + return 1; + } + if (s + c - 1 < s) { + do_warn(_("inode %llu - bad rt extent overflows - start %llu, " + "end %llu, offset %llu\n"), + ino, s, s + c - 1, o); + return 1; + } + + /* + * verify that the blocks listed in the record + * are multiples of an extent + */ + if (XFS_SB_VERSION_HASEXTFLGBIT(&mp->m_sb) =3D=3D 0 && + (s % mp->m_sb.sb_rextsize !=3D 0 || + c % mp->m_sb.sb_rextsize !=3D 0)) { + do_warn(_("malformed rt inode extent [%llu %llu] (fs rtext " + "size =3D %u)\n"), s, c, mp->m_sb.sb_rextsize); + return 1; + } + + /* + * set the appropriate number of extents + */ + for (b =3D s; b < s + c; b +=3D mp->m_sb.sb_rextsize) { + ext =3D (xfs_drtbno_t) b / mp->m_sb.sb_rextsize; + pwe =3D XFS_SB_VERSION_HASEXTFLGBIT(&mp->m_sb) && flag && + (b % mp->m_sb.sb_rextsize !=3D 0); + + if (check_dups =3D=3D 1) { + if (search_rt_dup_extent(mp, ext) && !pwe) { + do_warn(_("data fork in rt ino %llu claims " + "dup rt extent, off - %llu, " + "start - %llu, count %llu\n"), + ino, o, s, c); + return 1; + } + continue; + } + + state =3D get_rtbno_state(mp, ext); + + switch (state) { + case XR_E_FREE: + case XR_E_UNKNOWN: + set_rtbno_state(mp, ext, XR_E_INUSE); + break; + + case XR_E_BAD_STATE: + do_error(_("bad state in rt block map %llu\n"), + ext); + + case XR_E_FS_MAP: + case XR_E_INO: + case XR_E_INUSE_FS: + do_error(_("data fork in rt inode %llu found " + "metadata block %llu in rt bmap\n"), + ino, ext); + + case XR_E_INUSE: + if (pwe) + break; + + case XR_E_MULT: + set_rtbno_state(mp, ext, XR_E_MULT); + do_warn(_("data fork in rt inode %llu claims " + "used rt block %llu\n"), + ino, ext); + return 1; + + case XR_E_FREE1: + default: + do_error(_("illegal state %d in rt block map " + "%llu\n"), state, b); + } + } + + /* + * bump up the block counter + */ + *tot +=3D c; + + return 0; +} + /* * return 1 if inode should be cleared, 0 otherwise * if check_dups should be set to 1, that implies that @@ -560,7 +675,6 @@ int whichfork) { xfs_dfsbno_t b; - xfs_drtbno_t ext; xfs_dfilblks_t c; /* count */ xfs_dfilblks_t cp =3D 0; /* prev count */ xfs_dfsbno_t s; /* start */ @@ -572,7 +686,6 @@ int i; int state; int flag; /* extent flag */ - int pwe; /* partially-written extent */ xfs_dfsbno_t e; xfs_agnumber_t agno; xfs_agblock_t agbno; @@ -615,28 +728,22 @@ o, s, ino); PROCESS_BMBT_UNLOCK_RETURN(1); } - if (type =3D=3D XR_INO_RTDATA) { - if (s >=3D mp->m_sb.sb_rblocks) { - do_warn( - _("inode %llu - bad rt extent start block number %llu, offset %llu\n"), - ino, s, o); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - if (s + c - 1 >=3D mp->m_sb.sb_rblocks) { - do_warn( - _("inode %llu - bad rt extent last block number %llu, offset %llu\n"), - ino, s + c - 1, o); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - if (s + c - 1 < s) { - do_warn( - _("inode %llu - bad rt extent overflows - start %llu, end %llu, " - "offset %llu\n"), - ino, s, s + c - 1, o); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - } else { - switch (verify_dfsbno_range(mp, s, c)) { + + if (type =3D=3D XR_INO_RTDATA && whichfork =3D=3D XFS_DATA_FORK) { + /* + * realtime bitmaps don't use AG locks, so returning + * immediately is fine for this code path. + */ + if (process_rt_rec(mp, rp, ino, tot, check_dups)) + return 1; + /* + * skip rest of loop processing since that's + * all for regular file forks and attr forks + */ + continue; + } + + switch (verify_dfsbno_range(mp, s, c)) { case XR_DFSBNORANGE_VALID: break; case XR_DFSBNORANGE_BADSTART: @@ -656,109 +763,13 @@ "offset %llu\n"), ino, s, s + c - 1, o); PROCESS_BMBT_UNLOCK_RETURN(1); - } - if (o >=3D fs_max_file_offset) { - do_warn( + } + if (o >=3D fs_max_file_offset) { + do_warn( _("inode %llu - extent offset too large - start %llu, count %llu, " "offset %llu\n"), - ino, s, c, o); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - } - - /* - * realtime file data fork - */ - if (type =3D=3D XR_INO_RTDATA && whichfork =3D=3D XFS_DATA_FORK) { - /* - * XXX - verify that the blocks listed in the record - * are multiples of an extent - */ - if (XFS_SB_VERSION_HASEXTFLGBIT(&mp->m_sb) =3D=3D 0 - && (s % mp->m_sb.sb_rextsize !=3D 0 || - c % mp->m_sb.sb_rextsize !=3D 0)) { - do_warn( - _("malformed rt inode extent [%llu %llu] (fs rtext size =3D %u)\n"), - s, c, mp->m_sb.sb_rextsize); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - - /* - * XXX - set the appropriate number of extents - */ - for (b =3D s; b < s + c; b +=3D mp->m_sb.sb_rextsize) { - ext =3D (xfs_drtbno_t) b / mp->m_sb.sb_rextsize; - if (XFS_SB_VERSION_HASEXTFLGBIT(&mp->m_sb) && - flag && (b % mp->m_sb.sb_rextsize !=3D 0)) { - pwe =3D 1; - } else { - pwe =3D 0; - } - - if (check_dups =3D=3D 1) { - if (search_rt_dup_extent(mp, ext) && - !pwe) { - do_warn( - _("data fork in rt ino %llu claims dup rt extent, off - %llu, " - "start - %llu, count %llu\n"), - ino, o, s, c); - PROCESS_BMBT_UNLOCK_RETURN(1); - } - continue; - } - - state =3D get_rtbno_state(mp, ext); - - switch (state) { - case XR_E_FREE: -/* XXX - turn this back on after we - run process_rtbitmap() in phase2 - do_warn( - _("%s fork in rt ino %llu claims free rt block %llu\n"), - forkname, ino, ext); -*/ - /* fall through ... */ - case XR_E_UNKNOWN: - set_rtbno_state(mp, ext, XR_E_INUSE); - break; - case XR_E_BAD_STATE: - do_error( - _("bad state in rt block map %llu\n"), ext); - abort(); - break; - case XR_E_FS_MAP: - case XR_E_INO: - case XR_E_INUSE_FS: - do_error( - _("%s fork in rt inode %llu found metadata block %llu in %s bmap\n"), - forkname, ino, ext, ftype); - case XR_E_INUSE: - if (pwe) - break; - case XR_E_MULT: - set_rtbno_state(mp, ext, XR_E_MULT); - do_warn( - _("%s fork in rt inode %llu claims used rt block %llu\n"), - forkname, ino, ext); - PROCESS_BMBT_UNLOCK_RETURN(1); - case XR_E_FREE1: - default: - do_error( - _("illegal state %d in %s block map %llu\n"), - state, ftype, b); - } - } - - /* - * bump up the block counter - */ - *tot +=3D c; - - /* - * skip rest of loop processing since that's - * all for regular file forks and attr forks - */ - continue; + ino, s, c, o); + PROCESS_BMBT_UNLOCK_RETURN(1); } =20 =20 @@ -793,15 +804,6 @@ continue; } =20 - /* FIX FOR BUG 653709 -- EKN - * realtime attribute fork, should be valid block number - * in regular data space, not realtime partion. - */ - if (type =3D=3D XR_INO_RTDATA && whichfork =3D=3D XFS_ATTR_FORK) { - if (mp->m_sb.sb_agcount < agno) - PROCESS_BMBT_UNLOCK_RETURN(1); - } - /* Process in chunks of 16 (XR_BB_UNIT/XR_BB) * for common XR_E_UNKNOWN to XR_E_INUSE transition */ ------=_NextPart_000_025A_01C77DD7.1FB6E8F0-- From owner-xfs@oss.sgi.com Fri Apr 13 00:46:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 00:46:45 -0700 (PDT) Received: from ppsw-3.csi.cam.ac.uk (ppsw-3.csi.cam.ac.uk [131.111.8.133]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D7kYfB005684 for ; Fri, 13 Apr 2007 00:46:35 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:49243) by ppsw-3.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.153]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HcGU8-0000Xj-BR (Exim 4.63) (return-path ); Fri, 13 Apr 2007 08:46:20 +0100 In-Reply-To: <20070413040156.GU5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> <20070413040156.GU5967@schatzie.adilger.int> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Fri, 13 Apr 2007 08:46:18 +0100 To: Andreas Dilger X-Mailer: Apple Mail (2.752.3) X-archive-position: 11093 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs Hi Andreas, On 13 Apr 2007, at 05:01, Andreas Dilger wrote: > On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote: >> On 12 Apr 2007, at 12:05, Andreas Dilger wrote: >>> I'm interested in getting input for implementing an ioctl to >>> efficiently map file extents & holes (FIEMAP) instead of looping >>> over FIBMAP a billion times. We already have customers with single >>> files in the 10TB range and we additionally need to get the mapping >>> over the network so it needs to be efficient in terms of how data >>> is passed, and how easily it can be extracted from the filesystem. >>> >>> struct fibmap_extent { >>> __u64 fe_start; /* starting offset in bytes */ >>> __u64 fe_len; /* length in bytes */ >>> } >>> >>> struct fibmap { >>> struct fibmap_extent fm_start; /* offset, length of desired >>> mapping */ >>> __u32 fm_extent_count; /* number of extents in array */ >>> __u32 fm_flags; /* flags for input request */ >>> XFS_IOC_GETBMAP) */ >>> __u64 unused; >>> struct fibmap_extent fm_extents[0]; >>> } >>> >>> #define FIEMAP_LEN_MASK 0xff000000000000 >>> #define FIEMAP_LEN_HOLE 0x01000000000000 >>> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 >> >> Sound good but I would add: >> >> #define FIEMAP_LEN_NO_DIRECT_ACCESS >> >> This would say that the offset on disk can move at any time or that >> the data is compressed or encrypted on disk thus the data is not >> useful for direct disk access. > > This makes sense. Even for Reiserfs the same is true with packed > tails, > and I believe if FIBMAP is called on a tail it will migrate the > tail into > a block because this is might be a sign that the file is a kernel that > LILO wants to boot. > > I'd rather not have any such feature in FIEMAP, and just return the > on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me. > My main reason for FIEMAP is being able to investigate allocation > patterns > of files. > > By no means is my flag list exhaustive, just the ones that I > thought would > be needed to implement this for ext4 and Lustre. Sure, hence why I made my comment for NTFS. (-: And yes, ReiserFS and even ext* could use such flag. I believe there is a compression patch for ext somewhere isn't there? (Or at least there was one at some point I think...) >> Also why are you not using 0xff00000000000000, i.e. two more zeroes >> at the end? Seems unnecessary to drop an extra 8 bits of >> significance from the byte size... > > It was actually just a typo (this was the first time I'd written the > structs and flags down, it is just at the discussion stage). I'd > meant > for it to be 2^56 bytes for the file size as I wrote later in the > email. Ok. (-: > That said, I think that 2^48 bytes is probably sufficient for most > uses, > so that we get 16 bits for flags. As it is this email already > discusses > 5 flags, and that would give little room for expansion in the future. > > Remember, this is the mapping for a single file (which can't > practially > be beyond 2^64 bytes as yet) so it wouldn't be hard for the > filesystem to > return a few separate extents which are actually contiguous > (assuming that > there will actually be files in filesystems with > 2^48 bytes of > contiguous > space). Since the API is that it will return the extent that > contains the > requested "start" byte, the kernel will be able to detect this case > also, > since it won't be able to specify a length for the extent that > contains the > start byte. Valid point. As long as the "on-disk location" is maintained as full 64 bits then you are right we could just return multiple extents if the space does not fit. A bit of a kludge but it would certainly work. An alternative would be to have the flags in a separate field but that would add 8-bytes to the structure size if you want to maintain 8-byte alignment so that would not be great... > At most we'd have to call the ioctl() 65536 times for a completely > contiguous 2^64 byte file if the buffer was only large enough for a > single extent. In reality, I expect any file to have some > discontinuities > and the buffer to be large enough for a thousand or more entries so > the > corner case is not very bad. > >> Finally please make sure that the file system can return in one way >> or another errors for example when it fails to determine the extents >> because the system ran out of memory, there was an i/o error, >> whatever... It may even be useful to be able to say "here is an >> extent of size X bytes but we do not know where it is on disk because >> there was an error determining this particular extent's on-disk >> location for some reason or other"... > > Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and > FIEMAP_LEN_ERROR. Consider FIEMAP on a file that was migrated > to tape and currently has no blocks allocated in the filesystem. We > want to return some indication that there is actual file data and not > just a hole, but at the same time we don't want this to actually > return > the file from tape just to generate block mappings for it. Yes, NTFS also has off line storage (DFS - the Distributed File System I think it is called) but we don't support any of that. Perhaps one day... > This concept is also present in XFS_IOC_GETBMAPX - > BMV_IF_NO_DMAPI_READ, > but this needs to be specified on input to prevent the file being > mapped > and I'd rather the opposite (not getting file from tape) be the > default, > by principle of least surprise. > >>> block-aligned/sized allocations (e.g. tail packing). The >>> fm_extents array >>> returned contains the packed list of allocation extents for the >>> file, >>> including entries for holes (which have fe_start == 0, and a flag). >> >> Why the fe_start == 0? Surely just the flag is sufficient... On >> NTFS it is perfectly valid to have fe_start == 0 and to have that not >> be sparse (normally the $Boot system file is stored in the first 8 >> sectors of the volume)... > > I thought fe_start = 0 was pretty standard for a hole. It should be > something and I'd rather 0 than anything else. The _HOLE flag is > enough > as you say though. It is standard on Unix. I am trying to fight this standard because of NTFS... On NTFS a hole is -1 not 0 and zero is a valid block. But on NTFS device locations are "s64" not "u64" so the -1 is logical to use... As long as it is made clear that people MUST check the flag when fe_start == 0 rather than assume that fe_start == 0 means a hole I am happy with that. Hopefully not too many programmers will be lazy gits who will ignore this and just check fe_start == 0 or they will fail on NTFS and assume $Boot is sparse when it is not... > PS - I'd thought about adding you to the CC list for this, because > I know > you've had opinions on FIBMAP in the past, but I didn't have > your email handy and it was late, and I know you saw the NTFS > kmap > patch on fsdevel so I figured you would see this too... Thanks. Yes, I try to follow fsdevel closely and LKML not so closely (I often read it with "select all new, delete")... > Thanks for your input. You are welcome. Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Fri Apr 13 01:04:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 01:04:29 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D84PfB010564 for ; Fri, 13 Apr 2007 01:04:26 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HcGAh-0005dQ-77; Fri, 13 Apr 2007 08:26:15 +0100 Date: Fri, 13 Apr 2007 08:26:15 +0100 From: Christoph Hellwig To: Utako Kusaka Cc: xfs@oss.sgi.com Subject: Re: [PATCH] remove the unnecessary word in the log message. Message-ID: <20070413072615.GB20326@infradead.org> References: <200704130037.AA05196@TNESG9305.tnes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200704130037.AA05196@TNESG9305.tnes.nec.co.jp> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11094 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, Apr 13, 2007 at 09:37:35AM +0900, Utako Kusaka wrote: > --- linux-2.6.20-orig/fs/xfs/xfs_log_recover.c 2007-02-05 03:44:54.000000000 +0900 > +++ linux-2.6.20/fs/xfs/xfs_log_recover.c 2007-04-11 13:23:04.000000000 +0900 > @@ -3937,8 +3937,7 @@ xlog_recover( > * under the vfs layer, so we can get away with it unless > * the device itself is read-only, in which case we fail. > */ > - if ((error = xfs_dev_is_read_only(log->l_mp, > - "recovery required"))) { > + if ((error = xfs_dev_is_read_only(log->l_mp, "recovery"))) { > return error; > } Looks good. (And gets rid of an ugly line-break, nice) From owner-xfs@oss.sgi.com Fri Apr 13 01:04:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 01:04:32 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3D84RfB010574 for ; Fri, 13 Apr 2007 01:04:29 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HcGBD-0005dk-S2; Fri, 13 Apr 2007 08:26:47 +0100 Date: Fri, 13 Apr 2007 08:26:47 +0100 From: Christoph Hellwig To: Barry Naujok Cc: xfs@oss.sgi.com, "'xfs-dev'" Subject: Re: [PATCH] xfs_repair - move realtime extent processing to a separate function Message-ID: <20070413072647.GC20326@infradead.org> References: <200704130415.OAA16550@larry.melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200704130415.OAA16550@larry.melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11095 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, Apr 13, 2007 at 02:22:10PM +1000, Barry Naujok wrote: > While changing the process_bmbt_reclist_int() function, I observed a > realtime check inside the block map get/set state loop which is quite > CPU intensive. Upon further investigation, this loop is not used at > all for realtime extents and that the two types of extents are pretty > much processed exclusively. > > So, I simplified the functionality by moving the realtime extent > processing into it's own function and fixing a bug at the same time > when it comes to realtime inodes with attributes (it was comparing > attr extents to the realtime volume bmap instead of the normal bmap). Nice cleanup, looks good. From owner-xfs@oss.sgi.com Fri Apr 13 03:15:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 03:15:12 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3DAF7fB007394 for ; Fri, 13 Apr 2007 03:15:09 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HcIo7-0003PJ-50; Fri, 13 Apr 2007 11:15:07 +0100 Date: Fri, 13 Apr 2007 11:15:07 +0100 From: Christoph Hellwig To: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070413101507.GA11406@infradead.org> References: <20070412110550.GM5967@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11096 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > > #define FIEMAP_LEN_MASK 0xff000000000000 > #define FIEMAP_LEN_HOLE 0x01000000000000 > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > All offsets are in bytes to allow cases where filesystems are not going > block-aligned/sized allocations (e.g. tail packing). The fm_extents array > returned contains the packed list of allocation extents for the file, > including entries for holes (which have fe_start == 0, and a flag). > One feature that XFS_IOC_GETBMAPX has that may be desirable is the > ability to return unwritten extent information. In order to do this XFS > required expanding the per-extent struct from 32 to 48 bytes per extent, > but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) > and keep 8 bytes or so for input/output flags per extent (would need to > be masked before use). I'd be much happier to have the separate per-extent flags value. For one thing this allows much nicer representations of unwritten extents or holes without taking away bits from the len value. It also allows to make interesting use of this in the future, e.g. telling about an offline exttent for use in HSM applications. Also for this kernel<->user interface the wasted space shouldn't matter too much - if you want to pass the above condensed structure over the wire in lustre that shouldn't a problem, you'd have to convert to an endian-neutral on the wire format anyway. Not doing the masking also make the interface quite a bit simpler to use. One addition freature from the XFS getbmapx interface we should provide is the ability to query layout of xattrs. While other filesystems might not have the exact xattr fork XFS has it fits nicely into the interface. Especially when we have Anton's suggested flag for inline data. From owner-xfs@oss.sgi.com Fri Apr 13 04:39:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 04:39:27 -0700 (PDT) Received: from ppsw-3.csi.cam.ac.uk (ppsw-3.csi.cam.ac.uk [131.111.8.133]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3DBdIfB026056 for ; Fri, 13 Apr 2007 04:39:20 -0700 X-Cam-SpamDetails: Not scanned X-Cam-AntiVirus: No virus found X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from imp.csi.cam.ac.uk ([131.111.10.57]:50371) by ppsw-3.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.153]:587) with esmtpsa (PLAIN:aia21) (TLSv1:AES128-SHA:128) id 1HcK7H-0004AB-CS (Exim 4.63) (return-path ); Fri, 13 Apr 2007 12:38:59 +0100 In-Reply-To: <20070413101507.GA11406@infradead.org> References: <20070412110550.GM5967@schatzie.adilger.int> <20070413101507.GA11406@infradead.org> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <37B026AB-60FA-4595-B2B1-F57BB023D91C@cam.ac.uk> Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com Content-Transfer-Encoding: 7bit From: Anton Altaparmakov Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Date: Fri, 13 Apr 2007 12:38:58 +0100 To: Christoph Hellwig X-Mailer: Apple Mail (2.752.3) X-archive-position: 11097 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aia21@cam.ac.uk Precedence: bulk X-list: xfs On 13 Apr 2007, at 11:15, Christoph Hellwig wrote: > On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: >> struct fibmap_extent { >> __u64 fe_start; /* starting offset in bytes */ >> __u64 fe_len; /* length in bytes */ >> } >> >> struct fibmap { >> struct fibmap_extent fm_start; /* offset, length of desired >> mapping */ >> __u32 fm_extent_count; /* number of extents in array */ >> __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ >> __u64 unused; >> struct fibmap_extent fm_extents[0]; >> } >> >> #define FIEMAP_LEN_MASK 0xff000000000000 >> #define FIEMAP_LEN_HOLE 0x01000000000000 >> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 >> >> All offsets are in bytes to allow cases where filesystems are not >> going >> block-aligned/sized allocations (e.g. tail packing). The >> fm_extents array >> returned contains the packed list of allocation extents for the file, >> including entries for holes (which have fe_start == 0, and a flag). > >> One feature that XFS_IOC_GETBMAPX has that may be desirable is the >> ability to return unwritten extent information. In order to do >> this XFS >> required expanding the per-extent struct from 32 to 48 bytes per >> extent, >> but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what >> hardship) >> and keep 8 bytes or so for input/output flags per extent (would >> need to >> be masked before use). > > I'd be much happier to have the separate per-extent flags value. > For one thing this allows much nicer representations of unwritten > extents or holes without taking away bits from the len value. It also > allows to make interesting use of this in the future, e.g. telling > about an offline exttent for use in HSM applications. Also for > this kernel<->user interface the wasted space shouldn't matter too > much - if you want to pass the above condensed structure over the > wire in lustre that shouldn't a problem, you'd have to convert > to an endian-neutral on the wire format anyway. Not doing the > masking also make the interface quite a bit simpler to use. > > One addition freature from the XFS getbmapx interface we should > provide is the ability to query layout of xattrs. While other > filesystems might not have the exact xattr fork XFS has it fits > nicely into the interface. Especially when we have Anton's suggested > flag for inline data. Would it not be better to allow people to get a file descriptor on the xattr fork and then just run the normal FIEMAP ioctl on that file descriptor? I.e. "openat(base file descriptor, O_STREAM, streamname)" or O_XATTR or whatever... An alternative API would be to provide a "getxattrfd ()/fgetxattrfd()" call or similar that would instead of returning the value of an xattr return an fd to it. Then you do not need to modify openat() at all... Interface doesn't bother me, just some ideas... And for XFS you would define a magic streamname or xattrname (or whatever you want to call it) of say "com.sgi.filesystem.xfs.xattrstream" (or .xattrfork) or something and then XFS would intercept that and know what to do with it... Such an interface could then be used by NTFS named streams and other file systems providing such things... (Yes I know I will now totally get flamed about named streams not being wanted in Linux and crap like that but that is exactly what you are asking for except you want to special case a particular stream using a flag instead of calling it for what it really is and once you start doing that you might as well allow full named streams...) You can just see named streams as an alternative, non-atomic API to xattrs if you like, i.e. you can either use the atomic xattr API provided in Linux already or you can get a file descriptor to an xattr and then use the normal system calls to access it non- atomically thus you can use the FIEMAP ioctl also. (-: FWIW this two-API approach to xattrs/named streams is the direction OSX is heading towards also so it is not without precedent and Windows has had both APIs for many years. And Solaris has the "openat (O_XATTR)" interface so that is not without precedent either. Best regards, Anton PS. to all flamers: I am going to delete any non-technical flames without replying so please do us all a favour and don't bother... Thanks. -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ From owner-xfs@oss.sgi.com Fri Apr 13 08:12:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 08:12:12 -0700 (PDT) Received: from mx1.suse.de (mail.suse.de [195.135.220.2]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3DFC5fB002711 for ; Fri, 13 Apr 2007 08:12:06 -0700 Received: from Relay2.suse.de (mail2.suse.de [195.135.221.8]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.suse.de (Postfix) with ESMTP id BA42B122ED; Fri, 13 Apr 2007 16:54:50 +0200 (CEST) Message-ID: <461F997E.30002@suse.com> Date: Fri, 13 Apr 2007 10:53:50 -0400 From: Jeff Mahoney Organization: SUSE Labs, Novell, Inc User-Agent: Thunderbird 1.5.0.10 (X11/20060911) MIME-Version: 1.0 To: Anton Altaparmakov , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation References: <20070412110550.GM5967@schatzie.adilger.int> <97211C89-1810-4B22-B2F4-9D206D43C1F6@cam.ac.uk> <20070413040156.GU5967@schatzie.adilger.int> In-Reply-To: <20070413040156.GU5967@schatzie.adilger.int> X-Enigmail-Version: 0.94.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11098 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jeffm@suse.com Precedence: bulk X-list: xfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andreas Dilger wrote: > On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote: >> This would say that the offset on disk can move at any time or that >> the data is compressed or encrypted on disk thus the data is not >> useful for direct disk access. > > This makes sense. Even for Reiserfs the same is true with packed tails, > and I believe if FIBMAP is called on a tail it will migrate the tail into > a block because this is might be a sign that the file is a kernel that > LILO wants to boot. Actually, reiserfs_aop_bmap() returns 0 when the requested block is in a tail. There's a separate ioctl for unpacking them. - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iD8DBQFGH5l+LPWxlyuTD7IRAn5/AJ9VcocIcDGr9wtAlgGZuOAQWqVASwCfVdWM uLZQq1mkf8hsGXOpZtKQH5w= =AxnN -----END PGP SIGNATURE----- From owner-xfs@oss.sgi.com Fri Apr 13 12:06:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 13 Apr 2007 12:07:02 -0700 (PDT) Received: from rwcrmhc15.comcast.net (rwcrmhc15.comcast.net [216.148.227.155]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3DJ6wfB025581 for ; Fri, 13 Apr 2007 12:06:59 -0700 Received: from [192.168.1.10] (c-67-171-1-120.hsd1.wa.comcast.net[67.171.1.120]) by comcast.net (rwcrmhc15) with SMTP id <20070413185549m15007ivg8e>; Fri, 13 Apr 2007 18:55:50 +0000 Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation From: Nicholas Miell To: Anton Altaparmakov Cc: Christoph Hellwig , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com In-Reply-To: <37B026AB-60FA-4595-B2B1-F57BB023D91C@cam.ac.uk> References: <20070412110550.GM5967@schatzie.adilger.int> <20070413101507.GA11406@infradead.org> <37B026AB-60FA-4595-B2B1-F57BB023D91C@cam.ac.uk> Content-Type: text/plain Date: Fri, 13 Apr 2007 11:55:49 -0700 Message-Id: <1176490549.3122.16.camel@entropy> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 (2.8.3-2.0.njm.1) Content-Transfer-Encoding: 7bit X-archive-position: 11099 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nmiell@comcast.net Precedence: bulk X-list: xfs On Fri, 2007-04-13 at 12:38 +0100, Anton Altaparmakov wrote: > > One addition freature from the XFS getbmapx interface we should > > provide is the ability to query layout of xattrs. While other > > filesystems might not have the exact xattr fork XFS has it fits > > nicely into the interface. Especially when we have Anton's suggested > > flag for inline data. > > Would it not be better to allow people to get a file descriptor on > the xattr fork and then just run the normal FIEMAP ioctl on that file > descriptor? > > I.e. "openat(base file descriptor, O_STREAM, streamname)" or O_XATTR > or whatever... An alternative API would be to provide a "getxattrfd > ()/fgetxattrfd()" call or similar that would instead of returning the > value of an xattr return an fd to it. Then you do not need to modify > openat() at all... Interface doesn't bother me, just some ideas... > > And for XFS you would define a magic streamname or xattrname (or > whatever you want to call it) of say > "com.sgi.filesystem.xfs.xattrstream" (or .xattrfork) or something and > then XFS would intercept that and know what to do with it... > > Such an interface could then be used by NTFS named streams and other > file systems providing such things... > > (Yes I know I will now totally get flamed about named streams not > being wanted in Linux and crap like that but that is exactly what you > are asking for except you want to special case a particular stream > using a flag instead of calling it for what it really is and once you > start doing that you might as well allow full named streams...) > > You can just see named streams as an alternative, non-atomic API to > xattrs if you like, i.e. you can either use the atomic xattr API > provided in Linux already or you can get a file descriptor to an > xattr and then use the normal system calls to access it non- > atomically thus you can use the FIEMAP ioctl also. (-: > > FWIW this two-API approach to xattrs/named streams is the direction > OSX is heading towards also so it is not without precedent and > Windows has had both APIs for many years. And Solaris has the "openat > (O_XATTR)" interface so that is not without precedent either. Except that xattrs in Linux aren't streams, and providing a stream-like interface to them would be a weird abuse of the xattr concept. In essence, Linux xattrs are named extensions to struct stat, with getxattr() being in the same category as stat() and setxattr() being in the same category as chmod()/chown()/utime()/etc. They system namespace exists to provide a better interface than ioctl() to weird FS-specific features (DOS attribute bits, HFS+ creator/type, ext2/3/reiserfs/etc. immutable/append-only/secure-delete/etc. attributes and so on). The uptake of this feature isn't as high as I'd like, but that's what it's there for. They security namespace is there for all the neat LSM modules that need to attach metadata to files in order to function. Finally, the user namespace exists to allow users to attach small bits of information to their own files, since the API was already there and hey!, metadata is useful. Now, Solaris came along and totally confused the issue by using the same name for a completely different feature, but that isn't any real reason to mess up the existing Linux xattr concept just to graft named streams support into the kernel. (Not that I'm opposed to named streams in Linux, you just have to realize that xattrs aren't name streams, can't live in the same namespace as named streams, and certainly don't serve the same purpose as named streams.) -- Nicholas Miell From owner-xfs@oss.sgi.com Sun Apr 15 21:00:17 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 15 Apr 2007 21:00:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3G40DfB007691 for ; Sun, 15 Apr 2007 21:00:15 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA05077; Mon, 16 Apr 2007 14:00:07 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1116) id 0A0415903EDB; Mon, 16 Apr 2007 14:00:06 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com, xfs@oss.sgi.com Subject: TAKE 963465 - export xfs_buftarg_list for xfsidbg (using func) Message-Id: <20070416040007.0A0415903EDB@chook.melbourne.sgi.com> Date: Mon, 16 Apr 2007 14:00:06 +1000 (EST) From: tes@sgi.com (Tim Shimmin) X-archive-position: 11100 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Export via a function xfs_buftarg_list for use by kdb/xfsidbg. Date: Mon Apr 16 13:58:41 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/tes/2.6.x-xfs Inspected by: lachlan@sgi.com The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28414a fs/xfs/xfsidbg.c - 1.313 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfsidbg.c.diff?r1=text&tr1=1.313&r2=text&tr2=1.312&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.6/xfs_buf.h - 1.119 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.h.diff?r1=text&tr1=1.119&r2=text&tr2=1.118&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.6/xfs_buf.c - 1.235 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_buf.c.diff?r1=text&tr1=1.235&r2=text&tr2=1.234&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.4/xfs_buf.h - 1.118 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.4/xfs_buf.h.diff?r1=text&tr1=1.118&r2=text&tr2=1.117&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.4/xfs_buf.c - 1.219 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.4/xfs_buf.c.diff?r1=text&tr1=1.219&r2=text&tr2=1.218&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.6/xfs_ksyms.c - 1.57 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_ksyms.c.diff?r1=text&tr1=1.57&r2=text&tr2=1.56&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. fs/xfs/linux-2.4/xfs_ksyms.c - 1.49 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.4/xfs_ksyms.c.diff?r1=text&tr1=1.49&r2=text&tr2=1.48&f=h - Export via a function xfs_buftarg_list for use by kdb/xfsidbg. From owner-xfs@oss.sgi.com Sun Apr 15 22:21:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 15 Apr 2007 22:21:19 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3G5LDfB027136 for ; Sun, 15 Apr 2007 22:21:15 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA06555; Mon, 16 Apr 2007 15:21:09 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 1116) id 0483F5903EDB; Mon, 16 Apr 2007 15:21:08 +1000 (EST) To: sgi.bugs.xfs@engr.sgi.com, xfs@oss.sgi.com Subject: TAKE 963466 - remove the unnecessary word in the log message Message-Id: <20070416052109.0483F5903EDB@chook.melbourne.sgi.com> Date: Mon, 16 Apr 2007 15:21:08 +1000 (EST) From: tes@sgi.com (Tim Shimmin) X-archive-position: 11101 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Thanks to Utako Kusaka. Signed-off-by: Utako Kusaka Get rid of redundant "required" in msg. Date: Mon Apr 16 15:19:51 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/tes/2.6.x-xfs Inspected by: utako@tnes.nec.co.jp,hch@infradead.org The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28416a fs/xfs/xfs_log_recover.c - 1.318 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_log_recover.c.diff?r1=text&tr1=1.318&r2=text&tr2=1.317&f=h - Signed-off-by: Utako Kusaka Get rid of redundant "required" in msg. From owner-xfs@oss.sgi.com Mon Apr 16 00:59:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 16 Apr 2007 00:59:51 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3G7xhfB010633 for ; Mon, 16 Apr 2007 00:59:46 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA10446; Mon, 16 Apr 2007 17:59:20 +1000 Date: Mon, 16 Apr 2007 18:01:17 +1000 From: Timothy Shimmin To: Andreas Dilger , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com cc: hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <31588A06562720FE1E0F93DF@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11102 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs Hi Andreas, --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger wrote: > I'm interested in getting input for implementing an ioctl to efficiently > map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion > times. ... > > I had come up with a plan independently and was also steered toward > XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original > plan, though I think the XFS structs used there are a bit bloated. They certainly seem to be (combining entries and header). > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > ># define FIEMAP_LEN_MASK 0xff000000000000 ># define FIEMAP_LEN_HOLE 0x01000000000000 ># define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > All offsets are in bytes to allow cases where filesystems are not going > block-aligned/sized allocations (e.g. tail packing). The fm_extents array > returned contains the packed list of allocation extents for the file, > including entries for holes (which have fe_start == 0, and a flag). > > The ->fm_extents[] array includes all of the holes in addition to > allocated extents because this avoids the need to return both the logical > and physical address for every extent and does not make processing any > harder. Well, that's what stood out for me. I was wondering where the "fe_block" field had gone - the "physical address". So is your "fe_start; /* starting offset */" actually the disk location (not a logical file offset) _except_ in the header (fibmap) where it is the desired logical offset. Okay, looking at your example use below that's what it looks like. And when you refer to fm_start below, you mean fm_start.fe_start? Sorry, I realise this is just an approximation but this part confused me. So you get rid of all the logical file offsets in the extents because we report holes explicitly (and we know everything is contiguous if you include the holes). --Tim > > Caller works something like: > > char buf[4096]; > struct fibmap *fm = (struct fibmap *)buf; > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > fm->fm_extent.fe_start = 0; /* start of file */ > fm->fm_extent.fe_len = -1; /* end of file */ > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > fd = open(path, O_RDONLY); > printf("logical\t\tphysical\t\tbytes\n"); > > /* The last entry will have less extents than the maximum */ > while (fm->fm_extent_count == count) { > rc = ioctl(fd, FIEMAP, fm); > if (rc) > break; > > /* kernel filled in fm_extents[] array, set fm_extent_count > * to be actual number of extents returned, leaves fm_start > * alone (unlike XFS_IOC_GETBMAP). */ > > for (i = 0; i < fm->fm_extent_count; i++) { > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > __u64 fm_next = fm->fm_start + len; > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > fm->fm_start, fm_next - 1, > hole ? 0 : fm->fm_extents[i].fe_start, > hole ? 0 : fm->fm_extents[i].fe_start + > fm->fm_extents[i].fe_len - 1, > len, hole ? "(hole) " : "", > unwr ? "(unwritten) " : ""); > > /* get ready for printing next extent, or next ioctl */ > fm->fm_start = fm_next; > } > } > From owner-xfs@oss.sgi.com Mon Apr 16 04:23:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 16 Apr 2007 04:23:14 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3GBN6fB021516 for ; Mon, 16 Apr 2007 04:23:08 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id VAA14873; Mon, 16 Apr 2007 21:22:56 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3GBMsAf66125042; Mon, 16 Apr 2007 21:22:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3GBMrmv66162864; Mon, 16 Apr 2007 21:22:53 +1000 (AEST) Date: Mon, 16 Apr 2007 21:22:53 +1000 From: David Chinner To: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070416112252.GJ48531920@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070412110550.GM5967@schatzie.adilger.int> User-Agent: Mutt/1.4.2.1i X-archive-position: 11103 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > I'm interested in getting input for implementing an ioctl to efficiently > map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion > times. We already have customers with single files in the 10TB range and > we additionally need to get the mapping over the network so it needs to > be efficient in terms of how data is passed, and how easily it can be > extracted from the filesystem. > > I had come up with a plan independently and was also steered toward > XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original > plan, though I think the XFS structs used there are a bit bloated. Yeah, they were designed with having a long term stable ABI that limited expandability. Hence the "future" fields that never got used ;) > There was also recent discussion about SEEK_HOLE and SEEK_DATA as > implemented by Sun, but even if we could skip the holes we still might > need to do millions of FIBMAPs to see how large files are allocated > on disk. Conversely, having filesystems implement an efficient FIBMAP > ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE > and SEEK_DATA instead of doing looping over ->bmap() inside the kernel > as I saw one patch. Yup. > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > > #define FIEMAP_LEN_MASK 0xff000000000000 > #define FIEMAP_LEN_HOLE 0x01000000000000 > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 I'm not sure I like stealing bits from the length to use a flags - I'd prefer an explicit field per fibmap_extent for this. Given that xfs_bmap uses extra information from the filesystem (geometry) to display extra (and frequently used) information about the alignment of extents. ie: chook 681% xfs_bmap -vv fred fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width This information could be easily passed up in the flags fields if the filesystem has geometry information (there go 4 more flags ;). Also - what are the explicit sync semantics of this ioctl? The XFS ioctl causes a fsync of the file first to convert delalloc extents to real extents before returning the bmap. Is this functionality going to be the same? If not, then we need a DELALLOC flag to indicate extents that haven't been allocated yet. This might be handy to have, anyway.... > All offsets are in bytes to allow cases where filesystems are not going > block-aligned/sized allocations (e.g. tail packing). So it'll be ok for a few years yet ;) > The fm_extents array > returned contains the packed list of allocation extents for the file, > including entries for holes (which have fe_start == 0, and a flag). Internalling in XFS, we pass these around as: #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL) #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL) And the offset passed out through XFS_IOC_GETBMAP[X] is a block number of -1 for the start of a hole. Hence we don't need a flag for this. We can expose delalloc extents like this as well without needing flags... > The ->fm_extents[] array includes all of the holes in addition to > allocated extents because this avoids the need to return both the logical > and physical address for every extent and does not make processing any > harder. Doesn't really make it any easier to map to disk, either. > One feature that XFS_IOC_GETBMAPX has that may be desirable is the > ability to return unwritten extent information. You got that with the unwritten flag above..... > required expanding the per-extent struct from 32 to 48 bytes per extent, not sure I follow your maths here? > but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) > and keep 8 bytes or so for input/output flags per extent (would need to ^^^^^ bits? > be masked before use). > > > Caller works something like: > > char buf[4096]; > struct fibmap *fm = (struct fibmap *)buf; > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > fm->fm_extent.fe_start = 0; /* start of file */ > fm->fm_extent.fe_len = -1; /* end of file */ > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > fd = open(path, O_RDONLY); > printf("logical\t\tphysical\t\tbytes\n"); > > /* The last entry will have less extents than the maximum */ > while (fm->fm_extent_count == count) { fm_extent_count is an in/out parameter? > rc = ioctl(fd, FIEMAP, fm); > if (rc) > break; > > /* kernel filled in fm_extents[] array, set fm_extent_count > * to be actual number of extents returned, leaves fm_start > * alone (unlike XFS_IOC_GETBMAP). */ Ok, it is. > for (i = 0; i < fm->fm_extent_count; i++) { > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > __u64 fm_next = fm->fm_start + len; > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > fm->fm_start, fm_next - 1, > hole ? 0 : fm->fm_extents[i].fe_start, > hole ? 0 : fm->fm_extents[i].fe_start + > fm->fm_extents[i].fe_len - 1, > len, hole ? "(hole) " : "", > unwr ? "(unwritten) " : ""); > > /* get ready for printing next extent, or next ioctl */ > fm->fm_start = fm_next; Ok, so the only way you can determine where you are in the file is by adding up the length of each extent. What happens if the file is changing underneath you e.g. someone punches out a hole in teh file, or truncates and extends it again between ioctl() calls? Also, what happens if you ask for an offset/len that doesn't map to any extent boundaries - are you truncating the extents returned to teh off/len passed in? xfs_bmap gets around this by finding out how many extents there are in the file and allocating a buffer that big to hold all the extents so they are gathered in a single atomic call (think sparse matrix files).... > I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP. > I'm quite open to suggestions at this point, both in terms of how usable > the fibmap data structures are by the caller, and if we need to add anything > to make them more flexible for the future. ioctl is fine by me. perhaps a version number in the structure header would be handy so we can modify the interface easily in the future without having to worry about breaking userspace.... > In terms of implementing this in the kernel, there was originally code for > this during the development of the ext3 extent patches and it was done via > a callback in the extent tree iterator so it is very efficient. I believe > it implements all that is needed to allow this interface to be mapped > onto XFS_IOC_BMAP internally (or vice versa). I wouldn't map the ioctls - I'd just write another interface to xfs_getbmap(). That way we could eventually get rid of the XFS_IOC_BMAP interface. is there any code yet? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue Apr 17 05:55:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 17 Apr 2007 05:55:50 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3HCthfB002992 for ; Tue, 17 Apr 2007 05:55:45 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l3HCtgDX010326 for ; Tue, 17 Apr 2007 08:55:42 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3HCtg8k444038 for ; Tue, 17 Apr 2007 08:55:42 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3HCtPR3011241 for ; Tue, 17 Apr 2007 08:55:26 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3HCtOU8010339; Tue, 17 Apr 2007 08:55:25 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id DDD1129ED6E; Tue, 17 Apr 2007 18:25:15 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3HCtEZA015242; Tue, 17 Apr 2007 18:25:14 +0530 Date: Tue, 17 Apr 2007 18:25:14 +0530 From: "Amit K. Arora" To: Andrew Morton , Jakub Jelinek , torvalds@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com, suparna@in.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070417125514.GA7574@amitarora.in.ibm.com> References: <20070117094658.GA17390@amitarora.in.ibm.com> <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070330071417.GI355@devserv.devel.redhat.com> User-Agent: Mutt/1.4.1i X-archive-position: 11104 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > Wouldn't > int fallocate(loff_t offset, loff_t len, int fd, int mode) > work on both s390 and ppc/arm? glibc will certainly wrap it and > reorder the arguments as needed, so there is no need to keep fd first. > I think more people are comfirtable with this approach. Since glibc will wrap the system call and export the "conventional" interface (with fd first) to applications, we may not worry about keeping fd first in kernel code. I am personally fine with this approach. Still, if people have major concerns, we can think of getting rid of the "mode" argument itself. Anyhow we may, in future, need to have a policy based system call (say, for providing the goal block by applications for performance reasons). "mode" can then be made part of it. Comments ? -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue Apr 17 13:57:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 17 Apr 2007 13:57:43 -0700 (PDT) Received: from mx2.redhat.com (mx2.redhat.com [66.187.237.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3HKvdfB029040 for ; Tue, 17 Apr 2007 13:57:40 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx2.redhat.com (8.13.1/8.13.1) with ESMTP id l3HKUfeT009781 for ; Tue, 17 Apr 2007 16:30:42 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3HKUesm004074 for ; Tue, 17 Apr 2007 16:30:40 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3HKUcAx017901 for ; Tue, 17 Apr 2007 16:30:40 -0400 Message-ID: <46252D94.2050106@sandeen.net> Date: Tue, 17 Apr 2007 15:27:00 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: xfs mailing list Subject: when is a dmapi tarball not a dmapi tarball? Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11105 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs when you find it on oss, it seems :) [esandeen@neon tmp]$ wget ftp://oss.sgi.com/projects/xfs/cmd_tars/dmapi_2.2.8-1.tar.gz --15:11:31-- ftp://oss.sgi.com/projects/xfs/cmd_tars/dmapi_2.2.8-1.tar.gz => `dmapi_2.2.8-1.tar.gz' Resolving oss.sgi.com... 192.48.170.157 Connecting to oss.sgi.com|192.48.170.157|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD /projects/xfs/cmd_tars ... done. ==> SIZE dmapi_2.2.8-1.tar.gz ... 84649 ==> PASV ... done. ==> RETR dmapi_2.2.8-1.tar.gz ... done. Length: 84649 (83K) 100%[=======================================>] 84,649 133K/s in 0.6s 15:11:33 (133 KB/s) - `dmapi_2.2.8-1.tar.gz' saved [84649] [esandeen@neon tmp]$ tar xvzf dmapi_2.2.8-1.tar.gz gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error exit delayed from previous errors [esandeen@neon tmp]$ file dmapi_2.2.8-1.tar.gz dmapi_2.2.8-1.tar.gz: RPM v3 src IA64 dmapi-2.2.8-1 [esandeen@neon tmp]$ rpm -qpl dmapi_2.2.8-1.tar.gz dmapi-2.2.8.src.tar.gz dmapi.spec Might want to fix that... -Eric From owner-xfs@oss.sgi.com Tue Apr 17 18:03:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 17 Apr 2007 18:03:14 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3I138fB018969 for ; Tue, 17 Apr 2007 18:03:10 -0700 Received: from pcbnaujok (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA09246; Wed, 18 Apr 2007 11:03:00 +1000 Message-Id: <200704180103.LAA09246@larry.melbourne.sgi.com> From: "Barry Naujok" To: "'Eric Sandeen'" , "'xfs mailing list'" Subject: RE: when is a dmapi tarball not a dmapi tarball? Date: Wed, 18 Apr 2007 11:08:45 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.6353 In-Reply-To: <46252D94.2050106@sandeen.net> Thread-Index: AceBM0jqao+UQMxMRwqcjMZZKELIMQAIs3SA X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 X-archive-position: 11106 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs Fixed! > -----Original Message----- > From: xfs-bounce@oss.sgi.com [mailto:xfs-bounce@oss.sgi.com] > On Behalf Of Eric Sandeen > Sent: Wednesday, 18 April 2007 6:27 AM > To: xfs mailing list > Subject: when is a dmapi tarball not a dmapi tarball? > > when you find it on oss, it seems :) > > [esandeen@neon tmp]$ wget > ftp://oss.sgi.com/projects/xfs/cmd_tars/dmapi_2.2.8-1.tar.gz > --15:11:31-- > ftp://oss.sgi.com/projects/xfs/cmd_tars/dmapi_2.2.8-1.tar.gz > => `dmapi_2.2.8-1.tar.gz' > Resolving oss.sgi.com... 192.48.170.157 > Connecting to oss.sgi.com|192.48.170.157|:21... connected. > Logging in as anonymous ... Logged in! > ==> SYST ... done. ==> PWD ... done. > ==> TYPE I ... done. ==> CWD /projects/xfs/cmd_tars ... done. > ==> SIZE dmapi_2.2.8-1.tar.gz ... 84649 > ==> PASV ... done. ==> RETR dmapi_2.2.8-1.tar.gz ... done. > Length: 84649 (83K) > > 100%[=======================================>] 84,649 > 133K/s in > 0.6s > > 15:11:33 (133 KB/s) - `dmapi_2.2.8-1.tar.gz' saved [84649] > > [esandeen@neon tmp]$ tar xvzf dmapi_2.2.8-1.tar.gz > > gzip: stdin: not in gzip format > tar: Child returned status 1 > tar: Error exit delayed from previous errors > > [esandeen@neon tmp]$ file dmapi_2.2.8-1.tar.gz > dmapi_2.2.8-1.tar.gz: RPM v3 src IA64 dmapi-2.2.8-1 > > [esandeen@neon tmp]$ rpm -qpl dmapi_2.2.8-1.tar.gz > dmapi-2.2.8.src.tar.gz > dmapi.spec > > Might want to fix that... > > -Eric > > From owner-xfs@oss.sgi.com Wed Apr 18 06:06:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 06:06:12 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3ID63fB007264 for ; Wed, 18 Apr 2007 06:06:05 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 68EF64E4594; Wed, 18 Apr 2007 07:06:02 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 911374141; Wed, 18 Apr 2007 07:06:00 -0600 (MDT) Date: Wed, 18 Apr 2007 07:06:00 -0600 From: Andreas Dilger To: "Amit K. Arora" Cc: Andrew Morton , Jakub Jelinek , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com, suparna@in.ibm.com Subject: Re: Interface for the new fallocate() system call Message-ID: <20070418130600.GW5967@schatzie.adilger.int> Mail-Followup-To: "Amit K. Arora" , Andrew Morton , Jakub Jelinek , torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com, suparna@in.ibm.com References: <20070225022326.137b4875.akpm@linux-foundation.org> <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070417125514.GA7574@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11107 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On Apr 17, 2007 18:25 +0530, Amit K. Arora wrote: > On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > > Wouldn't > > int fallocate(loff_t offset, loff_t len, int fd, int mode) > > work on both s390 and ppc/arm? glibc will certainly wrap it and > > reorder the arguments as needed, so there is no need to keep fd first. > > I think more people are comfirtable with this approach. Really? I thought from the last postings that "fd first, wrap on s390" was better. > Since glibc > will wrap the system call and export the "conventional" interface > (with fd first) to applications, we may not worry about keeping fd first > in kernel code. I am personally fine with this approach. It would seem to make more sense to wrap the syscall on those architectures that can't handle the "conventional" interface (fd first). > Still, if people have major concerns, we can think of getting rid of the > "mode" argument itself. Anyhow we may, in future, need to have a policy > based system call (say, for providing the goal block by applications for > performance reasons). "mode" can then be made part of it. We need at least mode="unallocate" or a separate funallocate() call to allow allocated-but-unwritten blocks to be unallocated without actually punching out written data. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Wed Apr 18 10:57:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 10:57:41 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3IHvXfB017246 for ; Wed, 18 Apr 2007 10:57:36 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3IHvVLD018372 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Wed, 18 Apr 2007 19:57:31 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3IHvUal018370 for xfs@oss.sgi.com; Wed, 18 Apr 2007 19:57:30 +0200 Date: Wed, 18 Apr 2007 19:57:30 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: [PATCH] remove various useless min/max macros Message-ID: <20070418175730.GA18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11108 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs xfs_btree.h has various macros to calculate a min/max after casting it's arguments to a specific type. This can be done much simpler by using min_t/max_t with the type as first argument. Signed-off-by: Christoph Hellwig Index: linux-2.6/fs/xfs/xfs_alloc.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_alloc.c 2007-04-13 13:40:00.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_alloc.c 2007-04-13 13:44:07.000000000 +0200 @@ -151,11 +151,11 @@ xfs_alloc_compute_diff( if (newbno1 >= freeend) newbno1 = NULLAGBLOCK; else - newlen1 = XFS_EXTLEN_MIN(wantlen, freeend - newbno1); + newlen1 = min_t(xfs_extlen_t, wantlen, freeend - newbno1); if (newbno2 < freebno) newbno2 = NULLAGBLOCK; else - newlen2 = XFS_EXTLEN_MIN(wantlen, freeend - newbno2); + newlen2 = min_t(xfs_extlen_t, wantlen, freeend - newbno2); if (newbno1 != NULLAGBLOCK && newbno2 != NULLAGBLOCK) { if (newlen1 < newlen2 || (newlen1 == newlen2 && @@ -686,7 +686,7 @@ xfs_alloc_ag_vextent_exact( * End of extent will be smaller of the freespace end and the * maximal requested end. */ - end = XFS_AGBLOCK_MIN(fend, maxend); + end = min_t(xfs_agblock_t, fend, maxend); /* * Fix the length according to mod and prod if given. */ @@ -850,7 +850,7 @@ xfs_alloc_ag_vextent_near( args->alignment, args->minlen, <bnoa, <lena)) continue; - args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen); + args->len = min_t(xfs_extlen_t, ltlena, args->maxlen); xfs_alloc_fix_len(args); ASSERT(args->len >= args->minlen); if (args->len < blen) @@ -1007,7 +1007,7 @@ xfs_alloc_ag_vextent_near( /* * Fix up the length. */ - args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen); + args->len = min_t(xfs_extlen_t, ltlena, args->maxlen); xfs_alloc_fix_len(args); rlen = args->len; ltdiff = xfs_alloc_compute_diff(args->agbno, rlen, @@ -1045,7 +1045,7 @@ xfs_alloc_ag_vextent_near( */ if (gtlena >= args->minlen) { args->len = - XFS_EXTLEN_MIN(gtlena, + min_t(xfs_extlen_t, gtlena, args->maxlen); xfs_alloc_fix_len(args); rlen = args->len; @@ -1104,7 +1104,7 @@ xfs_alloc_ag_vextent_near( /* * Fix up the length. */ - args->len = XFS_EXTLEN_MIN(gtlena, args->maxlen); + args->len = min_t(xfs_extlen_t, gtlena, args->maxlen); xfs_alloc_fix_len(args); rlen = args->len; gtdiff = xfs_alloc_compute_diff(args->agbno, rlen, @@ -1141,7 +1141,7 @@ xfs_alloc_ag_vextent_near( * compare the two and pick the best. */ if (ltlena >= args->minlen) { - args->len = XFS_EXTLEN_MIN( + args->len = min_t(xfs_extlen_t, ltlena, args->maxlen); xfs_alloc_fix_len(args); rlen = args->len; @@ -1221,7 +1221,7 @@ xfs_alloc_ag_vextent_near( * Fix up the length and compute the useful address. */ ltend = ltbno + ltlen; - args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen); + args->len = min_t(xfs_extlen_t, ltlena, args->maxlen); xfs_alloc_fix_len(args); if (!xfs_alloc_fix_minleft(args)) { TRACE_ALLOC("nominleft", args); @@ -1320,7 +1320,7 @@ xfs_alloc_ag_vextent_size( */ xfs_alloc_compute_aligned(fbno, flen, args->alignment, args->minlen, &rbno, &rlen); - rlen = XFS_EXTLEN_MIN(args->maxlen, rlen); + rlen = min_t(xfs_extlen_t, args->maxlen, rlen); XFS_WANT_CORRUPTED_GOTO(rlen == 0 || (rlen <= flen && rbno + rlen <= fbno + flen), error0); if (rlen < args->maxlen) { @@ -1346,7 +1346,7 @@ xfs_alloc_ag_vextent_size( break; xfs_alloc_compute_aligned(fbno, flen, args->alignment, args->minlen, &rbno, &rlen); - rlen = XFS_EXTLEN_MIN(args->maxlen, rlen); + rlen = min_t(xfs_extlen_t, args->maxlen, rlen); XFS_WANT_CORRUPTED_GOTO(rlen == 0 || (rlen <= flen && rbno + rlen <= fbno + flen), error0); Index: linux-2.6/fs/xfs/xfs_bmap.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_bmap.c 2007-04-13 13:41:43.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_bmap.c 2007-04-13 13:45:14.000000000 +0200 @@ -994,7 +994,7 @@ xfs_bmap_add_extent_delay_real( LEFT.br_state))) goto done; } - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), STARTBLOCKVAL(PREV.br_startblock)); xfs_bmbt_set_startblock(ep, NULLSTARTBLOCK((int)temp)); xfs_bmap_trace_post_update(fname, "LF|LC", ip, idx, @@ -1043,7 +1043,7 @@ xfs_bmap_add_extent_delay_real( if (error) goto done; } - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), STARTBLOCKVAL(PREV.br_startblock) - (cur ? cur->bc_private.b.allocated : 0)); ep = xfs_iext_get_ext(ifp, idx + 1); @@ -1090,7 +1090,7 @@ xfs_bmap_add_extent_delay_real( RIGHT.br_state))) goto done; } - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), STARTBLOCKVAL(PREV.br_startblock)); xfs_bmbt_set_startblock(ep, NULLSTARTBLOCK((int)temp)); xfs_bmap_trace_post_update(fname, "RF|RC", ip, idx, @@ -1138,7 +1138,7 @@ xfs_bmap_add_extent_delay_real( if (error) goto done; } - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), STARTBLOCKVAL(PREV.br_startblock) - (cur ? cur->bc_private.b.allocated : 0)); ep = xfs_iext_get_ext(ifp, idx); @@ -3177,7 +3177,7 @@ xfs_bmap_del_extent( xfs_bmbt_set_blockcount(ep, temp); ifp->if_lastex = idx; if (delay) { - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), da_old); xfs_bmbt_set_startblock(ep, NULLSTARTBLOCK((int)temp)); xfs_bmap_trace_post_update(fname, "2", ip, idx, @@ -3206,7 +3206,7 @@ xfs_bmap_del_extent( xfs_bmbt_set_blockcount(ep, temp); ifp->if_lastex = idx; if (delay) { - temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + temp = min_t(xfs_filblks_t, xfs_bmap_worst_indlen(ip, temp), da_old); xfs_bmbt_set_startblock(ep, NULLSTARTBLOCK((int)temp)); xfs_bmap_trace_post_update(fname, "1", ip, idx, @@ -4337,7 +4337,7 @@ xfs_bmap_first_unused( return 0; } lastaddr = off + xfs_bmbt_get_blockcount(ep); - max = XFS_FILEOFF_MAX(lastaddr, lowest); + max = max_t(xfs_fileoff_t, lastaddr, lowest); } *first_unused = max; return 0; @@ -4850,16 +4850,16 @@ xfs_bmapi( } } else if (wasdelay) { alen = (xfs_extlen_t) - XFS_FILBLKS_MIN(len, + min_t(xfs_filblks_t, len, (got.br_startoff + got.br_blockcount) - bno); aoff = bno; } else { alen = (xfs_extlen_t) - XFS_FILBLKS_MIN(len, MAXEXTLEN); + min_t(xfs_filblks_t, len, MAXEXTLEN); if (!eof) alen = (xfs_extlen_t) - XFS_FILBLKS_MIN(alen, + min_t(xfs_filblks_t, alen, got.br_startoff - bno); aoff = bno; } @@ -5087,7 +5087,7 @@ xfs_bmapi( mval->br_startoff = bno; mval->br_startblock = HOLESTARTBLOCK; mval->br_blockcount = - XFS_FILBLKS_MIN(len, got.br_startoff - bno); + min_t(xfs_filblks_t, len, got.br_startoff - bno); mval->br_state = XFS_EXT_NORM; bno += mval->br_blockcount; len -= mval->br_blockcount; @@ -5122,7 +5122,7 @@ xfs_bmapi( * didn't overlap what was asked for. */ mval->br_blockcount = - XFS_FILBLKS_MIN(end - bno, got.br_blockcount - + min_t(xfs_filblks_t, end - bno, got.br_blockcount - (bno - got.br_startoff)); mval->br_state = got.br_state; ASSERT(mval->br_blockcount <= len); @@ -5462,7 +5462,7 @@ xfs_bunmapi( * Is the last block of this extent before the range * we're supposed to delete? If so, we're done. */ - bno = XFS_FILEOFF_MIN(bno, + bno = min_t(xfs_fileoff_t, bno, got.br_startoff + got.br_blockcount - 1); if (bno < start) break; Index: linux-2.6/fs/xfs/xfs_btree.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_btree.h 2007-04-13 13:43:19.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_btree.h 2007-04-13 13:43:56.000000000 +0200 @@ -440,35 +440,6 @@ xfs_btree_setbuf( #endif /* __KERNEL__ */ - -/* - * Min and max functions for extlen, agblock, fileoff, and filblks types. - */ -#define XFS_EXTLEN_MIN(a,b) \ - ((xfs_extlen_t)(a) < (xfs_extlen_t)(b) ? \ - (xfs_extlen_t)(a) : (xfs_extlen_t)(b)) -#define XFS_EXTLEN_MAX(a,b) \ - ((xfs_extlen_t)(a) > (xfs_extlen_t)(b) ? \ - (xfs_extlen_t)(a) : (xfs_extlen_t)(b)) -#define XFS_AGBLOCK_MIN(a,b) \ - ((xfs_agblock_t)(a) < (xfs_agblock_t)(b) ? \ - (xfs_agblock_t)(a) : (xfs_agblock_t)(b)) -#define XFS_AGBLOCK_MAX(a,b) \ - ((xfs_agblock_t)(a) > (xfs_agblock_t)(b) ? \ - (xfs_agblock_t)(a) : (xfs_agblock_t)(b)) -#define XFS_FILEOFF_MIN(a,b) \ - ((xfs_fileoff_t)(a) < (xfs_fileoff_t)(b) ? \ - (xfs_fileoff_t)(a) : (xfs_fileoff_t)(b)) -#define XFS_FILEOFF_MAX(a,b) \ - ((xfs_fileoff_t)(a) > (xfs_fileoff_t)(b) ? \ - (xfs_fileoff_t)(a) : (xfs_fileoff_t)(b)) -#define XFS_FILBLKS_MIN(a,b) \ - ((xfs_filblks_t)(a) < (xfs_filblks_t)(b) ? \ - (xfs_filblks_t)(a) : (xfs_filblks_t)(b)) -#define XFS_FILBLKS_MAX(a,b) \ - ((xfs_filblks_t)(a) > (xfs_filblks_t)(b) ? \ - (xfs_filblks_t)(a) : (xfs_filblks_t)(b)) - #define XFS_FSB_SANITY_CHECK(mp,fsb) \ (XFS_FSB_TO_AGNO(mp, fsb) < mp->m_sb.sb_agcount && \ XFS_FSB_TO_AGBNO(mp, fsb) < mp->m_sb.sb_agblocks) Index: linux-2.6/fs/xfs/xfs_inode.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_inode.c 2007-04-13 13:42:08.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_inode.c 2007-04-13 13:42:19.000000000 +0200 @@ -1341,7 +1341,7 @@ xfs_file_last_byte( last_block = 0; } size_last_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)ip->i_d.di_size); - last_block = XFS_FILEOFF_MAX(last_block, size_last_block); + last_block = max_t(xfs_fileoff_t, last_block, size_last_block); last_byte = XFS_FSB_TO_B(mp, last_block); if (last_byte < 0) { Index: linux-2.6/fs/xfs/xfs_iomap.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_iomap.c 2007-04-13 13:42:08.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_iomap.c 2007-04-13 13:42:23.000000000 +0200 @@ -820,7 +820,7 @@ xfs_iomap_write_allocate( end_fsb = XFS_B_TO_FSB(mp, ip->i_d.di_size); xfs_bmap_last_offset(NULL, ip, &last_block, XFS_DATA_FORK); - last_block = XFS_FILEOFF_MAX(last_block, end_fsb); + last_block = max_t(xfs_fileoff_t, last_block, end_fsb); if ((map_start_fsb + count_fsb) > last_block) { count_fsb = last_block - map_start_fsb; if (count_fsb == 0) { From owner-xfs@oss.sgi.com Wed Apr 18 10:59:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 10:59:14 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3IHx3fB017854 for ; Wed, 18 Apr 2007 10:59:05 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3IHx0LD018425 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Wed, 18 Apr 2007 19:59:00 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3IHx0WA018423 for xfs@oss.sgi.com; Wed, 18 Apr 2007 19:59:00 +0200 Date: Wed, 18 Apr 2007 19:59:00 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: [PATCH] kill macro noise in xfs_dir2*.h Message-ID: <20070418175859.GB18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11109 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs Remove all the macros that just give inline functions uppercase names. Signed-off-by: Christoph Hellwig Index: linux-2.6/fs/xfs/xfs_dir2.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2.c 2007-04-13 14:02:24.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2.c 2007-04-13 14:07:18.000000000 +0200 @@ -55,9 +55,9 @@ xfs_dir_mount( XFS_MAX_BLOCKSIZE); mp->m_dirblksize = 1 << (mp->m_sb.sb_blocklog + mp->m_sb.sb_dirblklog); mp->m_dirblkfsbs = 1 << mp->m_sb.sb_dirblklog; - mp->m_dirdatablk = XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_DATA_FIRSTDB(mp)); - mp->m_dirleafblk = XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_LEAF_FIRSTDB(mp)); - mp->m_dirfreeblk = XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_FREE_FIRSTDB(mp)); + mp->m_dirdatablk = xfs_dir2_db_to_da(mp, XFS_DIR2_DATA_FIRSTDB(mp)); + mp->m_dirleafblk = xfs_dir2_db_to_da(mp, XFS_DIR2_LEAF_FIRSTDB(mp)); + mp->m_dirfreeblk = xfs_dir2_db_to_da(mp, XFS_DIR2_FREE_FIRSTDB(mp)); mp->m_attr_node_ents = (mp->m_sb.sb_blocksize - (uint)sizeof(xfs_da_node_hdr_t)) / (uint)sizeof(xfs_da_node_entry_t); @@ -554,7 +554,7 @@ xfs_dir2_grow_inode( */ if (mapp != &map) kmem_free(mapp, sizeof(*mapp) * count); - *dbp = XFS_DIR2_DA_TO_DB(mp, (xfs_dablk_t)bno); + *dbp = xfs_dir2_da_to_db(mp, (xfs_dablk_t)bno); /* * Update file's size if this is the data space and it grew. */ @@ -706,7 +706,7 @@ xfs_dir2_shrink_inode( dp = args->dp; mp = dp->i_mount; tp = args->trans; - da = XFS_DIR2_DB_TO_DA(mp, db); + da = xfs_dir2_db_to_da(mp, db); /* * Unmap the fsblock(s). */ @@ -742,7 +742,7 @@ xfs_dir2_shrink_inode( /* * If the block isn't the last one in the directory, we're done. */ - if (dp->i_d.di_size > XFS_DIR2_DB_OFF_TO_BYTE(mp, db + 1, 0)) + if (dp->i_d.di_size > xfs_dir2_db_off_to_byte(mp, db + 1, 0)) return 0; bno = da; if ((error = xfs_bmap_last_before(tp, dp, &bno, XFS_DATA_FORK))) { Index: linux-2.6/fs/xfs/xfs_dir2_block.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_block.c 2007-04-13 13:47:00.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_block.c 2007-04-13 14:08:20.000000000 +0200 @@ -115,13 +115,13 @@ xfs_dir2_block_addname( xfs_da_brelse(tp, bp); return XFS_ERROR(EFSCORRUPTED); } - len = XFS_DIR2_DATA_ENTSIZE(args->namelen); + len = xfs_dir2_data_entsize(args->namelen); /* * Set up pointers to parts of the block. */ bf = block->hdr.bestfree; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * No stale entries? Need space for entry and new leaf. */ @@ -397,7 +397,7 @@ xfs_dir2_block_addname( * Fill in the leaf entry. */ blp[mid].hashval = cpu_to_be32(args->hashval); - blp[mid].address = cpu_to_be32(XFS_DIR2_BYTE_TO_DATAPTR(mp, + blp[mid].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)block)); xfs_dir2_block_log_leaf(tp, bp, lfloglow, lfloghigh); /* @@ -412,7 +412,7 @@ xfs_dir2_block_addname( dep->inumber = cpu_to_be64(args->inumber); dep->namelen = args->namelen; memcpy(dep->name, args->name, args->namelen); - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)block); /* * Clean up the bestfree array and log the header, tail, and entry. @@ -457,7 +457,7 @@ xfs_dir2_block_getdents( /* * If the block number in the offset is out of range, we're done. */ - if (XFS_DIR2_DATAPTR_TO_DB(mp, uio->uio_offset) > mp->m_dirdatablk) { + if (xfs_dir2_dataptr_to_db(mp, uio->uio_offset) > mp->m_dirdatablk) { *eofp = 1; return 0; } @@ -473,15 +473,15 @@ xfs_dir2_block_getdents( * Extract the byte offset we start at from the seek pointer. * We'll skip entries before this. */ - wantoff = XFS_DIR2_DATAPTR_TO_OFF(mp, uio->uio_offset); + wantoff = xfs_dir2_dataptr_to_off(mp, uio->uio_offset); block = bp->data; xfs_dir2_data_check(dp, bp); /* * Set up values for the loop. */ - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); ptr = (char *)block->u; - endptr = (char *)XFS_DIR2_BLOCK_LEAF_P(btp); + endptr = (char *)xfs_dir2_block_leaf_p(btp); p.dbp = dbp; p.put = put; p.uio = uio; @@ -504,7 +504,7 @@ xfs_dir2_block_getdents( /* * Bump pointer for the next iteration. */ - ptr += XFS_DIR2_DATA_ENTSIZE(dep->namelen); + ptr += xfs_dir2_data_entsize(dep->namelen); /* * The entry is before the desired starting point, skip it. */ @@ -515,7 +515,7 @@ xfs_dir2_block_getdents( */ p.namelen = dep->namelen; - p.cook = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + p.cook = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, ptr - (char *)block); p.ino = be64_to_cpu(dep->inumber); #if XFS_BIG_INUMS @@ -533,7 +533,7 @@ xfs_dir2_block_getdents( */ if (!p.done) { uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, (char *)dep - (char *)block); xfs_da_brelse(tp, bp); return error; @@ -547,7 +547,7 @@ xfs_dir2_block_getdents( *eofp = 1; uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk + 1, 0); + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk + 1, 0); xfs_da_brelse(tp, bp); @@ -571,8 +571,8 @@ xfs_dir2_block_log_leaf( mp = tp->t_mountp; block = bp->data; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); xfs_da_log_buf(tp, bp, (uint)((char *)&blp[first] - (char *)block), (uint)((char *)&blp[last + 1] - (char *)block - 1)); } @@ -591,7 +591,7 @@ xfs_dir2_block_log_tail( mp = tp->t_mountp; block = bp->data; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); xfs_da_log_buf(tp, bp, (uint)((char *)btp - (char *)block), (uint)((char *)(btp + 1) - (char *)block - 1)); } @@ -625,13 +625,13 @@ xfs_dir2_block_lookup( mp = dp->i_mount; block = bp->data; xfs_dir2_data_check(dp, bp); - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Get the offset from the leaf entry, to point to the data. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(blp[ent].address))); + ((char *)block + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address))); /* * Fill in inode number, release the block. */ @@ -677,8 +677,8 @@ xfs_dir2_block_lookup_int( ASSERT(bp != NULL); block = bp->data; xfs_dir2_data_check(dp, bp); - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Loop doing a binary search for our hash value. * Find our entry, ENOENT if it's not there. @@ -715,7 +715,7 @@ xfs_dir2_block_lookup_int( * Get pointer to the entry from the leaf. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, addr)); + ((char *)block + xfs_dir2_dataptr_to_off(mp, addr)); /* * Compare, if it's right give back buffer & entry number. */ @@ -770,20 +770,20 @@ xfs_dir2_block_removename( tp = args->trans; mp = dp->i_mount; block = bp->data; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Point to the data entry using the leaf entry. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(blp[ent].address))); + ((char *)block + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address))); /* * Mark the data entry's space free. */ needlog = needscan = 0; xfs_dir2_data_make_free(tp, bp, (xfs_dir2_data_aoff_t)((char *)dep - (char *)block), - XFS_DIR2_DATA_ENTSIZE(dep->namelen), &needlog, &needscan); + xfs_dir2_data_entsize(dep->namelen), &needlog, &needscan); /* * Fix up the block tail. */ @@ -846,13 +846,13 @@ xfs_dir2_block_replace( dp = args->dp; mp = dp->i_mount; block = bp->data; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Point to the data entry we need to change. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(blp[ent].address))); + ((char *)block + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address))); ASSERT(be64_to_cpu(dep->inumber) != args->inumber); /* * Change the inode number to the new value. @@ -915,7 +915,7 @@ xfs_dir2_leaf_to_block( mp = dp->i_mount; leaf = lbp->data; ASSERT(be16_to_cpu(leaf->hdr.info.magic) == XFS_DIR2_LEAF1_MAGIC); - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); /* * If there are data blocks other than the first one, take this * opportunity to remove trailing empty data blocks that may have @@ -923,7 +923,7 @@ xfs_dir2_leaf_to_block( * These will show up in the leaf bests table. */ while (dp->i_d.di_size > mp->m_dirblksize) { - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); + bestsp = xfs_dir2_leaf_bests_p(ltp); if (be16_to_cpu(bestsp[be32_to_cpu(ltp->bestcount) - 1]) == mp->m_dirblksize - (uint)sizeof(block->hdr)) { if ((error = @@ -977,14 +977,14 @@ xfs_dir2_leaf_to_block( /* * Initialize the block tail. */ - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); btp->count = cpu_to_be32(be16_to_cpu(leaf->hdr.count) - be16_to_cpu(leaf->hdr.stale)); btp->stale = 0; xfs_dir2_block_log_tail(tp, dbp); /* * Initialize the block leaf area. We compact out stale entries. */ - lep = XFS_DIR2_BLOCK_LEAF_P(btp); + lep = xfs_dir2_block_leaf_p(btp); for (from = to = 0; from < be16_to_cpu(leaf->hdr.count); from++) { if (be32_to_cpu(leaf->ents[from].address) == XFS_DIR2_NULL_DATAPTR) continue; @@ -1071,7 +1071,7 @@ xfs_dir2_sf_to_block( ASSERT(dp->i_df.if_bytes == dp->i_d.di_size); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * Copy the directory into the stack buffer. * Then pitch the incore inode data so we can make extents. @@ -1123,10 +1123,10 @@ xfs_dir2_sf_to_block( /* * Fill in the tail. */ - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); btp->count = cpu_to_be32(sfp->hdr.count + 2); /* ., .. */ btp->stale = 0; - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + blp = xfs_dir2_block_leaf_p(btp); endoffset = (uint)((char *)blp - (char *)block); /* * Remove the freespace, we'll manage it. @@ -1142,25 +1142,25 @@ xfs_dir2_sf_to_block( dep->inumber = cpu_to_be64(dp->i_ino); dep->namelen = 1; dep->name[0] = '.'; - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)block); xfs_dir2_data_log_entry(tp, bp, dep); blp[0].hashval = cpu_to_be32(xfs_dir_hash_dot); - blp[0].address = cpu_to_be32(XFS_DIR2_BYTE_TO_DATAPTR(mp, + blp[0].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)block)); /* * Create entry for .. */ dep = (xfs_dir2_data_entry_t *) ((char *)block + XFS_DIR2_DATA_DOTDOT_OFFSET); - dep->inumber = cpu_to_be64(XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent)); + dep->inumber = cpu_to_be64(xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent)); dep->namelen = 2; dep->name[0] = dep->name[1] = '.'; - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)block); xfs_dir2_data_log_entry(tp, bp, dep); blp[1].hashval = cpu_to_be32(xfs_dir_hash_dotdot); - blp[1].address = cpu_to_be32(XFS_DIR2_BYTE_TO_DATAPTR(mp, + blp[1].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)block)); offset = XFS_DIR2_DATA_FIRST_OFFSET; /* @@ -1169,7 +1169,7 @@ xfs_dir2_sf_to_block( if ((i = 0) == sfp->hdr.count) sfep = NULL; else - sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + sfep = xfs_dir2_sf_firstentry(sfp); /* * Need to preserve the existing offset values in the sf directory. * Insert holes (unused entries) where necessary. @@ -1181,7 +1181,7 @@ xfs_dir2_sf_to_block( if (sfep == NULL) newoffset = endoffset; else - newoffset = XFS_DIR2_SF_GET_OFFSET(sfep); + newoffset = xfs_dir2_sf_get_offset(sfep); /* * There should be a hole here, make one. */ @@ -1190,7 +1190,7 @@ xfs_dir2_sf_to_block( ((char *)block + offset); dup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); dup->length = cpu_to_be16(newoffset - offset); - *XFS_DIR2_DATA_UNUSED_TAG_P(dup) = cpu_to_be16( + *xfs_dir2_data_unused_tag_p(dup) = cpu_to_be16( ((char *)dup - (char *)block)); xfs_dir2_data_log_unused(tp, bp, dup); (void)xfs_dir2_data_freeinsert((xfs_dir2_data_t *)block, @@ -1202,22 +1202,22 @@ xfs_dir2_sf_to_block( * Copy a real entry. */ dep = (xfs_dir2_data_entry_t *)((char *)block + newoffset); - dep->inumber = cpu_to_be64(XFS_DIR2_SF_GET_INUMBER(sfp, - XFS_DIR2_SF_INUMBERP(sfep))); + dep->inumber = cpu_to_be64(xfs_dir2_sf_get_inumber(sfp, + xfs_dir2_sf_inumberp(sfep))); dep->namelen = sfep->namelen; memcpy(dep->name, sfep->name, dep->namelen); - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)block); xfs_dir2_data_log_entry(tp, bp, dep); blp[2 + i].hashval = cpu_to_be32(xfs_da_hashname( (char *)sfep->name, sfep->namelen)); - blp[2 + i].address = cpu_to_be32(XFS_DIR2_BYTE_TO_DATAPTR(mp, + blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)block)); offset = (int)((char *)(tagp + 1) - (char *)block); if (++i == sfp->hdr.count) sfep = NULL; else - sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep); + sfep = xfs_dir2_sf_nextentry(sfp, sfep); } /* Done with the temporary buffer */ kmem_free(buf, buf_len); Index: linux-2.6/fs/xfs/xfs_dir2_block.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_block.h 2007-04-13 13:48:21.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_block.h 2007-04-13 13:48:29.000000000 +0200 @@ -60,7 +60,6 @@ typedef struct xfs_dir2_block { /* * Pointer to the leaf header embedded in a data block (1-block format) */ -#define XFS_DIR2_BLOCK_TAIL_P(mp,block) xfs_dir2_block_tail_p(mp,block) static inline xfs_dir2_block_tail_t * xfs_dir2_block_tail_p(struct xfs_mount *mp, xfs_dir2_block_t *block) { @@ -71,7 +70,6 @@ xfs_dir2_block_tail_p(struct xfs_mount * /* * Pointer to the leaf entries embedded in a data block (1-block format) */ -#define XFS_DIR2_BLOCK_LEAF_P(btp) xfs_dir2_block_leaf_p(btp) static inline struct xfs_dir2_leaf_entry * xfs_dir2_block_leaf_p(xfs_dir2_block_tail_t *btp) { Index: linux-2.6/fs/xfs/xfs_dir2_data.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_data.c 2007-04-13 13:47:12.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_data.c 2007-04-13 14:08:11.000000000 +0200 @@ -72,8 +72,8 @@ xfs_dir2_data_check( bf = d->hdr.bestfree; p = (char *)d->u; if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) { - btp = XFS_DIR2_BLOCK_TAIL_P(mp, (xfs_dir2_block_t *)d); - lep = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, (xfs_dir2_block_t *)d); + lep = xfs_dir2_block_leaf_p(btp); endp = (char *)lep; } else endp = (char *)d + mp->m_dirblksize; @@ -107,7 +107,7 @@ xfs_dir2_data_check( */ if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) { ASSERT(lastfree == 0); - ASSERT(be16_to_cpu(*XFS_DIR2_DATA_UNUSED_TAG_P(dup)) == + ASSERT(be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)) == (char *)dup - (char *)d); dfp = xfs_dir2_data_freefind(d, dup); if (dfp) { @@ -131,12 +131,12 @@ xfs_dir2_data_check( dep = (xfs_dir2_data_entry_t *)p; ASSERT(dep->namelen != 0); ASSERT(xfs_dir_ino_validate(mp, be64_to_cpu(dep->inumber)) == 0); - ASSERT(be16_to_cpu(*XFS_DIR2_DATA_ENTRY_TAG_P(dep)) == + ASSERT(be16_to_cpu(*xfs_dir2_data_entry_tag_p(dep)) == (char *)dep - (char *)d); count++; lastfree = 0; if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) { - addr = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, (xfs_dir2_data_aoff_t) ((char *)dep - (char *)d)); hash = xfs_da_hashname((char *)dep->name, dep->namelen); @@ -147,7 +147,7 @@ xfs_dir2_data_check( } ASSERT(i < be32_to_cpu(btp->count)); } - p += XFS_DIR2_DATA_ENTSIZE(dep->namelen); + p += xfs_dir2_data_entsize(dep->namelen); } /* * Need to have seen all the entries and all the bestfree slots. @@ -349,8 +349,8 @@ xfs_dir2_data_freescan( if (aendp) endp = aendp; else if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) { - btp = XFS_DIR2_BLOCK_TAIL_P(mp, (xfs_dir2_block_t *)d); - endp = (char *)XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, (xfs_dir2_block_t *)d); + endp = (char *)xfs_dir2_block_leaf_p(btp); } else endp = (char *)d + mp->m_dirblksize; /* @@ -363,7 +363,7 @@ xfs_dir2_data_freescan( */ if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) { ASSERT((char *)dup - (char *)d == - be16_to_cpu(*XFS_DIR2_DATA_UNUSED_TAG_P(dup))); + be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup))); xfs_dir2_data_freeinsert(d, dup, loghead); p += be16_to_cpu(dup->length); } @@ -373,8 +373,8 @@ xfs_dir2_data_freescan( else { dep = (xfs_dir2_data_entry_t *)p; ASSERT((char *)dep - (char *)d == - be16_to_cpu(*XFS_DIR2_DATA_ENTRY_TAG_P(dep))); - p += XFS_DIR2_DATA_ENTSIZE(dep->namelen); + be16_to_cpu(*xfs_dir2_data_entry_tag_p(dep))); + p += xfs_dir2_data_entsize(dep->namelen); } } } @@ -405,7 +405,7 @@ xfs_dir2_data_init( /* * Get the buffer set up for the block. */ - error = xfs_da_get_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, blkno), -1, &bp, + error = xfs_da_get_buf(tp, dp, xfs_dir2_db_to_da(mp, blkno), -1, &bp, XFS_DATA_FORK); if (error) { return error; @@ -430,7 +430,7 @@ xfs_dir2_data_init( t=mp->m_dirblksize - (uint)sizeof(d->hdr); d->hdr.bestfree[0].length = cpu_to_be16(t); dup->length = cpu_to_be16(t); - *XFS_DIR2_DATA_UNUSED_TAG_P(dup) = cpu_to_be16((char *)dup - (char *)d); + *xfs_dir2_data_unused_tag_p(dup) = cpu_to_be16((char *)dup - (char *)d); /* * Log it and return it. */ @@ -455,7 +455,7 @@ xfs_dir2_data_log_entry( ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC || be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC); xfs_da_log_buf(tp, bp, (uint)((char *)dep - (char *)d), - (uint)((char *)(XFS_DIR2_DATA_ENTRY_TAG_P(dep) + 1) - + (uint)((char *)(xfs_dir2_data_entry_tag_p(dep) + 1) - (char *)d - 1)); } @@ -500,8 +500,8 @@ xfs_dir2_data_log_unused( * Log the end (tag) of the unused entry. */ xfs_da_log_buf(tp, bp, - (uint)((char *)XFS_DIR2_DATA_UNUSED_TAG_P(dup) - (char *)d), - (uint)((char *)XFS_DIR2_DATA_UNUSED_TAG_P(dup) - (char *)d + + (uint)((char *)xfs_dir2_data_unused_tag_p(dup) - (char *)d), + (uint)((char *)xfs_dir2_data_unused_tag_p(dup) - (char *)d + sizeof(xfs_dir2_data_off_t) - 1)); } @@ -538,8 +538,8 @@ xfs_dir2_data_make_free( xfs_dir2_block_tail_t *btp; /* block tail */ ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC); - btp = XFS_DIR2_BLOCK_TAIL_P(mp, (xfs_dir2_block_t *)d); - endptr = (char *)XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, (xfs_dir2_block_t *)d); + endptr = (char *)xfs_dir2_block_leaf_p(btp); } /* * If this isn't the start of the block, then back up to @@ -590,7 +590,7 @@ xfs_dir2_data_make_free( * Fix up the new big freespace. */ be16_add(&prevdup->length, len + be16_to_cpu(postdup->length)); - *XFS_DIR2_DATA_UNUSED_TAG_P(prevdup) = + *xfs_dir2_data_unused_tag_p(prevdup) = cpu_to_be16((char *)prevdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, prevdup); if (!needscan) { @@ -624,7 +624,7 @@ xfs_dir2_data_make_free( else if (prevdup) { dfp = xfs_dir2_data_freefind(d, prevdup); be16_add(&prevdup->length, len); - *XFS_DIR2_DATA_UNUSED_TAG_P(prevdup) = + *xfs_dir2_data_unused_tag_p(prevdup) = cpu_to_be16((char *)prevdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, prevdup); /* @@ -652,7 +652,7 @@ xfs_dir2_data_make_free( newdup = (xfs_dir2_data_unused_t *)((char *)d + offset); newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); newdup->length = cpu_to_be16(len + be16_to_cpu(postdup->length)); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); /* @@ -679,7 +679,7 @@ xfs_dir2_data_make_free( newdup = (xfs_dir2_data_unused_t *)((char *)d + offset); newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); newdup->length = cpu_to_be16(len); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); (void)xfs_dir2_data_freeinsert(d, newdup, needlogp); @@ -715,7 +715,7 @@ xfs_dir2_data_use_free( ASSERT(be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG); ASSERT(offset >= (char *)dup - (char *)d); ASSERT(offset + len <= (char *)dup + be16_to_cpu(dup->length) - (char *)d); - ASSERT((char *)dup - (char *)d == be16_to_cpu(*XFS_DIR2_DATA_UNUSED_TAG_P(dup))); + ASSERT((char *)dup - (char *)d == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup))); /* * Look up the entry in the bestfree table. */ @@ -748,7 +748,7 @@ xfs_dir2_data_use_free( newdup = (xfs_dir2_data_unused_t *)((char *)d + offset + len); newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); newdup->length = cpu_to_be16(oldlen - len); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); /* @@ -775,7 +775,7 @@ xfs_dir2_data_use_free( else if (matchback) { newdup = dup; newdup->length = cpu_to_be16(((char *)d + offset) - (char *)newdup); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); /* @@ -802,13 +802,13 @@ xfs_dir2_data_use_free( else { newdup = dup; newdup->length = cpu_to_be16(((char *)d + offset) - (char *)newdup); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup) = + *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup); newdup2 = (xfs_dir2_data_unused_t *)((char *)d + offset + len); newdup2->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); newdup2->length = cpu_to_be16(oldlen - len - be16_to_cpu(newdup->length)); - *XFS_DIR2_DATA_UNUSED_TAG_P(newdup2) = + *xfs_dir2_data_unused_tag_p(newdup2) = cpu_to_be16((char *)newdup2 - (char *)d); xfs_dir2_data_log_unused(tp, bp, newdup2); /* Index: linux-2.6/fs/xfs/xfs_dir2_data.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_data.h 2007-04-13 13:50:00.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_data.h 2007-04-13 14:04:36.000000000 +0200 @@ -44,7 +44,7 @@ struct xfs_trans; #define XFS_DIR2_DATA_SPACE 0 #define XFS_DIR2_DATA_OFFSET (XFS_DIR2_DATA_SPACE * XFS_DIR2_SPACE_SIZE) #define XFS_DIR2_DATA_FIRSTDB(mp) \ - XFS_DIR2_BYTE_TO_DB(mp, XFS_DIR2_DATA_OFFSET) + xfs_dir2_byte_to_db(mp, XFS_DIR2_DATA_OFFSET) /* * Offsets of . and .. in data space (always block 0) @@ -52,9 +52,9 @@ struct xfs_trans; #define XFS_DIR2_DATA_DOT_OFFSET \ ((xfs_dir2_data_aoff_t)sizeof(xfs_dir2_data_hdr_t)) #define XFS_DIR2_DATA_DOTDOT_OFFSET \ - (XFS_DIR2_DATA_DOT_OFFSET + XFS_DIR2_DATA_ENTSIZE(1)) + (XFS_DIR2_DATA_DOT_OFFSET + xfs_dir2_data_entsize(1)) #define XFS_DIR2_DATA_FIRST_OFFSET \ - (XFS_DIR2_DATA_DOTDOT_OFFSET + XFS_DIR2_DATA_ENTSIZE(2)) + (XFS_DIR2_DATA_DOTDOT_OFFSET + xfs_dir2_data_entsize(2)) /* * Structures. @@ -123,7 +123,6 @@ typedef struct xfs_dir2_data { /* * Size of a data entry. */ -#define XFS_DIR2_DATA_ENTSIZE(n) xfs_dir2_data_entsize(n) static inline int xfs_dir2_data_entsize(int n) { return (int)roundup(offsetof(xfs_dir2_data_entry_t, name[0]) + (n) + \ @@ -133,19 +132,16 @@ static inline int xfs_dir2_data_entsize( /* * Pointer to an entry's tag word. */ -#define XFS_DIR2_DATA_ENTRY_TAG_P(dep) xfs_dir2_data_entry_tag_p(dep) static inline __be16 * xfs_dir2_data_entry_tag_p(xfs_dir2_data_entry_t *dep) { return (__be16 *)((char *)dep + - XFS_DIR2_DATA_ENTSIZE(dep->namelen) - sizeof(__be16)); + xfs_dir2_data_entsize(dep->namelen) - sizeof(__be16)); } /* * Pointer to a freespace's tag word. */ -#define XFS_DIR2_DATA_UNUSED_TAG_P(dup) \ - xfs_dir2_data_unused_tag_p(dup) static inline __be16 * xfs_dir2_data_unused_tag_p(xfs_dir2_data_unused_t *dup) { Index: linux-2.6/fs/xfs/xfs_dir2_leaf.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_leaf.c 2007-04-13 13:47:18.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_leaf.c 2007-04-13 14:08:13.000000000 +0200 @@ -92,7 +92,7 @@ xfs_dir2_block_to_leaf( if ((error = xfs_da_grow_inode(args, &blkno))) { return error; } - ldb = XFS_DIR2_DA_TO_DB(mp, blkno); + ldb = xfs_dir2_da_to_db(mp, blkno); ASSERT(ldb == XFS_DIR2_LEAF_FIRSTDB(mp)); /* * Initialize the leaf block, get a buffer for it. @@ -104,8 +104,8 @@ xfs_dir2_block_to_leaf( leaf = lbp->data; block = dbp->data; xfs_dir2_data_check(dp, dbp); - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Set the counts in the leaf header. */ @@ -138,9 +138,9 @@ xfs_dir2_block_to_leaf( /* * Set up leaf tail and bests table. */ - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); ltp->bestcount = cpu_to_be32(1); - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); + bestsp = xfs_dir2_leaf_bests_p(ltp); bestsp[0] = block->hdr.bestfree[0].length; /* * Log the data header and leaf bests table. @@ -210,9 +210,9 @@ xfs_dir2_leaf_addname( */ index = xfs_dir2_leaf_search_hash(args, lbp); leaf = lbp->data; - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); - length = XFS_DIR2_DATA_ENTSIZE(args->namelen); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); + bestsp = xfs_dir2_leaf_bests_p(ltp); + length = xfs_dir2_data_entsize(args->namelen); /* * See if there are any entries with the same hash value * and space in their block for the new entry. @@ -224,7 +224,7 @@ xfs_dir2_leaf_addname( index++, lep++) { if (be32_to_cpu(lep->address) == XFS_DIR2_NULL_DATAPTR) continue; - i = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + i = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); ASSERT(i < be32_to_cpu(ltp->bestcount)); ASSERT(be16_to_cpu(bestsp[i]) != NULLDATAOFF); if (be16_to_cpu(bestsp[i]) >= length) { @@ -379,7 +379,7 @@ xfs_dir2_leaf_addname( */ else { if ((error = - xfs_da_read_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, use_block), + xfs_da_read_buf(tp, dp, xfs_dir2_db_to_da(mp, use_block), -1, &dbp, XFS_DATA_FORK))) { xfs_da_brelse(tp, lbp); return error; @@ -408,7 +408,7 @@ xfs_dir2_leaf_addname( dep->inumber = cpu_to_be64(args->inumber); dep->namelen = args->namelen; memcpy(dep->name, args->name, dep->namelen); - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)data); /* * Need to scan fix up the bestfree table. @@ -530,7 +530,7 @@ xfs_dir2_leaf_addname( * Fill in the new leaf entry. */ lep->hashval = cpu_to_be32(args->hashval); - lep->address = cpu_to_be32(XFS_DIR2_DB_OFF_TO_DATAPTR(mp, use_block, + lep->address = cpu_to_be32(xfs_dir2_db_off_to_dataptr(mp, use_block, be16_to_cpu(*tagp))); /* * Log the leaf fields and give up the buffers. @@ -568,13 +568,13 @@ xfs_dir2_leaf_check( * Should factor in the size of the bests table as well. * We can deduce a value for that from di_size. */ - ASSERT(be16_to_cpu(leaf->hdr.count) <= XFS_DIR2_MAX_LEAF_ENTS(mp)); - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ASSERT(be16_to_cpu(leaf->hdr.count) <= xfs_dir2_max_leaf_ents(mp)); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); /* * Leaves and bests don't overlap. */ ASSERT((char *)&leaf->ents[be16_to_cpu(leaf->hdr.count)] <= - (char *)XFS_DIR2_LEAF_BESTS_P(ltp)); + (char *)xfs_dir2_leaf_bests_p(ltp)); /* * Check hash value order, count stale entries. */ @@ -816,12 +816,12 @@ xfs_dir2_leaf_getdents( * Inside the loop we keep the main offset value as a byte offset * in the directory file. */ - curoff = XFS_DIR2_DATAPTR_TO_BYTE(mp, uio->uio_offset); + curoff = xfs_dir2_dataptr_to_byte(mp, uio->uio_offset); /* * Force this conversion through db so we truncate the offset * down to get the start of the data block. */ - map_off = XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_BYTE_TO_DB(mp, curoff)); + map_off = xfs_dir2_db_to_da(mp, xfs_dir2_byte_to_db(mp, curoff)); /* * Loop over directory entries until we reach the end offset. * Get more blocks and readahead as necessary. @@ -871,7 +871,7 @@ xfs_dir2_leaf_getdents( */ if (1 + ra_want > map_blocks && map_off < - XFS_DIR2_BYTE_TO_DA(mp, XFS_DIR2_LEAF_OFFSET)) { + xfs_dir2_byte_to_da(mp, XFS_DIR2_LEAF_OFFSET)) { /* * Get more bmaps, fill in after the ones * we already have in the table. @@ -879,7 +879,7 @@ xfs_dir2_leaf_getdents( nmap = map_size - map_valid; error = xfs_bmapi(tp, dp, map_off, - XFS_DIR2_BYTE_TO_DA(mp, + xfs_dir2_byte_to_da(mp, XFS_DIR2_LEAF_OFFSET) - map_off, XFS_BMAPI_METADATA, NULL, 0, &map[map_valid], &nmap, NULL, NULL); @@ -904,7 +904,7 @@ xfs_dir2_leaf_getdents( map[map_valid + nmap - 1].br_blockcount; else map_off = - XFS_DIR2_BYTE_TO_DA(mp, + xfs_dir2_byte_to_da(mp, XFS_DIR2_LEAF_OFFSET); /* * Look for holes in the mapping, and @@ -932,14 +932,14 @@ xfs_dir2_leaf_getdents( * No valid mappings, so no more data blocks. */ if (!map_valid) { - curoff = XFS_DIR2_DA_TO_BYTE(mp, map_off); + curoff = xfs_dir2_da_to_byte(mp, map_off); break; } /* * Read the directory block starting at the first * mapping. */ - curdb = XFS_DIR2_DA_TO_DB(mp, map->br_startoff); + curdb = xfs_dir2_da_to_db(mp, map->br_startoff); error = xfs_da_read_buf(tp, dp, map->br_startoff, map->br_blockcount >= mp->m_dirblkfsbs ? XFS_FSB_TO_DADDR(mp, map->br_startblock) : @@ -1015,7 +1015,7 @@ xfs_dir2_leaf_getdents( /* * Having done a read, we need to set a new offset. */ - newoff = XFS_DIR2_DB_OFF_TO_BYTE(mp, curdb, 0); + newoff = xfs_dir2_db_off_to_byte(mp, curdb, 0); /* * Start of the current block. */ @@ -1025,7 +1025,7 @@ xfs_dir2_leaf_getdents( * Make sure we're in the right block. */ else if (curoff > newoff) - ASSERT(XFS_DIR2_BYTE_TO_DB(mp, curoff) == + ASSERT(xfs_dir2_byte_to_db(mp, curoff) == curdb); data = bp->data; xfs_dir2_data_check(dp, bp); @@ -1033,7 +1033,7 @@ xfs_dir2_leaf_getdents( * Find our position in the block. */ ptr = (char *)&data->u; - byteoff = XFS_DIR2_BYTE_TO_OFF(mp, curoff); + byteoff = xfs_dir2_byte_to_off(mp, curoff); /* * Skip past the header. */ @@ -1055,15 +1055,15 @@ xfs_dir2_leaf_getdents( } dep = (xfs_dir2_data_entry_t *)ptr; length = - XFS_DIR2_DATA_ENTSIZE(dep->namelen); + xfs_dir2_data_entsize(dep->namelen); ptr += length; } /* * Now set our real offset. */ curoff = - XFS_DIR2_DB_OFF_TO_BYTE(mp, - XFS_DIR2_BYTE_TO_DB(mp, curoff), + xfs_dir2_db_off_to_byte(mp, + xfs_dir2_byte_to_db(mp, curoff), (char *)ptr - (char *)data); if (ptr >= (char *)data + mp->m_dirblksize) { continue; @@ -1092,9 +1092,9 @@ xfs_dir2_leaf_getdents( p->namelen = dep->namelen; - length = XFS_DIR2_DATA_ENTSIZE(p->namelen); + length = xfs_dir2_data_entsize(p->namelen); - p->cook = XFS_DIR2_BYTE_TO_DATAPTR(mp, curoff + length); + p->cook = xfs_dir2_byte_to_dataptr(mp, curoff + length); p->ino = be64_to_cpu(dep->inumber); #if XFS_BIG_INUMS @@ -1122,10 +1122,10 @@ xfs_dir2_leaf_getdents( * All done. Set output offset value to current offset. */ *eofp = eof; - if (curoff > XFS_DIR2_DATAPTR_TO_BYTE(mp, XFS_DIR2_MAX_DATAPTR)) + if (curoff > xfs_dir2_dataptr_to_byte(mp, XFS_DIR2_MAX_DATAPTR)) uio->uio_offset = XFS_DIR2_MAX_DATAPTR; else - uio->uio_offset = XFS_DIR2_BYTE_TO_DATAPTR(mp, curoff); + uio->uio_offset = xfs_dir2_byte_to_dataptr(mp, curoff); kmem_free(map, map_size * sizeof(*map)); kmem_free(p, sizeof(*p)); if (bp) @@ -1160,7 +1160,7 @@ xfs_dir2_leaf_init( /* * Get the buffer for the block. */ - error = xfs_da_get_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, bno), -1, &bp, + error = xfs_da_get_buf(tp, dp, xfs_dir2_db_to_da(mp, bno), -1, &bp, XFS_DATA_FORK); if (error) { return error; @@ -1182,7 +1182,7 @@ xfs_dir2_leaf_init( * the block. */ if (magic == XFS_DIR2_LEAF1_MAGIC) { - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); ltp->bestcount = 0; xfs_dir2_leaf_log_tail(tp, bp); } @@ -1207,9 +1207,9 @@ xfs_dir2_leaf_log_bests( leaf = bp->data; ASSERT(be16_to_cpu(leaf->hdr.info.magic) == XFS_DIR2_LEAF1_MAGIC); - ltp = XFS_DIR2_LEAF_TAIL_P(tp->t_mountp, leaf); - firstb = XFS_DIR2_LEAF_BESTS_P(ltp) + first; - lastb = XFS_DIR2_LEAF_BESTS_P(ltp) + last; + ltp = xfs_dir2_leaf_tail_p(tp->t_mountp, leaf); + firstb = xfs_dir2_leaf_bests_p(ltp) + first; + lastb = xfs_dir2_leaf_bests_p(ltp) + last; xfs_da_log_buf(tp, bp, (uint)((char *)firstb - (char *)leaf), (uint)((char *)lastb - (char *)leaf + sizeof(*lastb) - 1)); } @@ -1269,7 +1269,7 @@ xfs_dir2_leaf_log_tail( mp = tp->t_mountp; leaf = bp->data; ASSERT(be16_to_cpu(leaf->hdr.info.magic) == XFS_DIR2_LEAF1_MAGIC); - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); xfs_da_log_buf(tp, bp, (uint)((char *)ltp - (char *)leaf), (uint)(mp->m_dirblksize - 1)); } @@ -1313,7 +1313,7 @@ xfs_dir2_leaf_lookup( */ dep = (xfs_dir2_data_entry_t *) ((char *)dbp->data + - XFS_DIR2_DATAPTR_TO_OFF(dp->i_mount, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(dp->i_mount, be32_to_cpu(lep->address))); /* * Return the found inode number. */ @@ -1382,7 +1382,7 @@ xfs_dir2_leaf_lookup_int( /* * Get the new data block number. */ - newdb = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + newdb = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); /* * If it's not the same as the old data block number, * need to pitch the old one and read the new one. @@ -1392,7 +1392,7 @@ xfs_dir2_leaf_lookup_int( xfs_da_brelse(tp, dbp); if ((error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, newdb), -1, &dbp, + xfs_dir2_db_to_da(mp, newdb), -1, &dbp, XFS_DATA_FORK))) { xfs_da_brelse(tp, lbp); return error; @@ -1405,7 +1405,7 @@ xfs_dir2_leaf_lookup_int( */ dep = (xfs_dir2_data_entry_t *) ((char *)dbp->data + - XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address))); /* * If it matches then return it. */ @@ -1470,20 +1470,20 @@ xfs_dir2_leaf_removename( * Point to the leaf entry, use that to point to the data entry. */ lep = &leaf->ents[index]; - db = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + db = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); dep = (xfs_dir2_data_entry_t *) - ((char *)data + XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(lep->address))); + ((char *)data + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address))); needscan = needlog = 0; oldbest = be16_to_cpu(data->hdr.bestfree[0].length); - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); + bestsp = xfs_dir2_leaf_bests_p(ltp); ASSERT(be16_to_cpu(bestsp[db]) == oldbest); /* * Mark the former data entry unused. */ xfs_dir2_data_make_free(tp, dbp, (xfs_dir2_data_aoff_t)((char *)dep - (char *)data), - XFS_DIR2_DATA_ENTSIZE(dep->namelen), &needlog, &needscan); + xfs_dir2_data_entsize(dep->namelen), &needlog, &needscan); /* * We just mark the leaf entry stale by putting a null in it. */ @@ -1603,7 +1603,7 @@ xfs_dir2_leaf_replace( */ dep = (xfs_dir2_data_entry_t *) ((char *)dbp->data + - XFS_DIR2_DATAPTR_TO_OFF(dp->i_mount, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(dp->i_mount, be32_to_cpu(lep->address))); ASSERT(args->inumber != be64_to_cpu(dep->inumber)); /* * Put the new inode number in, log it. @@ -1699,7 +1699,7 @@ xfs_dir2_leaf_trim_data( /* * Read the offending data block. We need its buffer. */ - if ((error = xfs_da_read_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, db), -1, &dbp, + if ((error = xfs_da_read_buf(tp, dp, xfs_dir2_db_to_da(mp, db), -1, &dbp, XFS_DATA_FORK))) { return error; } @@ -1713,7 +1713,7 @@ xfs_dir2_leaf_trim_data( */ leaf = lbp->data; - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); ASSERT(be16_to_cpu(data->hdr.bestfree[0].length) == mp->m_dirblksize - (uint)sizeof(data->hdr)); ASSERT(db == be32_to_cpu(ltp->bestcount) - 1); @@ -1728,7 +1728,7 @@ xfs_dir2_leaf_trim_data( /* * Eliminate the last bests entry from the table. */ - bestsp = XFS_DIR2_LEAF_BESTS_P(ltp); + bestsp = xfs_dir2_leaf_bests_p(ltp); be32_add(<p->bestcount, -1); memmove(&bestsp[1], &bestsp[0], be32_to_cpu(ltp->bestcount) * sizeof(*bestsp)); xfs_dir2_leaf_log_tail(tp, lbp); @@ -1839,12 +1839,12 @@ xfs_dir2_node_to_leaf( /* * Set up the leaf tail from the freespace block. */ - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); ltp->bestcount = free->hdr.nvalid; /* * Set up the leaf bests table. */ - memcpy(XFS_DIR2_LEAF_BESTS_P(ltp), free->bests, + memcpy(xfs_dir2_leaf_bests_p(ltp), free->bests, be32_to_cpu(ltp->bestcount) * sizeof(leaf->bests[0])); xfs_dir2_leaf_log_bests(tp, lbp, 0, be32_to_cpu(ltp->bestcount) - 1); xfs_dir2_leaf_log_tail(tp, lbp); Index: linux-2.6/fs/xfs/xfs_dir2_leaf.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_leaf.h 2007-04-13 13:54:13.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_leaf.h 2007-04-13 14:10:43.000000000 +0200 @@ -32,7 +32,7 @@ struct xfs_trans; #define XFS_DIR2_LEAF_SPACE 1 #define XFS_DIR2_LEAF_OFFSET (XFS_DIR2_LEAF_SPACE * XFS_DIR2_SPACE_SIZE) #define XFS_DIR2_LEAF_FIRSTDB(mp) \ - XFS_DIR2_BYTE_TO_DB(mp, XFS_DIR2_LEAF_OFFSET) + xfs_dir2_byte_to_db(mp, XFS_DIR2_LEAF_OFFSET) /* * Offset in data space of a data entry. @@ -82,7 +82,6 @@ typedef struct xfs_dir2_leaf { * DB blocks here are logical directory block numbers, not filesystem blocks. */ -#define XFS_DIR2_MAX_LEAF_ENTS(mp) xfs_dir2_max_leaf_ents(mp) static inline int xfs_dir2_max_leaf_ents(struct xfs_mount *mp) { return (int)(((mp)->m_dirblksize - (uint)sizeof(xfs_dir2_leaf_hdr_t)) / @@ -92,7 +91,6 @@ static inline int xfs_dir2_max_leaf_ents /* * Get address of the bestcount field in the single-leaf block. */ -#define XFS_DIR2_LEAF_TAIL_P(mp,lp) xfs_dir2_leaf_tail_p(mp, lp) static inline xfs_dir2_leaf_tail_t * xfs_dir2_leaf_tail_p(struct xfs_mount *mp, xfs_dir2_leaf_t *lp) { @@ -104,7 +102,6 @@ xfs_dir2_leaf_tail_p(struct xfs_mount *m /* * Get address of the bests array in the single-leaf block. */ -#define XFS_DIR2_LEAF_BESTS_P(ltp) xfs_dir2_leaf_bests_p(ltp) static inline __be16 * xfs_dir2_leaf_bests_p(xfs_dir2_leaf_tail_t *ltp) { @@ -114,7 +111,6 @@ xfs_dir2_leaf_bests_p(xfs_dir2_leaf_tail /* * Convert dataptr to byte in file space */ -#define XFS_DIR2_DATAPTR_TO_BYTE(mp,dp) xfs_dir2_dataptr_to_byte(mp, dp) static inline xfs_dir2_off_t xfs_dir2_dataptr_to_byte(struct xfs_mount *mp, xfs_dir2_dataptr_t dp) { @@ -124,7 +120,6 @@ xfs_dir2_dataptr_to_byte(struct xfs_moun /* * Convert byte in file space to dataptr. It had better be aligned. */ -#define XFS_DIR2_BYTE_TO_DATAPTR(mp,by) xfs_dir2_byte_to_dataptr(mp,by) static inline xfs_dir2_dataptr_t xfs_dir2_byte_to_dataptr(struct xfs_mount *mp, xfs_dir2_off_t by) { @@ -134,7 +129,6 @@ xfs_dir2_byte_to_dataptr(struct xfs_moun /* * Convert byte in space to (DB) block */ -#define XFS_DIR2_BYTE_TO_DB(mp,by) xfs_dir2_byte_to_db(mp, by) static inline xfs_dir2_db_t xfs_dir2_byte_to_db(struct xfs_mount *mp, xfs_dir2_off_t by) { @@ -145,17 +139,15 @@ xfs_dir2_byte_to_db(struct xfs_mount *mp /* * Convert dataptr to a block number */ -#define XFS_DIR2_DATAPTR_TO_DB(mp,dp) xfs_dir2_dataptr_to_db(mp, dp) static inline xfs_dir2_db_t xfs_dir2_dataptr_to_db(struct xfs_mount *mp, xfs_dir2_dataptr_t dp) { - return XFS_DIR2_BYTE_TO_DB(mp, XFS_DIR2_DATAPTR_TO_BYTE(mp, dp)); + return xfs_dir2_byte_to_db(mp, xfs_dir2_dataptr_to_byte(mp, dp)); } /* * Convert byte in space to offset in a block */ -#define XFS_DIR2_BYTE_TO_OFF(mp,by) xfs_dir2_byte_to_off(mp, by) static inline xfs_dir2_data_aoff_t xfs_dir2_byte_to_off(struct xfs_mount *mp, xfs_dir2_off_t by) { @@ -166,18 +158,15 @@ xfs_dir2_byte_to_off(struct xfs_mount *m /* * Convert dataptr to a byte offset in a block */ -#define XFS_DIR2_DATAPTR_TO_OFF(mp,dp) xfs_dir2_dataptr_to_off(mp, dp) static inline xfs_dir2_data_aoff_t xfs_dir2_dataptr_to_off(struct xfs_mount *mp, xfs_dir2_dataptr_t dp) { - return XFS_DIR2_BYTE_TO_OFF(mp, XFS_DIR2_DATAPTR_TO_BYTE(mp, dp)); + return xfs_dir2_byte_to_off(mp, xfs_dir2_dataptr_to_byte(mp, dp)); } /* * Convert block and offset to byte in space */ -#define XFS_DIR2_DB_OFF_TO_BYTE(mp,db,o) \ - xfs_dir2_db_off_to_byte(mp, db, o) static inline xfs_dir2_off_t xfs_dir2_db_off_to_byte(struct xfs_mount *mp, xfs_dir2_db_t db, xfs_dir2_data_aoff_t o) @@ -189,7 +178,6 @@ xfs_dir2_db_off_to_byte(struct xfs_mount /* * Convert block (DB) to block (dablk) */ -#define XFS_DIR2_DB_TO_DA(mp,db) xfs_dir2_db_to_da(mp, db) static inline xfs_dablk_t xfs_dir2_db_to_da(struct xfs_mount *mp, xfs_dir2_db_t db) { @@ -199,29 +187,25 @@ xfs_dir2_db_to_da(struct xfs_mount *mp, /* * Convert byte in space to (DA) block */ -#define XFS_DIR2_BYTE_TO_DA(mp,by) xfs_dir2_byte_to_da(mp, by) static inline xfs_dablk_t xfs_dir2_byte_to_da(struct xfs_mount *mp, xfs_dir2_off_t by) { - return XFS_DIR2_DB_TO_DA(mp, XFS_DIR2_BYTE_TO_DB(mp, by)); + return xfs_dir2_db_to_da(mp, xfs_dir2_byte_to_db(mp, by)); } /* * Convert block and offset to dataptr */ -#define XFS_DIR2_DB_OFF_TO_DATAPTR(mp,db,o) \ - xfs_dir2_db_off_to_dataptr(mp, db, o) static inline xfs_dir2_dataptr_t xfs_dir2_db_off_to_dataptr(struct xfs_mount *mp, xfs_dir2_db_t db, xfs_dir2_data_aoff_t o) { - return XFS_DIR2_BYTE_TO_DATAPTR(mp, XFS_DIR2_DB_OFF_TO_BYTE(mp, db, o)); + return xfs_dir2_byte_to_dataptr(mp, xfs_dir2_db_off_to_byte(mp, db, o)); } /* * Convert block (dablk) to block (DB) */ -#define XFS_DIR2_DA_TO_DB(mp,da) xfs_dir2_da_to_db(mp, da) static inline xfs_dir2_db_t xfs_dir2_da_to_db(struct xfs_mount *mp, xfs_dablk_t da) { @@ -231,11 +215,10 @@ xfs_dir2_da_to_db(struct xfs_mount *mp, /* * Convert block (dablk) to byte offset in space */ -#define XFS_DIR2_DA_TO_BYTE(mp,da) xfs_dir2_da_to_byte(mp, da) static inline xfs_dir2_off_t xfs_dir2_da_to_byte(struct xfs_mount *mp, xfs_dablk_t da) { - return XFS_DIR2_DB_OFF_TO_BYTE(mp, XFS_DIR2_DA_TO_DB(mp, da), 0); + return xfs_dir2_db_off_to_byte(mp, xfs_dir2_da_to_db(mp, da), 0); } /* Index: linux-2.6/fs/xfs/xfs_dir2_node.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_node.c 2007-04-13 13:49:23.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_node.c 2007-04-13 14:08:15.000000000 +0200 @@ -136,14 +136,14 @@ xfs_dir2_leaf_to_node( /* * Get the buffer for the new freespace block. */ - if ((error = xfs_da_get_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, fdb), -1, &fbp, + if ((error = xfs_da_get_buf(tp, dp, xfs_dir2_db_to_da(mp, fdb), -1, &fbp, XFS_DATA_FORK))) { return error; } ASSERT(fbp != NULL); free = fbp->data; leaf = lbp->data; - ltp = XFS_DIR2_LEAF_TAIL_P(mp, leaf); + ltp = xfs_dir2_leaf_tail_p(mp, leaf); /* * Initialize the freespace block header. */ @@ -155,7 +155,7 @@ xfs_dir2_leaf_to_node( * Copy freespace entries from the leaf block to the new block. * Count active entries. */ - for (i = n = 0, from = XFS_DIR2_LEAF_BESTS_P(ltp), to = free->bests; + for (i = n = 0, from = xfs_dir2_leaf_bests_p(ltp), to = free->bests; i < be32_to_cpu(ltp->bestcount); i++, from++, to++) { if ((off = be16_to_cpu(*from)) != NULLDATAOFF) n++; @@ -215,7 +215,7 @@ xfs_dir2_leafn_add( * a compact. */ - if (be16_to_cpu(leaf->hdr.count) == XFS_DIR2_MAX_LEAF_ENTS(mp)) { + if (be16_to_cpu(leaf->hdr.count) == xfs_dir2_max_leaf_ents(mp)) { if (!leaf->hdr.stale) return XFS_ERROR(ENOSPC); compact = be16_to_cpu(leaf->hdr.stale) > 1; @@ -327,7 +327,7 @@ xfs_dir2_leafn_add( * Insert the new entry, log everything. */ lep->hashval = cpu_to_be32(args->hashval); - lep->address = cpu_to_be32(XFS_DIR2_DB_OFF_TO_DATAPTR(mp, + lep->address = cpu_to_be32(xfs_dir2_db_off_to_dataptr(mp, args->blkno, args->index)); xfs_dir2_leaf_log_header(tp, bp); xfs_dir2_leaf_log_ents(tp, bp, lfloglow, lfloghigh); @@ -352,7 +352,7 @@ xfs_dir2_leafn_check( leaf = bp->data; mp = dp->i_mount; ASSERT(be16_to_cpu(leaf->hdr.info.magic) == XFS_DIR2_LEAFN_MAGIC); - ASSERT(be16_to_cpu(leaf->hdr.count) <= XFS_DIR2_MAX_LEAF_ENTS(mp)); + ASSERT(be16_to_cpu(leaf->hdr.count) <= xfs_dir2_max_leaf_ents(mp)); for (i = stale = 0; i < be16_to_cpu(leaf->hdr.count); i++) { if (i + 1 < be16_to_cpu(leaf->hdr.count)) { ASSERT(be32_to_cpu(leaf->ents[i].hashval) <= @@ -440,7 +440,7 @@ xfs_dir2_leafn_lookup_int( if (args->addname) { curfdb = curbp ? state->extrablk.blkno : -1; curdb = -1; - length = XFS_DIR2_DATA_ENTSIZE(args->namelen); + length = xfs_dir2_data_entsize(args->namelen); if ((free = (curbp ? curbp->data : NULL))) ASSERT(be32_to_cpu(free->hdr.magic) == XFS_DIR2_FREE_MAGIC); } @@ -465,7 +465,7 @@ xfs_dir2_leafn_lookup_int( /* * Pull the data block number from the entry. */ - newdb = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + newdb = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); /* * For addname, we're looking for a place to put the new entry. * We want to use a data block with an entry of equal @@ -482,7 +482,7 @@ xfs_dir2_leafn_lookup_int( * Convert the data block to the free block * holding its freespace information. */ - newfdb = XFS_DIR2_DB_TO_FDB(mp, newdb); + newfdb = xfs_dir2_db_to_fdb(mp, newdb); /* * If it's not the one we have in hand, * read it in. @@ -497,7 +497,7 @@ xfs_dir2_leafn_lookup_int( * Read the free block. */ if ((error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, + xfs_dir2_db_to_da(mp, newfdb), -1, &curbp, XFS_DATA_FORK))) { @@ -517,7 +517,7 @@ xfs_dir2_leafn_lookup_int( /* * Get the index for our entry. */ - fi = XFS_DIR2_DB_TO_FDINDEX(mp, curdb); + fi = xfs_dir2_db_to_fdindex(mp, curdb); /* * If it has room, return it. */ @@ -561,7 +561,7 @@ xfs_dir2_leafn_lookup_int( */ if ((error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, newdb), -1, + xfs_dir2_db_to_da(mp, newdb), -1, &curbp, XFS_DATA_FORK))) { return error; } @@ -573,7 +573,7 @@ xfs_dir2_leafn_lookup_int( */ dep = (xfs_dir2_data_entry_t *) ((char *)curbp->data + - XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address))); /* * Compare the entry, return it if it matches. */ @@ -876,9 +876,9 @@ xfs_dir2_leafn_remove( /* * Extract the data block and offset from the entry. */ - db = XFS_DIR2_DATAPTR_TO_DB(mp, be32_to_cpu(lep->address)); + db = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address)); ASSERT(dblk->blkno == db); - off = XFS_DIR2_DATAPTR_TO_OFF(mp, be32_to_cpu(lep->address)); + off = xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)); ASSERT(dblk->index == off); /* * Kill the leaf entry by marking it stale. @@ -898,7 +898,7 @@ xfs_dir2_leafn_remove( longest = be16_to_cpu(data->hdr.bestfree[0].length); needlog = needscan = 0; xfs_dir2_data_make_free(tp, dbp, off, - XFS_DIR2_DATA_ENTSIZE(dep->namelen), &needlog, &needscan); + xfs_dir2_data_entsize(dep->namelen), &needlog, &needscan); /* * Rescan the data block freespaces for bestfree. * Log the data block header if needed. @@ -924,8 +924,8 @@ xfs_dir2_leafn_remove( * Convert the data block number to a free block, * read in the free block. */ - fdb = XFS_DIR2_DB_TO_FDB(mp, db); - if ((error = xfs_da_read_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, fdb), + fdb = xfs_dir2_db_to_fdb(mp, db); + if ((error = xfs_da_read_buf(tp, dp, xfs_dir2_db_to_da(mp, fdb), -1, &fbp, XFS_DATA_FORK))) { return error; } @@ -937,7 +937,7 @@ xfs_dir2_leafn_remove( /* * Calculate which entry we need to fix. */ - findex = XFS_DIR2_DB_TO_FDINDEX(mp, db); + findex = xfs_dir2_db_to_fdindex(mp, db); longest = be16_to_cpu(data->hdr.bestfree[0].length); /* * If the data block is now empty we can get rid of it @@ -1073,7 +1073,7 @@ xfs_dir2_leafn_split( /* * Initialize the new leaf block. */ - error = xfs_dir2_leaf_init(args, XFS_DIR2_DA_TO_DB(mp, blkno), + error = xfs_dir2_leaf_init(args, xfs_dir2_da_to_db(mp, blkno), &newblk->bp, XFS_DIR2_LEAFN_MAGIC); if (error) { return error; @@ -1385,7 +1385,7 @@ xfs_dir2_node_addname_int( dp = args->dp; mp = dp->i_mount; tp = args->trans; - length = XFS_DIR2_DATA_ENTSIZE(args->namelen); + length = xfs_dir2_data_entsize(args->namelen); /* * If we came in with a freespace block that means that lookup * found an entry with our hash value. This is the freespace @@ -1438,7 +1438,7 @@ xfs_dir2_node_addname_int( if ((error = xfs_bmap_last_offset(tp, dp, &fo, XFS_DATA_FORK))) return error; - lastfbno = XFS_DIR2_DA_TO_DB(mp, (xfs_dablk_t)fo); + lastfbno = xfs_dir2_da_to_db(mp, (xfs_dablk_t)fo); fbno = ifbno; } /* @@ -1474,7 +1474,7 @@ xfs_dir2_node_addname_int( * to avoid it. */ if ((error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, fbno), -2, &fbp, + xfs_dir2_db_to_da(mp, fbno), -2, &fbp, XFS_DATA_FORK))) { return error; } @@ -1550,9 +1550,9 @@ xfs_dir2_node_addname_int( * Get the freespace block corresponding to the data block * that was just allocated. */ - fbno = XFS_DIR2_DB_TO_FDB(mp, dbno); + fbno = xfs_dir2_db_to_fdb(mp, dbno); if (unlikely(error = xfs_da_read_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, fbno), -2, &fbp, + xfs_dir2_db_to_da(mp, fbno), -2, &fbp, XFS_DATA_FORK))) { xfs_da_buf_done(dbp); return error; @@ -1567,14 +1567,14 @@ xfs_dir2_node_addname_int( return error; } - if (unlikely(XFS_DIR2_DB_TO_FDB(mp, dbno) != fbno)) { + if (unlikely(xfs_dir2_db_to_fdb(mp, dbno) != fbno)) { cmn_err(CE_ALERT, "xfs_dir2_node_addname_int: dir ino " "%llu needed freesp block %lld for\n" " data block %lld, got %lld\n" " ifbno %llu lastfbno %d\n", (unsigned long long)dp->i_ino, - (long long)XFS_DIR2_DB_TO_FDB(mp, dbno), + (long long)xfs_dir2_db_to_fdb(mp, dbno), (long long)dbno, (long long)fbno, (unsigned long long)ifbno, lastfbno); if (fblk) { @@ -1598,7 +1598,7 @@ xfs_dir2_node_addname_int( * Get a buffer for the new block. */ if ((error = xfs_da_get_buf(tp, dp, - XFS_DIR2_DB_TO_DA(mp, fbno), + xfs_dir2_db_to_da(mp, fbno), -1, &fbp, XFS_DATA_FORK))) { return error; } @@ -1623,7 +1623,7 @@ xfs_dir2_node_addname_int( /* * Set the freespace block index from the data block number. */ - findex = XFS_DIR2_DB_TO_FDINDEX(mp, dbno); + findex = xfs_dir2_db_to_fdindex(mp, dbno); /* * If it's after the end of the current entries in the * freespace block, extend that table. @@ -1669,7 +1669,7 @@ xfs_dir2_node_addname_int( * Read the data block in. */ if (unlikely( - error = xfs_da_read_buf(tp, dp, XFS_DIR2_DB_TO_DA(mp, dbno), + error = xfs_da_read_buf(tp, dp, xfs_dir2_db_to_da(mp, dbno), -1, &dbp, XFS_DATA_FORK))) { if ((fblk == NULL || fblk->bp == NULL) && fbp != NULL) xfs_da_buf_done(fbp); @@ -1698,7 +1698,7 @@ xfs_dir2_node_addname_int( dep->inumber = cpu_to_be64(args->inumber); dep->namelen = args->namelen; memcpy(dep->name, args->name, dep->namelen); - tagp = XFS_DIR2_DATA_ENTRY_TAG_P(dep); + tagp = xfs_dir2_data_entry_tag_p(dep); *tagp = cpu_to_be16((char *)dep - (char *)data); xfs_dir2_data_log_entry(tp, dbp, dep); /* @@ -1904,7 +1904,7 @@ xfs_dir2_node_replace( ASSERT(be32_to_cpu(data->hdr.magic) == XFS_DIR2_DATA_MAGIC); dep = (xfs_dir2_data_entry_t *) ((char *)data + - XFS_DIR2_DATAPTR_TO_OFF(state->mp, be32_to_cpu(lep->address))); + xfs_dir2_dataptr_to_off(state->mp, be32_to_cpu(lep->address))); ASSERT(inum != be64_to_cpu(dep->inumber)); /* * Fill in the new inode number and log the entry. @@ -1980,7 +1980,7 @@ xfs_dir2_node_trim_free( * Blow the block away. */ if ((error = - xfs_dir2_shrink_inode(args, XFS_DIR2_DA_TO_DB(mp, (xfs_dablk_t)fo), + xfs_dir2_shrink_inode(args, xfs_dir2_da_to_db(mp, (xfs_dablk_t)fo), bp))) { /* * Can't fail with ENOSPC since that only happens with no Index: linux-2.6/fs/xfs/xfs_dir2_node.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_node.h 2007-04-13 13:55:47.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_node.h 2007-04-13 14:04:32.000000000 +0200 @@ -36,7 +36,7 @@ struct xfs_trans; #define XFS_DIR2_FREE_SPACE 2 #define XFS_DIR2_FREE_OFFSET (XFS_DIR2_FREE_SPACE * XFS_DIR2_SPACE_SIZE) #define XFS_DIR2_FREE_FIRSTDB(mp) \ - XFS_DIR2_BYTE_TO_DB(mp, XFS_DIR2_FREE_OFFSET) + xfs_dir2_byte_to_db(mp, XFS_DIR2_FREE_OFFSET) #define XFS_DIR2_FREE_MAGIC 0x58443246 /* XD2F */ @@ -60,7 +60,6 @@ typedef struct xfs_dir2_free { /* * Convert data space db to the corresponding free db. */ -#define XFS_DIR2_DB_TO_FDB(mp,db) xfs_dir2_db_to_fdb(mp, db) static inline xfs_dir2_db_t xfs_dir2_db_to_fdb(struct xfs_mount *mp, xfs_dir2_db_t db) { @@ -70,7 +69,6 @@ xfs_dir2_db_to_fdb(struct xfs_mount *mp, /* * Convert data space db to the corresponding index in a free db. */ -#define XFS_DIR2_DB_TO_FDINDEX(mp,db) xfs_dir2_db_to_fdindex(mp, db) static inline int xfs_dir2_db_to_fdindex(struct xfs_mount *mp, xfs_dir2_db_t db) { Index: linux-2.6/fs/xfs/xfs_dir2_sf.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_sf.c 2007-04-13 13:47:23.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_sf.c 2007-04-13 14:08:17.000000000 +0200 @@ -89,8 +89,8 @@ xfs_dir2_block_sfsize( mp = dp->i_mount; count = i8count = namelen = 0; - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); - blp = XFS_DIR2_BLOCK_LEAF_P(btp); + btp = xfs_dir2_block_tail_p(mp, block); + blp = xfs_dir2_block_leaf_p(btp); /* * Iterate over the block's data entries by using the leaf pointers. @@ -102,7 +102,7 @@ xfs_dir2_block_sfsize( * Calculate the pointer to the entry at hand. */ dep = (xfs_dir2_data_entry_t *) - ((char *)block + XFS_DIR2_DATAPTR_TO_OFF(mp, addr)); + ((char *)block + xfs_dir2_dataptr_to_off(mp, addr)); /* * Detect . and .., so we can special-case them. * . is not included in sf directories. @@ -124,7 +124,7 @@ xfs_dir2_block_sfsize( /* * Calculate the new size, see if we should give up yet. */ - size = XFS_DIR2_SF_HDR_SIZE(i8count) + /* header */ + size = xfs_dir2_sf_hdr_size(i8count) + /* header */ count + /* namelen */ count * (uint)sizeof(xfs_dir2_sf_off_t) + /* offset */ namelen + /* name */ @@ -139,7 +139,7 @@ xfs_dir2_block_sfsize( */ sfhp->count = count; sfhp->i8count = i8count; - XFS_DIR2_SF_PUT_INUMBER((xfs_dir2_sf_t *)sfhp, &parent, &sfhp->parent); + xfs_dir2_sf_put_inumber((xfs_dir2_sf_t *)sfhp, &parent, &sfhp->parent); return size; } @@ -199,15 +199,15 @@ xfs_dir2_block_to_sf( * Copy the header into the newly allocate local space. */ sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - memcpy(sfp, sfhp, XFS_DIR2_SF_HDR_SIZE(sfhp->i8count)); + memcpy(sfp, sfhp, xfs_dir2_sf_hdr_size(sfhp->i8count)); dp->i_d.di_size = size; /* * Set up to loop over the block's entries. */ - btp = XFS_DIR2_BLOCK_TAIL_P(mp, block); + btp = xfs_dir2_block_tail_p(mp, block); ptr = (char *)block->u; - endptr = (char *)XFS_DIR2_BLOCK_LEAF_P(btp); - sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + endptr = (char *)xfs_dir2_block_leaf_p(btp); + sfep = xfs_dir2_sf_firstentry(sfp); /* * Loop over the active and unused entries. * Stop when we reach the leaf/tail portion of the block. @@ -233,22 +233,22 @@ xfs_dir2_block_to_sf( else if (dep->namelen == 2 && dep->name[0] == '.' && dep->name[1] == '.') ASSERT(be64_to_cpu(dep->inumber) == - XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent)); + xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent)); /* * Normal entry, copy it into shortform. */ else { sfep->namelen = dep->namelen; - XFS_DIR2_SF_PUT_OFFSET(sfep, + xfs_dir2_sf_put_offset(sfep, (xfs_dir2_data_aoff_t) ((char *)dep - (char *)block)); memcpy(sfep->name, dep->name, dep->namelen); temp = be64_to_cpu(dep->inumber); - XFS_DIR2_SF_PUT_INUMBER(sfp, &temp, - XFS_DIR2_SF_INUMBERP(sfep)); - sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep); + xfs_dir2_sf_put_inumber(sfp, &temp, + xfs_dir2_sf_inumberp(sfep)); + sfep = xfs_dir2_sf_nextentry(sfp, sfep); } - ptr += XFS_DIR2_DATA_ENTSIZE(dep->namelen); + ptr += xfs_dir2_data_entsize(dep->namelen); } ASSERT((char *)sfep - (char *)sfp == size); xfs_dir2_sf_check(args); @@ -294,11 +294,11 @@ xfs_dir2_sf_addname( ASSERT(dp->i_df.if_bytes == dp->i_d.di_size); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * Compute entry (and change in) size. */ - add_entsize = XFS_DIR2_SF_ENTSIZE_BYNAME(sfp, args->namelen); + add_entsize = xfs_dir2_sf_entsize_byname(sfp, args->namelen); incr_isize = add_entsize; objchange = 0; #if XFS_BIG_INUMS @@ -392,7 +392,7 @@ xfs_dir2_sf_addname_easy( /* * Grow the in-inode space. */ - xfs_idata_realloc(dp, XFS_DIR2_SF_ENTSIZE_BYNAME(sfp, args->namelen), + xfs_idata_realloc(dp, xfs_dir2_sf_entsize_byname(sfp, args->namelen), XFS_DATA_FORK); /* * Need to set up again due to realloc of the inode data. @@ -403,10 +403,10 @@ xfs_dir2_sf_addname_easy( * Fill in the new entry. */ sfep->namelen = args->namelen; - XFS_DIR2_SF_PUT_OFFSET(sfep, offset); + xfs_dir2_sf_put_offset(sfep, offset); memcpy(sfep->name, args->name, sfep->namelen); - XFS_DIR2_SF_PUT_INUMBER(sfp, &args->inumber, - XFS_DIR2_SF_INUMBERP(sfep)); + xfs_dir2_sf_put_inumber(sfp, &args->inumber, + xfs_dir2_sf_inumberp(sfep)); /* * Update the header and inode. */ @@ -463,14 +463,14 @@ xfs_dir2_sf_addname_hard( * If it's going to end up at the end then oldsfep will point there. */ for (offset = XFS_DIR2_DATA_FIRST_OFFSET, - oldsfep = XFS_DIR2_SF_FIRSTENTRY(oldsfp), - add_datasize = XFS_DIR2_DATA_ENTSIZE(args->namelen), + oldsfep = xfs_dir2_sf_firstentry(oldsfp), + add_datasize = xfs_dir2_data_entsize(args->namelen), eof = (char *)oldsfep == &buf[old_isize]; !eof; - offset = new_offset + XFS_DIR2_DATA_ENTSIZE(oldsfep->namelen), - oldsfep = XFS_DIR2_SF_NEXTENTRY(oldsfp, oldsfep), + offset = new_offset + xfs_dir2_data_entsize(oldsfep->namelen), + oldsfep = xfs_dir2_sf_nextentry(oldsfp, oldsfep), eof = (char *)oldsfep == &buf[old_isize]) { - new_offset = XFS_DIR2_SF_GET_OFFSET(oldsfep); + new_offset = xfs_dir2_sf_get_offset(oldsfep); if (offset + add_datasize <= new_offset) break; } @@ -495,10 +495,10 @@ xfs_dir2_sf_addname_hard( * Fill in the new entry, and update the header counts. */ sfep->namelen = args->namelen; - XFS_DIR2_SF_PUT_OFFSET(sfep, offset); + xfs_dir2_sf_put_offset(sfep, offset); memcpy(sfep->name, args->name, sfep->namelen); - XFS_DIR2_SF_PUT_INUMBER(sfp, &args->inumber, - XFS_DIR2_SF_INUMBERP(sfep)); + xfs_dir2_sf_put_inumber(sfp, &args->inumber, + xfs_dir2_sf_inumberp(sfep)); sfp->hdr.count++; #if XFS_BIG_INUMS if (args->inumber > XFS_DIR2_MAX_SHORT_INUM && !objchange) @@ -508,7 +508,7 @@ xfs_dir2_sf_addname_hard( * If there's more left to copy, do that. */ if (!eof) { - sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep); + sfep = xfs_dir2_sf_nextentry(sfp, sfep); memcpy(sfep, oldsfep, old_isize - nbytes); } kmem_free(buf, old_isize); @@ -544,9 +544,9 @@ xfs_dir2_sf_addname_pick( mp = dp->i_mount; sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - size = XFS_DIR2_DATA_ENTSIZE(args->namelen); + size = xfs_dir2_data_entsize(args->namelen); offset = XFS_DIR2_DATA_FIRST_OFFSET; - sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + sfep = xfs_dir2_sf_firstentry(sfp); holefit = 0; /* * Loop over sf entries. @@ -555,10 +555,10 @@ xfs_dir2_sf_addname_pick( */ for (i = 0; i < sfp->hdr.count; i++) { if (!holefit) - holefit = offset + size <= XFS_DIR2_SF_GET_OFFSET(sfep); - offset = XFS_DIR2_SF_GET_OFFSET(sfep) + - XFS_DIR2_DATA_ENTSIZE(sfep->namelen); - sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep); + holefit = offset + size <= xfs_dir2_sf_get_offset(sfep); + offset = xfs_dir2_sf_get_offset(sfep) + + xfs_dir2_data_entsize(sfep->namelen); + sfep = xfs_dir2_sf_nextentry(sfp, sfep); } /* * Calculate data bytes used excluding the new entry, if this @@ -617,18 +617,18 @@ xfs_dir2_sf_check( sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; offset = XFS_DIR2_DATA_FIRST_OFFSET; - ino = XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent); + ino = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent); i8count = ino > XFS_DIR2_MAX_SHORT_INUM; - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { - ASSERT(XFS_DIR2_SF_GET_OFFSET(sfep) >= offset); - ino = XFS_DIR2_SF_GET_INUMBER(sfp, XFS_DIR2_SF_INUMBERP(sfep)); + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { + ASSERT(xfs_dir2_sf_get_offset(sfep) >= offset); + ino = xfs_dir2_sf_get_inumber(sfp, xfs_dir2_sf_inumberp(sfep)); i8count += ino > XFS_DIR2_MAX_SHORT_INUM; offset = - XFS_DIR2_SF_GET_OFFSET(sfep) + - XFS_DIR2_DATA_ENTSIZE(sfep->namelen); + xfs_dir2_sf_get_offset(sfep) + + xfs_dir2_data_entsize(sfep->namelen); } ASSERT(i8count == sfp->hdr.i8count); ASSERT(XFS_BIG_INUMS || i8count == 0); @@ -671,7 +671,7 @@ xfs_dir2_sf_create( ASSERT(dp->i_df.if_flags & XFS_IFINLINE); ASSERT(dp->i_df.if_bytes == 0); i8count = pino > XFS_DIR2_MAX_SHORT_INUM; - size = XFS_DIR2_SF_HDR_SIZE(i8count); + size = xfs_dir2_sf_hdr_size(i8count); /* * Make a buffer for the data. */ @@ -684,7 +684,7 @@ xfs_dir2_sf_create( /* * Now can put in the inode number, since i8count is set. */ - XFS_DIR2_SF_PUT_INUMBER(sfp, &pino, &sfp->hdr.parent); + xfs_dir2_sf_put_inumber(sfp, &pino, &sfp->hdr.parent); sfp->hdr.count = 0; dp->i_d.di_size = size; xfs_dir2_sf_check(args); @@ -727,12 +727,12 @@ xfs_dir2_sf_getdents( sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * If the block number in the offset is out of range, we're done. */ - if (XFS_DIR2_DATAPTR_TO_DB(mp, dir_offset) > mp->m_dirdatablk) { + if (xfs_dir2_dataptr_to_db(mp, dir_offset) > mp->m_dirdatablk) { *eofp = 1; return 0; } @@ -747,9 +747,9 @@ xfs_dir2_sf_getdents( * Put . entry unless we're starting past it. */ if (dir_offset <= - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_DOT_OFFSET)) { - p.cook = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, 0, + p.cook = xfs_dir2_db_off_to_dataptr(mp, 0, XFS_DIR2_DATA_DOTDOT_OFFSET); p.ino = dp->i_ino; #if XFS_BIG_INUMS @@ -762,7 +762,7 @@ xfs_dir2_sf_getdents( if (!p.done) { uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_DOT_OFFSET); return error; } @@ -772,11 +772,11 @@ xfs_dir2_sf_getdents( * Put .. entry unless we're starting past it. */ if (dir_offset <= - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_DOTDOT_OFFSET)) { - p.cook = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + p.cook = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_FIRST_OFFSET); - p.ino = XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent); + p.ino = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent); #if XFS_BIG_INUMS p.ino += mp->m_inoadd; #endif @@ -787,7 +787,7 @@ xfs_dir2_sf_getdents( if (!p.done) { uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, XFS_DIR2_DATA_DOTDOT_OFFSET); return error; } @@ -796,23 +796,23 @@ xfs_dir2_sf_getdents( /* * Loop while there are more entries and put'ing works. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { - off = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, - XFS_DIR2_SF_GET_OFFSET(sfep)); + off = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, + xfs_dir2_sf_get_offset(sfep)); if (dir_offset > off) continue; p.namelen = sfep->namelen; - p.cook = XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk, - XFS_DIR2_SF_GET_OFFSET(sfep) + - XFS_DIR2_DATA_ENTSIZE(p.namelen)); + p.cook = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, + xfs_dir2_sf_get_offset(sfep) + + xfs_dir2_data_entsize(p.namelen)); - p.ino = XFS_DIR2_SF_GET_INUMBER(sfp, XFS_DIR2_SF_INUMBERP(sfep)); + p.ino = xfs_dir2_sf_get_inumber(sfp, xfs_dir2_sf_inumberp(sfep)); #if XFS_BIG_INUMS p.ino += mp->m_inoadd; #endif @@ -832,7 +832,7 @@ xfs_dir2_sf_getdents( *eofp = 1; uio->uio_offset = - XFS_DIR2_DB_OFF_TO_DATAPTR(mp, mp->m_dirdatablk + 1, 0); + xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk + 1, 0); return 0; } @@ -865,7 +865,7 @@ xfs_dir2_sf_lookup( ASSERT(dp->i_df.if_bytes == dp->i_d.di_size); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * Special case for . */ @@ -878,21 +878,21 @@ xfs_dir2_sf_lookup( */ if (args->namelen == 2 && args->name[0] == '.' && args->name[1] == '.') { - args->inumber = XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent); + args->inumber = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent); return XFS_ERROR(EEXIST); } /* * Loop over all the entries trying to match ours. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { if (sfep->namelen == args->namelen && sfep->name[0] == args->name[0] && memcmp(args->name, sfep->name, args->namelen) == 0) { args->inumber = - XFS_DIR2_SF_GET_INUMBER(sfp, - XFS_DIR2_SF_INUMBERP(sfep)); + xfs_dir2_sf_get_inumber(sfp, + xfs_dir2_sf_inumberp(sfep)); return XFS_ERROR(EEXIST); } } @@ -934,19 +934,19 @@ xfs_dir2_sf_removename( ASSERT(dp->i_df.if_bytes == oldsize); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(oldsize >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(oldsize >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); /* * Loop over the old directory entries. * Find the one we're deleting. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { if (sfep->namelen == args->namelen && sfep->name[0] == args->name[0] && memcmp(sfep->name, args->name, args->namelen) == 0) { - ASSERT(XFS_DIR2_SF_GET_INUMBER(sfp, - XFS_DIR2_SF_INUMBERP(sfep)) == + ASSERT(xfs_dir2_sf_get_inumber(sfp, + xfs_dir2_sf_inumberp(sfep)) == args->inumber); break; } @@ -961,7 +961,7 @@ xfs_dir2_sf_removename( * Calculate sizes. */ byteoff = (int)((char *)sfep - (char *)sfp); - entsize = XFS_DIR2_SF_ENTSIZE_BYNAME(sfp, args->namelen); + entsize = xfs_dir2_sf_entsize_byname(sfp, args->namelen); newsize = oldsize - entsize; /* * Copy the part if any after the removed entry, sliding it down. @@ -1027,7 +1027,7 @@ xfs_dir2_sf_replace( ASSERT(dp->i_df.if_bytes == dp->i_d.di_size); ASSERT(dp->i_df.if_u1.if_data != NULL); sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data; - ASSERT(dp->i_d.di_size >= XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count)); + ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count)); #if XFS_BIG_INUMS /* * New inode number is large, and need to convert to 8-byte inodes. @@ -1067,28 +1067,28 @@ xfs_dir2_sf_replace( if (args->namelen == 2 && args->name[0] == '.' && args->name[1] == '.') { #if XFS_BIG_INUMS || defined(DEBUG) - ino = XFS_DIR2_SF_GET_INUMBER(sfp, &sfp->hdr.parent); + ino = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent); ASSERT(args->inumber != ino); #endif - XFS_DIR2_SF_PUT_INUMBER(sfp, &args->inumber, &sfp->hdr.parent); + xfs_dir2_sf_put_inumber(sfp, &args->inumber, &sfp->hdr.parent); } /* * Normal entry, look for the name. */ else { - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) { if (sfep->namelen == args->namelen && sfep->name[0] == args->name[0] && memcmp(args->name, sfep->name, args->namelen) == 0) { #if XFS_BIG_INUMS || defined(DEBUG) - ino = XFS_DIR2_SF_GET_INUMBER(sfp, - XFS_DIR2_SF_INUMBERP(sfep)); + ino = xfs_dir2_sf_get_inumber(sfp, + xfs_dir2_sf_inumberp(sfep)); ASSERT(args->inumber != ino); #endif - XFS_DIR2_SF_PUT_INUMBER(sfp, &args->inumber, - XFS_DIR2_SF_INUMBERP(sfep)); + xfs_dir2_sf_put_inumber(sfp, &args->inumber, + xfs_dir2_sf_inumberp(sfep)); break; } } @@ -1189,22 +1189,22 @@ xfs_dir2_sf_toino4( */ sfp->hdr.count = oldsfp->hdr.count; sfp->hdr.i8count = 0; - ino = XFS_DIR2_SF_GET_INUMBER(oldsfp, &oldsfp->hdr.parent); - XFS_DIR2_SF_PUT_INUMBER(sfp, &ino, &sfp->hdr.parent); + ino = xfs_dir2_sf_get_inumber(oldsfp, &oldsfp->hdr.parent); + xfs_dir2_sf_put_inumber(sfp, &ino, &sfp->hdr.parent); /* * Copy the entries field by field. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp), - oldsfep = XFS_DIR2_SF_FIRSTENTRY(oldsfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp), + oldsfep = xfs_dir2_sf_firstentry(oldsfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep), - oldsfep = XFS_DIR2_SF_NEXTENTRY(oldsfp, oldsfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep), + oldsfep = xfs_dir2_sf_nextentry(oldsfp, oldsfep)) { sfep->namelen = oldsfep->namelen; sfep->offset = oldsfep->offset; memcpy(sfep->name, oldsfep->name, sfep->namelen); - ino = XFS_DIR2_SF_GET_INUMBER(oldsfp, - XFS_DIR2_SF_INUMBERP(oldsfep)); - XFS_DIR2_SF_PUT_INUMBER(sfp, &ino, XFS_DIR2_SF_INUMBERP(sfep)); + ino = xfs_dir2_sf_get_inumber(oldsfp, + xfs_dir2_sf_inumberp(oldsfep)); + xfs_dir2_sf_put_inumber(sfp, &ino, xfs_dir2_sf_inumberp(sfep)); } /* * Clean up the inode. @@ -1266,22 +1266,22 @@ xfs_dir2_sf_toino8( */ sfp->hdr.count = oldsfp->hdr.count; sfp->hdr.i8count = 1; - ino = XFS_DIR2_SF_GET_INUMBER(oldsfp, &oldsfp->hdr.parent); - XFS_DIR2_SF_PUT_INUMBER(sfp, &ino, &sfp->hdr.parent); + ino = xfs_dir2_sf_get_inumber(oldsfp, &oldsfp->hdr.parent); + xfs_dir2_sf_put_inumber(sfp, &ino, &sfp->hdr.parent); /* * Copy the entries field by field. */ - for (i = 0, sfep = XFS_DIR2_SF_FIRSTENTRY(sfp), - oldsfep = XFS_DIR2_SF_FIRSTENTRY(oldsfp); + for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp), + oldsfep = xfs_dir2_sf_firstentry(oldsfp); i < sfp->hdr.count; - i++, sfep = XFS_DIR2_SF_NEXTENTRY(sfp, sfep), - oldsfep = XFS_DIR2_SF_NEXTENTRY(oldsfp, oldsfep)) { + i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep), + oldsfep = xfs_dir2_sf_nextentry(oldsfp, oldsfep)) { sfep->namelen = oldsfep->namelen; sfep->offset = oldsfep->offset; memcpy(sfep->name, oldsfep->name, sfep->namelen); - ino = XFS_DIR2_SF_GET_INUMBER(oldsfp, - XFS_DIR2_SF_INUMBERP(oldsfep)); - XFS_DIR2_SF_PUT_INUMBER(sfp, &ino, XFS_DIR2_SF_INUMBERP(sfep)); + ino = xfs_dir2_sf_get_inumber(oldsfp, + xfs_dir2_sf_inumberp(oldsfep)); + xfs_dir2_sf_put_inumber(sfp, &ino, xfs_dir2_sf_inumberp(sfep)); } /* * Clean up the inode. Index: linux-2.6/fs/xfs/xfs_dir2_sf.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_dir2_sf.h 2007-04-13 13:57:01.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_dir2_sf.h 2007-04-13 14:01:37.000000000 +0200 @@ -90,7 +90,6 @@ typedef struct xfs_dir2_sf { xfs_dir2_sf_entry_t list[1]; /* shortform entries */ } xfs_dir2_sf_t; -#define XFS_DIR2_SF_HDR_SIZE(i8count) xfs_dir2_sf_hdr_size(i8count) static inline int xfs_dir2_sf_hdr_size(int i8count) { return ((uint)sizeof(xfs_dir2_sf_hdr_t) - \ @@ -98,14 +97,11 @@ static inline int xfs_dir2_sf_hdr_size(i ((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t))); } -#define XFS_DIR2_SF_INUMBERP(sfep) xfs_dir2_sf_inumberp(sfep) static inline xfs_dir2_inou_t *xfs_dir2_sf_inumberp(xfs_dir2_sf_entry_t *sfep) { return (xfs_dir2_inou_t *)&(sfep)->name[(sfep)->namelen]; } -#define XFS_DIR2_SF_GET_INUMBER(sfp, from) \ - xfs_dir2_sf_get_inumber(sfp, from) static inline xfs_intino_t xfs_dir2_sf_get_inumber(xfs_dir2_sf_t *sfp, xfs_dir2_inou_t *from) { @@ -114,8 +110,6 @@ xfs_dir2_sf_get_inumber(xfs_dir2_sf_t *s (xfs_intino_t)XFS_GET_DIR_INO8((from)->i8)); } -#define XFS_DIR2_SF_PUT_INUMBER(sfp,from,to) \ - xfs_dir2_sf_put_inumber(sfp,from,to) static inline void xfs_dir2_sf_put_inumber(xfs_dir2_sf_t *sfp, xfs_ino_t *from, xfs_dir2_inou_t *to) { @@ -125,24 +119,18 @@ static inline void xfs_dir2_sf_put_inumb XFS_PUT_DIR_INO8(*(from), (to)->i8); } -#define XFS_DIR2_SF_GET_OFFSET(sfep) \ - xfs_dir2_sf_get_offset(sfep) static inline xfs_dir2_data_aoff_t xfs_dir2_sf_get_offset(xfs_dir2_sf_entry_t *sfep) { return INT_GET_UNALIGNED_16_BE(&(sfep)->offset.i); } -#define XFS_DIR2_SF_PUT_OFFSET(sfep,off) \ - xfs_dir2_sf_put_offset(sfep,off) static inline void xfs_dir2_sf_put_offset(xfs_dir2_sf_entry_t *sfep, xfs_dir2_data_aoff_t off) { INT_SET_UNALIGNED_16_BE(&(sfep)->offset.i, off); } -#define XFS_DIR2_SF_ENTSIZE_BYNAME(sfp,len) \ - xfs_dir2_sf_entsize_byname(sfp,len) static inline int xfs_dir2_sf_entsize_byname(xfs_dir2_sf_t *sfp, int len) { return ((uint)sizeof(xfs_dir2_sf_entry_t) - 1 + (len) - \ @@ -150,8 +138,6 @@ static inline int xfs_dir2_sf_entsize_by ((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t))); } -#define XFS_DIR2_SF_ENTSIZE_BYENTRY(sfp,sfep) \ - xfs_dir2_sf_entsize_byentry(sfp,sfep) static inline int xfs_dir2_sf_entsize_byentry(xfs_dir2_sf_t *sfp, xfs_dir2_sf_entry_t *sfep) { @@ -160,19 +146,17 @@ xfs_dir2_sf_entsize_byentry(xfs_dir2_sf_ ((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t))); } -#define XFS_DIR2_SF_FIRSTENTRY(sfp) xfs_dir2_sf_firstentry(sfp) static inline xfs_dir2_sf_entry_t *xfs_dir2_sf_firstentry(xfs_dir2_sf_t *sfp) { return ((xfs_dir2_sf_entry_t *) \ - ((char *)(sfp) + XFS_DIR2_SF_HDR_SIZE(sfp->hdr.i8count))); + ((char *)(sfp) + xfs_dir2_sf_hdr_size(sfp->hdr.i8count))); } -#define XFS_DIR2_SF_NEXTENTRY(sfp,sfep) xfs_dir2_sf_nextentry(sfp,sfep) static inline xfs_dir2_sf_entry_t * xfs_dir2_sf_nextentry(xfs_dir2_sf_t *sfp, xfs_dir2_sf_entry_t *sfep) { return ((xfs_dir2_sf_entry_t *) \ - ((char *)(sfep) + XFS_DIR2_SF_ENTSIZE_BYENTRY(sfp,sfep))); + ((char *)(sfep) + xfs_dir2_sf_entsize_byentry(sfp,sfep))); } /* From owner-xfs@oss.sgi.com Wed Apr 18 16:03:56 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 16:03:59 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3IN3sfB029878 for ; Wed, 18 Apr 2007 16:03:56 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 32A134E457A; Wed, 18 Apr 2007 17:03:53 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 7158B4141; Wed, 18 Apr 2007 17:03:50 -0600 (MDT) Date: Wed, 18 Apr 2007 17:03:50 -0600 From: Andreas Dilger To: Timothy Shimmin Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070418230349.GJ5967@schatzie.adilger.int> Mail-Followup-To: Timothy Shimmin , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <31588A06562720FE1E0F93DF@timothy-shimmins-power-mac-g5.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <31588A06562720FE1E0F93DF@timothy-shimmins-power-mac-g5.local> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11110 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On Apr 16, 2007 18:01 +1000, Timothy Shimmin wrote: > --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger > wrote: > >struct fiemap_extent { > > __u64 fe_start; /* starting offset in bytes */ > > __u64 fe_len; /* length in bytes */ > >} > > > >struct fiemap { > > struct fiemap_extent fm_start; /* offset, length of desired mapping > > */ > > __u32 fm_extent_count; /* number of extents in array */ > > __u32 fm_flags; /* flags (similar to > > XFS_IOC_GETBMAP) */ > > __u64 unused; > > struct fiemap_extent fm_extents[0]; > >} > > > ># define FIEMAP_LEN_MASK 0xff000000000000 > ># define FIEMAP_LEN_HOLE 0x01000000000000 > ># define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > > >All offsets are in bytes to allow cases where filesystems are not going > >block-aligned/sized allocations (e.g. tail packing). The fm_extents array > >returned contains the packed list of allocation extents for the file, > >including entries for holes (which have fe_start == 0, and a flag). > > > >The ->fm_extents[] array includes all of the holes in addition to > >allocated extents because this avoids the need to return both the logical > >and physical address for every extent and does not make processing any > >harder. > > Well, that's what stood out for me. I was wondering where the "fe_block" > field had gone - the "physical address". > So is your "fe_start; /* starting offset */" actually the disk location > (not a logical file offset) > _except_ in the header (fiemap) where it is the desired logical offset. Correct. The fm_extent in the request contains the logical start offset and length in bytes of the requested fiemap region. In the returned header it represents the logical start offset of the extent that contained the requested start offset, and the logical length of all the returned extents. I haven't decided whether the returned length should be until EOF, or have the "virtual hole" at the end of the file. I think EOF makes more sense. The fe_start + fe_len in the fm_extents represent the physical location on the block device for that extent. fm_extent[i].fe_start (per Anton) is undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole. > Okay, looking at your example use below that's what it looks like. > And when you refer to fm_start below, you mean fm_start.fe_start? > Sorry, I realise this is just an approximation but this part confused me. Right, I'll write up a new RFC based on feedback here, and correcting the various errors in the original proposal. > So you get rid of all the logical file offsets in the extents because we > report holes explicitly (and we know everything is contiguous if you > include the holes). Correct. It saves space in the common case. > >Caller works something like: > > > > char buf[4096]; > > struct fiemap *fm = (struct fiemap *)buf; > > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > > > fm->fm_start.fe_start = 0; /* start of file */ > > fm->fm_start.fe_len = -1; /* end of file */ > > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > > > fd = open(path, O_RDONLY); > > printf("logical\t\tphysical\t\tbytes\n"); > > > > /* The last entry will have less extents than the maximum */ > > while (fm->fm_extent_count == count) { > > rc = ioctl(fd, FIEMAP, fm); > > if (rc) > > break; > > > > /* kernel filled in fm_extents[] array, set fm_extent_count > > * to be actual number of extents returned, leaves > > * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */ > > > > for (i = 0; i < fm->fm_extent_count; i++) { > > __u64 len = fm->fm_extents[i].fe_len & > > FIEMAP_LEN_MASK; > > __u64 fm_next = fm->fm_start.fe_start + len; > > int hole = fm->fm_extents[i].fe_len & > > FIEMAP_LEN_HOLE; > > int unwr = fm->fm_extents[i].fe_len & > > FIEMAP_LEN_UNWRITTEN; > > > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > > fm->fm_start.fe_start, fm_next - 1, > > hole ? 0 : fm->fm_extents[i].fe_start, > > hole ? 0 : fm->fm_extents[i].fe_start + > > fm->fm_extents[i].fe_len - 1, > > len, hole ? "(hole) " : "", > > unwr ? "(unwritten) " : ""); > > > > /* get ready for printing next extent, or next ioctl > > */ > > fm->fm_start.fe_start = fm_next; > > } > > } > > Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Wed Apr 18 16:51:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 16:51:14 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3INp8fB014489 for ; Wed, 18 Apr 2007 16:51:10 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA12048; Thu, 19 Apr 2007 09:51:00 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3INowAf70438188; Thu, 19 Apr 2007 09:50:58 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3INouZT70152189; Thu, 19 Apr 2007 09:50:56 +1000 (AEST) Date: Thu, 19 Apr 2007 09:50:56 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Re: [PATCH] remove various useless min/max macros Message-ID: <20070418235056.GJ48531920@melbourne.sgi.com> References: <20070418175730.GA18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070418175730.GA18315@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11111 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, Apr 18, 2007 at 07:57:30PM +0200, Christoph Hellwig wrote: > xfs_btree.h has various macros to calculate a min/max after casting > it's arguments to a specific type. This can be done much simpler > by using min_t/max_t with the type as first argument. Sure, but I NACKed that last October for good reason. http://marc.info/?t=116116017600003&r=1&w=2 Specifically: http://marc.info/?l=linux-kernel&m=116122285309389&w=2 I still have no objection to changing the implementation of these macros or even changing them to non-shouting static inlines but I don't want them removed.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed Apr 18 16:57:53 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 16:57:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3INvofB016054 for ; Wed, 18 Apr 2007 16:57:52 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA12162; Thu, 19 Apr 2007 09:57:45 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3INvhAf69889917; Thu, 19 Apr 2007 09:57:43 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3INvfAQ69832729; Thu, 19 Apr 2007 09:57:41 +1000 (AEST) Date: Thu, 19 Apr 2007 09:57:41 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Re: [PATCH] kill macro noise in xfs_dir2*.h Message-ID: <20070418235741.GK48531920@melbourne.sgi.com> References: <20070418175859.GB18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070418175859.GB18315@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11112 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, Apr 18, 2007 at 07:59:00PM +0200, Christoph Hellwig wrote: > Remove all the macros that just give inline functions uppercase names. > > Signed-off-by: Christoph Hellwig Added to my QA tree. Thanks, Christoph. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed Apr 18 17:09:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 17:09:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J09qfB019240 for ; Wed, 18 Apr 2007 17:09:53 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA12511; Thu, 19 Apr 2007 10:09:44 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J09gAf69610411; Thu, 19 Apr 2007 10:09:42 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J09ekX69103188; Thu, 19 Apr 2007 10:09:40 +1000 (AEST) Date: Thu, 19 Apr 2007 10:09:40 +1000 From: David Chinner To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Re: [PATCH] kill macro noise in xfs_dir2*.h Message-ID: <20070419000940.GL48531920@melbourne.sgi.com> References: <20070418175859.GB18315@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070418175859.GB18315@lst.de> User-Agent: Mutt/1.4.2.1i X-archive-position: 11113 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, Apr 18, 2007 at 07:59:00PM +0200, Christoph Hellwig wrote: > Remove all the macros that just give inline functions uppercase names. > > Signed-off-by: Christoph Hellwig BTW, you'll need this patch to make debug kernels build.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfsidbg.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfsidbg.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfsidbg.c 2007-03-30 09:30:01.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfsidbg.c 2007-04-19 10:02:29.565671598 +1000 @@ -5490,7 +5490,7 @@ xfs_dir2data(void *addr, int size) /* XFS_DIR2_BLOCK_TAIL_P */ tail = (xfs_dir2_block_tail_t *) ((char *)bb + size - sizeof(xfs_dir2_block_tail_t)); - l = XFS_DIR2_BLOCK_LEAF_P(tail); + l = xfs_dir2_block_leaf_p(tail); t = (char *)l; } for (p = (char *)(h + 1); p < t; ) { @@ -5500,7 +5500,7 @@ xfs_dir2data(void *addr, int size) (unsigned long) (p - (char *)addr), INT_GET(u->freetag, ARCH_CONVERT), INT_GET(u->length, ARCH_CONVERT), - INT_GET(*XFS_DIR2_DATA_UNUSED_TAG_P(u), ARCH_CONVERT)); + INT_GET(*xfs_dir2_data_unused_tag_p(u), ARCH_CONVERT)); p += INT_GET(u->length, ARCH_CONVERT); continue; } @@ -5511,8 +5511,8 @@ xfs_dir2data(void *addr, int size) e->namelen); for (k = 0; k < e->namelen; k++) kdb_printf("%c", e->name[k]); - kdb_printf("\" tag 0x%x\n", INT_GET(*XFS_DIR2_DATA_ENTRY_TAG_P(e), ARCH_CONVERT)); - p += XFS_DIR2_DATA_ENTSIZE(e->namelen); + kdb_printf("\" tag 0x%x\n", INT_GET(*xfs_dir2_data_entry_tag_p(e), ARCH_CONVERT)); + p += xfs_dir2_data_entsize(e->namelen); } if (INT_GET(h->magic, ARCH_CONVERT) == XFS_DIR2_DATA_MAGIC) return; @@ -5557,7 +5557,7 @@ xfs_dir2leaf(xfs_dir2_leaf_t *leaf, int return; /* XFS_DIR2_LEAF_TAIL_P */ t = (xfs_dir2_leaf_tail_t *)((char *)leaf + size - sizeof(*t)); - b = XFS_DIR2_LEAF_BESTS_P(t); + b = xfs_dir2_leaf_bests_p(t); for (j = 0; j < INT_GET(t->bestcount, ARCH_CONVERT); j++, b++) { kdb_printf("0x%lx best %d 0x%x\n", (unsigned long) ((char *)b - (char *)leaf), j, @@ -5578,19 +5578,19 @@ xfsidbg_xdir2sf(xfs_dir2_sf_t *s) int i, j; sfh = &s->hdr; - ino = XFS_DIR2_SF_GET_INUMBER(s, &sfh->parent); + ino = xfs_dir2_sf_get_inumber(s, &sfh->parent); kdb_printf("hdr count %d i8count %d parent %llu\n", sfh->count, sfh->i8count, (unsigned long long) ino); - for (i = 0, sfe = XFS_DIR2_SF_FIRSTENTRY(s); i < sfh->count; i++) { - ino = XFS_DIR2_SF_GET_INUMBER(s, XFS_DIR2_SF_INUMBERP(sfe)); + for (i = 0, sfe = xfs_dir2_sf_firstentry(s); i < sfh->count; i++) { + ino = xfs_dir2_sf_get_inumber(s, xfs_dir2_sf_inumberp(sfe)); kdb_printf("entry %d inumber %llu offset 0x%x namelen %d name \"", i, (unsigned long long) ino, - XFS_DIR2_SF_GET_OFFSET(sfe), + xfs_dir2_sf_get_offset(sfe), sfe->namelen); for (j = 0; j < sfe->namelen; j++) kdb_printf("%c", sfe->name[j]); kdb_printf("\"\n"); - sfe = XFS_DIR2_SF_NEXTENTRY(s, sfe); + sfe = xfs_dir2_sf_nextentry(s, sfe); } } From owner-xfs@oss.sgi.com Wed Apr 18 17:21:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 17:21:50 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J0LkfB023372 for ; Wed, 18 Apr 2007 17:21:47 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id 8472C4E457A; Wed, 18 Apr 2007 18:21:41 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id AF2124141; Wed, 18 Apr 2007 18:21:39 -0600 (MDT) Date: Wed, 18 Apr 2007 18:21:39 -0600 From: Andreas Dilger To: David Chinner Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070419002139.GK5967@schatzie.adilger.int> Mail-Followup-To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070416112252.GJ48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11114 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On Apr 16, 2007 21:22 +1000, David Chinner wrote: > On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > > struct fiemap_extent { > > __u64 fe_start; /* starting offset in bytes */ > > __u64 fe_len; /* length in bytes */ > > } > > > > struct fiemap { > > struct fiemap_extent fm_start; /* offset, length of desired mapping */ > > __u32 fm_extent_count; /* number of extents in array */ > > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > > __u64 unused; > > struct fiemap_extent fm_extents[0]; > > } > > > > #define FIEMAP_LEN_MASK 0xff000000000000 > > #define FIEMAP_LEN_HOLE 0x01000000000000 > > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > I'm not sure I like stealing bits from the length to use a flags - > I'd prefer an explicit field per fiemap_extent for this. Christoph expressed the same concern. I'm not dead set against having an extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may mean the need for 50% more ioctls if the file is large. Below is an aggregation of the comments in this thread: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_lun; /* logical storage device number in array */ } struct fiemap { __u64 fm_start; /* logical start offset of mapping (in/out) */ __u64 fm_len; /* logical length of mapping (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; } /* flags for the fiemap request */ #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/ #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */ #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/ /* flags for the returned extents */ #define FIEMAP_EXTENT_HOLE 0x00000001 /* no space allocated */ #define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* uninitialized space */ #define FIEMAP_EXTENT_UNKNOWN 0x00000004 /* in use, location unknown */ #define FIEMAP_EXTENT_ERROR 0x00000008 /* error mapping space */ #define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* no direct data access */ SUMMARY OF CHANGES ================== - use fm_* fields directly in request instead of making it a fiemap_extent (though they are layed out identically) - separate flags word for fm_flags: - FIEMAP_FLAG_SYNC = range should be synced to disk before returning mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified (this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag) - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future if there is agreement on whether that is desirable to have or if it is better to call ioctl(FIEMAP) on an XATTR fd. - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it - __u64 fm_unused does not take up an extra space on all power-of-two buffer sizes (would otherwise be at end of buffer), and may be handy in the future. - add separate fe_flags word with flags from various suggestions: - FIEMAP_EXTENT_HOLE = extent has no space allocation - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown (e.g. HSM, delalloc awaiting sync, etc) - FIEMAP_EXTENT_ERROR = error mapping extent. Should fe_lun == errno? - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data encrypted, compressed, etc), may want separate flags for these? - add new fe_lun word per extent for filesystems that manage multiple devices (e.g. OCFS, GFS, ZFS, Lustre). This would otherwise have been unused. > Given that xfs_bmap uses extra information from the filesystem > (geometry) to display extra (and frequently used) information > about the alignment of extents. ie: > > chook 681% xfs_bmap -vv fred > fred: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010 > FLAG Values: > 010000 Unwritten preallocated extent > 001000 Doesn't begin on stripe unit > 000100 Doesn't end on stripe unit > 000010 Doesn't begin on stripe width > 000001 Doesn't end on stripe width Can you clarify the terminology here? What is a "stripe unit" and what is a "stripe width"? Are there "N * stripe_unit = stripe_width" in e.g. a RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa? I don't mind adding this, as long as it's clear that some filesystems don't have this kind of information. > This information could be easily passed up in the flags fields if the > filesystem has geometry information (there go 4 more flags ;). Got lots of flag bits now. > Also - what are the explicit sync semantics of this ioctl? The > XFS ioctl causes a fsync of the file first to convert delalloc > extents to real extents before returning the bmap. Is this functionality > going to be the same? If not, then we need a DELALLOC flag to indicate > extents that haven't been allocated yet. This might be handy to > have, anyway.... Have added a FIEMAP_FLAG_SYNC on the request to sync if applications care, and FIEMAP_EXTENT_UNKNOWN can handle unmapped extents for delalloc. > > The fm_extents array > > returned contains the packed list of allocation extents for the file, > > including entries for holes (which have fe_start == 0, and a flag). > > Internalling in XFS, we pass these around as: > > #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL) > #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL) We could do this too, instead of having flags, but many of the proposed flags are orthogonal so we'd end up needing a lot of separate values here and it would just degenerate into the FIEMAP_LEN_MASK I previously suggested. > > required expanding the per-extent struct from 32 to 48 bytes per extent, > > not sure I follow your maths here? That was the case for XFS getbmap vs. getbmapx. For FIEMAP it increases the extent size from 16 to 24 bytes. > > Caller works something like: > > > > char buf[4096]; > > struct fiemap *fm = (struct fiemap *)buf; > > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > > > fm->fm_start.fe_start = 0; /* start of file */ > > fm->fm_start.fe_len = -1; /* end of file */ > > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */ > > > > fd = open(path, O_RDONLY); > > printf("logical\t\tphysical\t\tbytes\n"); > > > > /* The last entry will have less extents than the maximum */ > > while (fm->fm_extent_count == count) { > > fm_extent_count is an in/out parameter? Correct. > > > rc = ioctl(fd, FIEMAP, fm); > > if (rc) > > break; > > > > /* kernel filled in fm_extents[] array, set fm_extent_count > > * to be actual number of extents returned, leaves fm_start > > * alone (unlike XFS_IOC_GETBMAP). */ > > Ok, it is. > > > for (i = 0; i < fm->fm_extent_count; i++) { > > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > > __u64 fm_next = fm->fm_start.fe_start + len; > > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > > fm->fm_start.fe_start, fm_next - 1, > > hole ? 0 : fm->fm_extents[i].fe_start, > > hole ? 0 : fm->fm_extents[i].fe_start + > > fm->fm_extents[i].fe_len - 1, > > len, hole ? "(hole) " : "", > > unwr ? "(unwritten) " : ""); > > > > /* get ready for printing next extent, or next ioctl */ > > fm->fm_start.fe_start = fm_next; > > Ok, so the only way you can determine where you are in the file > is by adding up the length of each extent. What happens if the file > is changing underneath you e.g. someone punches out a hole > in teh file, or truncates and extends it again between ioctl() > calls? Well, that is always true with data once it is out of the caller. > Also, what happens if you ask for an offset/len that doesn't map to > any extent boundaries - are you truncating the extents returned to > teh off/len passed in? The request offset will be returned as the start of the actual extent that it falls inside. And the returned extents will end with the extent that ends at or after the requested fm_start + fm_len. > xfs_bmap gets around this by finding out how many extents there are in the > file and allocating a buffer that big to hold all the extents so they > are gathered in a single atomic call (think sparse matrix files).... Yeah, except this might be persistent for a long time if it isn't fully read with a single ioctl and the app never continues reading but doesn't close the fd. > > I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP. > > I'm quite open to suggestions at this point, both in terms of how usable > > the fiemap data structures are by the caller, and if we need to add anything > > to make them more flexible for the future. > > ioctl is fine by me. perhaps a version number in the structure header > would be handy so we can modify the interface easily in the future > without having to worry about breaking userspace.... Yeah, but premature optimization and such. Would rather have INCOMPAT flags instead of version numbers. > > In terms of implementing this in the kernel, there was originally code for > > this during the development of the ext3 extent patches and it was done via > > a callback in the extent tree iterator so it is very efficient. I believe > > it implements all that is needed to allow this interface to be mapped > > onto XFS_IOC_BMAP internally (or vice versa). > > I wouldn't map the ioctls - I'd just write another interface to > xfs_getbmap(). That way we could eventually get rid of the XFS_IOC_BMAP > interface. is there any code yet? Up to you, I was just suggesting "mapping" in the generic sense. The flags and values would all have to be changed anyways. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Wed Apr 18 18:54:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 18:54:46 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J1sdfB017095 for ; Wed, 18 Apr 2007 18:54:41 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA14925; Thu, 19 Apr 2007 11:54:29 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J1sSAf70586857; Thu, 19 Apr 2007 11:54:28 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J1sQCO70511966; Thu, 19 Apr 2007 11:54:26 +1000 (AEST) Date: Thu, 19 Apr 2007 11:54:26 +1000 From: David Chinner To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070419015426.GM48531920@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070419002139.GK5967@schatzie.adilger.int> User-Agent: Mutt/1.4.2.1i X-archive-position: 11115 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, Apr 18, 2007 at 06:21:39PM -0600, Andreas Dilger wrote: > On Apr 16, 2007 21:22 +1000, David Chinner wrote: > > On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > > > struct fiemap_extent { > > > __u64 fe_start; /* starting offset in bytes */ > > > __u64 fe_len; /* length in bytes */ > > > } > > > > > > struct fiemap { > > > struct fiemap_extent fm_start; /* offset, length of desired mapping */ > > > __u32 fm_extent_count; /* number of extents in array */ > > > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > > > __u64 unused; > > > struct fiemap_extent fm_extents[0]; > > > } > > > > > > #define FIEMAP_LEN_MASK 0xff000000000000 > > > #define FIEMAP_LEN_HOLE 0x01000000000000 > > > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000 > > > > I'm not sure I like stealing bits from the length to use a flags - > > I'd prefer an explicit field per fiemap_extent for this. > > Christoph expressed the same concern. I'm not dead set against having an > extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may > mean the need for 50% more ioctls if the file is large. I don't think this overhead is a huge problem - just pass in a larger buffer (e.g. xfs_bmap can ask for thousands of extents in a single ioctl call as we can extract the number of extents in an inode via XFS_IOC_FSGETXATTRA). > Below is an aggregation of the comments in this thread: > > struct fiemap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > __u32 fe_lun; /* logical storage device number in array */ > } Oh, I missed the bit about the fe_lun - I was thinking something like that might be useful in future.... > struct fiemap { > __u64 fm_start; /* logical start offset of mapping (in/out) */ > __u64 fm_len; /* logical length of mapping (in/out) */ > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ > __u64 fm_unused; > struct fiemap_extent fm_extents[0]; > } > > /* flags for the fiemap request */ > #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/ > #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */ > #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/ No flags in the INCOMPAT range - shouldn't it be 0x3 at this point? > /* flags for the returned extents */ > #define FIEMAP_EXTENT_HOLE 0x00000001 /* no space allocated */ > #define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* uninitialized space */ > #define FIEMAP_EXTENT_UNKNOWN 0x00000004 /* in use, location unknown */ > #define FIEMAP_EXTENT_ERROR 0x00000008 /* error mapping space */ > #define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* no direct data access */ SO, there's a HSM_READ flag above. If we are going to make this interface useful for filesystems that have HSMs interacting with their extents, the HSM needs to be able to query whether the extent is online (on disk), has been migrated offline (on tape) or in dual-state (i.e. both online and offline). > SUMMARY OF CHANGES > ================== > - use fm_* fields directly in request instead of making it a fiemap_extent > (though they are layed out identically) > > - separate flags word for fm_flags: > - FIEMAP_FLAG_SYNC = range should be synced to disk before returning > mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise > - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified > (this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag) > - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future > if there is agreement on whether that is desirable to have or if it is > better to call ioctl(FIEMAP) on an XATTR fd. > - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel > must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we > don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it > > - __u64 fm_unused does not take up an extra space on all power-of-two buffer > sizes (would otherwise be at end of buffer), and may be handy in the future. > > - add separate fe_flags word with flags from various suggestions: > - FIEMAP_EXTENT_HOLE = extent has no space allocation > - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data > - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown > (e.g. HSM, delalloc awaiting sync, etc) I'd like an explicit delalloc flag, not lumping it in with "unknown". we *know* the extent is delalloc ;) > - FIEMAP_EXTENT_ERROR = error mapping extent. Should fe_lun == errno? > - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data > encrypted, compressed, etc), may want separate flags for these? > > - add new fe_lun word per extent for filesystems that manage multiple devices > (e.g. OCFS, GFS, ZFS, Lustre). This would otherwise have been unused. > > > > Given that xfs_bmap uses extra information from the filesystem > > (geometry) to display extra (and frequently used) information > > about the alignment of extents. ie: > > > > chook 681% xfs_bmap -vv fred > > fred: > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010 > > FLAG Values: > > 010000 Unwritten preallocated extent > > 001000 Doesn't begin on stripe unit > > 000100 Doesn't end on stripe unit > > 000010 Doesn't begin on stripe width > > 000001 Doesn't end on stripe width > > Can you clarify the terminology here? What is a "stripe unit" and what is > a "stripe width"? Stripe unit is equivalent of the chunk size in an MD RAID. It's the amount of data that is written to each lun in a stripe before moving onto the next stripe element. > Are there "N * stripe_unit = stripe_width" in e.g. a > RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa? Yes, on simple configurations. In more complex HW RAID configurations, we'll typically set the stripe unit to the width of the RAID5 lun (N * segment size) and the stripe width to the number of luns we've striped across. The reason I want this to come out of the filesystem is that one of the driving factors for multi-device support in XFS is to allow multiple devices of different geometries to co-exist efficiently in the one namespace (another reason I'm happy about the fe_lun addition). Passing this information out with the extent is far simpler than trying to find what device it lies on from userspace, then querying for the geometry of that device and then converting it. Especially when extents could lie on different devices with differing geometries.... > I don't mind adding this, as long as it's clear that some filesystems don't > have this kind of information. Sure. > > This information could be easily passed up in the flags fields if the > > filesystem has geometry information (there go 4 more flags ;). > > Got lots of flag bits now. Time to start using them all up ;) > > Also - what are the explicit sync semantics of this ioctl? The > > XFS ioctl causes a fsync of the file first to convert delalloc > > extents to real extents before returning the bmap. Is this functionality > > going to be the same? If not, then we need a DELALLOC flag to indicate > > extents that haven't been allocated yet. This might be handy to > > have, anyway.... > > Have added a FIEMAP_FLAG_SYNC on the request to sync if applications care, OK. > and FIEMAP_EXTENT_UNKNOWN can handle unmapped extents for delalloc. I'd prefer explicit enumeration of then, as I said before... > > > The fm_extents array > > > returned contains the packed list of allocation extents for the file, > > > including entries for holes (which have fe_start == 0, and a flag). > > > > Internalling in XFS, we pass these around as: > > > > #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL) > > #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL) > > We could do this too, instead of having flags, but many of the proposed > flags are orthogonal so we'd end up needing a lot of separate values here > and it would just degenerate into the FIEMAP_LEN_MASK I previously suggested. Yeah, fair enough. > > > for (i = 0; i < fm->fm_extent_count; i++) { > > > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; > > > __u64 fm_next = fm->fm_start.fe_start + len; > > > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; > > > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; > > > > > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", > > > fm->fm_start.fe_start, fm_next - 1, > > > hole ? 0 : fm->fm_extents[i].fe_start, > > > hole ? 0 : fm->fm_extents[i].fe_start + > > > fm->fm_extents[i].fe_len - 1, > > > len, hole ? "(hole) " : "", > > > unwr ? "(unwritten) " : ""); > > > > > > /* get ready for printing next extent, or next ioctl */ > > > fm->fm_start.fe_start = fm_next; > > > > Ok, so the only way you can determine where you are in the file > > is by adding up the length of each extent. What happens if the file > > is changing underneath you e.g. someone punches out a hole > > in teh file, or truncates and extends it again between ioctl() > > calls? > > Well, that is always true with data once it is out of the caller. Sure, but this interface requires iterative calls where the n+1 call is reliant on nothing changing since the first call to be accurate. My question is how do you use this interface to reliably and accurately get all the extents if you using iterative summing like this? > > Also, what happens if you ask for an offset/len that doesn't map to > > any extent boundaries - are you truncating the extents returned to > > teh off/len passed in? > > The request offset will be returned as the start of the actual extent that > it falls inside. And the returned extents will end with the extent that > ends at or after the requested fm_start + fm_len. Ok, so you round the start inwards and the round end outwards. Can you ensure that this is documented in the header file that describes this interface? > > xfs_bmap gets around this by finding out how many extents there are in the > > file and allocating a buffer that big to hold all the extents so they > > are gathered in a single atomic call (think sparse matrix files).... > > Yeah, except this might be persistent for a long time if it isn't fully > read with a single ioctl and the app never continues reading but doesn't > close the fd. Not sure I follow you here... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed Apr 18 20:04:44 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 20:04:48 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J34efB001974 for ; Wed, 18 Apr 2007 20:04:42 -0700 Received: from pc-bnaujok.melbourne.sgi.com (pc-bnaujok.melbourne.sgi.com [134.14.55.58]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA16453 for ; Thu, 19 Apr 2007 13:04:39 +1000 Date: Thu, 19 Apr 2007 13:10:08 +1000 To: xfs@oss.sgi.com Subject: [PATCH] Fix xfs_fsr so the temp dir is not world readable/writable From: "Barry Naujok" Organization: SGI Content-Type: multipart/mixed; boundary=----------Z2DvzjHEAFgw2CcUZZOzh8 MIME-Version: 1.0 Message-ID: User-Agent: Opera Mail/9.10 (Win32) X-archive-position: 11116 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: bnaujok@melbourne.sgi.com Precedence: bulk X-list: xfs ------------Z2DvzjHEAFgw2CcUZZOzh8 Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Just changed the ".fsr" directory to 0700. I also improved the usage text to give more information. Barry. ------------Z2DvzjHEAFgw2CcUZZOzh8 Content-Disposition: attachment; filename=xfs_fsr.patch Content-Type: application/octet-stream; name=xfs_fsr.patch Content-Transfer-Encoding: Base64 LS0tIGEveGZzZHVtcC9mc3IveGZzX2Zzci5jCTIwMDctMDQtMTkgMTM6MDI6 MDMuMDAwMDAwMDAwICsxMDAwCisrKyBiL3hmc2R1bXAvZnNyL3hmc19mc3Iu YwkyMDA3LTA0LTE5IDEyOjM4OjMyLjkzNTMyMjcxMCArMTAwMApAQCAtMTks NyArMTksNyBAQAogLyoKICAqIGZzciAtIGZpbGUgc3lzdGVtIHJlb3JnYW5p emVyCiAgKgotICogZnNyIFstZF0gWy12XSBbLW5dIFstc10gWy1nXSBbLXQg bWluc10gWy1mIGxlZnRmXSBbLW0gbXRhYl0KKyAqIGZzciBbLWRdIFstdl0g Wy1uXSBbLXNdIFstZ10gWy10IHNlY3NdIFstZiBsZWZ0Zl0gWy1tIG10YWJd CiAgKiBmc3IgWy1kXSBbLXZdIFstbl0gWy1zXSBbLWddIHhmc2RldiB8IGRp ciB8IGZpbGUgLi4uCiAgKgogICogSWYgaW52b2tlZCBpbiB0aGUgZmlyc3Qg Zm9ybSBmc3IgZG9lcyB0aGUgZm9sbG93aW5nOiBzdGFydGluZyB3aXRoIHRo ZQpAQCAtMTAwLDcgKzEwMCw3IEBAIHN0YXRpYyBfX2ludDY0X3QJbWluaW11 bWZyZWUgPSAyMDQ4Owogc3RhdGljIHRpbWVfdCBob3dsb25nID0gNzIwMDsJ CS8qIGRlZmF1bHQgc2Vjb25kcyBvZiByZW9yZ2FuaXppbmcgKi8KIHN0YXRp YyBjaGFyICpsZWZ0b2ZmZmlsZSA9ICIvdmFyL3RtcC8uZnNybGFzdF94ZnMi Oy8qIHdoZXJlIHdlIGxlZnQgb2ZmIGxhc3QgKi8KIHN0YXRpYyBjaGFyICpt dGFiID0gTU9VTlRFRDsKLXN0YXRpYyB0aW1lX3QgZW5kdGltZTsKK3N0YXRp YyB0aW1lX3QgZW5kdGltZSA9IDA7CiBzdGF0aWMgdGltZV90IHN0YXJ0dGlt ZTsKIHN0YXRpYyB4ZnNfaW5vX3QJbGVmdG9mZmlubyA9IDA7CiBzdGF0aWMg aW50CXBhZ2VzaXplOwpAQCAtMzU4LDcgKzM1OCwyMSBAQCBtYWluKGludCBh cmdjLCBjaGFyICoqYXJndikKIHZvaWQKIHVzYWdlKGludCByZXQpCiB7Ci0J ZnByaW50ZihzdGRlcnIsIF8oIlVzYWdlOiAlcyBbeGZzZmlsZV0gLi4uXG4i KSwgcHJvZ25hbWUpOworCWZwcmludGYoc3RkZXJyLCBfKAorIlVzYWdlOiAl cyBbLWRdIFstdl0gWy1uXSBbLXNdIFstZ10gWy10IHRpbWVdIFstcCBwYXNz ZXNdIFstZiBsZWZ0Zl0gWy1tIG10YWJdXG4iCisiICAgICAgICVzIFstZF0g Wy12XSBbLW5dIFstc10gWy1nXSB4ZnNkZXYgfCBkaXIgfCBmaWxlIC4uLlxu XG4iCisiT3B0aW9uczpcbiIKKyIgICAgICAgLW4gICAgICAgICAgICAgIERv IG5vdGhpbmcsIG9ubHkgaW50ZXJlc3Rpbmcgd2l0aCAtdi4gTm90XG4iCisi ICAgICAgICAgICAgICAgICAgICAgICBlZmZlY3RpdmUgd2l0aCBpbiBtdGFi IG1vZGUuXG4iCisiICAgICAgIC1zCQlQcmludCBzdGF0aXN0aWNzIG9ubHku XG4iCisiICAgICAgIC1nICAgICAgICAgICAgICBQcmludCB0byBzeXNsb2cg KGRlZmF1bHQgaWYgc3Rkb3V0IG5vdCBhIHR0eSkuXG4iCisiICAgICAgIC10 IHRpbWUgICAgICAgICBIb3cgbG9uZyB0byBydW4gaW4gc2Vjb25kcy5cbiIK KyIgICAgICAgLXAgcGFzc2VzCU51bWJlciBvZiBwYXNzZXMgYmVmb3JlIHRl cm1pbmF0aW5nIGdsb2JhbCByZS1vcmcuXG4iCisiICAgICAgIC1mIGxlZnRv ZmYgICAgICBVc2UgdGhpcyBpbnN0ZWFkIG9mIC9ldGMvZnNybGFzdC5cbiIK KyIgICAgICAgLW0gbXRhYiAgICAgICAgIFVzZSBzb21ldGhpbmcgb3RoZXIg dGhhbiAvZXRjL210YWIuXG4iCisiICAgICAgIC1kICAgICAgICAgICAgICBE ZWJ1ZywgcHJpbnQgZXZlbiBtb3JlLlxuIgorIiAgICAgICAtdgkJVmVyYm9z ZSwgbW9yZSAtdidzIG1vcmUgdmVyYm9zZS5cbiIKKwkJKSwgcHJvZ25hbWUs IHByb2duYW1lKTsKIAlleGl0KHJldCk7CiB9CiAKQEAgLTkxNSw3ICs5Mjks NyBAQCBmc3JmaWxlX2NvbW1vbigKIAl9CiAJaWYgKGZzeC5mc3hfeGZsYWdz ICYgWEZTX1hGTEFHX05PREVGUkFHKSB7CiAJCWlmICh2ZmxhZykKLQkJCWZz cnByaW50ZihfKCIlczogbWFya2VkIGFzIGRvbid0IGRlZnJhZywgaWdub3Jp bmdcbiIpLCAKKwkJCWZzcnByaW50ZihfKCIlczogbWFya2VkIGFzIGRvbid0 IGRlZnJhZywgaWdub3JpbmdcbiIpLAogCQkJICAgIGZuYW1lKTsKIAkJcmV0 dXJuKDApOwogCX0KQEAgLTE1MzMsNyArMTU0Nyw3IEBAIHRtcF9pbml0KGNo YXIgKm1udCkKIAlzcHJpbnRmKGJ1ZiwgIiVzLy5mc3IiLCBtbnQpOwogCiAJ bWFzayA9IHVtYXNrKDApOwotCWlmIChta2RpcihidWYsIDA3NzcpIDwgMCkg eworCWlmIChta2RpcihidWYsIDA3MDApIDwgMCkgewogCQlpZiAoZXJybm8g PT0gRUVYSVNUKSB7CiAJCQlpZiAoZGZsYWcpCiAJCQkJZnNycHJpbnRmKF8o InRtcGRpciBhbHJlYWR5IGV4aXN0czogJXNcbiIpLAo= ------------Z2DvzjHEAFgw2CcUZZOzh8-- From owner-xfs@oss.sgi.com Wed Apr 18 20:41:27 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 20:41:31 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J3fPfB013491 for ; Wed, 18 Apr 2007 20:41:27 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id 07A6FAAC419; Thu, 19 Apr 2007 13:41:23 +1000 (EST) Subject: Re: [PATCH] Fix xfs_fsr so the temp dir is not world readable/writable From: Nathan Scott Reply-To: nscott@aconex.com To: Barry Naujok Cc: xfs@oss.sgi.com In-Reply-To: References: Content-Type: text/plain Organization: Aconex Date: Thu, 19 Apr 2007 13:42:35 +1000 Message-Id: <1176954155.6273.143.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-archive-position: 11117 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs On Thu, 2007-04-19 at 13:10 +1000, Barry Naujok wrote: > Looks good. > -static time_t endtime; > +static time_t endtime = 0; This line of the change is unnecessary though. cheers. -- Nathan From owner-xfs@oss.sgi.com Wed Apr 18 23:21:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 18 Apr 2007 23:21:33 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J6LQfB018388 for ; Wed, 18 Apr 2007 23:21:28 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA20363; Thu, 19 Apr 2007 16:21:08 +1000 Date: Thu, 19 Apr 2007 16:23:16 +1000 From: Timothy Shimmin To: Andreas Dilger , David Chinner cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <60F23AB8D50382586C1E0BFC@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070419002139.GK5967@schatzie.adilger.int> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11118 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs --On 18 April 2007 6:21:39 PM -0600 Andreas Dilger wrote: > Below is an aggregation of the comments in this thread: > > struct fiemap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > __u32 fe_lun; /* logical storage device number in array */ > } > > struct fiemap { > __u64 fm_start; /* logical start offset of mapping (in/out) */ > __u64 fm_len; /* logical length of mapping (in/out) */ > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ > __u64 fm_unused; > struct fiemap_extent fm_extents[0]; > } > > /* flags for the fiemap request */ ># define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/ ># define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */ ># define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/ > > /* flags for the returned extents */ ># define FIEMAP_EXTENT_HOLE 0x00000001 /* no space allocated */ ># define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* uninitialized space */ ># define FIEMAP_EXTENT_UNKNOWN 0x00000004 /* in use, location unknown */ ># define FIEMAP_EXTENT_ERROR 0x00000008 /* error mapping space */ ># define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* no direct data access */ > > > > SUMMARY OF CHANGES > ================== > - use fm_* fields directly in request instead of making it a fiemap_extent > (though they are layed out identically) I much prefer that - it makes it a lot clearer to me to have fiemap_extent just for fm_extents (no different meanings now). (Don't like the word "offset" in comment without "physical" or some such but whatever;-) I also prefer the flags as separate fields too :) --Tim From owner-xfs@oss.sgi.com Thu Apr 19 00:19:08 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:19:10 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7J4fB001557 for ; Thu, 19 Apr 2007 00:19:06 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA22249; Thu, 19 Apr 2007 17:18:58 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7IvAf70629110; Thu, 19 Apr 2007 17:18:57 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7IuCf70659502; Thu, 19 Apr 2007 17:18:56 +1000 (AEST) Date: Thu, 19 Apr 2007 17:18:56 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: make xfs_dm_punch_hole() atomic when punching EOF Message-ID: <20070419071856.GR48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11119 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Currently punching a hole to EOF via xfs_dm_punch_hole() truncates the file and then extends it. This leaves a small window where applications can see an incorrect file size while the punch is in progress. This can cause problems with DMF leading to premature completion of recalls and hence data corruption. Use the UNRESVSP ioctl rather than FREESP+setattr to punch the hole at EOF. This can leave specualtive allocations past EOF, so truncate them off so we don't leave blocks that can't be migrated away around in the filesystem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/dmapi/xfs_dm.c | 47 +++++++++++++++++++++++++++---------------- fs/xfs/linux-2.6/xfs_ksyms.c | 1 fs/xfs/xfs_rw.h | 14 ++++++++++-- fs/xfs/xfs_vnodeops.c | 28 ++++++++++++++++--------- 4 files changed, 60 insertions(+), 30 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/dmapi/xfs_dm.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/dmapi/xfs_dm.c 2007-04-19 16:55:44.345586509 +1000 +++ 2.6.x-xfs-new/fs/xfs/dmapi/xfs_dm.c 2007-04-19 17:18:05.818466833 +1000 @@ -2601,9 +2601,9 @@ xfs_dm_punch_hole( xfs_inode_t *xip; xfs_mount_t *mp; u_int bsize; - int cmd = XFS_IOC_UNRESVSP; /* punch */ xfs_fsize_t realsize; bhv_vnode_t *vp = vn_from_inode(inode); + int punch_to_eof = 0; /* Returns negative errors to DMAPI */ @@ -2638,12 +2638,24 @@ xfs_dm_punch_hole( down_rw_sems(inode, DM_SEM_FLAG_WR); xfs_ilock(xip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL); - if ((off >= xip->i_size) || ((off+len) > xip->i_size)) { + realsize = xip->i_size; + + if ((off >= realsize) || ((off + len) > realsize)) { xfs_iunlock(xip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL); error = -E2BIG; goto up_and_out; } - realsize = xip->i_size; + if (len == 0) + punch_to_eof = 1; + + /* + * When we are punching to EOF, we have to make sure we punch the + * last partial block that contains EOF. Round up the length to + * make sure we punch the block and not just zero it. + */ + if (punch_to_eof) + len = roundup((realsize - off), bsize); + xfs_iunlock(xip, XFS_ILOCK_EXCL); bf.l_type = 0; @@ -2651,20 +2663,21 @@ xfs_dm_punch_hole( bf.l_start = (xfs_off_t)off; bf.l_len = (xfs_off_t)len; - if (len == 0) - cmd = XFS_IOC_FREESP; /* truncate */ - error = xfs_change_file_space(xbdp, cmd, &bf, (xfs_off_t)off, - sys_cred, - ATTR_DMI|ATTR_NOLOCK); - - /* If truncate, grow it back to its original size. */ - if ((error == 0) && (len == 0)) { - bhv_vattr_t va; - - va.va_mask = XFS_AT_SIZE; - va.va_size = realsize; - error = xfs_setattr(xbdp, &va, ATTR_DMI|ATTR_NOLOCK, - sys_cred); + error = xfs_change_file_space(xbdp, XFS_IOC_UNRESVSP, &bf, + (xfs_off_t)off, sys_cred, ATTR_DMI|ATTR_NOLOCK); + + /* + * if punching to end of file, kill any blocks past EOF that + * may have been (speculatively) preallocated. No point in + * leaving them around if we are migrating the file.... + */ + if (!error && punch_to_eof) { + error = xfs_free_eofblocks(mp, xip, XFS_FREE_EOF_NOLOCK); + if (!error) { + /* Update linux inode block count after free above */ + inode->i_blocks = XFS_FSB_TO_BB(mp, + xip->i_d.di_nblocks + xip->i_delayed_blks); + } } /* Let threads in send_data_event know we punched the file. */ Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_ksyms.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_ksyms.c 2007-04-19 16:56:33.471205020 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_ksyms.c 2007-04-19 17:18:05.082563433 +1000 @@ -332,3 +332,4 @@ EXPORT_SYMBOL(xfs_xlatesb); EXPORT_SYMBOL(xfs_zero_eof); EXPORT_SYMBOL(xlog_recover_process_iunlinks); EXPORT_SYMBOL(xfs_ichgtime_fast); +EXPORT_SYMBOL(xfs_free_eofblocks); Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c 2007-04-19 16:56:33.655181121 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c 2007-04-19 17:18:04.890588633 +1000 @@ -1207,13 +1207,15 @@ xfs_fsync( } /* - * This is called by xfs_inactive to free any blocks beyond eof, - * when the link count isn't zero. + * This is called by xfs_inactive to free any blocks beyond eof + * when the link count isn't zero and by xfs_dm_punch_hole() when + * punching a hole to EOF. */ -STATIC int -xfs_inactive_free_eofblocks( +int +xfs_free_eofblocks( xfs_mount_t *mp, - xfs_inode_t *ip) + xfs_inode_t *ip, + int flags) { xfs_trans_t *tp; int error; @@ -1222,6 +1224,7 @@ xfs_inactive_free_eofblocks( xfs_filblks_t map_len; int nimaps; xfs_bmbt_irec_t imap; + int use_iolock = (flags & XFS_FREE_EOF_LOCK); /* * Figure out if there are any blocks beyond the end @@ -1262,11 +1265,13 @@ xfs_inactive_free_eofblocks( * cache and we can't * do that within a transaction. */ - xfs_ilock(ip, XFS_IOLOCK_EXCL); + if (use_iolock) + xfs_ilock(ip, XFS_IOLOCK_EXCL); error = xfs_itruncate_start(ip, XFS_ITRUNC_DEFINITE, ip->i_size); if (error) { - xfs_iunlock(ip, XFS_IOLOCK_EXCL); + if (use_iolock) + xfs_iunlock(ip, XFS_IOLOCK_EXCL); return error; } @@ -1303,7 +1308,8 @@ xfs_inactive_free_eofblocks( error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES); } - xfs_iunlock(ip, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL); + xfs_iunlock(ip, (use_iolock ? (XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL) + : XFS_ILOCK_EXCL)); } return error; } @@ -1579,7 +1585,8 @@ xfs_release( (ip->i_df.if_flags & XFS_IFEXTENTS)) && (!(ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) { - if ((error = xfs_inactive_free_eofblocks(mp, ip))) + error = xfs_free_eofblocks(mp, ip, XFS_FREE_EOF_LOCK); + if (error) return error; /* Update linux inode block count after free above */ vn_to_inode(vp)->i_blocks = XFS_FSB_TO_BB(mp, @@ -1660,7 +1667,8 @@ xfs_inactive( (!(ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)) || (ip->i_delayed_blks != 0)))) { - if ((error = xfs_inactive_free_eofblocks(mp, ip))) + error = xfs_free_eofblocks(mp, ip, XFS_FREE_EOF_LOCK); + if (error) return VN_INACTIVE_CACHE; /* Update linux inode block count after free above */ vn_to_inode(vp)->i_blocks = XFS_FSB_TO_BB(mp, Index: 2.6.x-xfs-new/fs/xfs/xfs_rw.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_rw.h 2007-04-19 16:55:44.373582872 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_rw.h 2007-04-19 16:56:33.839157222 +1000 @@ -72,6 +72,12 @@ xfs_fsb_to_db_io(struct xfs_iocore *io, } /* + * Flags for xfs_free_eofblocks + */ +#define XFS_FREE_EOF_LOCK (1<<0) +#define XFS_FREE_EOF_NOLOCK (1<<1) + +/* * Prototypes for functions in xfs_rw.c. */ extern int xfs_write_clear_setuid(struct xfs_inode *ip); @@ -91,10 +97,12 @@ extern void xfs_ioerror_alert(char *func extern int xfs_rwlock(bhv_desc_t *bdp, bhv_vrwlock_t write_lock); extern void xfs_rwunlock(bhv_desc_t *bdp, bhv_vrwlock_t write_lock); extern int xfs_setattr(bhv_desc_t *, bhv_vattr_t *vap, int flags, - cred_t *credp); + cred_t *credp); extern int xfs_change_file_space(bhv_desc_t *bdp, int cmd, xfs_flock64_t *bf, - xfs_off_t offset, cred_t *credp, int flags); + xfs_off_t offset, cred_t *credp, int flags); extern int xfs_set_dmattrs(bhv_desc_t *bdp, u_int evmask, u_int16_t state, - cred_t *credp); + cred_t *credp); +extern int xfs_free_eofblocks(struct xfs_mount *mp, struct xfs_inode *ip, + int flags); #endif /* __XFS_RW_H__ */ From owner-xfs@oss.sgi.com Thu Apr 19 00:25:16 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:25:18 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7PCfB006340 for ; Thu, 19 Apr 2007 00:25:14 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA22351; Thu, 19 Apr 2007 17:25:07 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7P6Af70618500; Thu, 19 Apr 2007 17:25:06 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7P51e58858094; Thu, 19 Apr 2007 17:25:05 +1000 (AEST) Date: Thu, 19 Apr 2007 17:25:05 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: allocate bmapi args Message-ID: <20070419072505.GS48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11120 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Save some stack space (64 bytes on 32bit systems, 80 bytes on 64bit systems) in a critical path by allocating the xfs_bmalloca_t structure rather than putting it on the stack. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_bmap.c | 62 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 33 insertions(+), 29 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_bmap.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_bmap.c 2007-04-19 13:26:49.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_bmap.c 2007-04-19 13:47:03.161553644 +1000 @@ -4710,7 +4710,7 @@ xfs_bmapi( xfs_fsblock_t abno; /* allocated block number */ xfs_extlen_t alen; /* allocated extent length */ xfs_fileoff_t aoff; /* allocated file offset */ - xfs_bmalloca_t bma; /* args for xfs_bmap_alloc */ + xfs_bmalloca_t *bma; /* args for xfs_bmap_alloc */ xfs_btree_cur_t *cur; /* bmap btree cursor */ xfs_fileoff_t end; /* end of mapped file region */ int eof; /* we've hit the end of extents */ @@ -4763,6 +4763,9 @@ xfs_bmapi( } if (XFS_FORCED_SHUTDOWN(mp)) return XFS_ERROR(EIO); + bma = kmem_zalloc(sizeof(xfs_bmalloca_t), KM_SLEEP); + if (!bma) + return XFS_ERROR(ENOMEM); rt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip); ifp = XFS_IFORK_PTR(ip, whichfork); ASSERT(ifp->if_ext_max == @@ -4816,7 +4819,7 @@ xfs_bmapi( n = 0; end = bno + len; obno = bno; - bma.ip = NULL; + bma->ip = NULL; if (delta) { delta->xed_startoff = NULLFILEOFF; delta->xed_blockcount = 0; @@ -4960,34 +4963,34 @@ xfs_bmapi( * If first time, allocate and fill in * once-only bma fields. */ - if (bma.ip == NULL) { - bma.tp = tp; - bma.ip = ip; - bma.prevp = &prev; - bma.gotp = &got; - bma.total = total; - bma.userdata = 0; + if (bma->ip == NULL) { + bma->tp = tp; + bma->ip = ip; + bma->prevp = &prev; + bma->gotp = &got; + bma->total = total; + bma->userdata = 0; } /* Indicate if this is the first user data * in the file, or just any user data. */ if (!(flags & XFS_BMAPI_METADATA)) { - bma.userdata = (aoff == 0) ? + bma->userdata = (aoff == 0) ? XFS_ALLOC_INITIAL_USER_DATA : XFS_ALLOC_USERDATA; } /* * Fill in changeable bma fields. */ - bma.eof = eof; - bma.firstblock = *firstblock; - bma.alen = alen; - bma.off = aoff; - bma.conv = !!(flags & XFS_BMAPI_CONVERT); - bma.wasdel = wasdelay; - bma.minlen = minlen; - bma.low = flist->xbf_low; - bma.minleft = minleft; + bma->eof = eof; + bma->firstblock = *firstblock; + bma->alen = alen; + bma->off = aoff; + bma->conv = !!(flags & XFS_BMAPI_CONVERT); + bma->wasdel = wasdelay; + bma->minlen = minlen; + bma->low = flist->xbf_low; + bma->minleft = minleft; /* * Only want to do the alignment at the * eof if it is userdata and allocation length @@ -4997,30 +5000,30 @@ xfs_bmapi( (!(flags & XFS_BMAPI_METADATA)) && (whichfork == XFS_DATA_FORK)) { if ((error = xfs_bmap_isaeof(ip, aoff, - whichfork, &bma.aeof))) + whichfork, &bma->aeof))) goto error0; } else - bma.aeof = 0; + bma->aeof = 0; /* * Call allocator. */ - if ((error = xfs_bmap_alloc(&bma))) + if ((error = xfs_bmap_alloc(bma))) goto error0; /* * Copy out result fields. */ - abno = bma.rval; - if ((flist->xbf_low = bma.low)) + abno = bma->rval; + if ((flist->xbf_low = bma->low)) minleft = 0; - alen = bma.alen; - aoff = bma.off; + alen = bma->alen; + aoff = bma->off; ASSERT(*firstblock == NULLFSBLOCK || XFS_FSB_TO_AGNO(mp, *firstblock) == - XFS_FSB_TO_AGNO(mp, bma.firstblock) || + XFS_FSB_TO_AGNO(mp, bma->firstblock) || (flist->xbf_low && XFS_FSB_TO_AGNO(mp, *firstblock) < - XFS_FSB_TO_AGNO(mp, bma.firstblock))); - *firstblock = bma.firstblock; + XFS_FSB_TO_AGNO(mp, bma->firstblock))); + *firstblock = bma->firstblock; if (cur) cur->bc_private.b.firstblock = *firstblock; @@ -5290,6 +5293,7 @@ error0: if (!error) xfs_bmap_validate_ret(orig_bno, orig_len, orig_flags, orig_mval, orig_nmap, *nmap); + kmem_free(bma, sizeof(xfs_bmalloca_t)); return error; } From owner-xfs@oss.sgi.com Thu Apr 19 00:32:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:32:34 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7WPfB008491 for ; Thu, 19 Apr 2007 00:32:27 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA22571; Thu, 19 Apr 2007 17:32:18 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7WHAf70655162; Thu, 19 Apr 2007 17:32:18 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7WGYQ68515725; Thu, 19 Apr 2007 17:32:16 +1000 (AEST) Date: Thu, 19 Apr 2007 17:32:16 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: allocate alloc args Message-ID: <20070419073216.GT48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11121 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Save some stack space in the critical allocator paths by allocating the xfs_alloc_arg_t structures (104 bytes on 64bit, 88 bytes on 32bit systems) rather than placing them on the stack. There can be more than one of these structures on the stack through the critical allocation path (e.g. xfs_bmap_btalloc() and xfs_alloc_fix_freelist()) so there are significant savings to be had here... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_alloc.c | 81 +++++++------ fs/xfs/xfs_bmap.c | 276 ++++++++++++++++++++++++---------------------- fs/xfs/xfs_bmap_btree.c | 131 +++++++++++---------- fs/xfs/xfs_ialloc.c | 163 ++++++++++++++------------- fs/xfs/xfs_ialloc_btree.c | 120 ++++++++++---------- 5 files changed, 412 insertions(+), 359 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_alloc.c 2007-03-30 11:32:07.613682556 +1000 @@ -1826,7 +1826,7 @@ xfs_alloc_fix_freelist( xfs_mount_t *mp; /* file system mount point structure */ xfs_extlen_t need; /* total blocks needed in freelist */ xfs_perag_t *pag; /* per-ag information structure */ - xfs_alloc_arg_t targs; /* local allocation arguments */ + xfs_alloc_arg_t *targs; /* local allocation arguments */ xfs_trans_t *tp; /* transaction pointer */ mp = args->mp; @@ -1934,54 +1934,60 @@ xfs_alloc_fix_freelist( /* * Initialize the args structure. */ - targs.tp = tp; - targs.mp = mp; - targs.agbp = agbp; - targs.agno = args->agno; - targs.mod = targs.minleft = targs.wasdel = targs.userdata = - targs.minalignslop = 0; - targs.alignment = targs.minlen = targs.prod = targs.isfl = 1; - targs.type = XFS_ALLOCTYPE_THIS_AG; - targs.pag = pag; - if ((error = xfs_alloc_read_agfl(mp, tp, targs.agno, &agflbp))) - return error; + targs = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!targs) + return XFS_ERROR(ENOMEM); + targs->tp = tp; + targs->mp = mp; + targs->agbp = agbp; + targs->agno = args->agno; + targs->mod = targs->minleft = targs->wasdel = targs->userdata = + targs->minalignslop = 0; + targs->alignment = targs->minlen = targs->prod = targs->isfl = 1; + targs->type = XFS_ALLOCTYPE_THIS_AG; + targs->pag = pag; + if ((error = xfs_alloc_read_agfl(mp, tp, targs->agno, &agflbp))) + goto out_error; /* * Make the freelist longer if it's too short. */ while (be32_to_cpu(agf->agf_flcount) < need) { - targs.agbno = 0; - targs.maxlen = need - be32_to_cpu(agf->agf_flcount); + targs->agbno = 0; + targs->maxlen = need - be32_to_cpu(agf->agf_flcount); /* * Allocate as many blocks as possible at once. */ - if ((error = xfs_alloc_ag_vextent(&targs))) { + if ((error = xfs_alloc_ag_vextent(targs))) { xfs_trans_brelse(tp, agflbp); - return error; + goto out_error; } /* * Stop if we run out. Won't happen if callers are obeying * the restrictions correctly. Can happen for free calls * on a completely full ag. */ - if (targs.agbno == NULLAGBLOCK) { + if (targs->agbno == NULLAGBLOCK) { if (flags & XFS_ALLOC_FLAG_FREEING) break; xfs_trans_brelse(tp, agflbp); args->agbp = NULL; - return 0; + error = 0; + goto out_error; } /* * Put each allocated block on the list. */ - for (bno = targs.agbno; bno < targs.agbno + targs.len; bno++) { + for (bno = targs->agbno; bno < targs->agbno + targs->len; bno++) { if ((error = xfs_alloc_put_freelist(tp, agbp, agflbp, bno, 0))) - return error; + goto out_error; } } xfs_trans_brelse(tp, agflbp); args->agbp = agbp; - return 0; +out_error: + kmem_free(targs, sizeof(xfs_alloc_arg_t)); + return error; } /* @@ -2480,28 +2486,31 @@ xfs_free_extent( xfs_fsblock_t bno, /* starting block number of extent */ xfs_extlen_t len) /* length of extent */ { - xfs_alloc_arg_t args; + xfs_alloc_arg_t *args; int error; ASSERT(len != 0); - memset(&args, 0, sizeof(xfs_alloc_arg_t)); - args.tp = tp; - args.mp = tp->t_mountp; - args.agno = XFS_FSB_TO_AGNO(args.mp, bno); - ASSERT(args.agno < args.mp->m_sb.sb_agcount); - args.agbno = XFS_FSB_TO_AGBNO(args.mp, bno); - down_read(&args.mp->m_peraglock); - args.pag = &args.mp->m_perag[args.agno]; - if ((error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING))) + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); + args->tp = tp; + args->mp = tp->t_mountp; + args->agno = XFS_FSB_TO_AGNO(args->mp, bno); + ASSERT(args->agno < args->mp->m_sb.sb_agcount); + args->agbno = XFS_FSB_TO_AGBNO(args->mp, bno); + down_read(&args->mp->m_peraglock); + args->pag = &args->mp->m_perag[args->agno]; + if ((error = xfs_alloc_fix_freelist(args, XFS_ALLOC_FLAG_FREEING))) goto error0; #ifdef DEBUG - ASSERT(args.agbp != NULL); - ASSERT((args.agbno + len) <= - be32_to_cpu(XFS_BUF_TO_AGF(args.agbp)->agf_length)); + ASSERT(args->agbp != NULL); + ASSERT((args->agbno + len) <= + be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length)); #endif - error = xfs_free_ag_extent(tp, args.agbp, args.agno, args.agbno, len, 0); + error = xfs_free_ag_extent(tp, args->agbp, args->agno, args->agbno, len, 0); error0: - up_read(&args.mp->m_peraglock); + up_read(&args->mp->m_peraglock); + kmem_free(args, sizeof(xfs_alloc_arg_t)); return error; } Index: 2.6.x-xfs-new/fs/xfs/xfs_bmap.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_bmap.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_bmap.c 2007-03-30 11:33:25.711487339 +1000 @@ -2701,7 +2701,7 @@ xfs_bmap_btalloc( xfs_agnumber_t ag; xfs_agnumber_t fb_agno; /* ag number of ap->firstblock */ xfs_agnumber_t startag; - xfs_alloc_arg_t args; + xfs_alloc_arg_t *args; xfs_extlen_t blen; xfs_extlen_t delta; xfs_extlen_t longest; @@ -2712,8 +2712,11 @@ xfs_bmap_btalloc( int isaligned; int notinit; int tryagain; - int error; + int error = 0; + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); mp = ap->ip->i_mount; align = (ap->userdata && ap->ip->i_d.di_extsize && (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE)) ? @@ -2746,29 +2749,29 @@ xfs_bmap_btalloc( * Normal allocation, done through xfs_alloc_vextent. */ tryagain = isaligned = 0; - args.tp = ap->tp; - args.mp = mp; - args.fsbno = ap->rval; - args.maxlen = MIN(ap->alen, mp->m_sb.sb_agblocks); - args.firstblock = ap->firstblock; + args->tp = ap->tp; + args->mp = mp; + args->fsbno = ap->rval; + args->maxlen = MIN(ap->alen, mp->m_sb.sb_agblocks); + args->firstblock = ap->firstblock; blen = 0; if (nullfb) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.total = ap->total; + args->type = XFS_ALLOCTYPE_START_BNO; + args->total = ap->total; /* * Find the longest available space. * We're going to try for the whole allocation at once. */ - startag = ag = XFS_FSB_TO_AGNO(mp, args.fsbno); + startag = ag = XFS_FSB_TO_AGNO(mp, args->fsbno); notinit = 0; down_read(&mp->m_peraglock); while (blen < ap->alen) { pag = &mp->m_perag[ag]; if (!pag->pagf_init && - (error = xfs_alloc_pagf_init(mp, args.tp, + (error = xfs_alloc_pagf_init(mp, args->tp, ag, XFS_ALLOC_FLAG_TRYLOCK))) { up_read(&mp->m_peraglock); - return error; + goto out_error; } /* * See xfs_alloc_fix_freelist... @@ -2796,39 +2799,39 @@ xfs_bmap_btalloc( * possible that there is space for this request. */ if (notinit || blen < ap->minlen) - args.minlen = ap->minlen; + args->minlen = ap->minlen; /* * If the best seen length is less than the request * length, use the best as the minimum. */ else if (blen < ap->alen) - args.minlen = blen; + args->minlen = blen; /* * Otherwise we've seen an extent as big as alen, * use that as the minimum. */ else - args.minlen = ap->alen; + args->minlen = ap->alen; } else if (ap->low) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.total = args.minlen = ap->minlen; + args->type = XFS_ALLOCTYPE_START_BNO; + args->total = args->minlen = ap->minlen; } else { - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.total = ap->total; - args.minlen = ap->minlen; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->total = ap->total; + args->minlen = ap->minlen; } if (unlikely(ap->userdata && ap->ip->i_d.di_extsize && (ap->ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE))) { - args.prod = ap->ip->i_d.di_extsize; - if ((args.mod = (xfs_extlen_t)do_mod(ap->off, args.prod))) - args.mod = (xfs_extlen_t)(args.prod - args.mod); + args->prod = ap->ip->i_d.di_extsize; + if ((args->mod = (xfs_extlen_t)do_mod(ap->off, args->prod))) + args->mod = (xfs_extlen_t)(args->prod - args->mod); } else if (mp->m_sb.sb_blocksize >= NBPP) { - args.prod = 1; - args.mod = 0; + args->prod = 1; + args->mod = 0; } else { - args.prod = NBPP >> mp->m_sb.sb_blocklog; - if ((args.mod = (xfs_extlen_t)(do_mod(ap->off, args.prod)))) - args.mod = (xfs_extlen_t)(args.prod - args.mod); + args->prod = NBPP >> mp->m_sb.sb_blocklog; + if ((args->mod = (xfs_extlen_t)(do_mod(ap->off, args->prod)))) + args->mod = (xfs_extlen_t)(args->prod - args->mod); } /* * If we are not low on available data blocks, and the @@ -2841,25 +2844,25 @@ xfs_bmap_btalloc( */ if (!ap->low && ap->aeof) { if (!ap->off) { - args.alignment = mp->m_dalign; - atype = args.type; + args->alignment = mp->m_dalign; + atype = args->type; isaligned = 1; /* * Adjust for alignment */ - if (blen > args.alignment && blen <= ap->alen) - args.minlen = blen - args.alignment; - args.minalignslop = 0; + if (blen > args->alignment && blen <= ap->alen) + args->minlen = blen - args->alignment; + args->minalignslop = 0; } else { /* * First try an exact bno allocation. * If it fails then do a near or start bno * allocation with alignment turned on. */ - atype = args.type; + atype = args->type; tryagain = 1; - args.type = XFS_ALLOCTYPE_THIS_BNO; - args.alignment = 1; + args->type = XFS_ALLOCTYPE_THIS_BNO; + args->alignment = 1; /* * Compute the minlen+alignment for the * next case. Set slop so that the value @@ -2869,75 +2872,75 @@ xfs_bmap_btalloc( if (blen > mp->m_dalign && blen <= ap->alen) nextminlen = blen - mp->m_dalign; else - nextminlen = args.minlen; - if (nextminlen + mp->m_dalign > args.minlen + 1) - args.minalignslop = + nextminlen = args->minlen; + if (nextminlen + mp->m_dalign > args->minlen + 1) + args->minalignslop = nextminlen + mp->m_dalign - - args.minlen - 1; + args->minlen - 1; else - args.minalignslop = 0; + args->minalignslop = 0; } } else { - args.alignment = 1; - args.minalignslop = 0; + args->alignment = 1; + args->minalignslop = 0; } - args.minleft = ap->minleft; - args.wasdel = ap->wasdel; - args.isfl = 0; - args.userdata = ap->userdata; - if ((error = xfs_alloc_vextent(&args))) - return error; - if (tryagain && args.fsbno == NULLFSBLOCK) { + args->minleft = ap->minleft; + args->wasdel = ap->wasdel; + args->isfl = 0; + args->userdata = ap->userdata; + if ((error = xfs_alloc_vextent(args))) + goto out_error; + if (tryagain && args->fsbno == NULLFSBLOCK) { /* * Exact allocation failed. Now try with alignment * turned on. */ - args.type = atype; - args.fsbno = ap->rval; - args.alignment = mp->m_dalign; - args.minlen = nextminlen; - args.minalignslop = 0; + args->type = atype; + args->fsbno = ap->rval; + args->alignment = mp->m_dalign; + args->minlen = nextminlen; + args->minalignslop = 0; isaligned = 1; - if ((error = xfs_alloc_vextent(&args))) + if ((error = xfs_alloc_vextent(args))) return error; } - if (isaligned && args.fsbno == NULLFSBLOCK) { + if (isaligned && args->fsbno == NULLFSBLOCK) { /* * allocation failed, so turn off alignment and * try again. */ - args.type = atype; - args.fsbno = ap->rval; - args.alignment = 0; - if ((error = xfs_alloc_vextent(&args))) - return error; - } - if (args.fsbno == NULLFSBLOCK && nullfb && - args.minlen > ap->minlen) { - args.minlen = ap->minlen; - args.type = XFS_ALLOCTYPE_START_BNO; - args.fsbno = ap->rval; - if ((error = xfs_alloc_vextent(&args))) - return error; - } - if (args.fsbno == NULLFSBLOCK && nullfb) { - args.fsbno = 0; - args.type = XFS_ALLOCTYPE_FIRST_AG; - args.total = ap->minlen; - args.minleft = 0; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->type = atype; + args->fsbno = ap->rval; + args->alignment = 0; + if ((error = xfs_alloc_vextent(args))) + goto out_error; + } + if (args->fsbno == NULLFSBLOCK && nullfb && + args->minlen > ap->minlen) { + args->minlen = ap->minlen; + args->type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = ap->rval; + if ((error = xfs_alloc_vextent(args))) + goto out_error; + } + if (args->fsbno == NULLFSBLOCK && nullfb) { + args->fsbno = 0; + args->type = XFS_ALLOCTYPE_FIRST_AG; + args->total = ap->minlen; + args->minleft = 0; + if ((error = xfs_alloc_vextent(args))) + goto out_error; ap->low = 1; } - if (args.fsbno != NULLFSBLOCK) { - ap->firstblock = ap->rval = args.fsbno; - ASSERT(nullfb || fb_agno == args.agno || - (ap->low && fb_agno < args.agno)); - ap->alen = args.len; - ap->ip->i_d.di_nblocks += args.len; + if (args->fsbno != NULLFSBLOCK) { + ap->firstblock = ap->rval = args->fsbno; + ASSERT(nullfb || fb_agno == args->agno || + (ap->low && fb_agno < args->agno)); + ap->alen = args->len; + ap->ip->i_d.di_nblocks += args->len; xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE); if (ap->wasdel) - ap->ip->i_delayed_blks -= args.len; + ap->ip->i_delayed_blks -= args->len; /* * Adjust the disk quota also. This was reserved * earlier. @@ -2945,12 +2948,14 @@ xfs_bmap_btalloc( XFS_TRANS_MOD_DQUOT_BYINO(mp, ap->tp, ap->ip, ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT : XFS_TRANS_DQ_BCOUNT, - (long) args.len); + (long) args->len); } else { ap->rval = NULLFSBLOCK; ap->alen = 0; } - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } /* @@ -3395,7 +3400,7 @@ xfs_bmap_extents_to_btree( { xfs_bmbt_block_t *ablock; /* allocated (child) bt block */ xfs_buf_t *abp; /* buffer for ablock */ - xfs_alloc_arg_t args; /* allocation arguments */ + xfs_alloc_arg_t *args; /* allocation arguments */ xfs_bmbt_rec_t *arp; /* child record pointer */ xfs_bmbt_block_t *block; /* btree root block */ xfs_btree_cur_t *cur; /* bmap btree cursor */ @@ -3408,6 +3413,9 @@ xfs_bmap_extents_to_btree( xfs_extnum_t nextents; /* number of file extents */ xfs_bmbt_ptr_t *pp; /* root block address pointer */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); ifp = XFS_IFORK_PTR(ip, whichfork); ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS); ASSERT(ifp->if_ext_max == @@ -3439,42 +3447,42 @@ xfs_bmap_extents_to_btree( * Convert to a btree with two levels, one record in root. */ XFS_IFORK_FMT_SET(ip, whichfork, XFS_DINODE_FMT_BTREE); - args.tp = tp; - args.mp = mp; - args.firstblock = *firstblock; + args->tp = tp; + args->mp = mp; + args->firstblock = *firstblock; if (*firstblock == NULLFSBLOCK) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino); + args->type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = XFS_INO_TO_FSB(mp, ip->i_ino); } else if (flist->xbf_low) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.fsbno = *firstblock; + args->type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = *firstblock; } else { - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.fsbno = *firstblock; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->fsbno = *firstblock; } - args.minlen = args.maxlen = args.prod = 1; - args.total = args.minleft = args.alignment = args.mod = args.isfl = - args.minalignslop = 0; - args.wasdel = wasdel; + args->minlen = args->maxlen = args->prod = 1; + args->total = args->minleft = args->alignment = args->mod = args->isfl = + args->minalignslop = 0; + args->wasdel = wasdel; *logflagsp = 0; - if ((error = xfs_alloc_vextent(&args))) { + if ((error = xfs_alloc_vextent(args))) { xfs_iroot_realloc(ip, -1, whichfork); xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); - return error; + goto out_error; } /* * Allocation can't fail, the space was reserved. */ - ASSERT(args.fsbno != NULLFSBLOCK); + ASSERT(args->fsbno != NULLFSBLOCK); ASSERT(*firstblock == NULLFSBLOCK || - args.agno == XFS_FSB_TO_AGNO(mp, *firstblock) || + args->agno == XFS_FSB_TO_AGNO(mp, *firstblock) || (flist->xbf_low && - args.agno > XFS_FSB_TO_AGNO(mp, *firstblock))); - *firstblock = cur->bc_private.b.firstblock = args.fsbno; + args->agno > XFS_FSB_TO_AGNO(mp, *firstblock))); + *firstblock = cur->bc_private.b.firstblock = args->fsbno; cur->bc_private.b.allocated++; ip->i_d.di_nblocks++; XFS_TRANS_MOD_DQUOT_BYINO(mp, tp, ip, XFS_TRANS_DQ_BCOUNT, 1L); - abp = xfs_btree_get_bufl(mp, tp, args.fsbno, 0); + abp = xfs_btree_get_bufl(mp, tp, args->fsbno, 0); /* * Fill in the child block. */ @@ -3502,7 +3510,7 @@ xfs_bmap_extents_to_btree( arp = XFS_BMAP_REC_IADDR(ablock, 1, cur); kp->br_startoff = cpu_to_be64(xfs_bmbt_disk_get_startoff(arp)); pp = XFS_BMAP_PTR_IADDR(block, 1, cur); - *pp = cpu_to_be64(args.fsbno); + *pp = cpu_to_be64(args->fsbno); /* * Do all this logging at the end so that * the root is at the right level. @@ -3512,7 +3520,9 @@ xfs_bmap_extents_to_btree( ASSERT(*curp == NULL); *curp = cur; *logflagsp = XFS_ILOG_CORE | XFS_ILOG_FBROOT(whichfork); - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } /* @@ -3572,13 +3582,16 @@ xfs_bmap_local_to_extents( flags = 0; error = 0; if (ifp->if_bytes) { - xfs_alloc_arg_t args; /* allocation arguments */ + xfs_alloc_arg_t *args; /* allocation arguments */ xfs_buf_t *bp; /* buffer for extent block */ xfs_bmbt_rec_t *ep; /* extent record pointer */ - args.tp = tp; - args.mp = ip->i_mount; - args.firstblock = *firstblock; + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); + args->tp = tp; + args->mp = ip->i_mount; + args->firstblock = *firstblock; ASSERT((ifp->if_flags & (XFS_IFINLINE|XFS_IFEXTENTS|XFS_IFEXTIREC)) == XFS_IFINLINE); /* @@ -3586,39 +3599,42 @@ xfs_bmap_local_to_extents( * file currently fits in an inode. */ if (*firstblock == NULLFSBLOCK) { - args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino); - args.type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = XFS_INO_TO_FSB(args->mp, ip->i_ino); + args->type = XFS_ALLOCTYPE_START_BNO; } else { - args.fsbno = *firstblock; - args.type = XFS_ALLOCTYPE_NEAR_BNO; + args->fsbno = *firstblock; + args->type = XFS_ALLOCTYPE_NEAR_BNO; } - args.total = total; - args.mod = args.minleft = args.alignment = args.wasdel = - args.isfl = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - if ((error = xfs_alloc_vextent(&args))) + args->total = total; + args->mod = args->minleft = args->alignment = args->wasdel = + args->isfl = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + if ((error = xfs_alloc_vextent(args))) { + kmem_free(args, sizeof(xfs_alloc_arg_t)); goto done; + } /* * Can't fail, the space was reserved. */ - ASSERT(args.fsbno != NULLFSBLOCK); - ASSERT(args.len == 1); - *firstblock = args.fsbno; - bp = xfs_btree_get_bufl(args.mp, tp, args.fsbno, 0); + ASSERT(args->fsbno != NULLFSBLOCK); + ASSERT(args->len == 1); + *firstblock = args->fsbno; + bp = xfs_btree_get_bufl(args->mp, tp, args->fsbno, 0); memcpy((char *)XFS_BUF_PTR(bp), ifp->if_u1.if_data, ifp->if_bytes); xfs_trans_log_buf(tp, bp, 0, ifp->if_bytes - 1); - xfs_bmap_forkoff_reset(args.mp, ip, whichfork); + xfs_bmap_forkoff_reset(args->mp, ip, whichfork); xfs_idata_realloc(ip, -ifp->if_bytes, whichfork); xfs_iext_add(ifp, 0, 1); ep = xfs_iext_get_ext(ifp, 0); - xfs_bmbt_set_allf(ep, 0, args.fsbno, 1, XFS_EXT_NORM); + xfs_bmbt_set_allf(ep, 0, args->fsbno, 1, XFS_EXT_NORM); xfs_bmap_trace_post_update(fname, "new", ip, 0, whichfork); XFS_IFORK_NEXT_SET(ip, whichfork, 1); ip->i_d.di_nblocks = 1; - XFS_TRANS_MOD_DQUOT_BYINO(args.mp, tp, ip, + XFS_TRANS_MOD_DQUOT_BYINO(args->mp, tp, ip, XFS_TRANS_DQ_BCOUNT, 1L); flags |= XFS_ILOG_FEXT(whichfork); + kmem_free(args, sizeof(xfs_alloc_arg_t)); } else { ASSERT(XFS_IFORK_NEXTENTS(ip, whichfork) == 0); xfs_bmap_forkoff_reset(ip->i_mount, ip, whichfork); Index: 2.6.x-xfs-new/fs/xfs/xfs_bmap_btree.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_bmap_btree.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_bmap_btree.c 2007-03-30 11:32:42.257159915 +1000 @@ -1490,7 +1490,7 @@ xfs_bmbt_split( xfs_btree_cur_t **curp, int *stat) /* success/failure */ { - xfs_alloc_arg_t args; /* block allocation args */ + xfs_alloc_arg_t *args; /* block allocation args */ int error; /* error return value */ #ifdef XFS_BMBT_TRACE static char fname[] = "xfs_bmbt_split"; @@ -1510,50 +1510,54 @@ xfs_bmbt_split( xfs_buf_t *rrbp; /* right-right buffer pointer */ xfs_bmbt_rec_t *rrp; /* right record pointer */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); XFS_BMBT_TRACE_CURSOR(cur, ENTRY); XFS_BMBT_TRACE_ARGIFK(cur, level, *bnop, *startoff); - args.tp = cur->bc_tp; - args.mp = cur->bc_mp; + args->tp = cur->bc_tp; + args->mp = cur->bc_mp; lbp = cur->bc_bufs[level]; - lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp)); + lbno = XFS_DADDR_TO_FSB(args->mp, XFS_BUF_ADDR(lbp)); left = XFS_BUF_TO_BMBT_BLOCK(lbp); - args.fsbno = cur->bc_private.b.firstblock; - args.firstblock = args.fsbno; - if (args.fsbno == NULLFSBLOCK) { - args.fsbno = lbno; - args.type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = cur->bc_private.b.firstblock; + args->firstblock = args->fsbno; + if (args->fsbno == NULLFSBLOCK) { + args->fsbno = lbno; + args->type = XFS_ALLOCTYPE_START_BNO; } else - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.mod = args.minleft = args.alignment = args.total = args.isfl = - args.userdata = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - args.wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL; - if (!args.wasdel && xfs_trans_get_block_res(args.tp) == 0) { + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->mod = args->minleft = args->alignment = args->total = args->isfl = + args->userdata = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + args->wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL; + if (!args->wasdel && xfs_trans_get_block_res(args->tp) == 0) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); + kmem_free(args, sizeof(xfs_alloc_arg_t)); return XFS_ERROR(ENOSPC); } - if ((error = xfs_alloc_vextent(&args))) { + if ((error = xfs_alloc_vextent(args))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } - if (args.fsbno == NULLFSBLOCK) { + if (args->fsbno == NULLFSBLOCK) { XFS_BMBT_TRACE_CURSOR(cur, EXIT); *stat = 0; - return 0; + goto out_error; } - ASSERT(args.len == 1); - cur->bc_private.b.firstblock = args.fsbno; + ASSERT(args->len == 1); + cur->bc_private.b.firstblock = args->fsbno; cur->bc_private.b.allocated++; cur->bc_private.b.ip->i_d.di_nblocks++; - xfs_trans_log_inode(args.tp, cur->bc_private.b.ip, XFS_ILOG_CORE); - XFS_TRANS_MOD_DQUOT_BYINO(args.mp, args.tp, cur->bc_private.b.ip, + xfs_trans_log_inode(args->tp, cur->bc_private.b.ip, XFS_ILOG_CORE); + XFS_TRANS_MOD_DQUOT_BYINO(args->mp, args->tp, cur->bc_private.b.ip, XFS_TRANS_DQ_BCOUNT, 1L); - rbp = xfs_btree_get_bufl(args.mp, args.tp, args.fsbno, 0); + rbp = xfs_btree_get_bufl(args->mp, args->tp, args->fsbno, 0); right = XFS_BUF_TO_BMBT_BLOCK(rbp); #ifdef DEBUG if ((error = xfs_btree_check_lblock(cur, left, level, rbp))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } #endif right->bb_magic = cpu_to_be32(XFS_BMAP_MAGIC); @@ -1572,7 +1576,7 @@ xfs_bmbt_split( for (i = 0; i < be16_to_cpu(right->bb_numrecs); i++) { if ((error = xfs_btree_check_lptr_disk(cur, lpp[i], level))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } } #endif @@ -1590,23 +1594,23 @@ xfs_bmbt_split( } be16_add(&left->bb_numrecs, -(be16_to_cpu(right->bb_numrecs))); right->bb_rightsib = left->bb_rightsib; - left->bb_rightsib = cpu_to_be64(args.fsbno); + left->bb_rightsib = cpu_to_be64(args->fsbno); right->bb_leftsib = cpu_to_be64(lbno); xfs_bmbt_log_block(cur, rbp, XFS_BB_ALL_BITS); xfs_bmbt_log_block(cur, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB); if (be64_to_cpu(right->bb_rightsib) != NULLDFSBNO) { - if ((error = xfs_btree_read_bufl(args.mp, args.tp, + if ((error = xfs_btree_read_bufl(args->mp, args->tp, be64_to_cpu(right->bb_rightsib), 0, &rrbp, XFS_BMAP_BTREE_REF))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } rrblock = XFS_BUF_TO_BMBT_BLOCK(rrbp); if ((error = xfs_btree_check_lblock(cur, rrblock, level, rrbp))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } - rrblock->bb_leftsib = cpu_to_be64(args.fsbno); + rrblock->bb_leftsib = cpu_to_be64(args->fsbno); xfs_bmbt_log_block(cur, rrbp, XFS_BB_LEFTSIB); } if (cur->bc_ptrs[level] > be16_to_cpu(left->bb_numrecs) + 1) { @@ -1616,14 +1620,16 @@ xfs_bmbt_split( if (level + 1 < cur->bc_nlevels) { if ((error = xfs_btree_dup_cursor(cur, curp))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } (*curp)->bc_ptrs[level + 1]++; } - *bnop = args.fsbno; + *bnop = args->fsbno; XFS_BMBT_TRACE_CURSOR(cur, EXIT); *stat = 1; - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } @@ -2238,7 +2244,7 @@ xfs_bmbt_newroot( int *logflags, /* logging flags for inode */ int *stat) /* return status - 0 fail */ { - xfs_alloc_arg_t args; /* allocation arguments */ + xfs_alloc_arg_t *args; /* allocation arguments */ xfs_bmbt_block_t *block; /* bmap btree block */ xfs_buf_t *bp; /* buffer for block */ xfs_bmbt_block_t *cblock; /* child btree block */ @@ -2255,48 +2261,51 @@ xfs_bmbt_newroot( int level; /* btree level */ xfs_bmbt_ptr_t *pp; /* pointer to bmap block addr */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); XFS_BMBT_TRACE_CURSOR(cur, ENTRY); level = cur->bc_nlevels - 1; block = xfs_bmbt_get_block(cur, level, &bp); /* * Copy the root into a real block. */ - args.mp = cur->bc_mp; + args->mp = cur->bc_mp; pp = XFS_BMAP_PTR_IADDR(block, 1, cur); - args.tp = cur->bc_tp; - args.fsbno = cur->bc_private.b.firstblock; - args.mod = args.minleft = args.alignment = args.total = args.isfl = - args.userdata = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - args.wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL; - args.firstblock = args.fsbno; - if (args.fsbno == NULLFSBLOCK) { + args->tp = cur->bc_tp; + args->fsbno = cur->bc_private.b.firstblock; + args->mod = args->minleft = args->alignment = args->total = args->isfl = + args->userdata = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + args->wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL; + args->firstblock = args->fsbno; + if (args->fsbno == NULLFSBLOCK) { #ifdef DEBUG if ((error = xfs_btree_check_lptr_disk(cur, *pp, level))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } #endif - args.fsbno = be64_to_cpu(*pp); - args.type = XFS_ALLOCTYPE_START_BNO; + args->fsbno = be64_to_cpu(*pp); + args->type = XFS_ALLOCTYPE_START_BNO; } else - args.type = XFS_ALLOCTYPE_NEAR_BNO; - if ((error = xfs_alloc_vextent(&args))) { + args->type = XFS_ALLOCTYPE_NEAR_BNO; + if ((error = xfs_alloc_vextent(args))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } - if (args.fsbno == NULLFSBLOCK) { + if (args->fsbno == NULLFSBLOCK) { XFS_BMBT_TRACE_CURSOR(cur, EXIT); *stat = 0; - return 0; + goto out_error; } - ASSERT(args.len == 1); - cur->bc_private.b.firstblock = args.fsbno; + ASSERT(args->len == 1); + cur->bc_private.b.firstblock = args->fsbno; cur->bc_private.b.allocated++; cur->bc_private.b.ip->i_d.di_nblocks++; - XFS_TRANS_MOD_DQUOT_BYINO(args.mp, args.tp, cur->bc_private.b.ip, + XFS_TRANS_MOD_DQUOT_BYINO(args->mp, args->tp, cur->bc_private.b.ip, XFS_TRANS_DQ_BCOUNT, 1L); - bp = xfs_btree_get_bufl(args.mp, cur->bc_tp, args.fsbno, 0); + bp = xfs_btree_get_bufl(args->mp, cur->bc_tp, args->fsbno, 0); cblock = XFS_BUF_TO_BMBT_BLOCK(bp); *cblock = *block; be16_add(&block->bb_level, 1); @@ -2311,18 +2320,18 @@ xfs_bmbt_newroot( for (i = 0; i < be16_to_cpu(cblock->bb_numrecs); i++) { if ((error = xfs_btree_check_lptr_disk(cur, pp[i], level))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } } #endif memcpy(cpp, pp, be16_to_cpu(cblock->bb_numrecs) * sizeof(*pp)); #ifdef DEBUG - if ((error = xfs_btree_check_lptr(cur, args.fsbno, level))) { + if ((error = xfs_btree_check_lptr(cur, args->fsbno, level))) { XFS_BMBT_TRACE_CURSOR(cur, ERROR); - return error; + goto out_error; } #endif - *pp = cpu_to_be64(args.fsbno); + *pp = cpu_to_be64(args->fsbno); xfs_iroot_realloc(cur->bc_private.b.ip, 1 - be16_to_cpu(cblock->bb_numrecs), cur->bc_private.b.whichfork); xfs_btree_setbuf(cur, level, bp); @@ -2337,6 +2346,8 @@ xfs_bmbt_newroot( *logflags |= XFS_ILOG_CORE | XFS_ILOG_FBROOT(cur->bc_private.b.whichfork); *stat = 1; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); return 0; } Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c 2007-03-30 11:32:50.168127184 +1000 @@ -119,7 +119,7 @@ xfs_ialloc_ag_alloc( int *alloc) { xfs_agi_t *agi; /* allocation group header */ - xfs_alloc_arg_t args; /* allocation argument structure */ + xfs_alloc_arg_t *args; /* allocation argument structure */ int blks_per_cluster; /* fs blocks per inode cluster */ xfs_btree_cur_t *cur; /* inode btree cursor */ xfs_daddr_t d; /* disk addr of buffer */ @@ -138,18 +138,23 @@ xfs_ialloc_ag_alloc( int isaligned = 0; /* inode allocation at stripe unit */ /* boundary */ - args.tp = tp; - args.mp = tp->t_mountp; + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); + args->tp = tp; + args->mp = tp->t_mountp; /* * Locking will ensure that we don't have two callers in here * at one time. */ - newlen = XFS_IALLOC_INODES(args.mp); - if (args.mp->m_maxicount && - args.mp->m_sb.sb_icount + newlen > args.mp->m_maxicount) + newlen = XFS_IALLOC_INODES(args->mp); + if (args->mp->m_maxicount && + args->mp->m_sb.sb_icount + newlen > args->mp->m_maxicount) { + kmem_free(args, sizeof(xfs_alloc_arg_t)); return XFS_ERROR(ENOSPC); - args.minlen = args.maxlen = XFS_IALLOC_BLOCKS(args.mp); + } + args->minlen = args->maxlen = XFS_IALLOC_BLOCKS(args->mp); /* * First try to allocate inodes contiguous with the last-allocated * chunk of inodes. If the filesystem is striped, this will fill @@ -157,27 +162,27 @@ xfs_ialloc_ag_alloc( */ agi = XFS_BUF_TO_AGI(agbp); newino = be32_to_cpu(agi->agi_newino); - args.agbno = XFS_AGINO_TO_AGBNO(args.mp, newino) + - XFS_IALLOC_BLOCKS(args.mp); + args->agbno = XFS_AGINO_TO_AGBNO(args->mp, newino) + + XFS_IALLOC_BLOCKS(args->mp); if (likely(newino != NULLAGINO && - (args.agbno < be32_to_cpu(agi->agi_length)))) { - args.fsbno = XFS_AGB_TO_FSB(args.mp, - be32_to_cpu(agi->agi_seqno), args.agbno); - args.type = XFS_ALLOCTYPE_THIS_BNO; - args.mod = args.total = args.wasdel = args.isfl = - args.userdata = args.minalignslop = 0; - args.prod = 1; - args.alignment = 1; + (args->agbno < be32_to_cpu(agi->agi_length)))) { + args->fsbno = XFS_AGB_TO_FSB(args->mp, + be32_to_cpu(agi->agi_seqno), args->agbno); + args->type = XFS_ALLOCTYPE_THIS_BNO; + args->mod = args->total = args->wasdel = args->isfl = + args->userdata = args->minalignslop = 0; + args->prod = 1; + args->alignment = 1; /* * Allow space for the inode btree to split. */ - args.minleft = XFS_IN_MAXLEVELS(args.mp) - 1; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->minleft = XFS_IN_MAXLEVELS(args->mp) - 1; + if ((error = xfs_alloc_vextent(args))) + goto out_error; } else - args.fsbno = NULLFSBLOCK; + args->fsbno = NULLFSBLOCK; - if (unlikely(args.fsbno == NULLFSBLOCK)) { + if (unlikely(args->fsbno == NULLFSBLOCK)) { /* * Set the alignment for the allocation. * If stripe alignment is turned on then align at stripe unit @@ -187,82 +192,82 @@ xfs_ialloc_ag_alloc( * pieces, so don't need alignment anyway. */ isaligned = 0; - if (args.mp->m_sinoalign) { - ASSERT(!(args.mp->m_flags & XFS_MOUNT_NOALIGN)); - args.alignment = args.mp->m_dalign; + if (args->mp->m_sinoalign) { + ASSERT(!(args->mp->m_flags & XFS_MOUNT_NOALIGN)); + args->alignment = args->mp->m_dalign; isaligned = 1; - } else if (XFS_SB_VERSION_HASALIGN(&args.mp->m_sb) && - args.mp->m_sb.sb_inoalignmt >= - XFS_B_TO_FSBT(args.mp, - XFS_INODE_CLUSTER_SIZE(args.mp))) - args.alignment = args.mp->m_sb.sb_inoalignmt; + } else if (XFS_SB_VERSION_HASALIGN(&args->mp->m_sb) && + args->mp->m_sb.sb_inoalignmt >= + XFS_B_TO_FSBT(args->mp, + XFS_INODE_CLUSTER_SIZE(args->mp))) + args->alignment = args->mp->m_sb.sb_inoalignmt; else - args.alignment = 1; + args->alignment = 1; /* * Need to figure out where to allocate the inode blocks. * Ideally they should be spaced out through the a.g. * For now, just allocate blocks up front. */ - args.agbno = be32_to_cpu(agi->agi_root); - args.fsbno = XFS_AGB_TO_FSB(args.mp, - be32_to_cpu(agi->agi_seqno), args.agbno); + args->agbno = be32_to_cpu(agi->agi_root); + args->fsbno = XFS_AGB_TO_FSB(args->mp, + be32_to_cpu(agi->agi_seqno), args->agbno); /* * Allocate a fixed-size extent of inodes. */ - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.mod = args.total = args.wasdel = args.isfl = - args.userdata = args.minalignslop = 0; - args.prod = 1; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->mod = args->total = args->wasdel = args->isfl = + args->userdata = args->minalignslop = 0; + args->prod = 1; /* * Allow space for the inode btree to split. */ - args.minleft = XFS_IN_MAXLEVELS(args.mp) - 1; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->minleft = XFS_IN_MAXLEVELS(args->mp) - 1; + if ((error = xfs_alloc_vextent(args))) + goto out_error; } /* * If stripe alignment is turned on, then try again with cluster * alignment. */ - if (isaligned && args.fsbno == NULLFSBLOCK) { - args.type = XFS_ALLOCTYPE_NEAR_BNO; - args.agbno = be32_to_cpu(agi->agi_root); - args.fsbno = XFS_AGB_TO_FSB(args.mp, - be32_to_cpu(agi->agi_seqno), args.agbno); - if (XFS_SB_VERSION_HASALIGN(&args.mp->m_sb) && - args.mp->m_sb.sb_inoalignmt >= - XFS_B_TO_FSBT(args.mp, XFS_INODE_CLUSTER_SIZE(args.mp))) - args.alignment = args.mp->m_sb.sb_inoalignmt; + if (isaligned && args->fsbno == NULLFSBLOCK) { + args->type = XFS_ALLOCTYPE_NEAR_BNO; + args->agbno = be32_to_cpu(agi->agi_root); + args->fsbno = XFS_AGB_TO_FSB(args->mp, + be32_to_cpu(agi->agi_seqno), args->agbno); + if (XFS_SB_VERSION_HASALIGN(&args->mp->m_sb) && + args->mp->m_sb.sb_inoalignmt >= + XFS_B_TO_FSBT(args->mp, XFS_INODE_CLUSTER_SIZE(args->mp))) + args->alignment = args->mp->m_sb.sb_inoalignmt; else - args.alignment = 1; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->alignment = 1; + if ((error = xfs_alloc_vextent(args))) + goto out_error; } - if (args.fsbno == NULLFSBLOCK) { + if (args->fsbno == NULLFSBLOCK) { *alloc = 0; - return 0; + goto out_error; } - ASSERT(args.len == args.minlen); + ASSERT(args->len == args->minlen); /* * Convert the results. */ - newino = XFS_OFFBNO_TO_AGINO(args.mp, args.agbno, 0); + newino = XFS_OFFBNO_TO_AGINO(args->mp, args->agbno, 0); /* * Loop over the new block(s), filling in the inodes. * For small block sizes, manipulate the inodes in buffers * which are multiples of the blocks size. */ - if (args.mp->m_sb.sb_blocksize >= XFS_INODE_CLUSTER_SIZE(args.mp)) { + if (args->mp->m_sb.sb_blocksize >= XFS_INODE_CLUSTER_SIZE(args->mp)) { blks_per_cluster = 1; - nbufs = (int)args.len; - ninodes = args.mp->m_sb.sb_inopblock; + nbufs = (int)args->len; + ninodes = args->mp->m_sb.sb_inopblock; } else { - blks_per_cluster = XFS_INODE_CLUSTER_SIZE(args.mp) / - args.mp->m_sb.sb_blocksize; - nbufs = (int)args.len / blks_per_cluster; - ninodes = blks_per_cluster * args.mp->m_sb.sb_inopblock; + blks_per_cluster = XFS_INODE_CLUSTER_SIZE(args->mp) / + args->mp->m_sb.sb_blocksize; + nbufs = (int)args->len / blks_per_cluster; + ninodes = blks_per_cluster * args->mp->m_sb.sb_inopblock; } /* * Figure out what version number to use in the inodes we create. @@ -271,7 +276,7 @@ xfs_ialloc_ag_alloc( * use the old version so that old kernels will continue to be * able to use the file system. */ - if (XFS_SB_VERSION_HASNLINK(&args.mp->m_sb)) + if (XFS_SB_VERSION_HASNLINK(&args->mp->m_sb)) version = XFS_DINODE_VERSION_2; else version = XFS_DINODE_VERSION_1; @@ -280,19 +285,19 @@ xfs_ialloc_ag_alloc( /* * Get the block. */ - d = XFS_AGB_TO_DADDR(args.mp, be32_to_cpu(agi->agi_seqno), - args.agbno + (j * blks_per_cluster)); - fbuf = xfs_trans_get_buf(tp, args.mp->m_ddev_targp, d, - args.mp->m_bsize * blks_per_cluster, + d = XFS_AGB_TO_DADDR(args->mp, be32_to_cpu(agi->agi_seqno), + args->agbno + (j * blks_per_cluster)); + fbuf = xfs_trans_get_buf(tp, args->mp->m_ddev_targp, d, + args->mp->m_bsize * blks_per_cluster, XFS_BUF_LOCK); ASSERT(fbuf); ASSERT(!XFS_BUF_GETERROR(fbuf)); /* * Set initial values for the inodes in this buffer. */ - xfs_biozero(fbuf, 0, ninodes << args.mp->m_sb.sb_inodelog); + xfs_biozero(fbuf, 0, ninodes << args->mp->m_sb.sb_inodelog); for (i = 0; i < ninodes; i++) { - free = XFS_MAKE_IPTR(args.mp, fbuf, i); + free = XFS_MAKE_IPTR(args->mp, fbuf, i); INT_SET(free->di_core.di_magic, ARCH_CONVERT, XFS_DINODE_MAGIC); INT_SET(free->di_core.di_version, ARCH_CONVERT, version); INT_SET(free->di_next_unlinked, ARCH_CONVERT, NULLAGINO); @@ -304,14 +309,14 @@ xfs_ialloc_ag_alloc( be32_add(&agi->agi_count, newlen); be32_add(&agi->agi_freecount, newlen); agno = be32_to_cpu(agi->agi_seqno); - down_read(&args.mp->m_peraglock); - args.mp->m_perag[agno].pagi_freecount += newlen; - up_read(&args.mp->m_peraglock); + down_read(&args->mp->m_peraglock); + args->mp->m_perag[agno].pagi_freecount += newlen; + up_read(&args->mp->m_peraglock); agi->agi_newino = cpu_to_be32(newino); /* * Insert records describing the new inode chunk into the btree. */ - cur = xfs_btree_init_cursor(args.mp, tp, agbp, agno, + cur = xfs_btree_init_cursor(args->mp, tp, agbp, agno, XFS_BTNUM_INO, (xfs_inode_t *)0, 0); for (thisino = newino; thisino < newino + newlen; @@ -319,12 +324,12 @@ xfs_ialloc_ag_alloc( if ((error = xfs_inobt_lookup_eq(cur, thisino, XFS_INODES_PER_CHUNK, XFS_INOBT_ALL_FREE, &i))) { xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); - return error; + goto out_error; } ASSERT(i == 0); if ((error = xfs_inobt_insert(cur, &i))) { xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); - return error; + goto out_error; } ASSERT(i == 1); } @@ -340,7 +345,9 @@ xfs_ialloc_ag_alloc( xfs_trans_mod_sb(tp, XFS_TRANS_SB_ICOUNT, (long)newlen); xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, (long)newlen); *alloc = 1; - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } STATIC_INLINE xfs_agnumber_t Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc_btree.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc_btree.c 2007-03-30 11:31:24.239345301 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc_btree.c 2007-03-30 11:32:30.678671441 +1000 @@ -1185,7 +1185,7 @@ xfs_inobt_newroot( int *stat) /* success/failure */ { xfs_agi_t *agi; /* a.g. inode header */ - xfs_alloc_arg_t args; /* allocation argument structure */ + xfs_alloc_arg_t *args; /* allocation argument structure */ xfs_inobt_block_t *block; /* one half of the old root block */ xfs_buf_t *bp; /* buffer containing block */ int error; /* error return value */ @@ -1207,33 +1207,36 @@ xfs_inobt_newroot( /* * Get a block & a buffer. */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); agi = XFS_BUF_TO_AGI(cur->bc_private.i.agbp); - args.tp = cur->bc_tp; - args.mp = cur->bc_mp; - args.fsbno = XFS_AGB_TO_FSB(args.mp, cur->bc_private.i.agno, + args->tp = cur->bc_tp; + args->mp = cur->bc_mp; + args->fsbno = XFS_AGB_TO_FSB(args->mp, cur->bc_private.i.agno, be32_to_cpu(agi->agi_root)); - args.mod = args.minleft = args.alignment = args.total = args.wasdel = - args.isfl = args.userdata = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - args.type = XFS_ALLOCTYPE_NEAR_BNO; - if ((error = xfs_alloc_vextent(&args))) - return error; + args->mod = args->minleft = args->alignment = args->total = args->wasdel = + args->isfl = args->userdata = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + if ((error = xfs_alloc_vextent(args))) + goto out_error; /* * None available, we fail. */ - if (args.fsbno == NULLFSBLOCK) { + if (args->fsbno == NULLFSBLOCK) { *stat = 0; - return 0; + goto out_error; } - ASSERT(args.len == 1); - nbp = xfs_btree_get_bufs(args.mp, args.tp, args.agno, args.agbno, 0); + ASSERT(args->len == 1); + nbp = xfs_btree_get_bufs(args->mp, args->tp, args->agno, args->agbno, 0); new = XFS_BUF_TO_INOBT_BLOCK(nbp); /* * Set the root data in the a.g. inode structure. */ - agi->agi_root = cpu_to_be32(args.agbno); + agi->agi_root = cpu_to_be32(args->agbno); be32_add(&agi->agi_level, 1); - xfs_ialloc_log_agi(args.tp, cur->bc_private.i.agbp, + xfs_ialloc_log_agi(args->tp, cur->bc_private.i.agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL); /* * At the previous root level there are now two blocks: the old @@ -1245,41 +1248,41 @@ xfs_inobt_newroot( block = XFS_BUF_TO_INOBT_BLOCK(bp); #ifdef DEBUG if ((error = xfs_btree_check_sblock(cur, block, cur->bc_nlevels - 1, bp))) - return error; + goto out_error; #endif if (be32_to_cpu(block->bb_rightsib) != NULLAGBLOCK) { /* * Our block is left, pick up the right block. */ lbp = bp; - lbno = XFS_DADDR_TO_AGBNO(args.mp, XFS_BUF_ADDR(lbp)); + lbno = XFS_DADDR_TO_AGBNO(args->mp, XFS_BUF_ADDR(lbp)); left = block; rbno = be32_to_cpu(left->bb_rightsib); - if ((error = xfs_btree_read_bufs(args.mp, args.tp, args.agno, + if ((error = xfs_btree_read_bufs(args->mp, args->tp, args->agno, rbno, 0, &rbp, XFS_INO_BTREE_REF))) - return error; + goto out_error; bp = rbp; right = XFS_BUF_TO_INOBT_BLOCK(rbp); if ((error = xfs_btree_check_sblock(cur, right, cur->bc_nlevels - 1, rbp))) - return error; + goto out_error; nptr = 1; } else { /* * Our block is right, pick up the left block. */ rbp = bp; - rbno = XFS_DADDR_TO_AGBNO(args.mp, XFS_BUF_ADDR(rbp)); + rbno = XFS_DADDR_TO_AGBNO(args->mp, XFS_BUF_ADDR(rbp)); right = block; lbno = be32_to_cpu(right->bb_leftsib); - if ((error = xfs_btree_read_bufs(args.mp, args.tp, args.agno, + if ((error = xfs_btree_read_bufs(args->mp, args->tp, args->agno, lbno, 0, &lbp, XFS_INO_BTREE_REF))) - return error; + goto out_error; bp = lbp; left = XFS_BUF_TO_INOBT_BLOCK(lbp); if ((error = xfs_btree_check_sblock(cur, left, cur->bc_nlevels - 1, lbp))) - return error; + goto out_error; nptr = 2; } /* @@ -1290,7 +1293,7 @@ xfs_inobt_newroot( new->bb_numrecs = cpu_to_be16(2); new->bb_leftsib = cpu_to_be32(NULLAGBLOCK); new->bb_rightsib = cpu_to_be32(NULLAGBLOCK); - xfs_inobt_log_block(args.tp, nbp, XFS_BB_ALL_BITS); + xfs_inobt_log_block(args->tp, nbp, XFS_BB_ALL_BITS); ASSERT(lbno != NULLAGBLOCK && rbno != NULLAGBLOCK); /* * Fill in the key data in the new root. @@ -1320,7 +1323,9 @@ xfs_inobt_newroot( cur->bc_ptrs[cur->bc_nlevels] = nptr; cur->bc_nlevels++; *stat = 1; - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } /* @@ -1466,7 +1471,7 @@ xfs_inobt_split( xfs_btree_cur_t **curp, /* output: new cursor */ int *stat) /* success/failure */ { - xfs_alloc_arg_t args; /* allocation argument structure */ + xfs_alloc_arg_t *args; /* allocation argument structure */ int error; /* error return value */ int i; /* loop index/record number */ xfs_agblock_t lbno; /* left (current) block number */ @@ -1481,30 +1486,33 @@ xfs_inobt_split( xfs_inobt_ptr_t *rpp; /* right btree address pointer */ xfs_inobt_rec_t *rrp; /* right btree record pointer */ + args = kmem_zalloc(sizeof(xfs_alloc_arg_t), KM_SLEEP); + if (!args) + return XFS_ERROR(ENOMEM); /* * Set up left block (current one). */ lbp = cur->bc_bufs[level]; - args.tp = cur->bc_tp; - args.mp = cur->bc_mp; - lbno = XFS_DADDR_TO_AGBNO(args.mp, XFS_BUF_ADDR(lbp)); + args->tp = cur->bc_tp; + args->mp = cur->bc_mp; + lbno = XFS_DADDR_TO_AGBNO(args->mp, XFS_BUF_ADDR(lbp)); /* * Allocate the new block. * If we can't do it, we're toast. Give up. */ - args.fsbno = XFS_AGB_TO_FSB(args.mp, cur->bc_private.i.agno, lbno); - args.mod = args.minleft = args.alignment = args.total = args.wasdel = - args.isfl = args.userdata = args.minalignslop = 0; - args.minlen = args.maxlen = args.prod = 1; - args.type = XFS_ALLOCTYPE_NEAR_BNO; - if ((error = xfs_alloc_vextent(&args))) - return error; - if (args.fsbno == NULLFSBLOCK) { + args->fsbno = XFS_AGB_TO_FSB(args->mp, cur->bc_private.i.agno, lbno); + args->mod = args->minleft = args->alignment = args->total = args->wasdel = + args->isfl = args->userdata = args->minalignslop = 0; + args->minlen = args->maxlen = args->prod = 1; + args->type = XFS_ALLOCTYPE_NEAR_BNO; + if ((error = xfs_alloc_vextent(args))) + goto out_error; + if (args->fsbno == NULLFSBLOCK) { *stat = 0; - return 0; + goto out_error; } - ASSERT(args.len == 1); - rbp = xfs_btree_get_bufs(args.mp, args.tp, args.agno, args.agbno, 0); + ASSERT(args->len == 1); + rbp = xfs_btree_get_bufs(args->mp, args->tp, args->agno, args->agbno, 0); /* * Set up the new block as "right". */ @@ -1515,7 +1523,7 @@ xfs_inobt_split( left = XFS_BUF_TO_INOBT_BLOCK(lbp); #ifdef DEBUG if ((error = xfs_btree_check_sblock(cur, left, level, lbp))) - return error; + goto out_error; #endif /* * Fill in the btree header for the new block. @@ -1542,7 +1550,7 @@ xfs_inobt_split( #ifdef DEBUG for (i = 0; i < be16_to_cpu(right->bb_numrecs); i++) { if ((error = xfs_btree_check_sptr(cur, be32_to_cpu(lpp[i]), level))) - return error; + goto out_error; } #endif memcpy(rkp, lkp, be16_to_cpu(right->bb_numrecs) * sizeof(*rkp)); @@ -1567,10 +1575,10 @@ xfs_inobt_split( */ be16_add(&left->bb_numrecs, -(be16_to_cpu(right->bb_numrecs))); right->bb_rightsib = left->bb_rightsib; - left->bb_rightsib = cpu_to_be32(args.agbno); + left->bb_rightsib = cpu_to_be32(args->agbno); right->bb_leftsib = cpu_to_be32(lbno); - xfs_inobt_log_block(args.tp, rbp, XFS_BB_ALL_BITS); - xfs_inobt_log_block(args.tp, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB); + xfs_inobt_log_block(args->tp, rbp, XFS_BB_ALL_BITS); + xfs_inobt_log_block(args->tp, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB); /* * If there's a block to the new block's right, make that block * point back to right instead of to left. @@ -1579,15 +1587,15 @@ xfs_inobt_split( xfs_inobt_block_t *rrblock; /* rr btree block */ xfs_buf_t *rrbp; /* buffer for rrblock */ - if ((error = xfs_btree_read_bufs(args.mp, args.tp, args.agno, + if ((error = xfs_btree_read_bufs(args->mp, args->tp, args->agno, be32_to_cpu(right->bb_rightsib), 0, &rrbp, XFS_INO_BTREE_REF))) - return error; + goto out_error; rrblock = XFS_BUF_TO_INOBT_BLOCK(rrbp); if ((error = xfs_btree_check_sblock(cur, rrblock, level, rrbp))) - return error; - rrblock->bb_leftsib = cpu_to_be32(args.agbno); - xfs_inobt_log_block(args.tp, rrbp, XFS_BB_LEFTSIB); + goto out_error; + rrblock->bb_leftsib = cpu_to_be32(args->agbno); + xfs_inobt_log_block(args->tp, rrbp, XFS_BB_LEFTSIB); } /* * If the cursor is really in the right block, move it there. @@ -1604,12 +1612,14 @@ xfs_inobt_split( */ if (level + 1 < cur->bc_nlevels) { if ((error = xfs_btree_dup_cursor(cur, curp))) - return error; + goto out_error; (*curp)->bc_ptrs[level + 1]++; } - *bnop = args.agbno; + *bnop = args->agbno; *stat = 1; - return 0; +out_error: + kmem_free(args, sizeof(xfs_alloc_arg_t)); + return error; } /* From owner-xfs@oss.sgi.com Thu Apr 19 00:37:25 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:37:26 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7bLfB009730 for ; Thu, 19 Apr 2007 00:37:23 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA22700; Thu, 19 Apr 2007 17:37:16 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7bFAf68298774; Thu, 19 Apr 2007 17:37:15 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7bEMo69779756; Thu, 19 Apr 2007 17:37:14 +1000 (AEST) Date: Thu, 19 Apr 2007 17:37:14 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: handle barriers being switched off dynamically. Message-ID: <20070419073714.GU48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11122 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs As pointed out by Neil Brown, MD can switch barriers off dynamically underneath a mounted filesystem. If this happens to XFS, it will shutdown the filesystem immediately. Handle this more sanely by yelling into the syslog, retrying the I/O without barriers and if that is successful, turn off barriers. Also remove an unnecessary check when first checking to see if the underlying device supports barriers. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_buf.c | 13 ++++++++++++- fs/xfs/linux-2.6/xfs_super.c | 8 -------- fs/xfs/xfs_log.c | 13 +++++++++++++ 3 files changed, 25 insertions(+), 9 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-04-19 13:26:49.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-04-19 13:27:01.733786992 +1000 @@ -1000,7 +1000,18 @@ xfs_buf_iodone_work( xfs_buf_t *bp = container_of(work, xfs_buf_t, b_iodone_work); - if (bp->b_iodone) + /* + * We can get an EOPNOTSUPP to ordered writes. Here we clear the + * ordered flag and reissue them. Because we can't tell the higher + * layers directly that they should not issue ordered I/O anymore, they + * need to check if the ordered flag was cleared during I/O completion. + */ + if ((bp->b_error == EOPNOTSUPP) && + (bp->b_flags & (XBF_ORDERED|XBF_ASYNC)) == (XBF_ORDERED|XBF_ASYNC)) { + XB_TRACE(bp, "ordered_retry", bp->b_iodone); + bp->b_flags &= ~XBF_ORDERED; + xfs_buf_iorequest(bp); + } else if (bp->b_iodone) (*(bp->b_iodone))(bp); else if (bp->b_flags & XBF_ASYNC) xfs_buf_relse(bp); Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-04-19 13:27:00.245980891 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2007-04-19 13:27:01.753784386 +1000 @@ -961,6 +961,19 @@ xlog_iodone(xfs_buf_t *bp) l = iclog->ic_log; /* + * If the ordered flag has been removed by a lower + * layer, it means the underlyin device no longer supports + * barrier I/O. Warn loudly and turn off barriers. + */ + if ((l->l_mp->m_flags & XFS_MOUNT_BARRIER) && !XFS_BUF_ORDERED(bp)) { + l->l_mp->m_flags &= ~XFS_MOUNT_BARRIER; + xfs_fs_cmn_err(CE_WARN, l->l_mp, + "xlog_iodone: Barriers are no longer supported" + " by device. Disabling barriers\n"); + xfs_buftrace("XLOG_IODONE BARRIERS OFF", bp); + } + + /* * Race to shutdown the filesystem if we see an error. */ if (XFS_TEST_ERROR((XFS_BUF_GETERROR(bp)), l->l_mp, Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2007-04-19 13:27:00.277976721 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c 2007-04-19 13:27:01.757783865 +1000 @@ -314,14 +314,6 @@ xfs_mountfs_check_barriers(xfs_mount_t * return; } - if (mp->m_ddev_targp->bt_bdev->bd_disk->queue->ordered == - QUEUE_ORDERED_NONE) { - xfs_fs_cmn_err(CE_NOTE, mp, - "Disabling barriers, not supported by the underlying device"); - mp->m_flags &= ~XFS_MOUNT_BARRIER; - return; - } - if (xfs_readonly_buftarg(mp->m_ddev_targp)) { xfs_fs_cmn_err(CE_NOTE, mp, "Disabling barriers, underlying device is readonly"); From owner-xfs@oss.sgi.com Thu Apr 19 00:49:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:49:53 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J7nnfB011915 for ; Thu, 19 Apr 2007 00:49:50 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id AE5ABAAC37A; Thu, 19 Apr 2007 17:49:48 +1000 (EST) Subject: Re: review: allocate bmapi args From: Nathan Scott Reply-To: nscott@aconex.com To: David Chinner Cc: xfs-dev , xfs-oss In-Reply-To: <20070419072505.GS48531920@melbourne.sgi.com> References: <20070419072505.GS48531920@melbourne.sgi.com> Content-Type: text/plain Organization: Aconex Date: Thu, 19 Apr 2007 17:51:02 +1000 Message-Id: <1176969062.6273.169.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-archive-position: 11123 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs On Thu, 2007-04-19 at 17:25 +1000, David Chinner wrote: > > + bma = kmem_zalloc(sizeof(xfs_bmalloca_t), KM_SLEEP); > + if (!bma) > + return XFS_ERROR(ENOMEM); I guess you meant KM_NOSLEEP? Are you sure this is legit though? (are all callers going to be able to handle this?) I'm thinking of the writeout paths where we're doing space allocation (unwritten extent conversion comes through here too) in order to free up some page cache so other memory allocs elsewhere can proceed. I don't see any other memory allocations in this area of the code, so I guess I'd be treading really carefully here.. (Oh, and why the _zalloc? Could just do an _alloc, since previous code was using non-zeroed memory - so, should have been filling in all fields). cheers. -- Nathan From owner-xfs@oss.sgi.com Thu Apr 19 00:53:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 00:53:49 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J7rifB013002 for ; Thu, 19 Apr 2007 00:53:46 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id RAA23138; Thu, 19 Apr 2007 17:53:40 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J7rdAf70617956; Thu, 19 Apr 2007 17:53:39 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J7rc1f70015918; Thu, 19 Apr 2007 17:53:38 +1000 (AEST) Date: Thu, 19 Apr 2007 17:53:38 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: fix use after free of log buffers on shutdown. Message-ID: <20070419075338.GV48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11124 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs When unmounting the filesystem we write an unmount record into the log just before we start freeing up in memory structures. When we wait for the unmount record to hit the disk, we don't wait for the log buffers to be finished with, we only wait for part of the iodone callback to be run - the bit that processes the unmount record completion. Hence when the unmount wakes up, it races with the remainder of the log io completion and pretty much the first thing it does is free the log buffers. As a result, when iodone processing completes and we check the buffer's async status, the buffer can already have been freed. Luckily, all log I/O is issued asynchronously, so we don't really need the async check and so we can avoid this use after free easily. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_log.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-04-19 17:18:14.097380099 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2007-04-19 17:51:03.017078512 +1000 @@ -988,14 +988,16 @@ xlog_iodone(xfs_buf_t *bp) } else if (iclog->ic_state & XLOG_STATE_IOERROR) { aborted = XFS_LI_ABORTED; } + + /* log I/O is always issued ASYNC */ + ASSERT(XFS_BUF_ISASYNC(bp)); xlog_state_done_syncing(iclog, aborted); - if (!(XFS_BUF_ISASYNC(bp))) { - /* - * Corresponding psema() will be done in bwrite(). If we don't - * vsema() here, panic. - */ - XFS_BUF_V_IODONESEMA(bp); - } + /* + * do not reference the buffer (bp) here as we could race + * with it being freed after writing the unmount record to the + * log. + */ + } /* xlog_iodone */ /* From owner-xfs@oss.sgi.com Thu Apr 19 01:23:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 01:23:42 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3J8NafB019495 for ; Thu, 19 Apr 2007 01:23:38 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA24557; Thu, 19 Apr 2007 18:23:34 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3J8NXAf70514203; Thu, 19 Apr 2007 18:23:33 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3J8NVJY70575441; Thu, 19 Apr 2007 18:23:31 +1000 (AEST) Date: Thu, 19 Apr 2007 18:23:31 +1000 From: David Chinner To: Nathan Scott Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: review: allocate bmapi args Message-ID: <20070419082331.GW48531920@melbourne.sgi.com> References: <20070419072505.GS48531920@melbourne.sgi.com> <1176969062.6273.169.camel@edge> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1176969062.6273.169.camel@edge> User-Agent: Mutt/1.4.2.1i X-archive-position: 11125 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 05:51:02PM +1000, Nathan Scott wrote: > On Thu, 2007-04-19 at 17:25 +1000, David Chinner wrote: > > > > + bma = kmem_zalloc(sizeof(xfs_bmalloca_t), KM_SLEEP); > > + if (!bma) > > + return XFS_ERROR(ENOMEM); > > I guess you meant KM_NOSLEEP? No, I meant a sleeping allocation. I guess that mea I don't need the error handling.... > Are you sure this is legit though? It *must* be. We already rely on being able to do substantial amounts of allocation in this path.... > (are all callers going to be able to handle this?) I'm thinking > of the writeout paths where we're doing space allocation (unwritten > extent conversion comes through here too) in order to free up some > page cache so other memory allocs elsewhere can proceed. I don't > see any other memory allocations in this area of the code, so I > guess I'd be treading really carefully here.. We modify the incore extent list as it grows and shrinks in this path. It is critical that we are able to allocate at least small amounts of memory in these writeback paths, and we currently do that with kmem_alloc(KM_SLEEP). A quick search of the xfs_iext_* functions shows lots of allocations are done in manipulating the incore extents.... Then there's needing new pages in the page cache and xfs_buf_t's if we trigger a btree split duringthe allocation, and so on. IOWS, there's plenty of far larger allocations down through this path already, and if any one of them doesn't succeed we are pretty much fscked. given that we haven't had any reports of writeback deadlocks since the new incore extent handling went in, I think small allocations like this are not a problem. FWIW, I have done low memory testing and I wasn't about to trigger any problems..... > (Oh, and why the _zalloc? Could just do an _alloc, since previous > code was using non-zeroed memory - so, should have been filling in > all fields). Habit. And it doesn't hurt performance at all - we've got to take that cache miss somewhere along the line.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu Apr 19 01:38:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 01:38:06 -0700 (PDT) Received: from tyo200.gate.nec.co.jp (TYO200.gate.nec.co.jp [210.143.35.50]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J8c2fB023026 for ; Thu, 19 Apr 2007 01:38:04 -0700 Received: from tyo202.gate.nec.co.jp ([10.7.69.202]) by tyo200.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3J8c0nB006255 for ; Thu, 19 Apr 2007 17:38:00 +0900 (JST) Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.193]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3J8bK1d011041 for ; Thu, 19 Apr 2007 17:37:20 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l3J8bKn00848 for xfs@oss.sgi.com; Thu, 19 Apr 2007 17:37:20 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv3.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id l3J8bJL27926 for ; Thu, 19 Apr 2007 17:37:19 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070419.173719.50002280 for ; Thu, 19 Apr 2007 17:37:19 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Thu Apr 19 17:37:19 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 57B77AE4B3; Thu, 19 Apr 2007 17:37:10 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l3J8bJrN008943; Thu, 19 Apr 2007 17:37:19 +0900 Message-Id: <200704190837.AA05238@TNESG9305.tnes.nec.co.jp> Date: Thu, 19 Apr 2007 17:37:11 +0900 To: xfs@oss.sgi.com Subject: [PATCH] Fix "quota -n" command in xfs_quota. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11126 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, "quota -n" command in xfs_quota don't work when specifying the project id. This patch fixes it. Example: # ./xfs_quota -x -c 'quota -p -n 42' ~utako/mpnt Disk quotas for Project logfiles (42) Filesystem Blocks Quota Limit Warn/Time Mounted on /dev/sda6 52 0 0 00 [--------] /home/utako/mpnt Signed-off-by: Utako Kusaka --- --- xfsprogs-2.8.20/quota/quota.orig 2007-04-18 10:36:38.000000000 +0900 +++ xfsprogs-2.8.20/quota/quota.c 2007-04-18 11:09:10.000000000 +0900 @@ -312,7 +312,7 @@ getprojectname( static char buffer[32]; fs_project_t *p; - if ((p = getprprid(prid))) + if (!numeric && (p = getprprid(prid))) return p->pr_name; snprintf(buffer, sizeof(buffer), "#%u", (unsigned int)prid); return &buffer[0]; From owner-xfs@oss.sgi.com Thu Apr 19 01:41:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 01:41:07 -0700 (PDT) Received: from ex-osl-dc03.exense.int (exense-pdc.exense.com [195.204.47.129]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3J8f2fB024171 for ; Thu, 19 Apr 2007 01:41:03 -0700 Received: from [127.0.0.1] ([10.1.3.13]) by ex-osl-dc03.exense.int with Microsoft SMTPSVC(6.0.3790.1830); Thu, 19 Apr 2007 10:28:45 +0200 Message-ID: <4627283E.7060000@start.no> Date: Thu, 19 Apr 2007 10:28:46 +0200 From: "Stein M. Hugubakken" User-Agent: Thunderbird 1.5.0.10 (X11/20070317) MIME-Version: 1.0 To: xfs@oss.sgi.com Subject: Inode usage Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 19 Apr 2007 08:28:45.0181 (UTC) FILETIME=[BEB9FAD0:01C7825C] X-archive-position: 11127 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dulci@start.no Precedence: bulk X-list: xfs Hi! I have a lot of free inodes on my xfs-partitions and was wondering about what impact this has on performance or memory? Here is output from 'df': df -ih Filesystem Inodes IUsed IFree IUse% Mounted on /dev/hda2 5,1M 129K 5,0M 3% / /dev/hda3 31M 54K 31M 1% /home df -h Filesystem Size Used Avail Use% Mounted on /dev/hda2 5,1G 2,8G 2,4G 55% / /dev/hda3 31G 18G 13G 59% /home With xfs_growfs -m I can adjust the amount of free inodes, but it seems I can't change it for the root-partition, why is that a problem? Kind regards Stein From owner-xfs@oss.sgi.com Thu Apr 19 06:14:54 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 06:14:57 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3JDEpfB023951 for ; Thu, 19 Apr 2007 06:14:52 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id XAA00078; Thu, 19 Apr 2007 23:14:44 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3JDEhAf70798677; Thu, 19 Apr 2007 23:14:43 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3JDEgnG70177793; Thu, 19 Apr 2007 23:14:42 +1000 (AEST) Date: Thu, 19 Apr 2007 23:14:42 +1000 From: David Chinner To: "Stein M. Hugubakken" Cc: xfs@oss.sgi.com Subject: Re: Inode usage Message-ID: <20070419131442.GC32602149@melbourne.sgi.com> References: <4627283E.7060000@start.no> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4627283E.7060000@start.no> User-Agent: Mutt/1.4.2.1i X-archive-position: 11128 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 10:28:46AM +0200, Stein M. Hugubakken wrote: > Hi! > > I have a lot of free inodes on my xfs-partitions and was wondering about > what impact this has on performance or memory? None. The number of free inodes is a made up number. ;) XFS dynamically allocates and frees inodes, so the number of free inodes is determined by working out how many inodes could be allocated in the remaining free space you have. It's a theoretical maximum.... > Here is output from 'df': > df -ih > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/hda2 5,1M 129K 5,0M 3% / > /dev/hda3 31M 54K 31M 1% /home > > df -h > Filesystem Size Used Avail Use% Mounted on > /dev/hda2 5,1G 2,8G 2,4G 55% / > /dev/hda3 31G 18G 13G 59% /home > > With xfs_growfs -m I can adjust the amount of free inodes, but it seems > I can't change it for the root-partition, why is that a problem? Works for me.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu Apr 19 06:43:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 06:43:37 -0700 (PDT) Received: from mail.interline.it (mail.interline.it [195.182.241.4]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3JDhWfB032373 for ; Thu, 19 Apr 2007 06:43:35 -0700 Received: from localhost (localhost [127.0.0.1]) by mail.interline.it (Postfix) with ESMTP id 599D1EE7 for ; Thu, 19 Apr 2007 15:42:20 +0200 (CEST) Received: from mail.interline.it ([127.0.0.1]) by localhost (pin [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 23588-03 for ; Thu, 19 Apr 2007 15:41:43 +0200 (CEST) From: "Daniele P." Organization: Interline To: xfs@oss.sgi.com Subject: Re: Inode usage Date: Thu, 19 Apr 2007 15:42:53 +0200 User-Agent: KMail/1.9.5 References: <4627283E.7060000@start.no> <20070419131442.GC32602149@melbourne.sgi.com> In-Reply-To: <20070419131442.GC32602149@melbourne.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200704191542.53366.daniele@interline.it> X-archive-position: 11129 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: daniele@interline.it Precedence: bulk X-list: xfs On Thursday 19 April 2007 15:14, you wrote: > On Thu, Apr 19, 2007 at 10:28:46AM +0200, Stein M. Hugubakken wrote: > > Hi! > > > > I have a lot of free inodes on my xfs-partitions and was wondering > > about what impact this has on performance or memory? > > None. The number of free inodes is a made up number. ;) That's for normal usage. Well, David don't forget what you tell me some time ago about xfs_repair In terms of inode count, I generally use the rule that for every 10million inodes you need a gigabyte of RAM for repair - you needed about 500MB for 6million inodes. The inode count also affects xfsdump/xfsrestore operations but I don't know the details. There was a huge improvement in performances for partial dump with xfsprogs-2.8.16 IIRC. Regards, Daniele P. From owner-xfs@oss.sgi.com Thu Apr 19 06:50:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 06:50:19 -0700 (PDT) Received: from mail.interline.it (mail.interline.it [195.182.241.4]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3JDoBfB001378 for ; Thu, 19 Apr 2007 06:50:14 -0700 Received: from localhost (localhost [127.0.0.1]) by mail.interline.it (Postfix) with ESMTP id 0FD73F52 for ; Thu, 19 Apr 2007 15:49:00 +0200 (CEST) Received: from mail.interline.it ([127.0.0.1]) by localhost (pin [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 24272-02 for ; Thu, 19 Apr 2007 15:48:53 +0200 (CEST) From: "Daniele P." Organization: Interline To: xfs@oss.sgi.com Subject: Re: Inode usage Date: Thu, 19 Apr 2007 15:50:03 +0200 User-Agent: KMail/1.9.5 References: <4627283E.7060000@start.no> <20070419131442.GC32602149@melbourne.sgi.com> <200704191542.53366.daniele@interline.it> In-Reply-To: <200704191542.53366.daniele@interline.it> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200704191550.03134.daniele@interline.it> X-archive-position: 11130 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: daniele@interline.it Precedence: bulk X-list: xfs On Thursday 19 April 2007 15:42, you wrote: > On Thursday 19 April 2007 15:14, you wrote: > > On Thu, Apr 19, 2007 at 10:28:46AM +0200, Stein M. Hugubakken wrote: > > > Hi! > > > > > > I have a lot of free inodes on my xfs-partitions and was > > > wondering about what impact this has on performance or memory? > > > > None. The number of free inodes is a made up number. ;) Sorry. I was really blind. I didn't read "free". Apologies, Daniele P. From owner-xfs@oss.sgi.com Thu Apr 19 07:00:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 07:00:16 -0700 (PDT) Received: from mr3.cc.ic.ac.uk (mr3.cc.ic.ac.uk [155.198.5.113]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3JE0CfB003297 for ; Thu, 19 Apr 2007 07:00:13 -0700 Received: from icexp2.cc.ic.ac.uk ([155.198.3.42] helo=icex.imperial.ac.uk) by mr3.cc.ic.ac.uk with smtp (Exim 4.63) (envelope-from ) id 1HeWhO-0000uL-45 for xfs@oss.sgi.com; Thu, 19 Apr 2007 14:29:22 +0100 Received: from icex1.ic.ac.uk ([155.198.3.1]) by icex.imperial.ac.uk with Microsoft SMTPSVC(6.0.3790.1830); Thu, 19 Apr 2007 14:28:24 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: XFS internal error XFS_WANT_CORRUPTED_GOTO Date: Thu, 19 Apr 2007 14:28:24 +0100 Message-ID: <735C1873E656C24699818814048F8FB0054C43B4@icex1.ic.ac.uk> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: XFS internal error XFS_WANT_CORRUPTED_GOTO Thread-Index: AceChptrVSYxch2yTEukdzyeziz/tg== From: "Burbidge, Simon A" To: X-OriginalArrivalTime: 19 Apr 2007 13:28:24.0700 (UTC) FILETIME=[9B5ADBC0:01C78286] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l3JE0EfB003302 X-archive-position: 11131 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: s.burbidge@imperial.ac.uk Precedence: bulk X-list: xfs Hi, We've had a couple of occurrnces of xfs shutdowns on one of our fileservers. The latest had the message: Apr 19 10:35:00 fs3 kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1745 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff8819bc7c Apr 19 10:35:00 fs3 kernel: Apr 19 10:35:00 fs3 kernel: Call Trace:{:xfs:xfs_free_ag_extent+1449} {:xfs:xfs_free_extent+188} Apr 19 10:35:00 fs3 kernel: {:xfs:xfs_efd_init+71} {:xfs:xfs_bmap_finish+264} Apr 19 10:35:00 fs3 kernel: {:xfs:xfs_itruncate_finish+439} {:xfs:xfs_inactive+622} Apr 19 10:35:00 fs3 kernel: {:xfs:validate_fields+39} {:xfs:vn_rele+99} Apr 19 10:35:00 fs3 kernel: {:xfs:linvfs_clear_inode+18} {clear_inode+228} Apr 19 10:35:00 fs3 kernel: {generic_delete_inode+253} {d_delete+202} Apr 19 10:35:00 fs3 kernel: {vfs_unlink+500} {:nfsd:nfsd_unlink+520} Apr 19 10:35:00 fs3 kernel: {:nfsd:nfsd3_proc_remove+200} {:nfsd:nfsd_dispatch+240} Apr 19 10:35:00 fs3 kernel: {svc_process+1049} {__down_read+49} Apr 19 10:35:00 fs3 kernel: {:nfsd:nfsd+472} {schedule_tail+64} Apr 19 10:35:00 fs3 kernel: {child_rip+8} {:nfsd:nfsd+0} Apr 19 10:35:00 fs3 kernel: {child_rip+0} Apr 19 10:35:00 fs3 kernel: xfs_force_shutdown(dm-0,0x8) called from line 4086 of file fs/xfs/xfs_bmap.c. Return address = 0xffffffff881a7b7f Apr 19 10:35:00 fs3 kernel: Filesystem "dm-0": Corruption of in-memory data detected. Shutting down filesystem: dm-0 Apr 19 10:35:00 fs3 kernel: Please umount the filesystem, and rectify the problem(s) xfs_repair didn't find any errors. An earlier occurrence had a bunch of corrupt dinode messages, followed by : Apr 8 07:23:30 fs3 kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1583 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff88225c7c Apr 8 07:23:30 fs3 kernel: Apr 8 07:23:30 fs3 kernel: Call Trace:{:xfs:xfs_free_ag_extent+436} {:xfs:xfs_free_extent+188} Apr 8 07:23:30 fs3 kernel: {:xfs:xfs_efd_init+71} {:xfs:xfs_trans_get_efd+43} Apr 8 07:23:30 fs3 kernel: {:xfs:xfs_bmap_finish+264} {:xfs:xfs_itruncate_finish+439} Apr 8 07:23:30 fs3 kernel: {:xfs:xfs_inactive_free_eofblocks+415} Apr 8 07:23:30 fs3 kernel: {:xfs:xfs_inactive+339} {:xfs:xfs_iunlock+56} Apr 8 07:23:30 fs3 kernel: {:xfs:xfs_reclaim+188} {:xfs:vn_purge+475} Apr 8 07:23:30 fs3 kernel: {:xfs:vn_rele+99} {:xfs:linvfs_clear_inode+18} Apr 8 07:23:30 fs3 kernel: {clear_inode+228} {dispose_list+104} Apr 8 07:23:30 fs3 kernel: {shrink_icache_memory+490} {shrink_slab+220} Apr 8 07:23:30 fs3 kernel: {balance_pgdat+619} {kswapd+343} Apr 8 07:23:30 fs3 kernel: {autoremove_wake_function+0} {schedule_tail+64} Apr 8 07:23:30 fs3 kernel: {child_rip+8} {kswapd+0} Apr 8 07:23:30 fs3 kernel: {child_rip+0} Apr 8 07:23:30 fs3 kernel: xfs_force_shutdown(dm-0,0x8) called from line 4086 of file fs/xfs/xfs_bmap.c. Return address = 0xffffffff88231b7f Apr 8 07:23:30 fs3 kernel: Filesystem "dm-0": Corruption of in-memory data detected. Shutting down filesystem: dm-0 Apr 8 07:23:30 fs3 kernel: Please umount the filesystem, and rectify the problem(s) xfs_repair did find and correct some bits for this. Suggestions welcomed! Thanks in advance, Simon From owner-xfs@oss.sgi.com Thu Apr 19 07:18:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 07:18:41 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3JEIZfB007662 for ; Thu, 19 Apr 2007 07:18:38 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id AAA02006; Fri, 20 Apr 2007 00:18:29 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3JEISAf70976015; Fri, 20 Apr 2007 00:18:28 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3JEIRWs70418417; Fri, 20 Apr 2007 00:18:27 +1000 (AEST) Date: Fri, 20 Apr 2007 00:18:27 +1000 From: David Chinner To: "Burbidge, Simon A" Cc: xfs@oss.sgi.com Subject: Re: XFS internal error XFS_WANT_CORRUPTED_GOTO Message-ID: <20070419141827.GF32602149@melbourne.sgi.com> References: <735C1873E656C24699818814048F8FB0054C43B4@icex1.ic.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <735C1873E656C24699818814048F8FB0054C43B4@icex1.ic.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11132 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 02:28:24PM +0100, Burbidge, Simon A wrote: > > Hi, > > We've had a couple of occurrnces of xfs shutdowns on one of our > fileservers. > The latest had the message: > > Apr 19 10:35:00 fs3 kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO > at line 1745 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff8819bc7c > Apr 19 10:35:00 fs3 kernel: > Apr 19 10:35:00 fs3 kernel: Call > Trace:{:xfs:xfs_free_ag_extent+1449} > {:xfs:xfs_free_extent+188} So you've got a corrupted freespace btree. What is the filesystem hosted on - a normal block device, iscsi, nbd? What kernel? Are there any I/O errors in the log? What we you running at the time of the shutdowns? Anything common between the occurrences? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu Apr 19 07:37:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 07:37:06 -0700 (PDT) Received: from mr4.cc.ic.ac.uk (mr4.cc.ic.ac.uk [155.198.5.114]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3JEb1fB012213 for ; Thu, 19 Apr 2007 07:37:03 -0700 Received: from icexp1.cc.ic.ac.uk ([155.198.3.41] helo=icex.imperial.ac.uk) by mr4.cc.ic.ac.uk with smtp (Exim 4.63) (envelope-from ) id 1HeXkp-0006SK-LY; Thu, 19 Apr 2007 15:36:59 +0100 Received: from icex1.ic.ac.uk ([155.198.3.1]) by icex.imperial.ac.uk with Microsoft SMTPSVC(6.0.3790.1830); Thu, 19 Apr 2007 15:36:58 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: RE: XFS internal error XFS_WANT_CORRUPTED_GOTO Date: Thu, 19 Apr 2007 15:36:58 +0100 Message-ID: <735C1873E656C24699818814048F8FB0054C43B8@icex1.ic.ac.uk> In-Reply-To: <20070419141827.GF32602149@melbourne.sgi.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: XFS internal error XFS_WANT_CORRUPTED_GOTO Thread-Index: AceCjZ5eSBw8KQFcT82U/nY+cEZbfAAAKdDg From: "Burbidge, Simon A" To: "David Chinner" Cc: X-OriginalArrivalTime: 19 Apr 2007 14:36:58.0841 (UTC) FILETIME=[2F92F490:01C78290] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id l3JEb4fB012220 X-archive-position: 11133 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: s.burbidge@imperial.ac.uk Precedence: bulk X-list: xfs Hi Dave, Thanks for the response. No I/O errors reported in the message log or on the RAID box. It's an Infortrend SATA RAID5 array, with a fibre channel connection to the server. The filesystem is build on an LVM volume. Kernel is 2.6.13-15-smp running on an x86_64 dual CPU Xeon server with hyper-threading enabled. The most significant feature of the load is that it is part of an HPC cluster, and has a large number of nodes NFS mounting the filesystem across Gigabit ethernet. I did notice that in the first incident, a user had a directory with 700000 files in it, and xfs_repair found fault with that directory. The user has revised their workflow since and removed the files. Very difficult to spot common traits in the workload between the 2 incidents. Cheers, Simon > -----Original Message----- > From: David Chinner [mailto:dgc@sgi.com] > Sent: 19 April 2007 15:18 > To: Burbidge, Simon A > Cc: xfs@oss.sgi.com > Subject: Re: XFS internal error XFS_WANT_CORRUPTED_GOTO > > On Thu, Apr 19, 2007 at 02:28:24PM +0100, Burbidge, Simon A wrote: > > > > Hi, > > > > We've had a couple of occurrnces of xfs shutdowns on one of our > > fileservers. > > The latest had the message: > > > > Apr 19 10:35:00 fs3 kernel: XFS internal error > XFS_WANT_CORRUPTED_GOTO > > at line 1745 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff8819bc7c > > Apr 19 10:35:00 fs3 kernel: > > Apr 19 10:35:00 fs3 kernel: Call > > Trace:{:xfs:xfs_free_ag_extent+1449} > > {:xfs:xfs_free_extent+188} > > So you've got a corrupted freespace btree. What is the filesystem > hosted on - a normal block device, iscsi, nbd? What kernel? > > Are there any I/O errors in the log? > > What we you running at the time of the shutdowns? Anything > common between the occurrences? > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > From owner-xfs@oss.sgi.com Thu Apr 19 15:11:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 15:11:21 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3JMBDfB016311 for ; Thu, 19 Apr 2007 15:11:17 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA13616; Fri, 20 Apr 2007 08:11:03 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3JMB2Af71268124; Fri, 20 Apr 2007 08:11:02 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3JMAxpO71372721; Fri, 20 Apr 2007 08:10:59 +1000 (AEST) Date: Fri, 20 Apr 2007 08:10:59 +1000 From: David Chinner To: "Burbidge, Simon A" Cc: David Chinner , xfs@oss.sgi.com Subject: Re: XFS internal error XFS_WANT_CORRUPTED_GOTO Message-ID: <20070419221059.GI32602149@melbourne.sgi.com> References: <20070419141827.GF32602149@melbourne.sgi.com> <735C1873E656C24699818814048F8FB0054C43B8@icex1.ic.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <735C1873E656C24699818814048F8FB0054C43B8@icex1.ic.ac.uk> User-Agent: Mutt/1.4.2.1i X-archive-position: 11134 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 03:36:58PM +0100, Burbidge, Simon A wrote: > Hi Dave, > Thanks for the response. > No I/O errors reported in the message log or on the RAID box. OK. > It's an Infortrend SATA RAID5 array, with a fibre channel connection to > the server. > The filesystem is build on an LVM volume. > Kernel is 2.6.13-15-smp running on an x86_64 dual CPU Xeon server with > hyper-threading enabled. That's a relatively old kernel. It's possible that what you are seeing has been fixed since that kernel was released. > The most significant feature of the load is that it is part of an HPC > cluster, and has a large number of nodes NFS mounting the filesystem > across Gigabit ethernet. Not uncommon - we do that all the time ;) > I did notice that in the first incident, a user had a directory with > 700000 files in it, and xfs_repair found fault with that directory. The > user has revised their workflow since and removed the files. > Very difficult to spot common traits in the workload between the 2 > incidents. Ok, so that makes it kind of hard to start tracking this down. If it keeps occurring and you can't isolate the workload that is causing the problem, you might want to upgrade to a more recent kernel and see if that helps..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Thu Apr 19 16:15:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 16:15:17 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3JNF8fB028708 for ; Thu, 19 Apr 2007 16:15:10 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA15234; Fri, 20 Apr 2007 09:15:02 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3JNF1Af68501064; Fri, 20 Apr 2007 09:15:01 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3JNF07D71145666; Fri, 20 Apr 2007 09:15:00 +1000 (AEST) Date: Fri, 20 Apr 2007 09:15:00 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review [1 of 3]: lazy superblock counters - core kernel Message-ID: <20070419231459.GX48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11135 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Introduce lazy superblock accounting. When we have a couple of hundred transactions on the fly at once, they all typically modify the on disk superblock in some way. create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify free block counts. When these counts are modified in a transaction, the must eventually lock the superblock buffer and apply the mods. The buffer then remains locked until the transaction is committed into the incore log buffer. The result of this is that with enough transactions on the fly the incore superblock buffer becomes a bottleneck. The result of contention on the incore superblock buffer is that transaction rates fall - the more pressure that is put on the superblock buffer, the slower things go. Some of you might remember the dbench test teh Samba folks did a couple of years back that showed these sorts of results: http://samba.org/~tridge/xattr_results/noxattr.png Here are some numbers from Irix XFS from 2 years ago when I did this work originally (the first major piece of work I did on XFS!): Loads rate MB/s CPU (%) csw/s stock lazy stock lazy stock lazy ----- ------ ------ ----- ----- ----- ---- 1 82.93 75.07 87.8 88.9 701 692 2 94.98 88.11 183.8 184.7 299 287 3 110.66 101.62 278.6 279.2 431 353 4 120.75 113.75 365.5 367.7 599 492 8 101.12 114.00 274.4 369.7 1989 545 15 96.44 116.57 255.9 370.5 1874 453 30 95.52 116.22 246.8 370.6 1831 465 50 93.35 94.63 238.7 272.9 2195 744 75 92.43 110.67 235.7 369.4 2599 2689 100 89.95 113.07 236.5 367.2 2847 1940 125 84.47 112.72 214.8 369.8 2786 2316 150 83.38 109.36 215.0 366.4 2902 3567 175 87.56 109.25 223.5 363.3 3088 3141 200 86.03 104.97 234.7 346.3 3282 3470 250 72.82 106.10 246.7 342.3 3374 4470 300 62.21 100.86 249.7 319.4 3649 5175 400 40.48 110.38 193.7 327.2 3885 6251 500 58.53 119.91 162.4 304.0 3682 6098 750 51.44 124.05 159.0 240.2 3592 6217 1000 65.18 144.17 179.6 219.1 4732 6345 As you can see, perfromance starts to drop off under pretty low numbers of concurrent processes on a normal filesystem, but with lazy superblock accounting the performance does not drop off - we get a change of behaviour instead and we start contending on different, more parallelised objects (e.g. AGF and AGI buffers) and hence we don't hit a single point of contention that limits us. [ FWIW, Nathan addressed the xattr performance problems shown by these dbench tests with attr2. ] It should also be noted that this modification does not improve peak performance - it provides graceful behaviour in overload conditions more than anything else. The key to removing the contention is to not require the superblock fields in question to be locked. We do that by not marking the superblock dirty in the transaction. IOWs, we modify the incore superblock but do not modify the cached superblock buffer. In short, we do not log superblock modifications to critical fields in the superblock on every transaction. In fact we only do it just before we write the superblock to disk every sync period or just before unmount. This creates an interesting problem - if we don't log or write out the fileds in every transaction, then how do the values get recovered after a crash? the answer is simple - we keep enough duplicate, logged information in other structures that we can reconstruct the correct count after log recovery has been performed. It is the AGF and AGI structures that contain the duplicate information; after recovery, we walk every AGI and AGF and sum their individual counters to get the correct value, and we do a transaction into the log to correct them. An optimisation of this is that if we have a clean unmount record, we know the value in the superblock is correct, so we can avoid the summation walk under normal conditions and so mount/recovery times do not change under normal operation. One wrinkle that was discovered during development was that the blocks used in the freespace btrees are never accounted for in the AGF counters. This was once a valid optimisation to make; when the filesystem is full, the free space btrees are empty and consume no space. Hence when it matters, the "accounting" is correct. But that means the when we do the AGF summations, we would not have a correct count and xfs_check would complain. Hence a new counter was added to track the number of blocks used by the free space btrees. This is an *on-disk format change*. As a result of this change, superblock counters are a mkfs option and at the moment on linux there is no way to convert an old filesystem. This is possible - xfs_db can be used to twiddle the right bits and then xfs_repair will do the format conversion for you. Similarly, you can convert backwards as well. At some point we'll add functionality to xfs_admin to do the bit twiddling easily.... This first patch is the core kernel code from the original port Nathan did from the Irix a long while ago. The second patch is a recent fix needed to work with the per-cpu incore superblock counters, and the third patch is all the userspace tool changes needed also derived. [nathans - the first and thid patches are pretty much unchanged from when you last saw them. ] Comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group Index: 2.6.x-xfs-new/fs/xfs/xfs_trans.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans.c 2006-04-03 08:50:16.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_trans.c 2006-04-11 16:07:46.511510919 +1000 @@ -439,6 +439,14 @@ undo_blocks: * * Mark the transaction structure to indicate that the superblock * needs to be updated before committing. + * + * Because we may not be keeping track of allocated/free inodes and + * used filesystem blocks in the superblock, we do not mark the + * superblock dirty in this transaction if we modify these fields. + * We still need to update the transaction deltas so that they get + * applied to the incore superblock, but we don't want them to + * cause the superblock to get locked and logged if these are the + * only fields in the superblock that the transaction modifies. */ void xfs_trans_mod_sb( @@ -446,13 +454,19 @@ xfs_trans_mod_sb( uint field, long delta) { + uint32_t flags = (XFS_TRANS_DIRTY|XFS_TRANS_SB_DIRTY); + xfs_mount_t *mp = tp->t_mountp; switch (field) { case XFS_TRANS_SB_ICOUNT: tp->t_icount_delta += delta; + if (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)) + flags &= ~XFS_TRANS_SB_DIRTY; break; case XFS_TRANS_SB_IFREE: tp->t_ifree_delta += delta; + if (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)) + flags &= ~XFS_TRANS_SB_DIRTY; break; case XFS_TRANS_SB_FDBLOCKS: /* @@ -465,6 +479,8 @@ xfs_trans_mod_sb( ASSERT(tp->t_blk_res_used <= tp->t_blk_res); } tp->t_fdblocks_delta += delta; + if (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)) + flags &= ~XFS_TRANS_SB_DIRTY; break; case XFS_TRANS_SB_RES_FDBLOCKS: /* @@ -474,6 +490,8 @@ xfs_trans_mod_sb( */ ASSERT(delta < 0); tp->t_res_fdblocks_delta += delta; + if (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)) + flags &= ~XFS_TRANS_SB_DIRTY; break; case XFS_TRANS_SB_FREXTENTS: /* @@ -556,18 +574,23 @@ xfs_trans_apply_sb_deltas( (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta + tp->t_ag_btree_delta)); - if (tp->t_icount_delta != 0) { - INT_MOD(sbp->sb_icount, ARCH_CONVERT, tp->t_icount_delta); - } - if (tp->t_ifree_delta != 0) { - INT_MOD(sbp->sb_ifree, ARCH_CONVERT, tp->t_ifree_delta); - } + /* + * Only update the superblock counters if we are logging them + */ + if (!XFS_SB_VERSION_LAZYSBCOUNT(&(tp->t_mountp->m_sb))) { + if (tp->t_icount_delta != 0) { + INT_MOD(sbp->sb_icount, ARCH_CONVERT, tp->t_icount_delta); + } + if (tp->t_ifree_delta != 0) { + INT_MOD(sbp->sb_ifree, ARCH_CONVERT, tp->t_ifree_delta); + } - if (tp->t_fdblocks_delta != 0) { - INT_MOD(sbp->sb_fdblocks, ARCH_CONVERT, tp->t_fdblocks_delta); - } - if (tp->t_res_fdblocks_delta != 0) { - INT_MOD(sbp->sb_fdblocks, ARCH_CONVERT, tp->t_res_fdblocks_delta); + if (tp->t_fdblocks_delta != 0) { + INT_MOD(sbp->sb_fdblocks, ARCH_CONVERT, tp->t_fdblocks_delta); + } + if (tp->t_res_fdblocks_delta != 0) { + INT_MOD(sbp->sb_fdblocks, ARCH_CONVERT, tp->t_res_fdblocks_delta); + } } if (tp->t_frextents_delta != 0) { @@ -639,6 +662,7 @@ xfs_trans_unreserve_and_mod_sb( { xfs_mod_sb_t msb[14]; /* If you add cases, add entries */ xfs_mod_sb_t *msbp; + xfs_mount_t *mp = tp->t_mountp; /* REFERENCED */ int error; int rsvd; @@ -671,8 +695,15 @@ xfs_trans_unreserve_and_mod_sb( * The t_res_fdblocks_delta and t_res_frextents_delta fields are * explicitly NOT applied to the in-core superblock. * The idea is that that has already been done. + * + * If we are not logging superblock counters, then the inode + * allocated/free and used block counts are not updated in the + * on disk superblock. In this case, XFS_TRANS_SB_DIRTY will + * not be set when the transaction is updated but we still need + * to update the incore superblock with the changes. */ - if (tp->t_flags & XFS_TRANS_SB_DIRTY) { + if (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb) || + (tp->t_flags & XFS_TRANS_SB_DIRTY)) { if (tp->t_icount_delta != 0) { msbp->msb_field = XFS_SBS_ICOUNT; msbp->msb_delta = (int)tp->t_icount_delta; @@ -688,6 +719,9 @@ xfs_trans_unreserve_and_mod_sb( msbp->msb_delta = (int)tp->t_fdblocks_delta; msbp++; } + } + + if (tp->t_flags & XFS_TRANS_SB_DIRTY) { if (tp->t_frextents_delta != 0) { msbp->msb_field = XFS_SBS_FREXTENTS; msbp->msb_delta = (int)tp->t_frextents_delta; Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc.c 2006-04-03 08:50:15.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_alloc.c 2006-04-11 16:07:46.688244167 +1000 @@ -1449,7 +1449,8 @@ xfs_alloc_ag_vextent_small( else if (args->minlen == 1 && args->alignment == 1 && !args->isfl && (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount) > args->minleft)) { - if ((error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno))) + error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0); + if (error) goto error0; if (fbno != NULLAGBLOCK) { if (args->userdata) { @@ -1912,7 +1913,7 @@ xfs_alloc_fix_freelist( while (be32_to_cpu(agf->agf_flcount) > need) { xfs_buf_t *bp; - if ((error = xfs_alloc_get_freelist(tp, agbp, &bno))) + if ((error = xfs_alloc_get_freelist(tp, agbp, &bno, 0))) return error; if ((error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1, 1))) return error; @@ -1956,7 +1957,7 @@ xfs_alloc_fix_freelist( */ for (bno = targs.agbno; bno < targs.agbno + targs.len; bno++) { if ((error = xfs_alloc_put_freelist(tp, agbp, agflbp, - bno))) + bno, 0))) return error; } } @@ -1972,13 +1973,15 @@ int /* error */ xfs_alloc_get_freelist( xfs_trans_t *tp, /* transaction pointer */ xfs_buf_t *agbp, /* buffer containing the agf structure */ - xfs_agblock_t *bnop) /* block address retrieved from freelist */ + xfs_agblock_t *bnop, /* block address retrieved from freelist */ + int btreeblk) /* destination is a AGF btree */ { xfs_agf_t *agf; /* a.g. freespace structure */ xfs_agfl_t *agfl; /* a.g. freelist structure */ xfs_buf_t *agflbp;/* buffer for a.g. freelist structure */ xfs_agblock_t bno; /* block number returned */ int error; + int logflags; #ifdef XFS_ALLOC_TRACE static char fname[] = "xfs_alloc_get_freelist"; #endif @@ -2013,8 +2016,16 @@ xfs_alloc_get_freelist( be32_add(&agf->agf_flcount, -1); xfs_trans_agflist_delta(tp, -1); pag->pagf_flcount--; - TRACE_MODAGF(NULL, agf, XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT); - xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT); + + logflags = XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT; + if (btreeblk) { + be32_add(&agf->agf_btreeblks, 1); + pag->pagf_btreeblks++; + logflags |= XFS_AGF_BTREEBLKS; + } + + TRACE_MODAGF(NULL, agf, logflags); + xfs_alloc_log_agf(tp, agbp, logflags); *bnop = bno; /* @@ -2052,6 +2063,7 @@ xfs_alloc_log_agf( offsetof(xfs_agf_t, agf_flcount), offsetof(xfs_agf_t, agf_freeblks), offsetof(xfs_agf_t, agf_longest), + offsetof(xfs_agf_t, agf_btreeblks), sizeof(xfs_agf_t) }; @@ -2087,12 +2099,14 @@ xfs_alloc_put_freelist( xfs_trans_t *tp, /* transaction pointer */ xfs_buf_t *agbp, /* buffer for a.g. freelist header */ xfs_buf_t *agflbp,/* buffer for a.g. free block array */ - xfs_agblock_t bno) /* block being freed */ + xfs_agblock_t bno, /* block being freed */ + int btreeblk) /* block came from a AGF btree */ { xfs_agf_t *agf; /* a.g. freespace structure */ xfs_agfl_t *agfl; /* a.g. free block array */ xfs_agblock_t *blockp;/* pointer to array entry */ int error; + int logflags; #ifdef XFS_ALLOC_TRACE static char fname[] = "xfs_alloc_put_freelist"; #endif @@ -2113,11 +2127,22 @@ xfs_alloc_put_freelist( be32_add(&agf->agf_flcount, 1); xfs_trans_agflist_delta(tp, 1); pag->pagf_flcount++; + + logflags = XFS_AGF_FLLAST | XFS_AGF_FLCOUNT; + if (btreeblk) { + be32_add(&agf->agf_btreeblks, -1); + pag->pagf_btreeblks--; + logflags |= XFS_AGF_BTREEBLKS; + } + + TRACE_MODAGF(NULL, agf, logflags); + xfs_alloc_log_agf(tp, agbp, logflags); + ASSERT(be32_to_cpu(agf->agf_flcount) <= XFS_AGFL_SIZE(mp)); blockp = &agfl->agfl_bno[be32_to_cpu(agf->agf_fllast)]; INT_SET(*blockp, ARCH_CONVERT, bno); - TRACE_MODAGF(NULL, agf, XFS_AGF_FLLAST | XFS_AGF_FLCOUNT); - xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLLAST | XFS_AGF_FLCOUNT); + TRACE_MODAGF(NULL, agf, logflags); + xfs_alloc_log_agf(tp, agbp, logflags); xfs_trans_log_buf(tp, agflbp, (int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl), (int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl + @@ -2177,6 +2202,7 @@ xfs_alloc_read_agf( pag = &mp->m_perag[agno]; if (!pag->pagf_init) { pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks); + pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks); pag->pagf_flcount = be32_to_cpu(agf->agf_flcount); pag->pagf_longest = be32_to_cpu(agf->agf_longest); pag->pagf_levels[XFS_BTNUM_BNOi] = Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc.c 2006-04-03 08:50:16.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c 2006-04-11 16:07:46.694102727 +1000 @@ -125,6 +125,7 @@ xfs_ialloc_ag_alloc( int blks_per_cluster; /* fs blocks per inode cluster */ xfs_btree_cur_t *cur; /* inode btree cursor */ xfs_daddr_t d; /* disk addr of buffer */ + xfs_agnumber_t agno; int error; xfs_buf_t *fbuf; /* new free inodes' buffer */ xfs_dinode_t *free; /* new free inode structure */ @@ -303,15 +304,15 @@ xfs_ialloc_ag_alloc( } be32_add(&agi->agi_count, newlen); be32_add(&agi->agi_freecount, newlen); + agno = be32_to_cpu(agi->agi_seqno); down_read(&args.mp->m_peraglock); - args.mp->m_perag[be32_to_cpu(agi->agi_seqno)].pagi_freecount += newlen; + args.mp->m_perag[agno].pagi_freecount += newlen; up_read(&args.mp->m_peraglock); agi->agi_newino = cpu_to_be32(newino); /* * Insert records describing the new inode chunk into the btree. */ - cur = xfs_btree_init_cursor(args.mp, tp, agbp, - be32_to_cpu(agi->agi_seqno), + cur = xfs_btree_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO, (xfs_inode_t *)0, 0); for (thisino = newino; thisino < newino + newlen; @@ -1384,6 +1385,7 @@ xfs_ialloc_read_agi( pag = &mp->m_perag[agno]; if (!pag->pagi_init) { pag->pagi_freecount = be32_to_cpu(agi->agi_freecount); + pag->pagi_count = be32_to_cpu(agi->agi_count); pag->pagi_init = 1; } else { /* @@ -1407,3 +1409,22 @@ xfs_ialloc_read_agi( *bpp = bp; return 0; } + +/* + * Read in the agi to initialise the per-ag data in the mount structure + */ +int +xfs_ialloc_pagi_init( + xfs_mount_t *mp, /* file system mount structure */ + xfs_trans_t *tp, /* transaction pointer */ + xfs_agnumber_t agno) /* allocation group number */ +{ + xfs_buf_t *bp = NULL; + int error; + + if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bp))) + return error; + if (bp) + xfs_trans_brelse(tp, bp); + return 0; +} Index: 2.6.x-xfs-new/fs/xfs/xfs_ag.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ag.h 2006-04-03 08:50:15.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ag.h 2006-04-11 16:07:46.704843422 +1000 @@ -68,6 +68,7 @@ typedef struct xfs_agf { __be32 agf_flcount; /* count of blocks in freelist */ __be32 agf_freeblks; /* total free blocks */ __be32 agf_longest; /* longest free space */ + __be32 agf_btreeblks; /* # of blocks held in AGF btrees */ } xfs_agf_t; #define XFS_AGF_MAGICNUM 0x00000001 @@ -81,7 +82,8 @@ typedef struct xfs_agf { #define XFS_AGF_FLCOUNT 0x00000100 #define XFS_AGF_FREEBLKS 0x00000200 #define XFS_AGF_LONGEST 0x00000400 -#define XFS_AGF_NUM_BITS 11 +#define XFS_AGF_BTREEBLKS 0x00000800 +#define XFS_AGF_NUM_BITS 12 #define XFS_AGF_ALL_BITS ((1 << XFS_AGF_NUM_BITS) - 1) /* disk block (xfs_daddr_t) in the AG */ @@ -186,11 +188,13 @@ typedef struct xfs_perag __uint32_t pagf_flcount; /* count of blocks in freelist */ xfs_extlen_t pagf_freeblks; /* total free blocks */ xfs_extlen_t pagf_longest; /* longest free space */ + __uint32_t pagf_btreeblks; /* # of blocks held in AGF btrees */ xfs_agino_t pagi_freecount; /* number of free inodes */ + xfs_agino_t pagi_count; /* number of allocated inodes */ + int pagb_count; /* pagb slots in use */ #ifdef __KERNEL__ lock_t pagb_lock; /* lock for pagb_list */ #endif - int pagb_count; /* pagb slots in use */ xfs_perag_busy_t *pagb_list; /* unstable blocks */ } xfs_perag_t; Index: 2.6.x-xfs-new/fs/xfs/xfs_sb.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_sb.h 2006-03-20 11:41:02.814713148 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_sb.h 2006-04-11 16:07:46.712654836 +1000 @@ -78,7 +78,7 @@ struct xfs_mount; */ #define XFS_SB_VERSION2_REALFBITS 0x00ffffff /* Mask: features */ #define XFS_SB_VERSION2_RESERVED1BIT 0x00000001 -#define XFS_SB_VERSION2_RESERVED2BIT 0x00000002 +#define XFS_SB_VERSION2_LAZYSBCOUNTBIT 0x00000002 /* Superblk counters */ #define XFS_SB_VERSION2_RESERVED4BIT 0x00000004 #define XFS_SB_VERSION2_ATTR2BIT 0x00000008 /* Inline attr rework */ #define XFS_SB_VERSION2_SASHFBITS 0xff000000 /* Mask: features that @@ -86,7 +86,8 @@ struct xfs_mount; PROM and SASH */ #define XFS_SB_VERSION2_OKREALFBITS \ - (XFS_SB_VERSION2_ATTR2BIT) + (XFS_SB_VERSION2_LAZYSBCOUNTBIT | \ + XFS_SB_VERSION2_ATTR2BIT) #define XFS_SB_VERSION2_OKSASHFBITS \ (0) #define XFS_SB_VERSION2_OKREALBITS \ @@ -188,6 +189,9 @@ typedef enum { #define XFS_SB_SHARED_VN XFS_SB_MVAL(SHARED_VN) #define XFS_SB_UNIT XFS_SB_MVAL(UNIT) #define XFS_SB_WIDTH XFS_SB_MVAL(WIDTH) +#define XFS_SB_ICOUNT XFS_SB_MVAL(ICOUNT) +#define XFS_SB_IFREE XFS_SB_MVAL(IFREE) +#define XFS_SB_FDBLOCKS XFS_SB_MVAL(FDBLOCKS) #define XFS_SB_FEATURES2 XFS_SB_MVAL(FEATURES2) #define XFS_SB_NUM_BITS ((int)XFS_SBS_FIELDCOUNT) #define XFS_SB_ALL_BITS ((1LL << XFS_SB_NUM_BITS) - 1) @@ -195,7 +199,7 @@ typedef enum { (XFS_SB_UUID | XFS_SB_ROOTINO | XFS_SB_RBMINO | XFS_SB_RSUMINO | \ XFS_SB_VERSIONNUM | XFS_SB_UQUOTINO | XFS_SB_GQUOTINO | \ XFS_SB_QFLAGS | XFS_SB_SHARED_VN | XFS_SB_UNIT | XFS_SB_WIDTH | \ - XFS_SB_FEATURES2) + XFS_SB_ICOUNT | XFS_SB_IFREE | XFS_SB_FDBLOCKS | XFS_SB_FEATURES2) /* @@ -427,6 +431,13 @@ static inline int xfs_sb_version_hasmore * ((sbp)->sb_features2 & XFS_SB_VERSION2_FUNBIT) */ +#define XFS_SB_VERSION_LAZYSBCOUNT(sbp) xfs_sb_version_haslazysbcount(sbp) +static inline int xfs_sb_version_haslazysbcount(xfs_sb_t *sbp) +{ + return (XFS_SB_VERSION_HASMOREBITS(sbp) && \ + ((sbp)->sb_features2 & XFS_SB_VERSION2_LAZYSBCOUNTBIT)); +} + #define XFS_SB_VERSION_HASATTR2(sbp) xfs_sb_version_hasattr2(sbp) static inline int xfs_sb_version_hasattr2(xfs_sb_t *sbp) { Index: 2.6.x-xfs-new/fs/xfs/xfs_mount.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_mount.c 2006-04-11 10:05:35.457757222 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mount.c 2006-04-11 16:07:46.796627539 +1000 @@ -635,6 +635,60 @@ xfs_mount_common(xfs_mount_t *mp, xfs_sb sbp->sb_inopblock); mp->m_ialloc_blks = mp->m_ialloc_inos >> sbp->sb_inopblog; } + +/* + * xfs_initialize_perag_data + * + * Read in each per-ag structure so we can count up the number of + * allocated inodes, free inodes and used filesystem blocks as this + * information is no longer persistent in the superblock. Once we have + * this information, write it into the in-core superblock structure. + */ +STATIC int +xfs_initialize_perag_data(xfs_mount_t *mp, xfs_agnumber_t agcount) +{ + xfs_agnumber_t index; + xfs_perag_t *pag; + xfs_sb_t *sbp = &mp->m_sb; + uint64_t ifree = 0; + uint64_t ialloc = 0; + uint64_t bfree = 0; + uint64_t bfreelst = 0; + uint64_t btree = 0; + int error; + int s; + + for (index = 0; index < agcount; index++) { + /* + * read the agf, then the agi. This gets us + * all the inforamtion we need and populates the + * per-ag structures for us. + */ + error = xfs_alloc_pagf_init(mp, NULL, index, 0); + if (error) + return error; + + error = xfs_ialloc_pagi_init(mp, NULL, index); + if (error) + return error; + pag = &mp->m_perag[index]; + ifree += pag->pagi_freecount; + ialloc += pag->pagi_count; + bfree += pag->pagf_freeblks; + bfreelst += pag->pagf_flcount; + btree += pag->pagf_btreeblks; + } + /* + * Overwrite incore superblock counters with just-read data + */ + s = XFS_SB_LOCK(mp); + sbp->sb_ifree = ifree; + sbp->sb_icount = ialloc; + sbp->sb_fdblocks = bfree + bfreelst + btree; + XFS_SB_UNLOCK(mp, s); + return 0; +} + /* * xfs_mountfs * @@ -1051,6 +1105,28 @@ xfs_mountfs( } /* + * Now recovery has completed, we can initialise incore + * superblock counters from the per-ag data. These may not + * be correct if the filesystem was not cleanly unmounted, + * so we need to wait for recovery to finish before doing this. + * + * If the filesystem was cleanly unmounted, then we can trust + * the values in the superblock to be correct and we don't need + * to do anything here. + * + * If we are currently making the filesystem, the initialisation + * will fail as the perag data is in an undefined state. + */ + if (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb) && + !XFS_LAST_UNMOUNT_WAS_CLEAN(mp) && + !mp->m_sb.sb_inprogress) { + error = xfs_initialize_perag_data(mp, sbp->sb_agcount); + if (error) { + goto error4; + } + } + + /* * Complete the quota initialisation, post-log-replay component. */ if ((error = XFS_QM_MOUNT(mp, quotamount, quotaflags, mfsi_flags))) @@ -1160,27 +1236,102 @@ xfs_unmountfs_wait(xfs_mount_t *mp) xfs_wait_buftarg(mp->m_ddev_targp); } +/* + * xfs_log_sbcount + * + * Called either periodically to keep the on disk superblock values + * roughly up to date or from unmount to make sure the values are + * correct on a clean unmount. + */ +int +xfs_log_sbcount( + xfs_mount_t *mp, + uint flags) /* 0 = no log force, + non-zero = log force flags */ +{ + xfs_trans_t *tp; + int error; + + if (flags & XFS_LOG_SYNC) + xfs_icsb_sync_counters(mp); + else + xfs_icsb_sync_counters_lazy(mp); + + /* + * we don't need to write the superblock if we are updating the + * counters on every modification. + */ + if (!XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)) + return 0; + + tp = xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT); + error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0, + XFS_DEFAULT_LOG_COUNT); + if (error) { + xfs_trans_cancel(tp, 0); + return error; + } + + /* + * modify the superblock and commit the transaction + * to disk - flags will be set to indicate if and how + * we need to force the log. + */ + xfs_mod_sb(tp, (XFS_SB_IFREE | + XFS_SB_ICOUNT | + XFS_SB_FDBLOCKS)); + + (void)xfs_trans_commit(tp, 0, NULL); + + if (flags) { + xfs_log_force(mp, (xfs_lsn_t)0, flags); + } + + return 0; +} + int xfs_unmountfs_writesb(xfs_mount_t *mp) { xfs_buf_t *sbp; xfs_sb_t *sb; int error = 0; + int s; /* * skip superblock write if fs is read-only, or * if we are doing a forced umount. */ - sbp = xfs_getsb(mp, 0); if (!(XFS_MTOVFS(mp)->vfs_flag & VFS_RDONLY || XFS_FORCED_SHUTDOWN(mp))) { + /* + * Write into superblock the fields that we haven't + * been logging - allocated/free inode and free block + * counts - from the incore superblock. + */ + error = xfs_log_sbcount(mp, (XFS_LOG_FORCE|XFS_LOG_SYNC)); - xfs_icsb_sync_counters(mp); + sbp = xfs_getsb(mp, 0); + sb = XFS_BUF_TO_SBP(sbp); + if (error) { + /* + * Hmmm - failed to get log reservations so just + * do the mod without a transaction. Whine about + * it, too. + */ + ASSERT(XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)); + xfs_fs_cmn_err(CE_NOTE, mp, + "Unmounting, non-transactional sb update"); + s = XFS_SB_LOCK(mp); + INT_SET(sb->sb_icount, ARCH_CONVERT, mp->m_sb.sb_icount); + INT_SET(sb->sb_ifree, ARCH_CONVERT, mp->m_sb.sb_ifree); + INT_SET(sb->sb_fdblocks, ARCH_CONVERT, mp->m_sb.sb_fdblocks); + XFS_SB_UNLOCK(mp, s); + } /* * mark shared-readonly if desired */ - sb = XFS_BUF_TO_SBP(sbp); if (mp->m_mk_sharedro) { if (!(sb->sb_flags & XFS_SBF_READONLY)) sb->sb_flags |= XFS_SBF_READONLY; @@ -1189,6 +1340,7 @@ xfs_unmountfs_writesb(xfs_mount_t *mp) xfs_fs_cmn_err(CE_NOTE, mp, "Unmounting, marking shared read-only"); } + XFS_BUF_UNDONE(sbp); XFS_BUF_UNREAD(sbp); XFS_BUF_UNDELAYWRITE(sbp); @@ -1203,8 +1355,8 @@ xfs_unmountfs_writesb(xfs_mount_t *mp) mp, sbp, XFS_BUF_ADDR(sbp)); if (error && mp->m_mk_sharedro) xfs_fs_cmn_err(CE_ALERT, mp, "Superblock write error detected while unmounting. Filesystem may not be marked shared readonly"); + xfs_buf_relse(sbp); } - xfs_buf_relse(sbp); return error; } Index: 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vfsops.c 2006-04-11 10:05:35.439205062 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c 2006-04-11 16:07:46.804438954 +1000 @@ -1524,12 +1524,19 @@ xfs_syncsub( * If this is the periodic sync, then kick some entries out of * the reference cache. This ensures that idle entries are * eventually kicked out of the cache. + * + * We also update the disk superblock with incore counter + * values if we are using non-persistent counters so that + * they don't get too far out of sync if we crash or get a + * forced shutdown. We don't want to force this to disk, + * just get a transaction into the iclogs.... */ if (flags & SYNC_REFCACHE) { if (flags & SYNC_WAIT) xfs_refcache_purge_mp(mp); else xfs_refcache_purge_some(mp); + xfs_log_sbcount(mp, 0); } /* Index: 2.6.x-xfs-new/fs/xfs/xfs_log_recover.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log_recover.c 2006-04-03 08:50:16.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log_recover.c 2006-04-11 16:07:46.823967489 +1000 @@ -929,6 +929,14 @@ xlog_find_tail( ASSIGN_ANY_LSN_HOST(log->l_last_sync_lsn, log->l_curr_cycle, after_umount_blk); *tail_blk = after_umount_blk; + + /* + * Note that the unmount was clean. If the unmount + * was not clean, we need to know this to rebuild the + * superblock counters from the perag headers if we + * have a filesystem using non-persistent counters. + */ + log->l_mp->m_flags |= XFS_MOUNT_WAS_CLEAN; } } Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc_btree.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc_btree.c 2006-03-20 11:41:02.853770346 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_alloc_btree.c 2006-04-11 16:07:46.987030762 +1000 @@ -226,7 +226,7 @@ xfs_alloc_delrec( * Put this buffer/block on the ag's freelist. */ if ((error = xfs_alloc_put_freelist(cur->bc_tp, - cur->bc_private.a.agbp, NULL, bno))) + cur->bc_private.a.agbp, NULL, bno, 1))) return error; /* * Since blocks move to the free list without the @@ -551,7 +551,7 @@ xfs_alloc_delrec( * Free the deleting block by putting it on the freelist. */ if ((error = xfs_alloc_put_freelist(cur->bc_tp, cur->bc_private.a.agbp, - NULL, rbno))) + NULL, rbno, 1))) return error; /* * Since blocks move to the free list without the coordination @@ -1317,7 +1317,7 @@ xfs_alloc_newroot( * Get a buffer from the freelist blocks, for the new root. */ if ((error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp, - &nbno))) + &nbno, 1))) return error; /* * None available, we fail. @@ -1601,7 +1601,7 @@ xfs_alloc_split( * If we can't do it, we're toast. Give up. */ if ((error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp, - &rbno))) + &rbno, 1))) return error; if (rbno == NULLAGBLOCK) { *stat = 0; Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc.h 2006-04-03 08:50:15.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_alloc.h 2006-04-11 16:07:47.092484854 +1000 @@ -114,7 +114,8 @@ int /* error */ xfs_alloc_get_freelist( struct xfs_trans *tp, /* transaction pointer */ struct xfs_buf *agbp, /* buffer containing the agf structure */ - xfs_agblock_t *bnop); /* block address retrieved from freelist */ + xfs_agblock_t *bnop, /* block address retrieved from freelist */ + int btreeblk); /* destination is a AGF btree */ /* * Log the given fields from the agf structure. @@ -143,7 +144,8 @@ xfs_alloc_put_freelist( struct xfs_trans *tp, /* transaction pointer */ struct xfs_buf *agbp, /* buffer for a.g. freelist header */ struct xfs_buf *agflbp,/* buffer for a.g. free block array */ - xfs_agblock_t bno); /* block being freed */ + xfs_agblock_t bno, /* block being freed */ + int btreeblk); /* owner was a AGF btree */ /* * Read in the allocation group header (free/alloc section). Index: 2.6.x-xfs-new/fs/xfs/xfs_trans.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans.h 2006-04-03 08:50:16.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_trans.h 2006-04-11 16:07:47.096390561 +1000 @@ -98,7 +98,8 @@ typedef struct xfs_trans_header { #define XFS_TRANS_GROWFSRT_ZERO 38 #define XFS_TRANS_GROWFSRT_FREE 39 #define XFS_TRANS_SWAPEXT 40 -#define XFS_TRANS_TYPE_MAX 40 +#define XFS_TRANS_SB_COUNT 41 +#define XFS_TRANS_TYPE_MAX 41 /* new transaction types need to be reflected in xfs_logprint(8) */ Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc.h 2006-03-20 11:41:02.857676066 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc.h 2006-04-11 16:07:47.097366988 +1000 @@ -149,6 +149,16 @@ xfs_ialloc_read_agi( xfs_agnumber_t agno, /* allocation group number */ struct xfs_buf **bpp); /* allocation group hdr buf */ +/* + * Read in the allocation group header to initialise the per-ag data + * in the mount structure + */ +int +xfs_ialloc_pagi_init( + struct xfs_mount *mp, /* file system mount structure */ + struct xfs_trans *tp, /* transaction pointer */ + xfs_agnumber_t agno); /* allocation group number */ + #endif /* __KERNEL__ */ #endif /* __XFS_IALLOC_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfsidbg.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfsidbg.c 2006-04-11 10:05:35.450922216 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfsidbg.c 2006-04-11 16:07:47.110060536 +1000 @@ -6482,6 +6482,8 @@ xfsidbg_print_trans_type(unsigned int t_ case XFS_TRANS_GROWFSRT_ALLOC: kdb_printf("GROWFSRT_ALLOC"); break; case XFS_TRANS_GROWFSRT_ZERO: kdb_printf("GROWFSRT_ZERO"); break; case XFS_TRANS_GROWFSRT_FREE: kdb_printf("GROWFSRT_FREE"); break; + case XFS_TRANS_SWAPEXT: kdb_printf("SWAPEXT"); break; + case XFS_TRANS_SB_COUNT: kdb_printf("SB_COUNT"); break; default: kdb_printf("unknown(0x%x)", t_type); break; } } Index: 2.6.x-xfs-new/fs/xfs/xfs_mount.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_mount.h 2006-04-03 08:50:16.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mount.h 2006-04-11 16:07:47.112989816 +1000 @@ -415,7 +415,7 @@ typedef struct xfs_mount { for space allocations */ #define XFS_MOUNT_INO64 (1ULL << 1) /* (1ULL << 2) -- currently unused */ - /* (1ULL << 3) -- currently unused */ +#define XFS_MOUNT_WAS_CLEAN (1ULL << 3) #define XFS_MOUNT_FS_SHUTDOWN (1ULL << 4) /* atomic stop of all filesystem operations, typically for disk errors in metadata */ @@ -492,6 +492,8 @@ xfs_preferred_iosize(xfs_mount_t *mp) #define XFS_MAXIOFFSET(mp) ((mp)->m_maxioffset) +#define XFS_LAST_UNMOUNT_WAS_CLEAN(mp) \ + ((mp)->m_flags & XFS_MOUNT_WAS_CLEAN) #define XFS_FORCED_SHUTDOWN(mp) ((mp)->m_flags & XFS_MOUNT_FS_SHUTDOWN) #define xfs_force_shutdown(m,f) \ VFS_FORCE_SHUTDOWN((XFS_MTOVFS(m)), f, __FILE__, __LINE__) @@ -570,6 +572,7 @@ typedef struct xfs_mod_sb { extern xfs_mount_t *xfs_mount_init(void); extern void xfs_mod_sb(xfs_trans_t *, __int64_t); +extern int xfs_log_sbcount(xfs_mount_t *, uint); extern void xfs_mount_free(xfs_mount_t *mp, int remove_bhv); extern int xfs_mountfs(struct vfs *, xfs_mount_t *mp, int); extern void xfs_mountfs_check_barriers(xfs_mount_t *mp); Index: 2.6.x-xfs-new/fs/xfs/xfs_fsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_fsops.c 2006-04-03 08:50:16.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_fsops.c 2006-04-11 16:07:47.114942670 +1000 @@ -96,6 +96,8 @@ xfs_fs_geometry( XFS_FSOP_GEOM_FLAGS_DIRV2 : 0) | (XFS_SB_VERSION_HASSECTOR(&mp->m_sb) ? XFS_FSOP_GEOM_FLAGS_SECTOR : 0) | + (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb) ? + XFS_FSOP_GEOM_FLAGS_LAZYSB : 0) | (XFS_SB_VERSION_HASATTR2(&mp->m_sb) ? XFS_FSOP_GEOM_FLAGS_ATTR2 : 0); geo->logsectsize = XFS_SB_VERSION_HASSECTOR(&mp->m_sb) ? Index: 2.6.x-xfs-new/fs/xfs/xfs_fs.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_fs.h 2006-03-20 11:41:02.868416796 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_fs.h 2006-04-11 16:07:47.126659791 +1000 @@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks { #define XFS_FSOP_GEOM_FLAGS_LOGV2 0x0100 /* log format version 2 */ #define XFS_FSOP_GEOM_FLAGS_SECTOR 0x0200 /* sector sizes >1BB */ #define XFS_FSOP_GEOM_FLAGS_ATTR2 0x0400 /* inline attributes rework */ +#define XFS_FSOP_GEOM_FLAGS_LAZYSB 0x4000 /* lazy superblock counters */ /* Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2006-04-03 08:50:15.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c 2006-04-11 16:07:47.182316118 +1000 @@ -557,7 +557,7 @@ xfs_flush_device( xfs_log_force(ip->i_mount, (xfs_lsn_t)0, XFS_LOG_FORCE|XFS_LOG_SYNC); } -#define SYNCD_FLAGS (SYNC_FSDATA|SYNC_BDFLUSH|SYNC_ATTR) +#define SYNCD_FLAGS (SYNC_FSDATA|SYNC_BDFLUSH|SYNC_ATTR|SYNC_REFCACHE) STATIC void vfs_sync_worker( vfs_t *vfsp, From owner-xfs@oss.sgi.com Thu Apr 19 16:19:08 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 16:19:11 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3JNJ5fB000769 for ; Thu, 19 Apr 2007 16:19:07 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA15420; Fri, 20 Apr 2007 09:19:01 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3JNJ0Af71272351; Fri, 20 Apr 2007 09:19:00 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3JNIxRH71364971; Fri, 20 Apr 2007 09:18:59 +1000 (AEST) Date: Fri, 20 Apr 2007 09:18:59 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review [1 of 3]: lazy superblock counters - fix interaction with per-cpu incore counters Message-ID: <20070419231859.GY48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11136 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs We need to reinitialise the per-cpu superblock counters after we have corrected them in the new recovery phase (summing hte AGF/AGI counters. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_mount.c | 53 ++++++++++++++++++++++++++++++++--------------------- 1 file changed, 32 insertions(+), 21 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_mount.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_mount.c 2007-04-19 13:46:42.536211213 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mount.c 2007-04-19 13:47:41.272642686 +1000 @@ -676,6 +676,10 @@ xfs_initialize_perag_data(xfs_mount_t *m sbp->sb_icount = ialloc; sbp->sb_fdblocks = bfree + bfreelst + btree; XFS_SB_UNLOCK(mp, s); + + /* Fixup the per-cpu counters as well. */ + xfs_icsb_reinit_counters(mp); + return 0; } @@ -1022,6 +1026,34 @@ xfs_mountfs( } /* + * Now the log is mounted, we know if it was an unclean shutdown or + * not. If it was, with the first phase of recovery has completed, we + * have consistent AG blocks on disk. We have not recovered EFIs yet, + * but they are recovered transactionally in the second recovery phase + * later. + * + * Hence we can safely re-initialise incore superblock counters from + * the per-ag data. These may not be correct if the filesystem was not + * cleanly unmounted, so we need to wait for recovery to finish before + * doing this. + * + * If the filesystem was cleanly unmounted, then we can trust the + * values in the superblock to be correct and we don't need to do + * anything here. + * + * If we are currently making the filesystem, the initialisation will + * fail as the perag data is in an undefined state. + */ + + if (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb) && + !XFS_LAST_UNMOUNT_WAS_CLEAN(mp) && + !mp->m_sb.sb_inprogress) { + error = xfs_initialize_perag_data(mp, sbp->sb_agcount); + if (error) { + goto error2; + } + } + /* * Get and sanity-check the root inode. * Save the pointer to it in the mount structure. */ @@ -1084,27 +1116,6 @@ xfs_mountfs( goto error4; } - /* - * Now recovery has completed, we can initialise incore - * superblock counters from the per-ag data. These may not - * be correct if the filesystem was not cleanly unmounted, - * so we need to wait for recovery to finish before doing this. - * - * If the filesystem was cleanly unmounted, then we can trust - * the values in the superblock to be correct and we don't need - * to do anything here. - * - * If we are currently making the filesystem, the initialisation - * will fail as the perag data is in an undefined state. - */ - if (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb) && - !XFS_LAST_UNMOUNT_WAS_CLEAN(mp) && - !mp->m_sb.sb_inprogress) { - error = xfs_initialize_perag_data(mp, sbp->sb_agcount); - if (error) { - goto error4; - } - } /* * Complete the quota initialisation, post-log-replay component. From owner-xfs@oss.sgi.com Thu Apr 19 16:21:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 16:21:51 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3JNLgfB001650 for ; Thu, 19 Apr 2007 16:21:44 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA15499; Fri, 20 Apr 2007 09:21:35 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3JNLYAf71378703; Fri, 20 Apr 2007 09:21:35 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3JNLXQW71341069; Fri, 20 Apr 2007 09:21:33 +1000 (AEST) Date: Fri, 20 Apr 2007 09:21:33 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review [3 of 3]: lazy superblock counters - userspace bits Message-ID: <20070419232133.GZ48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11137 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs userspace tool support for lazy superblock accounting. -- Dave Chinner Principal Engineer SGI Australian Software Group --- xfsprogs/db/agf.c | 1 xfsprogs/db/check.c | 25 +++++++++++++++++ xfsprogs/db/sb.c | 2 + xfsprogs/growfs/xfs_growfs.c | 13 ++++++--- xfsprogs/include/libxfs.h | 1 xfsprogs/include/xfs_ag.h | 8 ++++- xfsprogs/include/xfs_alloc.h | 6 ++-- xfsprogs/include/xfs_fs.h | 1 xfsprogs/include/xfs_ialloc.h | 10 +++++++ xfsprogs/include/xfs_mount.h | 5 ++- xfsprogs/include/xfs_sb.h | 15 +++++++++- xfsprogs/include/xfs_trans.h | 3 +- xfsprogs/libxfs/init.c | 14 ++++++++++ xfsprogs/libxfs/xfs.h | 6 ++-- xfsprogs/libxfs/xfs_alloc.c | 49 +++++++++++++++++++++++++++-------- xfsprogs/libxfs/xfs_alloc_btree.c | 8 ++--- xfsprogs/libxfs/xfs_ialloc.c | 21 +++++++++++++++ xfsprogs/libxfs/xfs_mount.c | 53 ++++++++++++++++++++++++++++++++++++++ xfsprogs/man/man8/mkfs.xfs.8 | 23 +++++++++++++++- xfsprogs/mkfs/xfs_mkfs.c | 22 ++++++++++++--- xfsprogs/repair/phase5.c | 22 +++++++++++++++ 21 files changed, 273 insertions(+), 35 deletions(-) Index: xfs-cmds/xfsprogs/db/agf.c =================================================================== --- xfs-cmds.orig/xfsprogs/db/agf.c 2006-11-15 19:00:28.000000000 +1100 +++ xfs-cmds/xfsprogs/db/agf.c 2007-04-03 14:52:51.659780004 +1000 @@ -68,6 +68,7 @@ const field_t agf_flds[] = { { "flcount", FLDT_UINT32D, OI(OFF(flcount)), C1, 0, TYP_NONE }, { "freeblks", FLDT_EXTLEN, OI(OFF(freeblks)), C1, 0, TYP_NONE }, { "longest", FLDT_EXTLEN, OI(OFF(longest)), C1, 0, TYP_NONE }, + { "btreeblks", FLDT_UINT32D, OI(OFF(btreeblks)), C1, 0, TYP_NONE }, { NULL } }; Index: xfs-cmds/xfsprogs/db/check.c =================================================================== --- xfs-cmds.orig/xfsprogs/db/check.c 2006-12-22 11:49:54.000000000 +1100 +++ xfs-cmds/xfsprogs/db/check.c 2007-04-03 14:52:51.675777913 +1000 @@ -110,6 +110,9 @@ typedef struct dirhash { static xfs_extlen_t agffreeblks; static xfs_extlen_t agflongest; +static __uint64_t agf_aggr_freeblks; /* aggregate count over all */ +static __uint32_t agfbtreeblks; +static int lazycount; static xfs_agino_t agicount; static xfs_agino_t agifreecount; static xfs_fsblock_t *blist; @@ -854,6 +857,12 @@ blockget_f( mp->m_sb.sb_fdblocks, fdblocks); error++; } + if (lazycount && mp->m_sb.sb_fdblocks != agf_aggr_freeblks) { + if (!sflag) + dbprintf("sb_fdblocks %lld, aggregate AGF count %lld\n", + mp->m_sb.sb_fdblocks, agf_aggr_freeblks); + error++; + } if (mp->m_sb.sb_frextents != frextents) { if (!sflag) dbprintf("sb_frextents %lld, counted %lld\n", @@ -3886,6 +3895,7 @@ scan_ag( xfs_sb_t *sb=&tsb; agffreeblks = agflongest = 0; + agfbtreeblks = -2; agicount = agifreecount = 0; push_cur(); set_cur(&typtab[TYP_SB], @@ -3914,6 +3924,9 @@ scan_ag( error++; sbver_err++; } + if (!lazycount && XFS_SB_VERSION_LAZYSBCOUNT(sb)) { + lazycount = 1; + } if (agno == 0 && sb->sb_inprogress != 0) { if (!sflag) dbprintf("mkfs not completed successfully\n"); @@ -4010,6 +4023,15 @@ scan_ag( agflongest, agno); error++; } + if (lazycount && + INT_GET(agf->agf_btreeblks, ARCH_CONVERT) != agfbtreeblks) { + if (!sflag) + dbprintf("agf_btreeblks %u, counted %u in ag %u\n", + INT_GET(agf->agf_btreeblks, ARCH_CONVERT), + agfbtreeblks, agno); + error++; + } + agf_aggr_freeblks += agffreeblks + agfbtreeblks; if (INT_GET(agi->agi_count, ARCH_CONVERT) != agicount) { if (!sflag) dbprintf("agi_count %u, counted %u in ag %u\n", @@ -4086,6 +4108,7 @@ scan_freelist( error++; } fdblocks += count; + agf_aggr_freeblks += count; pop_cur(); } @@ -4242,6 +4265,7 @@ scanfunc_bno( return; } fdblocks++; + agfbtreeblks++; if (INT_GET(block->bb_level, ARCH_CONVERT) != level) { if (!sflag) dbprintf("expected level %d got %d in btbno block " @@ -4317,6 +4341,7 @@ scanfunc_cnt( return; } fdblocks++; + agfbtreeblks++; if (INT_GET(block->bb_level, ARCH_CONVERT) != level) { if (!sflag) dbprintf("expected level %d got %d in btcnt block " Index: xfs-cmds/xfsprogs/db/sb.c =================================================================== --- xfs-cmds.orig/xfsprogs/db/sb.c 2006-11-15 19:00:29.000000000 +1100 +++ xfs-cmds/xfsprogs/db/sb.c 2007-04-03 14:52:51.683776868 +1000 @@ -608,6 +608,8 @@ version_string( strcat(s, ",MOREBITS"); if (XFS_SB_VERSION_HASATTR2(sbp)) strcat(s, ",ATTR2"); + if (XFS_SB_VERSION_LAZYSBCOUNT(sbp)) + strcat(s, ",LAZYSBCOUNT"); return s; } Index: xfs-cmds/xfsprogs/growfs/xfs_growfs.c =================================================================== --- xfs-cmds.orig/xfsprogs/growfs/xfs_growfs.c 2006-12-22 11:49:54.000000000 +1100 +++ xfs-cmds/xfsprogs/growfs/xfs_growfs.c 2007-04-03 14:52:51.703774255 +1000 @@ -59,6 +59,7 @@ report_info( char *logname, char *rtname, int unwritten, + int lazycount, int dirversion, int logversion, int attrversion) @@ -70,7 +71,7 @@ report_info( " =%-22s sunit=%-6u swidth=%u blks, unwritten=%u\n" "naming =version %-14u bsize=%-6u\n" "log =%-22s bsize=%-6u blocks=%u, version=%u\n" - " =%-22s sectsz=%-5u sunit=%u blks\n" + " =%-22s sectsz=%-5u sunit=%u blks, lazy-count=%u\n" "realtime =%-22s extsz=%-6u blocks=%llu, rtextents=%llu\n"), mntpoint, geo.inodesize, geo.agcount, geo.agblocks, @@ -81,7 +82,7 @@ report_info( dirversion, geo.dirblocksize, isint ? _("internal") : logname ? logname : _("external"), geo.blocksize, geo.logblocks, logversion, - "", geo.logsectsize, geo.logsunit / geo.blocksize, + "", geo.logsectsize, geo.logsunit / geo.blocksize, lazycount, !geo.rtblocks ? _("none") : rtname ? rtname : _("external"), geo.rtextsize * geo.blocksize, (unsigned long long)geo.rtblocks, (unsigned long long)geo.rtextents); @@ -115,6 +116,7 @@ main(int argc, char **argv) int rflag; /* -r flag */ long long rsize; /* new rt size in fs blocks */ int unwritten; /* unwritten extent flag */ + int lazycount; /* lazy superblock counters */ int xflag; /* -x flag */ char *fname; /* mount point name */ char *datadev; /* data device name */ @@ -235,6 +237,7 @@ main(int argc, char **argv) } isint = geo.logstart > 0; unwritten = geo.flags & XFS_FSOP_GEOM_FLAGS_EXTFLG ? 1 : 0; + lazycount = geo.flags & XFS_FSOP_GEOM_FLAGS_LAZYSB ? 1 : 0; dirversion = geo.flags & XFS_FSOP_GEOM_FLAGS_DIRV2 ? 2 : 1; logversion = geo.flags & XFS_FSOP_GEOM_FLAGS_LOGV2 ? 2 : 1; attrversion = geo.flags & XFS_FSOP_GEOM_FLAGS_ATTR2 ? 2 : \ @@ -242,7 +245,8 @@ main(int argc, char **argv) if (nflag) { report_info(geo, datadev, isint, logdev, rtdev, - unwritten, dirversion, logversion, attrversion); + unwritten, lazycount, dirversion, logversion, + attrversion); exit(0); } @@ -278,7 +282,8 @@ main(int argc, char **argv) } report_info(geo, datadev, isint, logdev, rtdev, - unwritten, dirversion, logversion, attrversion); + unwritten, lazycount, dirversion, logversion, + attrversion); ddsize = xi.dsize; dlsize = ( xi.logBBsize? xi.logBBsize : Index: xfs-cmds/xfsprogs/include/xfs_fs.h =================================================================== --- xfs-cmds.orig/xfsprogs/include/xfs_fs.h 2007-02-08 00:43:21.546422620 +1100 +++ xfs-cmds/xfsprogs/include/xfs_fs.h 2007-04-03 14:52:51.731770597 +1000 @@ -240,6 +240,7 @@ typedef struct xfs_fsop_resblks { #define XFS_FSOP_GEOM_FLAGS_LOGV2 0x0100 /* log format version 2 */ #define XFS_FSOP_GEOM_FLAGS_SECTOR 0x0200 /* sector sizes >1BB */ #define XFS_FSOP_GEOM_FLAGS_ATTR2 0x0400 /* inline attributes rework */ +#define XFS_FSOP_GEOM_FLAGS_LAZYSB 0x4000 /* lazy superblock counters */ /* Index: xfs-cmds/xfsprogs/include/xfs_sb.h =================================================================== --- xfs-cmds.orig/xfsprogs/include/xfs_sb.h 2006-11-15 19:00:30.000000000 +1100 +++ xfs-cmds/xfsprogs/include/xfs_sb.h 2007-04-03 14:52:51.739769552 +1000 @@ -87,7 +87,8 @@ struct xfs_mount; PROM and SASH */ #define XFS_SB_VERSION2_OKREALFBITS \ - (XFS_SB_VERSION2_ATTR2BIT) + (XFS_SB_VERSION2_ATTR2BIT | \ + XFS_SB_VERSION2_LAZYSBCOUNTBIT) #define XFS_SB_VERSION2_OKSASHFBITS \ (0) #define XFS_SB_VERSION2_OKREALBITS \ @@ -189,6 +190,9 @@ typedef enum { #define XFS_SB_SHARED_VN XFS_SB_MVAL(SHARED_VN) #define XFS_SB_UNIT XFS_SB_MVAL(UNIT) #define XFS_SB_WIDTH XFS_SB_MVAL(WIDTH) +#define XFS_SB_ICOUNT XFS_SB_MVAL(ICOUNT) +#define XFS_SB_IFREE XFS_SB_MVAL(IFREE) +#define XFS_SB_FDBLOCKS XFS_SB_MVAL(FDBLOCKS) #define XFS_SB_FEATURES2 XFS_SB_MVAL(FEATURES2) #define XFS_SB_NUM_BITS ((int)XFS_SBS_FIELDCOUNT) #define XFS_SB_ALL_BITS ((1LL << XFS_SB_NUM_BITS) - 1) @@ -196,7 +200,7 @@ typedef enum { (XFS_SB_UUID | XFS_SB_ROOTINO | XFS_SB_RBMINO | XFS_SB_RSUMINO | \ XFS_SB_VERSIONNUM | XFS_SB_UQUOTINO | XFS_SB_GQUOTINO | \ XFS_SB_QFLAGS | XFS_SB_SHARED_VN | XFS_SB_UNIT | XFS_SB_WIDTH | \ - XFS_SB_FEATURES2) + XFS_SB_ICOUNT | XFS_SB_IFREE | XFS_SB_FDBLOCKS | XFS_SB_FEATURES2) /* @@ -428,6 +432,13 @@ static inline int xfs_sb_version_hasmore * ((sbp)->sb_features2 & XFS_SB_VERSION2_FUNBIT) */ +#define XFS_SB_VERSION_LAZYSBCOUNT(sbp) xfs_sb_version_haslazysbcount(sbp) +static inline int xfs_sb_version_haslazysbcount(xfs_sb_t *sbp) +{ + return (XFS_SB_VERSION_HASMOREBITS(sbp) && + ((sbp)->sb_features2 & XFS_SB_VERSION2_LAZYSBCOUNTBIT)); +} + #define XFS_SB_VERSION_HASATTR2(sbp) xfs_sb_version_hasattr2(sbp) static inline int xfs_sb_version_hasattr2(xfs_sb_t *sbp) { Index: xfs-cmds/xfsprogs/mkfs/xfs_mkfs.c =================================================================== --- xfs-cmds.orig/xfsprogs/mkfs/xfs_mkfs.c 2006-12-22 11:49:54.000000000 +1100 +++ xfs-cmds/xfsprogs/mkfs/xfs_mkfs.c 2007-04-03 14:52:51.771765371 +1000 @@ -118,6 +118,8 @@ char *lopts[] = { "file", #define L_NAME 10 "name", +#define L_LAZYSBCNTR 11 + "lazy-count", NULL }; @@ -602,6 +604,7 @@ main( libxfs_init_t xi; int xlv_dsunit; int xlv_dswidth; + int lazy_sb_counters; progname = basename(argv[0]); setlocale(LC_ALL, ""); @@ -631,6 +634,7 @@ main( extent_flagging = 1; force_overwrite = 0; worst_freelist = 0; + lazy_sb_counters = 0; bzero(&fsx, sizeof(fsx)); bzero(&xi, sizeof(xi)); @@ -1082,6 +1086,15 @@ main( libxfs_highbit32(lsectorsize); lssflag = 1; break; + case L_LAZYSBCNTR: + if (!value) + reqval('l', lopts, + L_LAZYSBCNTR); + c = atoi(value); + if (c < 0 || c > 1) + illegal(value, "l lazy-count"); + lazy_sb_counters = c; + break; default: unknown('l', value); } @@ -1914,7 +1927,7 @@ an AG size that is one stripe unit small " =%-22s sunit=%-6u swidth=%u blks, unwritten=%u\n" "naming =version %-14u bsize=%-6u\n" "log =%-22s bsize=%-6d blocks=%lld, version=%d\n" - " =%-22s sectsz=%-5u sunit=%d blks\n" + " =%-22s sectsz=%-5u sunit=%d blks, lazy-count=%d\n" "realtime =%-22s extsz=%-6d blocks=%lld, rtextents=%lld\n"), dfile, isize, (long long)agcount, (long long)agsize, "", sectorsize, attrversion, @@ -1923,7 +1936,7 @@ an AG size that is one stripe unit small "", dsunit, dswidth, extent_flagging, dirversion, dirversion == 1 ? blocksize : dirblocksize, logfile, 1 << blocklog, (long long)logblocks, - logversion, "", lsectorsize, lsunit, + logversion, "", lsectorsize, lsunit, lazy_sb_counters, rtfile, rtextblocks << blocklog, (long long)rtblocks, (long long)rtextents); if (Nflag) @@ -1985,7 +1998,7 @@ an AG size that is one stripe unit small sbp->sb_logsectlog = 0; sbp->sb_logsectsize = 0; } - sbp->sb_features2 = XFS_SB_VERSION2_MKFS(0, attrversion == 2, 0); + sbp->sb_features2 = XFS_SB_VERSION2_MKFS(lazy_sb_counters, attrversion == 2, 0); sbp->sb_versionnum = XFS_SB_VERSION_MKFS( iaflag, dsunit != 0, extent_flagging, dirversion == 2, logversion == 2, attrversion == 1, @@ -2457,7 +2470,8 @@ usage( void ) sectlog=n|sectsize=num,unwritten=0|1]\n\ /* inode size */ [-i log=n|perblock=n|size=num,maxpct=n,attr=0|1|2]\n\ /* log subvol */ [-l agnum=n,internal,size=num,logdev=xxx,version=n\n\ - sunit=value|su=num,sectlog=n|sectsize=num]\n\ + sunit=value|su=num,sectlog=n|sectsize=num,\n\ + lazy-count=0|1]\n\ /* label */ [-L label (maximum 12 characters)]\n\ /* naming */ [-n log=n|size=num,version=n]\n\ /* prototype file */ [-p fname]\n\ Index: xfs-cmds/xfsprogs/repair/phase5.c =================================================================== --- xfs-cmds.orig/xfsprogs/repair/phase5.c 2006-11-15 19:00:34.000000000 +1100 +++ xfs-cmds/xfsprogs/repair/phase5.c 2007-04-03 14:52:51.795762235 +1000 @@ -333,6 +333,10 @@ write_cursor(bt_status_t *curs) #endif if (curs->level[i].prev_buf_p != NULL) { ASSERT(curs->level[i].prev_agbno != NULLAGBLOCK); +#if defined(XR_BLD_FREE_TRACE) || defined(XR_BLD_INO_TRACE) + fprintf(stderr, "writing bt prev block %u\n", + curs->level[i].prev_agbno); +#endif libxfs_writebuf(curs->level[i].prev_buf_p, 0); } libxfs_writebuf(curs->level[i].buf_p, 0); @@ -1285,6 +1289,24 @@ build_agf_agfl(xfs_mount_t *mp, bcnt_bt->num_levels); INT_SET(agf->agf_freeblks, ARCH_CONVERT, freeblks); + /* + * Count and record the number of btree blocks consumed if required. + */ + if (XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)) { + /* + * Don't count the root blocks as they are already + * accounted for. + */ + INT_SET(agf->agf_btreeblks, ARCH_CONVERT, + (bno_bt->num_tot_blocks - bno_bt->num_free_blocks) + + (bcnt_bt->num_tot_blocks - bcnt_bt->num_free_blocks) - + 2); +#ifdef XR_BLD_FREE_TRACE + fprintf(stderr, "agf->agf_btreeblks = %u\n", + INT_GET(agf->agf_btreeblks, ARCH_CONVERT)); +#endif + } + #ifdef XR_BLD_FREE_TRACE fprintf(stderr, "bno root = %u, bcnt root = %u, indices = %u %u\n", INT_GET(agf->agf_roots[XFS_BTNUM_BNO], ARCH_CONVERT), Index: xfs-cmds/xfsprogs/include/xfs_mount.h =================================================================== --- xfs-cmds.orig/xfsprogs/include/xfs_mount.h 2006-11-15 19:00:30.000000000 +1100 +++ xfs-cmds/xfsprogs/include/xfs_mount.h 2007-04-03 14:52:51.799761713 +1000 @@ -383,7 +383,7 @@ typedef struct xfs_mount { for space allocations */ #define XFS_MOUNT_INO64 (1ULL << 1) /* (1ULL << 2) -- currently unused */ - /* (1ULL << 3) -- currently unused */ +#define XFS_MOUNT_WAS_CLEAN (1ULL << 3) #define XFS_MOUNT_FS_SHUTDOWN (1ULL << 4) /* atomic stop of all filesystem operations, typically for disk errors in metadata */ @@ -460,6 +460,8 @@ xfs_preferred_iosize(xfs_mount_t *mp) #define XFS_MAXIOFFSET(mp) ((mp)->m_maxioffset) +#define XFS_LAST_UNMOUNT_WAS_CLEAN(mp) \ + ((mp)->m_flags & XFS_MOUNT_WAS_CLEAN) #define XFS_FORCED_SHUTDOWN(mp) ((mp)->m_flags & XFS_MOUNT_FS_SHUTDOWN) #define xfs_force_shutdown(m,f) \ VFS_FORCE_SHUTDOWN((XFS_MTOVFS(m)), f, __FILE__, __LINE__) @@ -540,6 +542,7 @@ typedef struct xfs_mod_sb { extern xfs_mount_t *xfs_mount_init(void); extern void xfs_mod_sb(xfs_trans_t *, __int64_t); +extern int xfs_log_sbcount(xfs_mount_t *, uint); extern void xfs_mount_free(xfs_mount_t *mp, int remove_bhv); extern int xfs_mountfs(struct vfs *, xfs_mount_t *mp, int); extern void xfs_mountfs_check_barriers(xfs_mount_t *mp); Index: xfs-cmds/xfsprogs/include/xfs_trans.h =================================================================== --- xfs-cmds.orig/xfsprogs/include/xfs_trans.h 2006-11-15 19:00:30.000000000 +1100 +++ xfs-cmds/xfsprogs/include/xfs_trans.h 2007-04-03 14:52:51.803761190 +1000 @@ -98,7 +98,8 @@ typedef struct xfs_trans_header { #define XFS_TRANS_GROWFSRT_ZERO 38 #define XFS_TRANS_GROWFSRT_FREE 39 #define XFS_TRANS_SWAPEXT 40 -#define XFS_TRANS_TYPE_MAX 40 +#define XFS_TRANS_SB_COUNT 41 +#define XFS_TRANS_TYPE_MAX 41 /* new transaction types need to be reflected in xfs_logprint(8) */ Index: xfs-cmds/xfsprogs/include/xfs_ag.h =================================================================== --- xfs-cmds.orig/xfsprogs/include/xfs_ag.h 2006-11-15 19:00:30.000000000 +1100 +++ xfs-cmds/xfsprogs/include/xfs_ag.h 2007-04-03 14:52:51.811760145 +1000 @@ -68,6 +68,7 @@ typedef struct xfs_agf { __be32 agf_flcount; /* count of blocks in freelist */ __be32 agf_freeblks; /* total free blocks */ __be32 agf_longest; /* longest free space */ + __be32 agf_btreeblks; /* # of blocks held in AGF btrees */ } xfs_agf_t; #define XFS_AGF_MAGICNUM 0x00000001 @@ -81,7 +82,8 @@ typedef struct xfs_agf { #define XFS_AGF_FLCOUNT 0x00000100 #define XFS_AGF_FREEBLKS 0x00000200 #define XFS_AGF_LONGEST 0x00000400 -#define XFS_AGF_NUM_BITS 11 +#define XFS_AGF_BTREEBLKS 0x00000800 +#define XFS_AGF_NUM_BITS 12 #define XFS_AGF_ALL_BITS ((1 << XFS_AGF_NUM_BITS) - 1) /* disk block (xfs_daddr_t) in the AG */ @@ -186,11 +188,13 @@ typedef struct xfs_perag __uint32_t pagf_flcount; /* count of blocks in freelist */ xfs_extlen_t pagf_freeblks; /* total free blocks */ xfs_extlen_t pagf_longest; /* longest free space */ + __uint32_t pagf_btreeblks; /* # of blocks held in AGF btrees */ xfs_agino_t pagi_freecount; /* number of free inodes */ + xfs_agino_t pagi_count; /* number of allocated inodes */ + int pagb_count; /* pagb slots in use */ #ifdef __KERNEL__ lock_t pagb_lock; /* lock for pagb_list */ #endif - int pagb_count; /* pagb slots in use */ xfs_perag_busy_t *pagb_list; /* unstable blocks */ } xfs_perag_t; Index: xfs-cmds/xfsprogs/include/xfs_ialloc.h =================================================================== --- xfs-cmds.orig/xfsprogs/include/xfs_ialloc.h 2006-11-15 19:00:30.000000000 +1100 +++ xfs-cmds/xfsprogs/include/xfs_ialloc.h 2007-04-03 14:52:51.815759622 +1000 @@ -149,6 +149,16 @@ xfs_ialloc_read_agi( xfs_agnumber_t agno, /* allocation group number */ struct xfs_buf **bpp); /* allocation group hdr buf */ +/* + * Read in the allocation group header to initialise the per-ag data + * in the mount structure + */ +int +xfs_ialloc_pagi_init( + struct xfs_mount *mp, /* file system mount structure */ + struct xfs_trans *tp, /* transaction pointer */ + xfs_agnumber_t agno); /* allocation group number */ + #endif /* __KERNEL__ */ #endif /* __XFS_IALLOC_H__ */ Index: xfs-cmds/xfsprogs/include/xfs_alloc.h =================================================================== --- xfs-cmds.orig/xfsprogs/include/xfs_alloc.h 2006-11-15 19:00:30.000000000 +1100 +++ xfs-cmds/xfsprogs/include/xfs_alloc.h 2007-04-03 14:52:51.815759622 +1000 @@ -114,7 +114,8 @@ int /* error */ xfs_alloc_get_freelist( struct xfs_trans *tp, /* transaction pointer */ struct xfs_buf *agbp, /* buffer containing the agf structure */ - xfs_agblock_t *bnop); /* block address retrieved from freelist */ + xfs_agblock_t *bnop, /* block address retrieved from freelist */ + int btreeblk); /* destination is a AGF btree */ /* * Log the given fields from the agf structure. @@ -143,7 +144,8 @@ xfs_alloc_put_freelist( struct xfs_trans *tp, /* transaction pointer */ struct xfs_buf *agbp, /* buffer for a.g. freelist header */ struct xfs_buf *agflbp,/* buffer for a.g. free block array */ - xfs_agblock_t bno); /* block being freed */ + xfs_agblock_t bno, /* block being freed */ + int btreeblk); /* owner was a AGF btree */ /* * Read in the allocation group header (free/alloc section). Index: xfs-cmds/xfsprogs/libxfs/xfs_alloc_btree.c =================================================================== --- xfs-cmds.orig/xfsprogs/libxfs/xfs_alloc_btree.c 2006-11-15 19:00:31.000000000 +1100 +++ xfs-cmds/xfsprogs/libxfs/xfs_alloc_btree.c 2007-04-03 14:52:51.839756487 +1000 @@ -188,7 +188,7 @@ xfs_alloc_delrec( * Put this buffer/block on the ag's freelist. */ if ((error = xfs_alloc_put_freelist(cur->bc_tp, - cur->bc_private.a.agbp, NULL, bno))) + cur->bc_private.a.agbp, NULL, bno, 1))) return error; /* * Since blocks move to the free list without the @@ -513,7 +513,7 @@ xfs_alloc_delrec( * Free the deleting block by putting it on the freelist. */ if ((error = xfs_alloc_put_freelist(cur->bc_tp, cur->bc_private.a.agbp, - NULL, rbno))) + NULL, rbno, 1))) return error; /* * Since blocks move to the free list without the coordination @@ -1279,7 +1279,7 @@ xfs_alloc_newroot( * Get a buffer from the freelist blocks, for the new root. */ if ((error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp, - &nbno))) + &nbno, 1))) return error; /* * None available, we fail. @@ -1563,7 +1563,7 @@ xfs_alloc_split( * If we can't do it, we're toast. Give up. */ if ((error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp, - &rbno))) + &rbno, 1))) return error; if (rbno == NULLAGBLOCK) { *stat = 0; Index: xfs-cmds/xfsprogs/libxfs/xfs_alloc.c =================================================================== --- xfs-cmds.orig/xfsprogs/libxfs/xfs_alloc.c 2006-11-15 19:00:31.000000000 +1100 +++ xfs-cmds/xfsprogs/libxfs/xfs_alloc.c 2007-04-03 14:52:51.843755964 +1000 @@ -1355,7 +1355,8 @@ xfs_alloc_ag_vextent_small( else if (args->minlen == 1 && args->alignment == 1 && !args->isfl && (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount) > args->minleft)) { - if ((error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno))) + error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0); + if (error) goto error0; if (fbno != NULLAGBLOCK) { if (args->userdata) { @@ -1819,7 +1820,7 @@ xfs_alloc_fix_freelist( while (be32_to_cpu(agf->agf_flcount) > need) { xfs_buf_t *bp; - if ((error = xfs_alloc_get_freelist(tp, agbp, &bno))) + if ((error = xfs_alloc_get_freelist(tp, agbp, &bno, 0))) return error; if ((error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1, 1))) return error; @@ -1865,7 +1866,7 @@ xfs_alloc_fix_freelist( */ for (bno = targs.agbno; bno < targs.agbno + targs.len; bno++) { if ((error = xfs_alloc_put_freelist(tp, agbp, agflbp, - bno))) + bno, 0))) return error; } } @@ -1882,13 +1883,15 @@ int /* error */ xfs_alloc_get_freelist( xfs_trans_t *tp, /* transaction pointer */ xfs_buf_t *agbp, /* buffer containing the agf structure */ - xfs_agblock_t *bnop) /* block address retrieved from freelist */ + xfs_agblock_t *bnop, /* block address retrieved from freelist */ + int btreeblk) /* destination is a AGF btree */ { xfs_agf_t *agf; /* a.g. freespace structure */ xfs_agfl_t *agfl; /* a.g. freelist structure */ xfs_buf_t *agflbp;/* buffer for a.g. freelist structure */ xfs_agblock_t bno; /* block number returned */ int error; + int logflags; #ifdef XFS_ALLOC_TRACE static char fname[] = "xfs_alloc_get_freelist"; #endif @@ -1923,8 +1926,16 @@ xfs_alloc_get_freelist( be32_add(&agf->agf_flcount, -1); xfs_trans_agflist_delta(tp, -1); pag->pagf_flcount--; - TRACE_MODAGF(NULL, agf, XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT); - xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT); + + logflags = XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT; + if (btreeblk) { + be32_add(&agf->agf_btreeblks, 1); + pag->pagf_btreeblks++; + logflags |= XFS_AGF_BTREEBLKS; + } + + TRACE_MODAGF(NULL, agf, logflags); + xfs_alloc_log_agf(tp, agbp, logflags); *bnop = bno; /* @@ -1962,6 +1973,7 @@ xfs_alloc_log_agf( offsetof(xfs_agf_t, agf_flcount), offsetof(xfs_agf_t, agf_freeblks), offsetof(xfs_agf_t, agf_longest), + offsetof(xfs_agf_t, agf_btreeblks), sizeof(xfs_agf_t) }; @@ -1997,12 +2009,14 @@ xfs_alloc_put_freelist( xfs_trans_t *tp, /* transaction pointer */ xfs_buf_t *agbp, /* buffer for a.g. freelist header */ xfs_buf_t *agflbp,/* buffer for a.g. free block array */ - xfs_agblock_t bno) /* block being freed */ + xfs_agblock_t bno, /* block being freed */ + int btreeblk) /* block came from a AGF btree */ { xfs_agf_t *agf; /* a.g. freespace structure */ xfs_agfl_t *agfl; /* a.g. free block array */ xfs_agblock_t *blockp;/* pointer to array entry */ int error; + int logflags; #ifdef XFS_ALLOC_TRACE static char fname[] = "xfs_alloc_put_freelist"; #endif @@ -2023,11 +2037,19 @@ xfs_alloc_put_freelist( be32_add(&agf->agf_flcount, 1); xfs_trans_agflist_delta(tp, 1); pag->pagf_flcount++; + + logflags = XFS_AGF_FLLAST | XFS_AGF_FLCOUNT; + if (btreeblk) { + be32_add(&agf->agf_btreeblks, -1); + pag->pagf_btreeblks--; + logflags |= XFS_AGF_BTREEBLKS; + } + ASSERT(be32_to_cpu(agf->agf_flcount) <= XFS_AGFL_SIZE(mp)); blockp = &agfl->agfl_bno[be32_to_cpu(agf->agf_fllast)]; - INT_SET(*blockp, ARCH_CONVERT, bno); - TRACE_MODAGF(NULL, agf, XFS_AGF_FLLAST | XFS_AGF_FLCOUNT); - xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLLAST | XFS_AGF_FLCOUNT); + INT_SET(*blockp, ARCH_CONVERT, bno); + TRACE_MODAGF(NULL, agf, logflags); + xfs_alloc_log_agf(tp, agbp, logflags); xfs_trans_log_buf(tp, agflbp, (int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl), (int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl + @@ -2048,6 +2070,7 @@ xfs_alloc_read_agf( { xfs_agf_t *agf; /* ag freelist header */ int agf_ok; /* set if agf is consistent */ + int agf_length; /* ag length from agf */ xfs_buf_t *bp; /* return value */ xfs_perag_t *pag; /* per allocation group data */ int error; @@ -2070,10 +2093,12 @@ xfs_alloc_read_agf( * Validate the magic number of the agf block. */ agf = XFS_BUF_TO_AGF(bp); + agf_length = be32_to_cpu(agf->agf_length); agf_ok = be32_to_cpu(agf->agf_magicnum) == XFS_AGF_MAGIC && XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)) && - be32_to_cpu(agf->agf_freeblks) <= be32_to_cpu(agf->agf_length) && + be32_to_cpu(agf->agf_freeblks) <= agf_length && + be32_to_cpu(agf->agf_btreeblks) <= agf_length && be32_to_cpu(agf->agf_flfirst) < XFS_AGFL_SIZE(mp) && be32_to_cpu(agf->agf_fllast) < XFS_AGFL_SIZE(mp) && be32_to_cpu(agf->agf_flcount) <= XFS_AGFL_SIZE(mp); @@ -2087,6 +2112,7 @@ xfs_alloc_read_agf( pag = &mp->m_perag[agno]; if (!pag->pagf_init) { pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks); + pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks); pag->pagf_flcount = be32_to_cpu(agf->agf_flcount); pag->pagf_longest = be32_to_cpu(agf->agf_longest); pag->pagf_levels[XFS_BTNUM_BNOi] = @@ -2101,6 +2127,7 @@ xfs_alloc_read_agf( #ifdef DEBUG else if (!XFS_FORCED_SHUTDOWN(mp)) { ASSERT(pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks)); + ASSERT(pag->pagf_btreeblks == be32_to_cpu(agf->agf_btreeblks)); ASSERT(pag->pagf_flcount == be32_to_cpu(agf->agf_flcount)); ASSERT(pag->pagf_longest == be32_to_cpu(agf->agf_longest)); ASSERT(pag->pagf_levels[XFS_BTNUM_BNOi] == Index: xfs-cmds/xfsprogs/libxfs/xfs_ialloc.c =================================================================== --- xfs-cmds.orig/xfsprogs/libxfs/xfs_ialloc.c 2006-11-15 19:00:32.000000000 +1100 +++ xfs-cmds/xfsprogs/libxfs/xfs_ialloc.c 2007-04-03 14:52:51.851754919 +1000 @@ -1128,6 +1128,7 @@ xfs_ialloc_read_agi( pag = &mp->m_perag[agno]; if (!pag->pagi_init) { pag->pagi_freecount = be32_to_cpu(agi->agi_freecount); + pag->pagi_count = be32_to_cpu(agi->agi_count); pag->pagi_init = 1; } else { /* @@ -1135,6 +1136,7 @@ xfs_ialloc_read_agi( * we are in the middle of a forced shutdown. */ ASSERT(pag->pagi_freecount == be32_to_cpu(agi->agi_freecount) || + pag->pagi_count == be32_to_cpu(agi->agi_count) || XFS_FORCED_SHUTDOWN(mp)); } @@ -1151,3 +1153,22 @@ xfs_ialloc_read_agi( *bpp = bp; return 0; } + +/* + * Read in the agi to initialise the per-ag data in the mount structure + */ +int +xfs_ialloc_pagi_init( + xfs_mount_t *mp, /* file system mount structure */ + xfs_trans_t *tp, /* transaction pointer */ + xfs_agnumber_t agno) /* allocation group number */ +{ + xfs_buf_t *bp = NULL; + int error; + + if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bp))) + return error; + if (bp) + xfs_trans_brelse(tp, bp); + return 0; +} Index: xfs-cmds/xfsprogs/libxfs/xfs.h =================================================================== --- xfs-cmds.orig/xfsprogs/libxfs/xfs.h 2006-11-15 19:00:31.000000000 +1100 +++ xfs-cmds/xfsprogs/libxfs/xfs.h 2007-04-03 14:52:51.851754919 +1000 @@ -86,6 +86,7 @@ #define xfs_mount_common libxfs_mount_common #define xfs_initialize_perag libxfs_initialize_perag +#define xfs_initialize_perag_data libxfs_initialize_perag_data #define xfs_rtmount_init libxfs_rtmount_init #define xfs_alloc_fix_freelist libxfs_alloc_fix_freelist #define xfs_idata_realloc libxfs_idata_realloc @@ -352,10 +353,10 @@ static inline int __do_div(unsigned long */ /* xfs_alloc.c */ -int xfs_alloc_get_freelist (xfs_trans_t *, xfs_buf_t *, xfs_agblock_t *); +int xfs_alloc_get_freelist (xfs_trans_t *, xfs_buf_t *, xfs_agblock_t *, int); void xfs_alloc_log_agf (xfs_trans_t *, xfs_buf_t *, int); int xfs_alloc_put_freelist (xfs_trans_t *, xfs_buf_t *, xfs_buf_t *, - xfs_agblock_t); + xfs_agblock_t, int); int xfs_alloc_read_agf (xfs_mount_t *, xfs_trans_t *, xfs_agnumber_t, int, xfs_buf_t **); int xfs_alloc_vextent (xfs_alloc_arg_t *); @@ -372,6 +373,7 @@ int xfs_dialloc (xfs_trans_t *, xfs_ino void xfs_ialloc_log_agi (xfs_trans_t *, xfs_buf_t *, int); int xfs_ialloc_read_agi (xfs_mount_t *, xfs_trans_t *, xfs_agnumber_t, xfs_buf_t **); +int xfs_ialloc_pagi_init (xfs_mount_t *, xfs_trans_t *, xfs_agnumber_t); int xfs_dilocate (xfs_mount_t *, xfs_trans_t *, xfs_ino_t, xfs_fsblock_t *, int *, int *, uint); Index: xfs-cmds/xfsprogs/libxfs/xfs_mount.c =================================================================== --- xfs-cmds.orig/xfsprogs/libxfs/xfs_mount.c 2006-11-15 19:00:32.000000000 +1100 +++ xfs-cmds/xfsprogs/libxfs/xfs_mount.c 2007-04-03 14:52:51.859753874 +1000 @@ -313,3 +313,56 @@ xfs_initialize_perag(xfs_mount_t *mp, xf } return index; } + +/* + * xfs_initialize_perag_data + * + * Read in each per-ag structure so we can count up the number of + * allocated inodes, free inodes and used filesystem blocks as this + * information is no longer persistent in the superblock. Once we have + * this information, write it into the in-core superblock structure. + */ +STATIC int +xfs_initialize_perag_data(xfs_mount_t *mp, xfs_agnumber_t agcount) +{ + xfs_agnumber_t index; + xfs_perag_t *pag; + xfs_sb_t *sbp = &mp->m_sb; + __uint64_t ifree = 0; + __uint64_t ialloc = 0; + __uint64_t bfree = 0; + __uint64_t bfreelst = 0; + __uint64_t btree = 0; + int error; + int s; + + for (index = 0; index < agcount; index++) { + /* + * read the agf, then the agi. This gets us + * all the information we need and populates the + * per-ag structures for us. + */ + error = xfs_alloc_pagf_init(mp, NULL, index, 0); + if (error) + return error; + + error = xfs_ialloc_pagi_init(mp, NULL, index); + if (error) + return error; + pag = &mp->m_perag[index]; + ifree += pag->pagi_freecount; + ialloc += pag->pagi_count; + bfree += pag->pagf_freeblks; + bfreelst += pag->pagf_flcount; + btree += pag->pagf_btreeblks; + } + /* + * Overwrite incore superblock counters with just-read data + */ + s = XFS_SB_LOCK(mp); + sbp->sb_ifree = ifree; + sbp->sb_icount = ialloc; + sbp->sb_fdblocks = bfree + bfreelst + btree; + XFS_SB_UNLOCK(mp, s); + return 0; +} Index: xfs-cmds/xfsprogs/include/libxfs.h =================================================================== --- xfs-cmds.orig/xfsprogs/include/libxfs.h 2006-12-22 11:49:54.000000000 +1100 +++ xfs-cmds/xfsprogs/include/libxfs.h 2007-04-03 14:52:51.863753351 +1000 @@ -197,6 +197,7 @@ extern xfs_mount_t *libxfs_mount (xfs_mo dev_t, dev_t, dev_t, int); extern void libxfs_mount_common (xfs_mount_t *, xfs_sb_t *); extern xfs_agnumber_t libxfs_initialize_perag (xfs_mount_t *, xfs_agnumber_t); +extern int libxfs_initialize_perag_data (xfs_mount_t *, xfs_agnumber_t); extern void libxfs_umount (xfs_mount_t *); extern int libxfs_rtmount_init (xfs_mount_t *); extern void libxfs_rtmount_destroy (xfs_mount_t *); Index: xfs-cmds/xfsprogs/libxfs/init.c =================================================================== --- xfs-cmds.orig/xfsprogs/libxfs/init.c 2006-12-22 11:49:54.000000000 +1100 +++ xfs-cmds/xfsprogs/libxfs/init.c 2007-04-03 14:52:51.879751261 +1000 @@ -649,6 +649,20 @@ libxfs_mount( libxfs_iput(mp->m_rootip, 0); return NULL; } + + /* + * mkfs calls mount before the AGF/AGI structures are written. + */ + if ((flags & LIBXFS_MOUNT_ROOTINOS) && sbp->sb_rootino != NULLFSINO && + XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)) { + error = libxfs_initialize_perag_data(mp, sbp->sb_agcount); + if (error) { + fprintf(stderr, _("%s: cannot init perag data (%d)\n"), + progname, error); + return NULL; + } + } + return mp; } Index: xfs-cmds/xfsprogs/man/man8/mkfs.xfs.8 =================================================================== --- xfs-cmds.orig/xfsprogs/man/man8/mkfs.xfs.8 2006-11-15 19:00:33.000000000 +1100 +++ xfs-cmds/xfsprogs/man/man8/mkfs.xfs.8 2007-04-03 14:52:51.907747603 +1000 @@ -335,8 +335,9 @@ The valid suboptions are: \f3logdev=\f1\f2device\f1, \f3size=\f1\f2value\f1, \f3version=\f1\f2[1|2]\f1, -\f3sunit=\f1\f2value\f1, and -\f3su=\f1\f2value\f1. +\f3sunit=\f1\f2value\f1, +\f3su=\f1\f2value\f1 and +\f3lazy-count=\f1\f2[0|1]\f1. .IP The .B internal @@ -415,6 +416,24 @@ The suboption value has to be specified This value must be a multiple of the filesystem block size. Version 2 logs are automatically selected if the log \f3su\f1 suboption is specified. +.IP +The +.B lazy-count +suboption changes the method of logging various persistent counters +in the superblock. Under metadata intensive workloads, these +counters are updated and logged frequently enough that the +superblock updates become a serialisation point in the filesystem. +.IP +With +.BR lazy-count=1 , +the superblock is not modified or logged on every change of the +persistent counters. Instead, enough information is kept in +other parts of the filesystem to be able to maintain the persistent +counter values without needed to keep them in the superblock. +This gives significant improvements in performance on some configurations. +The default value is 0 (off) so you must specify +.B lazy-count=1 +if you want to make use of this feature. .TP .B \-n Naming options. From owner-xfs@oss.sgi.com Thu Apr 19 21:40:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 21:40:48 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3K4eefB009871 for ; Thu, 19 Apr 2007 21:40:41 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id D7CCE92C2D2; Fri, 20 Apr 2007 12:34:20 +1000 (EST) Subject: Re: review: allocate bmapi args From: Nathan Scott Reply-To: nscott@aconex.com To: David Chinner Cc: xfs-dev , xfs-oss In-Reply-To: <20070419082331.GW48531920@melbourne.sgi.com> References: <20070419072505.GS48531920@melbourne.sgi.com> <1176969062.6273.169.camel@edge> <20070419082331.GW48531920@melbourne.sgi.com> Content-Type: text/plain Organization: Aconex Date: Fri, 20 Apr 2007 14:41:57 +1000 Message-Id: <1177044117.6273.203.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-archive-position: 11138 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs On Thu, 2007-04-19 at 18:23 +1000, David Chinner wrote: > ... > > Are you sure this is legit though? > > It *must* be. We already rely on being able to do substantial > amounts of allocation in this path.... Not necessarily, we only sometimes (for some values of BMAPI flags I mean) do memory allocations. > ... > We modify the incore extent list as it grows and shrinks in this > path. It is critical that we are able to allocate at least small Well, not always. For cases where we modify the extents we must call in here with Steve's funky I'm-in-a-transaction process-flag set, which is the secret handshake for the memory allocator to not use GFP_FS. For cases where we are only reading the extent list, we would not be doing allocations before and we'd not be protected by that extra magic. So now those paths can start getting ENOMEM in places where that wouldn't have happened before.. I guess that filesystem shutdowns could result, possibly. > FWIW, I have done low memory testing and I wasn't about to trigger > any problems..... It would be a once-in-a-blue-moon type problem though, unfortunately; one of the really-really-hard to trigger types of problem. > > (Oh, and why the _zalloc? Could just do an _alloc, since previous > > code was using non-zeroed memory - so, should have been filling in > > all fields). > > Habit. And it doesn't hurt performance at all - we've got to take Hrmmm... is there any point in having a non-zeroing interface at all then? I thought the non-zeroing version was about all using the fact that you know you're going to overwrite all the fields anyway shortly, so theres no point zeroing initially... cheers. -- Nathan From owner-xfs@oss.sgi.com Thu Apr 19 22:34:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 19 Apr 2007 22:34:21 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3K5YFfB025985 for ; Thu, 19 Apr 2007 22:34:17 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA23562; Fri, 20 Apr 2007 15:34:14 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3K5YCAf70534614; Fri, 20 Apr 2007 15:34:13 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3K5YBb367492209; Fri, 20 Apr 2007 15:34:11 +1000 (AEST) Date: Fri, 20 Apr 2007 15:34:11 +1000 From: David Chinner To: Nathan Scott Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: review: allocate bmapi args Message-ID: <20070420053411.GM32602149@melbourne.sgi.com> References: <20070419072505.GS48531920@melbourne.sgi.com> <1176969062.6273.169.camel@edge> <20070419082331.GW48531920@melbourne.sgi.com> <1177044117.6273.203.camel@edge> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177044117.6273.203.camel@edge> User-Agent: Mutt/1.4.2.1i X-archive-position: 11139 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, Apr 20, 2007 at 02:41:57PM +1000, Nathan Scott wrote: > On Thu, 2007-04-19 at 18:23 +1000, David Chinner wrote: > > ... > > > Are you sure this is legit though? > > > > It *must* be. We already rely on being able to do substantial > > amounts of allocation in this path.... > > Not necessarily, we only sometimes (for some values of BMAPI flags > I mean) do memory allocations. > > > ... > > We modify the incore extent list as it grows and shrinks in this > > path. It is critical that we are able to allocate at least small > > Well, not always. For cases where we modify the extents we must > call in here with Steve's funky I'm-in-a-transaction process-flag > set, which is the secret handshake for the memory allocator to not > use GFP_FS. *nod* > For cases where we are only reading the extent list, > we would not be doing allocations before and we'd not be protected > by that extra magic. So now those paths can start getting ENOMEM > in places where that wouldn't have happened before.. I guess that > filesystem shutdowns could result, possibly. Well, with a sleeping allocation we'll get hangs, not shutdowns. Quite frankly, hangs are far easier to debug than shutdowns. If it is really does become an issue, we could use mempools here - we can guarantee that we will return the object to the pool if we don't hang on some other allocation and hence we'd always be able to make progress. However, I'm not sure I want to go that far without having actually seen a normal allocation cause a problem here and right now I think that saving ~250 bytes of stack (~10-15% of XFS's stack usage on ia32!) through the paths we know blow is substantial. > > FWIW, I have done low memory testing and I wasn't about to trigger > > any problems..... > > It would be a once-in-a-blue-moon type problem though, unfortunately; > one of the really-really-hard to trigger types of problem. Yes, that it is. But a sysrq-t will point out the problem immediately as we'll see processes hung trying to do memory allocation. > > > (Oh, and why the _zalloc? Could just do an _alloc, since previous > > > code was using non-zeroed memory - so, should have been filling in > > > all fields). > > > > Habit. And it doesn't hurt performance at all - we've got to take > > Hrmmm... is there any point in having a non-zeroing interface at > all then? Sorry - i should have said that "for small allocations like this it doesn't hurt performance" - the cycles consumed by the allocation and lost in the initial cacheline fetch are far, far greater than those spent in zeroing part of a cacheline once it's accessable. > I thought the non-zeroing version was about all using > the fact that you know you're going to overwrite all the fields > anyway shortly, so theres no point zeroing initially... Zeroing causes sequential access to cachelines and hence hardware prefetchers can operate quickly and reduce the number of CPU stalls compared to filling out the structure in random order. And some CPUs can zero-fill cachelines without having fetched them from memory (PPC can do this IIRC) so you don't even stall the CPU.... But there's still plenty of cases where you don't want to touch some or all of the allocated region (e.g. you're about to memcpy() something into it) so we still need the non-zeroing version of the allocator... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri Apr 20 07:23:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 20 Apr 2007 07:23:47 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3KENefB010382 for ; Fri, 20 Apr 2007 07:23:41 -0700 Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3KDpkB1006040 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Fri, 20 Apr 2007 09:51:47 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l3KDpgf7004800 for ; Fri, 20 Apr 2007 09:51:42 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3KDpgsB169698 for ; Fri, 20 Apr 2007 07:51:42 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3KDpfI1023636 for ; Fri, 20 Apr 2007 07:51:42 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3KDpeeD023497; Fri, 20 Apr 2007 07:51:40 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id C95A829EE1C; Fri, 20 Apr 2007 19:21:47 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3KDpkuk000327; Fri, 20 Apr 2007 19:21:46 +0530 Date: Fri, 20 Apr 2007 19:21:46 +0530 From: "Amit K. Arora" To: Andreas Dilger Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com, suparna@in.ibm.com, Andrew Morton , Jakub Jelinek , torvalds@linux-foundation.org Subject: Re: Interface for the new fallocate() system call Message-ID: <20070420135146.GA21352@amitarora.in.ibm.com> References: <20070301183445.GA7911@amitarora.in.ibm.com> <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070418130600.GW5967@schatzie.adilger.int> User-Agent: Mutt/1.4.1i X-archive-position: 11140 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Wed, Apr 18, 2007 at 07:06:00AM -0600, Andreas Dilger wrote: > On Apr 17, 2007 18:25 +0530, Amit K. Arora wrote: > > On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote: > > > Wouldn't > > > int fallocate(loff_t offset, loff_t len, int fd, int mode) > > > work on both s390 and ppc/arm? glibc will certainly wrap it and > > > reorder the arguments as needed, so there is no need to keep fd first. > > > > I think more people are comfirtable with this approach. > > Really? I thought from the last postings that "fd first, wrap on s390" > was better. > > > Since glibc > > will wrap the system call and export the "conventional" interface > > (with fd first) to applications, we may not worry about keeping fd first > > in kernel code. I am personally fine with this approach. > > It would seem to make more sense to wrap the syscall on those architectures > that can't handle the "conventional" interface (fd first). Ok. In this case we may have to consider following things: 1) Obviously, for this glibc will have to call fallocate() syscall with different arguments on s390, than other archs. I think this should be doable and should not be an issue with glibc folks (right?). 2) we also need to see how strace behaves in this case. With little knowledge that I have of strace, I don't think it should depend on argument ordering of a system call on different archs (since it uses ptrace internally and that should take care of it). But, it will be nice if someone can confirm this. Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Fri Apr 20 08:00:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 20 Apr 2007 08:00:40 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3KF0VfB019084 for ; Fri, 20 Apr 2007 08:00:33 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l3KExlaK001147; Fri, 20 Apr 2007 10:59:47 -0400 Received: from devserv.devel.redhat.com (devserv.devel.redhat.com [172.16.58.1]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3KExkJZ014872; Fri, 20 Apr 2007 10:59:46 -0400 Received: from devserv.devel.redhat.com (localhost.localdomain [127.0.0.1]) by devserv.devel.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id l3KExkK0005918; Fri, 20 Apr 2007 10:59:46 -0400 Received: (from jakub@localhost) by devserv.devel.redhat.com (8.12.11.20060308/8.12.11/Submit) id l3KExIC0005908; Fri, 20 Apr 2007 10:59:18 -0400 Date: Fri, 20 Apr 2007 10:59:18 -0400 From: Jakub Jelinek To: "Amit K. Arora" Cc: Andreas Dilger , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com, suparna@in.ibm.com, Andrew Morton , torvalds@linux-foundation.org Subject: Re: Interface for the new fallocate() system call Message-ID: <20070420145918.GY355@devserv.devel.redhat.com> Reply-To: Jakub Jelinek References: <20070316143101.GA10152@amitarora.in.ibm.com> <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070420135146.GA21352@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11141 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jakub@redhat.com Precedence: bulk X-list: xfs On Fri, Apr 20, 2007 at 07:21:46PM +0530, Amit K. Arora wrote: > Ok. > In this case we may have to consider following things: > > 1) Obviously, for this glibc will have to call fallocate() syscall with > different arguments on s390, than other archs. I think this should be > doable and should not be an issue with glibc folks (right?). glibc can cope with this easily, will just add sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override the generic Linux implementation. > 2) we also need to see how strace behaves in this case. With little > knowledge that I have of strace, I don't think it should depend on > argument ordering of a system call on different archs (since it uses > ptrace internally and that should take care of it). But, it will be > nice if someone can confirm this. strace would solve this with #ifdef mess, it already does that in many places so guess another few lines don't make it significantly worse. Jakub From owner-xfs@oss.sgi.com Sun Apr 22 07:33:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 22 Apr 2007 07:33:37 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3MEXTfB011841 for ; Sun, 22 Apr 2007 07:33:32 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3MEXRLD015786 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Sun, 22 Apr 2007 16:33:27 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3MEXQZR015784; Sun, 22 Apr 2007 16:33:26 +0200 Date: Sun, 22 Apr 2007 16:33:26 +0200 From: Christoph Hellwig To: David Chinner Cc: Christoph Hellwig , xfs@oss.sgi.com Subject: Re: [PATCH] kill macro noise in xfs_dir2*.h Message-ID: <20070422143326.GA15747@lst.de> References: <20070418175859.GB18315@lst.de> <20070419000940.GL48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070419000940.GL48531920@melbourne.sgi.com> User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11142 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 10:09:40AM +1000, David Chinner wrote: > On Wed, Apr 18, 2007 at 07:59:00PM +0200, Christoph Hellwig wrote: > > Remove all the macros that just give inline functions uppercase names. > > > > Signed-off-by: Christoph Hellwig > > BTW, you'll need this patch to make debug kernels build.... Well, there's not xfsidb in mainline and that's the tree I tend to work with for all my patches. I suspect you'll have to fixup xfsidb for whatever patches come from normal mainline-using developers :) From owner-xfs@oss.sgi.com Sun Apr 22 07:35:51 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 22 Apr 2007 07:35:56 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3MEZnfB013651 for ; Sun, 22 Apr 2007 07:35:51 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3MEZmLD015868 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Sun, 22 Apr 2007 16:35:48 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3MEZm5F015866; Sun, 22 Apr 2007 16:35:48 +0200 Date: Sun, 22 Apr 2007 16:35:48 +0200 From: Christoph Hellwig To: David Chinner Cc: Christoph Hellwig , xfs@oss.sgi.com Subject: Re: [PATCH] remove various useless min/max macros Message-ID: <20070422143548.GB15747@lst.de> References: <20070418175730.GA18315@lst.de> <20070418235056.GJ48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070418235056.GJ48531920@melbourne.sgi.com> User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11143 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 09:50:56AM +1000, David Chinner wrote: > On Wed, Apr 18, 2007 at 07:57:30PM +0200, Christoph Hellwig wrote: > > xfs_btree.h has various macros to calculate a min/max after casting > > it's arguments to a specific type. This can be done much simpler > > by using min_t/max_t with the type as first argument. > > Sure, but I NACKed that last October for good reason. > > http://marc.info/?t=116116017600003&r=1&w=2 > > Specifically: > > http://marc.info/?l=linux-kernel&m=116122285309389&w=2 > > I still have no objection to changing the implementation of these > macros or even changing them to non-shouting static inlines but > I don't want them removed.... Oh, I don't remember that thread anymore. Anyway, I disagree. min_t/max_t says as much as the existing macros that we want to do a comparism as the first type passed to it. That's the whole point of these macros. I agree in case you apply your judgement to the first patch posted in that thread that uses plain min/max. But anyway, you're the maintainer, so.. From owner-xfs@oss.sgi.com Sun Apr 22 16:03:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 22 Apr 2007 16:03:21 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3MN3FfB003682 for ; Sun, 22 Apr 2007 16:03:17 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA29480; Mon, 23 Apr 2007 09:03:06 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3MN35Af73694234; Mon, 23 Apr 2007 09:03:05 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3MN335E73969261; Mon, 23 Apr 2007 09:03:03 +1000 (AEST) Date: Mon, 23 Apr 2007 09:03:03 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: don't hold ilock when calling vn_iowait Message-ID: <20070422230303.GX32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11144 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Regression introduced by recent freezing fixes - we should not hold the ilock while waiting for I/O completion. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_vfsops.c | 73 +++++++++++++++++++--------------------------------- 1 file changed, 28 insertions(+), 45 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vfsops.c 2007-04-19 17:51:09.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c 2007-04-20 17:38:39.946453274 +1000 @@ -1169,58 +1169,41 @@ xfs_sync_inodes( * in the inode list. */ - if ((flags & SYNC_CLOSE) && (vp != NULL)) { - /* - * This is the shutdown case. We just need to - * flush and invalidate all the pages associated - * with the inode. Drop the inode lock since - * we can't hold it across calls to the buffer - * cache. - * - * We don't set the VREMAPPING bit in the vnode - * here, because we don't hold the vnode lock - * exclusively. It doesn't really matter, though, - * because we only come here when we're shutting - * down anyway. - */ - xfs_iunlock(ip, XFS_ILOCK_SHARED); - - if (XFS_FORCED_SHUTDOWN(mp)) { - bhv_vop_toss_pages(vp, 0, -1, FI_REMAPF); - } else { - error = bhv_vop_flushinval_pages(vp, 0, -1, FI_REMAPF); + /* + * If we have to flush data or wait for I/O completion + * we need to drop the ilock that we currently hold. + * If we need to drop the lock, insert a marker if we + * have not already done so. + */ + if ((flags & (SYNC_CLOSE|SYNC_IOWAIT)) || + ((flags & SYNC_DELWRI) && VN_DIRTY(vp))) { + if (mount_locked) { + IPOINTER_INSERT(ip, mp); } + xfs_iunlock(ip, XFS_ILOCK_SHARED); - xfs_ilock(ip, XFS_ILOCK_SHARED); - - } else if ((flags & SYNC_DELWRI) && (vp != NULL)) { - if (VN_DIRTY(vp)) { - /* We need to have dropped the lock here, - * so insert a marker if we have not already - * done so. - */ - if (mount_locked) { - IPOINTER_INSERT(ip, mp); - } - - /* - * Drop the inode lock since we can't hold it - * across calls to the buffer cache. - */ - xfs_iunlock(ip, XFS_ILOCK_SHARED); + if (flags & SYNC_CLOSE) { + /* Shutdown case. Flush and invalidate. */ + if (XFS_FORCED_SHUTDOWN(mp)) + bhv_vop_toss_pages(vp, 0, -1, FI_REMAPF); + else + error = bhv_vop_flushinval_pages(vp, 0, + -1, FI_REMAPF); + } else if ((flags & SYNC_DELWRI) && VN_DIRTY(vp)) { error = bhv_vop_flush_pages(vp, (xfs_off_t)0, -1, fflag, FI_NONE); - xfs_ilock(ip, XFS_ILOCK_SHARED); } + /* + * When freezing, we need to wait ensure all I/O (including direct + * I/O) is complete to ensure no further data modification can take + * place after this point + */ + if (flags & SYNC_IOWAIT) + vn_iowait(vp); + + xfs_ilock(ip, XFS_ILOCK_SHARED); } - /* - * When freezing, we need to wait ensure all I/O (including direct - * I/O) is complete to ensure no further data modification can take - * place after this point - */ - if (flags & SYNC_IOWAIT) - vn_iowait(vp); if (flags & SYNC_BDFLUSH) { if ((flags & SYNC_ATTR) && From owner-xfs@oss.sgi.com Sun Apr 22 17:26:26 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 22 Apr 2007 17:26:28 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3N0QMfB026825 for ; Sun, 22 Apr 2007 17:26:24 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA01019; Mon, 23 Apr 2007 10:26:18 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3N0QHAf72814019; Mon, 23 Apr 2007 10:26:17 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3N0QGpT73783311; Mon, 23 Apr 2007 10:26:16 +1000 (AEST) Date: Mon, 23 Apr 2007 10:26:16 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: don't block non-blocking writes when frozen Message-ID: <20070423002616.GY32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11145 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Prevent nfsds for blocking trying to write to a frozen filesystem or a filesystem in the process of freezing. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_lrw.c | 4 ++++ 1 file changed, 4 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_lrw.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_lrw.c 2007-03-29 19:03:30.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_lrw.c 2007-03-29 19:08:06.262169809 +1000 @@ -684,6 +684,10 @@ xfs_write( io = &xip->i_iocore; mp = io->io_mount; + if (FILP_DELAY_FLAG(file) && vfs_test_for_freeze(vp->v_vfsp)) { + /* so nfsd can return EJUKEBOX to clients during a freeze */ + return -EAGAIN; + } vfs_wait_for_freeze(vp->v_vfsp, SB_FREEZE_WRITE); if (XFS_FORCED_SHUTDOWN(mp)) From owner-xfs@oss.sgi.com Sun Apr 22 17:31:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 22 Apr 2007 17:31:54 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3N0VnfB028244 for ; Sun, 22 Apr 2007 17:31:51 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA01176; Mon, 23 Apr 2007 10:31:44 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3N0VhAf72583323; Mon, 23 Apr 2007 10:31:44 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3N0VgBa73953403; Mon, 23 Apr 2007 10:31:42 +1000 (AEST) Date: Mon, 23 Apr 2007 10:31:42 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: review: don't block non-blocking setattr when frozen Message-ID: <20070423003142.GZ32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11146 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Prevent nfsds from blocking doing on setattr calls (i.e. truncates) when the filesystem is frozen or being frozen. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_vnodeops.c | 6 ++++++ 1 file changed, 6 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c 2007-01-31 13:56:03.781573432 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c 2007-01-31 13:56:13.428316729 +1100 @@ -281,6 +281,12 @@ xfs_setattr( return XFS_ERROR(EINVAL); } + /* + * Don't block if the filesystem is frozen. + */ + if ((flags & ATTR_NONBLOCK) && vfs_test_for_freeze(vp->v_vfsp)) + return XFS_ERROR(EAGAIN); + ip = XFS_BHVTOI(bdp); mp = ip->i_mount; From owner-xfs@oss.sgi.com Sun Apr 22 23:26:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 22 Apr 2007 23:26:05 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3N6PwfB022830 for ; Sun, 22 Apr 2007 23:26:00 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA07957; Mon, 23 Apr 2007 16:25:52 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 16346) id 7B12B5910FDF; Mon, 23 Apr 2007 16:25:52 +1000 (EST) To: xfs@oss.sgi.com, sgi.bugs.xfs@engr.sgi.com Subject: PARTIAL TAKE 940392 - unwritten extents aren't converted when modified through mmap Message-Id: <20070423062552.7B12B5910FDF@chook.melbourne.sgi.com> Date: Mon, 23 Apr 2007 16:25:52 +1000 (EST) From: dgc@sgi.com (David Chinner) X-archive-position: 11147 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs QA test to demonstrate unwritten extent/mmap write problem. Date: Mon Apr 23 16:24:57 AEST 2007 Workarea: chook.melbourne.sgi.com:/build/dgc/isms/xfs-cmds Inspected by: ddiss The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/xfs-cmds/master-melb Modid: master-melb:xfs-cmds:28456a xfstests/166 - 1.1 - new http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/166 xfstests/166.out - 1.1 - new http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/166.out xfstests/src/unwritten_mmap.c - 1.1 - new http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/src/unwritten_mmap.c xfstests/group - 1.103 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/group.diff?r1=text&tr1=1.103&r2=text&tr2=1.102&f=h xfstests/src/Makefile - 1.39 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/src/Makefile.diff?r1=text&tr1=1.39&r2=text&tr2=1.38&f=h - mmap vs unwritten extents test. From owner-xfs@oss.sgi.com Mon Apr 23 09:08:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 09:08:10 -0700 (PDT) Received: from mail.g-house.de (ns2.g-housing.de [81.169.133.75]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NG84fB021680 for ; Mon, 23 Apr 2007 09:08:05 -0700 Received: from [85.211.139.192] (helo=sheep.housecafe.de) by mail.g-house.de with esmtpsa (TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1Hg15D-0003JF-Ji for xfs@oss.sgi.com; Mon, 23 Apr 2007 18:08:07 +0200 Received: from localhost ([127.0.0.1] helo=derchris.gotdns.org) by sheep.housecafe.de with esmtp (Exim 4.63) (envelope-from ) id 1Hg156-000417-6z for xfs@oss.sgi.com; Mon, 23 Apr 2007 17:08:00 +0100 Received: from 194.246.123.250 (SquirrelMail authenticated user evil) by derchris.gotdns.org:8080 with HTTP; Mon, 23 Apr 2007 17:08:00 +0100 (BST) Message-ID: <20603.194.246.123.250.1177344480.squirrel@derchris.gotdns.org:8080> In-Reply-To: References: Date: Mon, 23 Apr 2007 17:08:00 +0100 (BST) Subject: Re: possible recursive locking detected From: "Christian Kujau" To: xfs@oss.sgi.com User-Agent: SquirrelMail/1.5.2 [SVN] MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-15 Content-Transfer-Encoding: 8bit X-archive-position: 11148 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lists@nerdbynature.de Precedence: bulk X-list: xfs Hi there, On Mon, April 2, 2007 19:18, Christian Kujau wrote: > when I enabled a few more debug-options in the kernel (vanilla > 2.6.21-rc5), I came across: > > [ INFO: possible recursive locking detected ] > 2.6.21-rc5 #2 The same happened with -rc7, see below. Can anyone comment if this is/could lead to a problem? Thanks, Christian. Please see http://nerdbynature.de/bits/2.6.21-rc7/ for full details: [37380.435689] ============================================= [37380.435703] [ INFO: possible recursive locking detected ] [37380.435707] 2.6.21-rc7 #6 [37380.435710] --------------------------------------------- [37380.435715] rm/14081 is trying to acquire lock: [37380.435719] (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x71/0xa0 [37380.435734] [37380.435735] but task is already holding lock: [37380.435739] (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x71/0xa0 [37380.435749] [37380.435750] other info that might help us debug this: [37380.435755] 3 locks held by rm/14081: [37380.435758] #0: (&inode->i_mutex/1){--..}, at: [] do_unlinkat+0x96/0x160 [37380.435770] #1: (&inode->i_mutex){--..}, at: [] vfs_unlink+0x75/0xe0 [37380.435782] #2: (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x71/0xa0 [37380.435792] [37380.435792] stack backtrace: [37380.435798] [] __lock_acquire+0xa99/0x1010 [37380.435808] [] lock_acquire+0x57/0x70 [37380.435814] [] xfs_ilock+0x71/0xa0 [37380.435820] [] down_write+0x38/0x50 [37380.435828] [] xfs_ilock+0x71/0xa0 [37380.435833] [] xfs_ilock+0x71/0xa0 [37380.435839] [] xfs_lock_dir_and_entry+0xf6/0x100 [37380.435847] [] xfs_remove+0x197/0x4e0 [37380.435853] [] d_instantiate+0x19/0x40 [37380.435860] [] d_rehash+0x20/0x50 [37380.435868] [] vfs_unlink+0x75/0xe0 [37380.435875] [] xfs_vn_unlink+0x23/0x60 [37380.435882] [] __mutex_lock_slowpath+0x13f/0x280 [37380.435889] [] mark_held_locks+0x6b/0x90 [37380.435894] [] __mutex_lock_slowpath+0x13f/0x280 [37380.435900] [] __mutex_lock_slowpath+0x13f/0x280 [37380.435906] [] trace_hardirqs_on+0xb9/0x160 [37380.435913] [] vfs_unlink+0x75/0xe0 [37380.435919] [] __mutex_lock_slowpath+0x132/0x280 [37380.435925] [] vfs_unlink+0x75/0xe0 [37380.435931] [] permission+0x91/0xf0 [37380.435938] [] vfs_unlink+0x89/0xe0 [37380.435945] [] do_unlinkat+0xd2/0x160 [37380.435953] [] restore_nocheck+0x12/0x15 [37380.435959] [] trace_hardirqs_on+0xb9/0x160 [37380.435967] [] sysenter_past_esp+0x5d/0x99 [37380.435976] ======================= -- BOFH excuse #442: Trojan horse ran out of hay From owner-xfs@oss.sgi.com Mon Apr 23 12:51:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 12:51:42 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NJpZfB024622 for ; Mon, 23 Apr 2007 12:51:37 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l3NJpW1K020319; Mon, 23 Apr 2007 15:51:32 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3NJpVKC027113; Mon, 23 Apr 2007 15:51:31 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3NJpUMp012968; Mon, 23 Apr 2007 15:51:31 -0400 Message-ID: <462D0D23.7010803@sandeen.net> Date: Mon, 23 Apr 2007 14:46:43 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Christian Kujau CC: xfs@oss.sgi.com Subject: Re: possible recursive locking detected References: <20603.194.246.123.250.1177344480.squirrel@derchris.gotdns.org:8080> In-Reply-To: <20603.194.246.123.250.1177344480.squirrel@derchris.gotdns.org:8080> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit X-archive-position: 11149 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Christian Kujau wrote: > Hi there, > > On Mon, April 2, 2007 19:18, Christian Kujau wrote: >> when I enabled a few more debug-options in the kernel (vanilla >> 2.6.21-rc5), I came across: >> >> [ INFO: possible recursive locking detected ] >> 2.6.21-rc5 #2 > > The same happened with -rc7, see below. Can anyone comment if this > is/could lead to a problem? > The consensus seems to be that it is cosmetic. -Eric From owner-xfs@oss.sgi.com Mon Apr 23 14:43:44 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 14:43:48 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NLhgfB011391 for ; Mon, 23 Apr 2007 14:43:44 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg6Ju-0004lT-RM; Mon, 23 Apr 2007 22:43:38 +0100 Date: Mon, 23 Apr 2007 22:43:38 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: review: don't hold ilock when calling vn_iowait Message-ID: <20070423214338.GA17561@infradead.org> References: <20070422230303.GX32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070422230303.GX32602149@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11150 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 09:03:03AM +1000, David Chinner wrote: > > Regression introduced by recent freezing fixes - we should > not hold the ilock while waiting for I/O completion. Looks good, and actually simplies the twisted maze the xfs_sync_inodes is a little bit. And the missing IPOINTER_INSERT in the SYNC_CLOSE case looks like an actual bugfix. Of course in the end I'd still like to see all pagecache-writeout to be driven by sync_sb_inodes() instead of the fs code, but it'll probably take a little longer until that is done. From owner-xfs@oss.sgi.com Mon Apr 23 14:49:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 14:49:33 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NLnTfB014530 for ; Mon, 23 Apr 2007 14:49:31 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg62w-0004CO-2a; Mon, 23 Apr 2007 22:26:06 +0100 Date: Mon, 23 Apr 2007 22:26:06 +0100 From: Christoph Hellwig To: Utako Kusaka Cc: xfs@oss.sgi.com Subject: Re: [PATCH] Fix "quota -n" command in xfs_quota. Message-ID: <20070423212606.GE13572@infradead.org> References: <200704190837.AA05238@TNESG9305.tnes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200704190837.AA05238@TNESG9305.tnes.nec.co.jp> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11151 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 05:37:11PM +0900, Utako Kusaka wrote: > Hi, > > "quota -n" command in xfs_quota don't work when specifying the project id. > This patch fixes it. > > Example: > # ./xfs_quota -x -c 'quota -p -n 42' ~utako/mpnt > Disk quotas for Project logfiles (42) > Filesystem Blocks Quota Limit Warn/Time Mounted on > /dev/sda6 52 0 0 00 [--------] /home/utako/mpnt Looks good to me, but the even the original code could be a little bit cleaner: > --- xfsprogs-2.8.20/quota/quota.orig 2007-04-18 10:36:38.000000000 +0900 > +++ xfsprogs-2.8.20/quota/quota.c 2007-04-18 11:09:10.000000000 +0900 > @@ -312,7 +312,7 @@ getprojectname( > static char buffer[32]; > fs_project_t *p; > > - if ((p = getprprid(prid))) > + if (!numeric && (p = getprprid(prid))) > return p->pr_name; > snprintf(buffer, sizeof(buffer), "#%u", (unsigned int)prid); > return &buffer[0]; if (!numeric) { fs_project_t *p = getprprid(prid); if (p) return p->pr_name; } snprintf(buffer, sizeof(buffer), "#%u", (unsigned int)prid); return &buffer[0]; From owner-xfs@oss.sgi.com Mon Apr 23 14:49:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 14:49:34 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NLnWfB014540 for ; Mon, 23 Apr 2007 14:49:32 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg643-0004Fs-P2; Mon, 23 Apr 2007 22:27:15 +0100 Date: Mon, 23 Apr 2007 22:27:15 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: review: don't block non-blocking writes when frozen Message-ID: <20070423212715.GF13572@infradead.org> References: <20070423002616.GY32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423002616.GY32602149@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11152 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 10:26:16AM +1000, David Chinner wrote: > > Prevent nfsds for blocking trying to write to a frozen filesystem > or a filesystem in the process of freezing. Looks good for trees actually having support for non-blocking file I/O, which doesn't include mainline. (So please don't send this upstead) From owner-xfs@oss.sgi.com Mon Apr 23 14:49:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 14:49:42 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NLncfB014659 for ; Mon, 23 Apr 2007 14:49:39 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg64R-0004G8-Jk; Mon, 23 Apr 2007 22:27:39 +0100 Date: Mon, 23 Apr 2007 22:27:39 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: review: don't block non-blocking setattr when frozen Message-ID: <20070423212739.GG13572@infradead.org> References: <20070423003142.GZ32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423003142.GZ32602149@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11156 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 10:31:42AM +1000, David Chinner wrote: > > Prevent nfsds from blocking doing on setattr calls (i.e. truncates) > when the filesystem is frozen or being frozen. Ok with same caveat as the last one. From owner-xfs@oss.sgi.com Mon Apr 23 14:49:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 14:49:38 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NLnZfB014590 for ; Mon, 23 Apr 2007 14:49:36 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg5yz-0003yk-4J; Mon, 23 Apr 2007 22:22:01 +0100 Date: Mon, 23 Apr 2007 22:22:01 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: review: allocate alloc args Message-ID: <20070423212201.GB13572@infradead.org> References: <20070419073216.GT48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070419073216.GT48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11154 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 05:32:16PM +1000, David Chinner wrote: > > Save some stack space in the critical allocator paths > by allocating the xfs_alloc_arg_t structures (104 bytes > on 64bit, 88 bytes on 32bit systems) rather than placing > them on the stack. > > There can be more than one of these structures on the stack > through the critical allocation path (e.g. xfs_bmap_btalloc() > and xfs_alloc_fix_freelist()) so there are significant > savings to be had here... I don't like doing even more dynamic allocations that deep down in the stack. Can we try another approach and see if we really need the full structure in all places but could rather pass down a few arguments or a smaller structure instead? I've done some work on that in the dir good with quite good results a while ago (and yes, I need to bring it up to date and send it out) From owner-xfs@oss.sgi.com Mon Apr 23 14:49:41 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 14:49:43 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NLnefB014695 for ; Mon, 23 Apr 2007 14:49:41 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg60W-000429-99; Mon, 23 Apr 2007 22:23:36 +0100 Date: Mon, 23 Apr 2007 22:23:36 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: review: handle barriers being switched off dynamically. Message-ID: <20070423212336.GC13572@infradead.org> References: <20070419073714.GU48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070419073714.GU48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11157 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 05:37:14PM +1000, David Chinner wrote: > > As pointed out by Neil Brown, MD can switch barriers off > dynamically underneath a mounted filesystem. If this happens > to XFS, it will shutdown the filesystem immediately. > > Handle this more sanely by yelling into the syslog, retrying > the I/O without barriers and if that is successful, turn > off barriers. > > Also remove an unnecessary check when first checking to > see if the underlying device supports barriers. Looks good to me (well, not really good, but as good as it can be given the circumstances..) > + /* > + * We can get an EOPNOTSUPP to ordered writes. Here we clear the > + * ordered flag and reissue them. Because we can't tell the higher > + * layers directly that they should not issue ordered I/O anymore, they > + * need to check if the ordered flag was cleared during I/O completion. > + */ > + if ((bp->b_error == EOPNOTSUPP) && no need for the additional braces here, though. From owner-xfs@oss.sgi.com Mon Apr 23 14:49:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 14:49:40 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NLnbfB014624 for ; Mon, 23 Apr 2007 14:49:37 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg60x-00043C-If; Mon, 23 Apr 2007 22:24:03 +0100 Date: Mon, 23 Apr 2007 22:24:03 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: review: fix use after free of log buffers on shutdown. Message-ID: <20070423212403.GD13572@infradead.org> References: <20070419075338.GV48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070419075338.GV48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11155 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Thu, Apr 19, 2007 at 05:53:38PM +1000, David Chinner wrote: > > When unmounting the filesystem we write an unmount record into the > log just before we start freeing up in memory structures. > > When we wait for the unmount record to hit the disk, we don't > wait for the log buffers to be finished with, we only wait for part of > the iodone callback to be run - the bit that processes the > unmount record completion. > > Hence when the unmount wakes up, it races with the remainder of the > log io completion and pretty much the first thing it does is free > the log buffers. > > As a result, when iodone processing completes and we check the > buffer's async status, the buffer can already have been freed. > > Luckily, all log I/O is issued asynchronously, so we don't really > need the async check and so we can avoid this use after free > easily. Looks good to me. > Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c > =================================================================== > --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-04-19 17:18:14.097380099 +1000 > +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2007-04-19 17:51:03.017078512 +1000 > @@ -988,14 +988,16 @@ xlog_iodone(xfs_buf_t *bp) > } else if (iclog->ic_state & XLOG_STATE_IOERROR) { > aborted = XFS_LI_ABORTED; > } > + > + /* log I/O is always issued ASYNC */ > + ASSERT(XFS_BUF_ISASYNC(bp)); > xlog_state_done_syncing(iclog, aborted); > - if (!(XFS_BUF_ISASYNC(bp))) { > - /* > - * Corresponding psema() will be done in bwrite(). If we don't > - * vsema() here, panic. > - */ > - XFS_BUF_V_IODONESEMA(bp); > - } > + /* > + * do not reference the buffer (bp) here as we could race > + * with it being freed after writing the unmount record to the > + * log. > + */ > + > } /* xlog_iodone */ > > /* > > ---end quoted text--- From owner-xfs@oss.sgi.com Mon Apr 23 14:49:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 14:49:36 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NLnXfB014564 for ; Mon, 23 Apr 2007 14:49:34 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg5wu-0003oI-5g; Mon, 23 Apr 2007 22:19:52 +0100 Date: Mon, 23 Apr 2007 22:19:52 +0100 From: Christoph Hellwig To: Eric Sandeen Cc: Christian Kujau , xfs@oss.sgi.com Subject: Re: possible recursive locking detected Message-ID: <20070423211952.GA13572@infradead.org> References: <20603.194.246.123.250.1177344480.squirrel@derchris.gotdns.org:8080> <462D0D23.7010803@sandeen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <462D0D23.7010803@sandeen.net> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11153 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 02:46:43PM -0500, Eric Sandeen wrote: > Christian Kujau wrote: > > Hi there, > > > > On Mon, April 2, 2007 19:18, Christian Kujau wrote: > >> when I enabled a few more debug-options in the kernel (vanilla > >> 2.6.21-rc5), I came across: > >> > >> [ INFO: possible recursive locking detected ] > >> 2.6.21-rc5 #2 > > > > The same happened with -rc7, see below. Can anyone comment if this > > is/could lead to a problem? > > > > The consensus seems to be that it is cosmetic. It's not really cosmetic. It means i_lock and i_iolock are beeing acquired without an order that is detectable by lockdep. At the very first it means annotations for lockdep are missing, because acquiring two per-inode locks at the same time is a basic fact in unix filesystems. But deeper than that the rules for taking both locks are not very well defined in XFS. These rules at least need documentation in form of lockdep annotations, and possibly some fixes and cleanups around the more dirty areas like xfs_lock_for_rename() or xfs_lock_dir_and_entry() From owner-xfs@oss.sgi.com Mon Apr 23 15:00:13 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 15:00:15 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NM0BfB020909 for ; Mon, 23 Apr 2007 15:00:13 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg6Zu-0005Ts-Ri; Mon, 23 Apr 2007 23:00:10 +0100 Date: Mon, 23 Apr 2007 23:00:10 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: review [1 of 3]: lazy superblock counters - core kernel Message-ID: <20070423220010.GA18325@infradead.org> References: <20070419231459.GX48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070419231459.GX48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11158 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs > - if ((error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno))) > + error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0); > + if (error) Nice cleanup here. > - if ((error = xfs_alloc_get_freelist(tp, agbp, &bno))) > + if ((error = xfs_alloc_get_freelist(tp, agbp, &bno, 0))) but not here. Any chance you could use the linux kernel prefered style in the first hunk everywhere? Especially in new code where the patch uses the second style aswell. > +#define XFS_SB_VERSION2_LAZYSBCOUNTBIT 0x00000002 /* Superblk counters */ The flag seems a bit misnamed to me. It's really about counting the freelist blocks, not the lazy counters that require it. But given that it's been in IRIX for a while that's probably not something we could change. > +#define XFS_SB_VERSION_LAZYSBCOUNT(sbp) xfs_sb_version_haslazysbcount(sbp) > +static inline int xfs_sb_version_haslazysbcount(xfs_sb_t *sbp) Please just use the inline version everywhere and don't introduce the uppercase superflous macro. > + * Write into superblock the fields that we haven't > + * been logging - allocated/free inode and free block > + * counts - from the incore superblock. > + */ > + error = xfs_log_sbcount(mp, (XFS_LOG_FORCE|XFS_LOG_SYNC)); > > - xfs_icsb_sync_counters(mp); > + sbp = xfs_getsb(mp, 0); > + sb = XFS_BUF_TO_SBP(sbp); > + if (error) { > + /* > + * Hmmm - failed to get log reservations so just > + * do the mod without a transaction. Whine about > + * it, too. > + */ > + ASSERT(XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)); > + xfs_fs_cmn_err(CE_NOTE, mp, > + "Unmounting, non-transactional sb update"); > + s = XFS_SB_LOCK(mp); > + INT_SET(sb->sb_icount, ARCH_CONVERT, mp->m_sb.sb_icount); > + INT_SET(sb->sb_ifree, ARCH_CONVERT, mp->m_sb.sb_ifree); > + INT_SET(sb->sb_fdblocks, ARCH_CONVERT, mp->m_sb.sb_fdblocks); > + XFS_SB_UNLOCK(mp, s); This is really quite nasty. Should we at least force a cache flush here? > + * > + * We also update the disk superblock with incore counter > + * values if we are using non-persistent counters so that > + * they don't get too far out of sync if we crash or get a > + * forced shutdown. We don't want to force this to disk, > + * just get a transaction into the iclogs.... > */ > if (flags & SYNC_REFCACHE) { > if (flags & SYNC_WAIT) > xfs_refcache_purge_mp(mp); > else > xfs_refcache_purge_some(mp); > + xfs_log_sbcount(mp, 0); Can you please give this a SYNC_ flag of it's own? SYNC_REFCACHE is misnamed for this, and I hope it will go away once we stop pretending to support 2.4 builds. From owner-xfs@oss.sgi.com Mon Apr 23 15:00:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 15:00:47 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NM0hfB021201 for ; Mon, 23 Apr 2007 15:00:45 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg6aR-0005UF-4D; Mon, 23 Apr 2007 23:00:43 +0100 Date: Mon, 23 Apr 2007 23:00:43 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: review [1 of 3]: lazy superblock counters - fix interaction with per-cpu incore counters Message-ID: <20070423220043.GB18325@infradead.org> References: <20070419231859.GY48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070419231859.GY48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11159 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Fri, Apr 20, 2007 at 09:18:59AM +1000, David Chinner wrote: > > We need to reinitialise the per-cpu superblock counters after > we have corrected them in the new recovery phase (summing hte AGF/AGI > counters. I think this should be merged into the first patch. It mostly moves code from that patch around, and the tree is in a buggy state without it. From owner-xfs@oss.sgi.com Mon Apr 23 15:16:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 15:16:37 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3NMGSfB027455 for ; Mon, 23 Apr 2007 15:16:31 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA29081; Tue, 24 Apr 2007 08:16:25 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3NMGOAf74864464; Tue, 24 Apr 2007 08:16:24 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3NMGNfa74921501; Tue, 24 Apr 2007 08:16:23 +1000 (AEST) Date: Tue, 24 Apr 2007 08:16:23 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: review [1 of 3]: lazy superblock counters - core kernel Message-ID: <20070423221622.GL32602149@melbourne.sgi.com> References: <20070419231459.GX48531920@melbourne.sgi.com> <20070423220010.GA18325@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423220010.GA18325@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11160 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 11:00:10PM +0100, Christoph Hellwig wrote: > > - if ((error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno))) > > + error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0); > > + if (error) > > Nice cleanup here. > > > - if ((error = xfs_alloc_get_freelist(tp, agbp, &bno))) > > + if ((error = xfs_alloc_get_freelist(tp, agbp, &bno, 0))) > > but not here. Any chance you could use the linux kernel prefered style in the first > hunk everywhere? Especially in new code where the patch uses the second style aswell. Yeah, I can convert them - the first would have been done because the line would have been > 80 chars.... > > +#define XFS_SB_VERSION2_LAZYSBCOUNTBIT 0x00000002 /* Superblk counters */ > > The flag seems a bit misnamed to me. It's really about counting the freelist > blocks, not the lazy counters that require it. But given that it's been in > IRIX for a while that's probably not something we could change. Though in the places where it is checked in the transaction code, it does actually make sense to call it that (i.e. if lazysb don't mark the sb dirty) which is where it came from. The need to count freelist blocks came up after that code was done and working. So yes, it doesn't describe the on disk format change closely, just the functionality that uses it.... > > +#define XFS_SB_VERSION_LAZYSBCOUNT(sbp) xfs_sb_version_haslazysbcount(sbp) > > +static inline int xfs_sb_version_haslazysbcount(xfs_sb_t *sbp) > > Please just use the inline version everywhere and don't introduce the > uppercase superflous macro. Will do. > > + * Write into superblock the fields that we haven't > > + * been logging - allocated/free inode and free block > > + * counts - from the incore superblock. > > + */ > > + error = xfs_log_sbcount(mp, (XFS_LOG_FORCE|XFS_LOG_SYNC)); > > > > - xfs_icsb_sync_counters(mp); > > + sbp = xfs_getsb(mp, 0); > > + sb = XFS_BUF_TO_SBP(sbp); > > + if (error) { > > + /* > > + * Hmmm - failed to get log reservations so just > > + * do the mod without a transaction. Whine about > > + * it, too. > > + */ > > + ASSERT(XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)); > > + xfs_fs_cmn_err(CE_NOTE, mp, > > + "Unmounting, non-transactional sb update"); > > + s = XFS_SB_LOCK(mp); > > + INT_SET(sb->sb_icount, ARCH_CONVERT, mp->m_sb.sb_icount); > > + INT_SET(sb->sb_ifree, ARCH_CONVERT, mp->m_sb.sb_ifree); > > + INT_SET(sb->sb_fdblocks, ARCH_CONVERT, mp->m_sb.sb_fdblocks); > > + XFS_SB_UNLOCK(mp, s); > > This is really quite nasty. Should we at least force a cache flush here? Well, that is what it's doing - xfs_log_sbcount() flushes the counters and logs the changes to the superblock. If that fails (very rare) we've already got the current values in mp->m_sb and so all we need to do is push them into the disk superblock and write it. > > + * > > + * We also update the disk superblock with incore counter > > + * values if we are using non-persistent counters so that > > + * they don't get too far out of sync if we crash or get a > > + * forced shutdown. We don't want to force this to disk, > > + * just get a transaction into the iclogs.... > > */ > > if (flags & SYNC_REFCACHE) { > > if (flags & SYNC_WAIT) > > xfs_refcache_purge_mp(mp); > > else > > xfs_refcache_purge_some(mp); > > + xfs_log_sbcount(mp, 0); > > Can you please give this a SYNC_ flag of it's own? SYNC_REFCACHE is > misnamed for this, and I hope it will go away once we stop pretending > to support 2.4 builds. Will do. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 23 15:23:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 15:23:45 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3NMNffB031057 for ; Mon, 23 Apr 2007 15:23:43 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hg6we-0006Di-UW; Mon, 23 Apr 2007 23:23:40 +0100 Date: Mon, 23 Apr 2007 23:23:40 +0100 From: Christoph Hellwig To: David Chinner Cc: Christoph Hellwig , xfs-dev , xfs-oss Subject: Re: review [1 of 3]: lazy superblock counters - core kernel Message-ID: <20070423222340.GA23870@infradead.org> References: <20070419231459.GX48531920@melbourne.sgi.com> <20070423220010.GA18325@infradead.org> <20070423221622.GL32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423221622.GL32602149@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11161 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Tue, Apr 24, 2007 at 08:16:23AM +1000, David Chinner wrote: > > > + INT_SET(sb->sb_fdblocks, ARCH_CONVERT, mp->m_sb.sb_fdblocks); > > > + XFS_SB_UNLOCK(mp, s); > > > > This is really quite nasty. Should we at least force a cache flush here? > > Well, that is what it's doing - xfs_log_sbcount() flushes the counters and > logs the changes to the superblock. If that fails (very rare) we've already > got the current values in mp->m_sb and so all we need to do is push them > into the disk superblock and write it. Sorry, should have been more detailed. I meant the disk cache, as in blkdev_issue_flush, to make sure the data hits the disk, even if it doesn't go through a transaction which would normally do that. (in the barriers case) From owner-xfs@oss.sgi.com Mon Apr 23 15:33:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 15:33:13 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3NMX7fB003514 for ; Mon, 23 Apr 2007 15:33:10 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA29628; Tue, 24 Apr 2007 08:32:59 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3NMWwAf74145286; Tue, 24 Apr 2007 08:32:58 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3NMWvT175001017; Tue, 24 Apr 2007 08:32:57 +1000 (AEST) Date: Tue, 24 Apr 2007 08:32:57 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: review: allocate alloc args Message-ID: <20070423223257.GM32602149@melbourne.sgi.com> References: <20070419073216.GT48531920@melbourne.sgi.com> <20070423212201.GB13572@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423212201.GB13572@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11162 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 10:22:01PM +0100, Christoph Hellwig wrote: > On Thu, Apr 19, 2007 at 05:32:16PM +1000, David Chinner wrote: > > > > Save some stack space in the critical allocator paths > > by allocating the xfs_alloc_arg_t structures (104 bytes > > on 64bit, 88 bytes on 32bit systems) rather than placing > > them on the stack. > > > > There can be more than one of these structures on the stack > > through the critical allocation path (e.g. xfs_bmap_btalloc() > > and xfs_alloc_fix_freelist()) so there are significant > > savings to be had here... > > I don't like doing even more dynamic allocations that deep > down in the stack. I'm not a big fan of it either, but I don't really see any other option here. We need a bunch of temporary space for structures *somewhere*, and if there isn't enough stack space then it's got to come frm somewhere else. > Can we try another approach and see if > we really need the full structure in all places but could > rather pass down a few arguments or a smaller structure instead? We use all the variables in the xfs_bmalloca_t structure when it is passed, and we use most (all?) of the args in xfs_alloc_args_t in the critical spots so I'm not sure whether this approach could gain us much. Given that both you and Nathan has expressed concern about these two alloc arg patches, I'm going to hold off doing anythinng with them. We're caught between a rock and a hard place here - if we use too much stack and then can't dynamically allocate safely then I don't really see that there is that much we can do to reduce stack usage.... > I've done some work on that in the dir good with quite good results > a while ago (and yes, I need to bring it up to date and send it out) Sounds interesting - I'll have a closer look at this in the allocator context. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 23 16:13:52 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 16:13:54 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3NNDnfB021234 for ; Mon, 23 Apr 2007 16:13:51 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA00924; Tue, 24 Apr 2007 09:13:40 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3NNDdAf74879383; Tue, 24 Apr 2007 09:13:39 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3NNDbA774966022; Tue, 24 Apr 2007 09:13:37 +1000 (AEST) Date: Tue, 24 Apr 2007 09:13:37 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , gnb@sgi.com, xfs-oss , nfs@lists.sourceforge.net Subject: Re: review: don't block non-blocking writes when frozen Message-ID: <20070423231337.GN32602149@melbourne.sgi.com> References: <20070423002616.GY32602149@melbourne.sgi.com> <20070423212715.GF13572@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423212715.GF13572@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11163 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 10:27:15PM +0100, Christoph Hellwig wrote: > On Mon, Apr 23, 2007 at 10:26:16AM +1000, David Chinner wrote: > > > > Prevent nfsds for blocking trying to write to a frozen filesystem > > or a filesystem in the process of freezing. > > Looks good for trees actually having support for non-blocking file > I/O, which doesn't include mainline. (So please don't send this > upstead) Yeah - you NACKed that a year ago based on the fact it would never get used by mainline code: http://marc.info/?l=linux-nfs&m=114683005119982&w=2 So, given the catch-22 you've just presented us can we revisit the nfsd non-blocking I/O issue again? This affects anyone using DM snapshots on their NFS servers and has nothing to do with HSMs or DMAPI... FWIW, you can still do non-blocking userspace I/O to a file, so this XFS patch is still valid for mainline (that's how I tested it). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 23 16:17:14 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 16:17:16 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3NNHBfB022721 for ; Mon, 23 Apr 2007 16:17:13 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA01238; Tue, 24 Apr 2007 09:17:08 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3NNH7Af74765876; Tue, 24 Apr 2007 09:17:07 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3NNH6JP74995522; Tue, 24 Apr 2007 09:17:06 +1000 (AEST) Date: Tue, 24 Apr 2007 09:17:06 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: review: don't hold ilock when calling vn_iowait Message-ID: <20070423231706.GO32602149@melbourne.sgi.com> References: <20070422230303.GX32602149@melbourne.sgi.com> <20070423214338.GA17561@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423214338.GA17561@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11164 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 10:43:38PM +0100, Christoph Hellwig wrote: > On Mon, Apr 23, 2007 at 09:03:03AM +1000, David Chinner wrote: > > > > Regression introduced by recent freezing fixes - we should > > not hold the ilock while waiting for I/O completion. > > Looks good, and actually simplies the twisted maze the xfs_sync_inodes is > a little bit. And the missing IPOINTER_INSERT in the SYNC_CLOSE case > looks like an actual bugfix. I had to look closely at that IPOINTER_INSERT case with SYNC_CLOSE; it was actaully working properly because you'd always end up in the SYNC_CLOSE case having inserted a pointer earlier on in the flow of the function. It certainly wasn't obvious that it was doing the right thing, though. > Of course in the end I'd still like to see all pagecache-writeout to > be driven by sync_sb_inodes() instead of the fs code, but it'll probably > take a little longer until that is done. Agreed on both counts. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 23 16:20:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 16:20:10 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3NNK6fB027382 for ; Mon, 23 Apr 2007 16:20:07 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA01403; Tue, 24 Apr 2007 09:20:04 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3NNK3Af74197441; Tue, 24 Apr 2007 09:20:03 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3NNK2Jb74997309; Tue, 24 Apr 2007 09:20:02 +1000 (AEST) Date: Tue, 24 Apr 2007 09:20:02 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: review [1 of 3]: lazy superblock counters - core kernel Message-ID: <20070423232002.GP32602149@melbourne.sgi.com> References: <20070419231459.GX48531920@melbourne.sgi.com> <20070423220010.GA18325@infradead.org> <20070423221622.GL32602149@melbourne.sgi.com> <20070423222340.GA23870@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423222340.GA23870@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11165 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 11:23:40PM +0100, Christoph Hellwig wrote: > On Tue, Apr 24, 2007 at 08:16:23AM +1000, David Chinner wrote: > > > > + INT_SET(sb->sb_fdblocks, ARCH_CONVERT, mp->m_sb.sb_fdblocks); > > > > + XFS_SB_UNLOCK(mp, s); > > > > > > This is really quite nasty. Should we at least force a cache flush here? > > > > Well, that is what it's doing - xfs_log_sbcount() flushes the counters and > > logs the changes to the superblock. If that fails (very rare) we've already > > got the current values in mp->m_sb and so all we need to do is push them > > into the disk superblock and write it. > > Sorry, should have been more detailed. I meant the disk cache, as in > blkdev_issue_flush, to make sure the data hits the disk, even if it doesn't > go through a transaction which would normally do that. (in the barriers case) Ah, gotcha. Hmmm - if this is necessary, I may as well add the flush to the closing of the buftargs - that way we will always be certain that an unmount leaves everything on disk and not in disk caches. That sounds like a better approach to me than putting an explicit flush in this particular case. Sound fair? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 23 18:28:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 18:28:24 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3O1SEfB026569 for ; Mon, 23 Apr 2007 18:28:16 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA04443; Tue, 24 Apr 2007 11:28:10 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3O1S9Af74902958; Tue, 24 Apr 2007 11:28:09 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3O1S87f73627054; Tue, 24 Apr 2007 11:28:08 +1000 (AEST) Date: Tue, 24 Apr 2007 11:28:08 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: review [1 of 3]: lazy superblock counters - core kernel Message-ID: <20070424012808.GD48531920@melbourne.sgi.com> References: <20070419231459.GX48531920@melbourne.sgi.com> <20070423220010.GA18325@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423220010.GA18325@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11166 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 23, 2007 at 11:00:10PM +0100, Christoph Hellwig wrote: > > + * Write into superblock the fields that we haven't > > + * been logging - allocated/free inode and free block > > + * counts - from the incore superblock. > > + */ > > + error = xfs_log_sbcount(mp, (XFS_LOG_FORCE|XFS_LOG_SYNC)); > > > > - xfs_icsb_sync_counters(mp); > > + sbp = xfs_getsb(mp, 0); > > + sb = XFS_BUF_TO_SBP(sbp); > > + if (error) { > > + /* > > + * Hmmm - failed to get log reservations so just > > + * do the mod without a transaction. Whine about > > + * it, too. > > + */ > > + ASSERT(XFS_SB_VERSION_LAZYSBCOUNT(&mp->m_sb)); > > + xfs_fs_cmn_err(CE_NOTE, mp, > > + "Unmounting, non-transactional sb update"); > > + s = XFS_SB_LOCK(mp); > > + INT_SET(sb->sb_icount, ARCH_CONVERT, mp->m_sb.sb_icount); > > + INT_SET(sb->sb_ifree, ARCH_CONVERT, mp->m_sb.sb_ifree); > > + INT_SET(sb->sb_fdblocks, ARCH_CONVERT, mp->m_sb.sb_fdblocks); > > + XFS_SB_UNLOCK(mp, s); > > This is really quite nasty. Should we at least force a cache flush here? Ok, so the patch I sent out was an older version that had a very similar name to the current patch in my series (xfs-lazy-sb vs xfs_lazy_sb). This code doesn't exist in the version I should have sent out. The latest version, plus the changes suggested here and with the second patch folded back into it is attached. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.4/xfs_super.c | 5 - fs/xfs/linux-2.4/xfs_vfs.h | 1 fs/xfs/linux-2.6/xfs_vfs.h | 1 fs/xfs/xfs_ag.h | 8 +- fs/xfs/xfs_alloc.c | 48 ++++++++++--- fs/xfs/xfs_alloc.h | 6 + fs/xfs/xfs_alloc_btree.c | 20 +++-- fs/xfs/xfs_fs.h | 1 fs/xfs/xfs_fsops.c | 2 fs/xfs/xfs_ialloc.c | 28 ++++++- fs/xfs/xfs_ialloc.h | 10 ++ fs/xfs/xfs_log.c | 4 - fs/xfs/xfs_log_recover.c | 8 ++ fs/xfs/xfs_mount.c | 154 +++++++++++++++++++++++++++++++++++++++++-- fs/xfs/xfs_mount.h | 10 +- fs/xfs/xfs_sb.h | 16 +++- fs/xfs/xfs_trans.c | 58 ++++++++++++---- fs/xfs/xfs_trans.h | 3 fs/xfs/xfs_vfsops.c | 11 +++ fs/xfs/xfsidbg.c | 1 20 files changed, 340 insertions(+), 55 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_ag.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ag.h 2007-04-24 09:53:43.718660132 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ag.h 2007-04-24 09:55:36.295884906 +1000 @@ -68,6 +68,7 @@ typedef struct xfs_agf { __be32 agf_flcount; /* count of blocks in freelist */ __be32 agf_freeblks; /* total free blocks */ __be32 agf_longest; /* longest free space */ + __be32 agf_btreeblks; /* # of blocks held in AGF btrees */ } xfs_agf_t; #define XFS_AGF_MAGICNUM 0x00000001 @@ -81,7 +82,8 @@ typedef struct xfs_agf { #define XFS_AGF_FLCOUNT 0x00000100 #define XFS_AGF_FREEBLKS 0x00000200 #define XFS_AGF_LONGEST 0x00000400 -#define XFS_AGF_NUM_BITS 11 +#define XFS_AGF_BTREEBLKS 0x00000800 +#define XFS_AGF_NUM_BITS 12 #define XFS_AGF_ALL_BITS ((1 << XFS_AGF_NUM_BITS) - 1) /* disk block (xfs_daddr_t) in the AG */ @@ -186,11 +188,13 @@ typedef struct xfs_perag __uint32_t pagf_flcount; /* count of blocks in freelist */ xfs_extlen_t pagf_freeblks; /* total free blocks */ xfs_extlen_t pagf_longest; /* longest free space */ + __uint32_t pagf_btreeblks; /* # of blocks held in AGF btrees */ xfs_agino_t pagi_freecount; /* number of free inodes */ + xfs_agino_t pagi_count; /* number of allocated inodes */ + int pagb_count; /* pagb slots in use */ #ifdef __KERNEL__ lock_t pagb_lock; /* lock for pagb_list */ #endif - int pagb_count; /* pagb slots in use */ xfs_perag_busy_t *pagb_list; /* unstable blocks */ /* Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc.c 2007-04-24 09:53:43.718660132 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_alloc.c 2007-04-24 09:55:36.299884381 +1000 @@ -1447,7 +1447,8 @@ xfs_alloc_ag_vextent_small( else if (args->minlen == 1 && args->alignment == 1 && !args->isfl && (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount) > args->minleft)) { - if ((error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno))) + error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0); + if (error) goto error0; if (fbno != NULLAGBLOCK) { if (args->userdata) { @@ -1923,7 +1924,8 @@ xfs_alloc_fix_freelist( while (be32_to_cpu(agf->agf_flcount) > need) { xfs_buf_t *bp; - if ((error = xfs_alloc_get_freelist(tp, agbp, &bno))) + error = xfs_alloc_get_freelist(tp, agbp, &bno, 0); + if (error) return error; if ((error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1, 1))) return error; @@ -1973,8 +1975,9 @@ xfs_alloc_fix_freelist( * Put each allocated block on the list. */ for (bno = targs.agbno; bno < targs.agbno + targs.len; bno++) { - if ((error = xfs_alloc_put_freelist(tp, agbp, agflbp, - bno))) + error = xfs_alloc_put_freelist(tp, agbp, + agflbp, bno, 0); + if (error) return error; } } @@ -1991,13 +1994,15 @@ int /* error */ xfs_alloc_get_freelist( xfs_trans_t *tp, /* transaction pointer */ xfs_buf_t *agbp, /* buffer containing the agf structure */ - xfs_agblock_t *bnop) /* block address retrieved from freelist */ + xfs_agblock_t *bnop, /* block address retrieved from freelist */ + int btreeblk) /* destination is a AGF btree */ { xfs_agf_t *agf; /* a.g. freespace structure */ xfs_agfl_t *agfl; /* a.g. freelist structure */ xfs_buf_t *agflbp;/* buffer for a.g. freelist structure */ xfs_agblock_t bno; /* block number returned */ int error; + int logflags; #ifdef XFS_ALLOC_TRACE static char fname[] = "xfs_alloc_get_freelist"; #endif @@ -2032,8 +2037,16 @@ xfs_alloc_get_freelist( be32_add(&agf->agf_flcount, -1); xfs_trans_agflist_delta(tp, -1); pag->pagf_flcount--; - TRACE_MODAGF(NULL, agf, XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT); - xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT); + + logflags = XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT; + if (btreeblk) { + be32_add(&agf->agf_btreeblks, 1); + pag->pagf_btreeblks++; + logflags |= XFS_AGF_BTREEBLKS; + } + + TRACE_MODAGF(NULL, agf, logflags); + xfs_alloc_log_agf(tp, agbp, logflags); *bnop = bno; /* @@ -2071,6 +2084,7 @@ xfs_alloc_log_agf( offsetof(xfs_agf_t, agf_flcount), offsetof(xfs_agf_t, agf_freeblks), offsetof(xfs_agf_t, agf_longest), + offsetof(xfs_agf_t, agf_btreeblks), sizeof(xfs_agf_t) }; @@ -2106,12 +2120,14 @@ xfs_alloc_put_freelist( xfs_trans_t *tp, /* transaction pointer */ xfs_buf_t *agbp, /* buffer for a.g. freelist header */ xfs_buf_t *agflbp,/* buffer for a.g. free block array */ - xfs_agblock_t bno) /* block being freed */ + xfs_agblock_t bno, /* block being freed */ + int btreeblk) /* block came from a AGF btree */ { xfs_agf_t *agf; /* a.g. freespace structure */ xfs_agfl_t *agfl; /* a.g. free block array */ __be32 *blockp;/* pointer to array entry */ int error; + int logflags; #ifdef XFS_ALLOC_TRACE static char fname[] = "xfs_alloc_put_freelist"; #endif @@ -2132,11 +2148,22 @@ xfs_alloc_put_freelist( be32_add(&agf->agf_flcount, 1); xfs_trans_agflist_delta(tp, 1); pag->pagf_flcount++; + + logflags = XFS_AGF_FLLAST | XFS_AGF_FLCOUNT; + if (btreeblk) { + be32_add(&agf->agf_btreeblks, -1); + pag->pagf_btreeblks--; + logflags |= XFS_AGF_BTREEBLKS; + } + + TRACE_MODAGF(NULL, agf, logflags); + xfs_alloc_log_agf(tp, agbp, logflags); + ASSERT(be32_to_cpu(agf->agf_flcount) <= XFS_AGFL_SIZE(mp)); blockp = &agfl->agfl_bno[be32_to_cpu(agf->agf_fllast)]; *blockp = cpu_to_be32(bno); - TRACE_MODAGF(NULL, agf, XFS_AGF_FLLAST | XFS_AGF_FLCOUNT); - xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLLAST | XFS_AGF_FLCOUNT); + TRACE_MODAGF(NULL, agf, logflags); + xfs_alloc_log_agf(tp, agbp, logflags); xfs_trans_log_buf(tp, agflbp, (int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl), (int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl + @@ -2196,6 +2223,7 @@ xfs_alloc_read_agf( pag = &mp->m_perag[agno]; if (!pag->pagf_init) { pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks); + pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks); pag->pagf_flcount = be32_to_cpu(agf->agf_flcount); pag->pagf_longest = be32_to_cpu(agf->agf_longest); pag->pagf_levels[XFS_BTNUM_BNOi] = Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc.h 2007-04-24 09:53:43.718660132 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_alloc.h 2007-04-24 09:55:36.299884381 +1000 @@ -136,7 +136,8 @@ int /* error */ xfs_alloc_get_freelist( struct xfs_trans *tp, /* transaction pointer */ struct xfs_buf *agbp, /* buffer containing the agf structure */ - xfs_agblock_t *bnop); /* block address retrieved from freelist */ + xfs_agblock_t *bnop, /* block address retrieved from freelist */ + int btreeblk); /* destination is a AGF btree */ /* * Log the given fields from the agf structure. @@ -165,7 +166,8 @@ xfs_alloc_put_freelist( struct xfs_trans *tp, /* transaction pointer */ struct xfs_buf *agbp, /* buffer for a.g. freelist header */ struct xfs_buf *agflbp,/* buffer for a.g. free block array */ - xfs_agblock_t bno); /* block being freed */ + xfs_agblock_t bno, /* block being freed */ + int btreeblk); /* owner was a AGF btree */ /* * Read in the allocation group header (free/alloc section). Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc_btree.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc_btree.c 2007-04-24 09:53:43.718660132 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_alloc_btree.c 2007-04-24 09:55:36.303883856 +1000 @@ -226,8 +226,9 @@ xfs_alloc_delrec( /* * Put this buffer/block on the ag's freelist. */ - if ((error = xfs_alloc_put_freelist(cur->bc_tp, - cur->bc_private.a.agbp, NULL, bno))) + error = xfs_alloc_put_freelist(cur->bc_tp, + cur->bc_private.a.agbp, NULL, bno, 1); + if (error) return error; /* * Since blocks move to the free list without the @@ -549,8 +550,9 @@ xfs_alloc_delrec( /* * Free the deleting block by putting it on the freelist. */ - if ((error = xfs_alloc_put_freelist(cur->bc_tp, cur->bc_private.a.agbp, - NULL, rbno))) + error = xfs_alloc_put_freelist(cur->bc_tp, + cur->bc_private.a.agbp, NULL, rbno, 1); + if (error) return error; /* * Since blocks move to the free list without the coordination @@ -1320,8 +1322,9 @@ xfs_alloc_newroot( /* * Get a buffer from the freelist blocks, for the new root. */ - if ((error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp, - &nbno))) + error = xfs_alloc_get_freelist(cur->bc_tp, + cur->bc_private.a.agbp, &nbno, 1); + if (error) return error; /* * None available, we fail. @@ -1604,8 +1607,9 @@ xfs_alloc_split( * Allocate the new block from the freelist. * If we can't do it, we're toast. Give up. */ - if ((error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp, - &rbno))) + error = xfs_alloc_get_freelist(cur->bc_tp, + cur->bc_private.a.agbp, &rbno, 1); + if (error) return error; if (rbno == NULLAGBLOCK) { *stat = 0; Index: 2.6.x-xfs-new/fs/xfs/xfs_fs.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_fs.h 2007-04-24 09:53:43.718660132 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_fs.h 2007-04-24 09:55:36.303883856 +1000 @@ -238,6 +238,7 @@ typedef struct xfs_fsop_resblks { #define XFS_FSOP_GEOM_FLAGS_LOGV2 0x0100 /* log format version 2 */ #define XFS_FSOP_GEOM_FLAGS_SECTOR 0x0200 /* sector sizes >1BB */ #define XFS_FSOP_GEOM_FLAGS_ATTR2 0x0400 /* inline attributes rework */ +#define XFS_FSOP_GEOM_FLAGS_LAZYSB 0x4000 /* lazy superblock counters */ /* Index: 2.6.x-xfs-new/fs/xfs/xfs_fsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_fsops.c 2007-04-24 09:53:43.718660132 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_fsops.c 2007-04-24 09:55:36.303883856 +1000 @@ -94,6 +94,8 @@ xfs_fs_geometry( XFS_FSOP_GEOM_FLAGS_DIRV2 : 0) | (XFS_SB_VERSION_HASSECTOR(&mp->m_sb) ? XFS_FSOP_GEOM_FLAGS_SECTOR : 0) | + (xfs_sb_version_haslazysbcount(&mp->m_sb) ? + XFS_FSOP_GEOM_FLAGS_LAZYSB : 0) | (XFS_SB_VERSION_HASATTR2(&mp->m_sb) ? XFS_FSOP_GEOM_FLAGS_ATTR2 : 0); geo->logsectsize = XFS_SB_VERSION_HASSECTOR(&mp->m_sb) ? Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc.c 2007-04-24 09:53:43.718660132 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c 2007-04-24 09:55:36.307883331 +1000 @@ -123,6 +123,7 @@ xfs_ialloc_ag_alloc( int blks_per_cluster; /* fs blocks per inode cluster */ xfs_btree_cur_t *cur; /* inode btree cursor */ xfs_daddr_t d; /* disk addr of buffer */ + xfs_agnumber_t agno; int error; xfs_buf_t *fbuf; /* new free inodes' buffer */ xfs_dinode_t *free; /* new free inode structure */ @@ -302,15 +303,15 @@ xfs_ialloc_ag_alloc( } be32_add(&agi->agi_count, newlen); be32_add(&agi->agi_freecount, newlen); + agno = be32_to_cpu(agi->agi_seqno); down_read(&args.mp->m_peraglock); - args.mp->m_perag[be32_to_cpu(agi->agi_seqno)].pagi_freecount += newlen; + args.mp->m_perag[agno].pagi_freecount += newlen; up_read(&args.mp->m_peraglock); agi->agi_newino = cpu_to_be32(newino); /* * Insert records describing the new inode chunk into the btree. */ - cur = xfs_btree_init_cursor(args.mp, tp, agbp, - be32_to_cpu(agi->agi_seqno), + cur = xfs_btree_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO, (xfs_inode_t *)0, 0); for (thisino = newino; thisino < newino + newlen; @@ -1460,6 +1461,7 @@ xfs_ialloc_read_agi( pag = &mp->m_perag[agno]; if (!pag->pagi_init) { pag->pagi_freecount = be32_to_cpu(agi->agi_freecount); + pag->pagi_count = be32_to_cpu(agi->agi_count); pag->pagi_init = 1; } else { /* @@ -1483,3 +1485,23 @@ xfs_ialloc_read_agi( *bpp = bp; return 0; } + +/* + * Read in the agi to initialise the per-ag data in the mount structure + */ +int +xfs_ialloc_pagi_init( + xfs_mount_t *mp, /* file system mount structure */ + xfs_trans_t *tp, /* transaction pointer */ + xfs_agnumber_t agno) /* allocation group number */ +{ + xfs_buf_t *bp = NULL; + int error; + + error = xfs_ialloc_read_agi(mp, tp, agno, &bp); + if (error) + return error; + if (bp) + xfs_trans_brelse(tp, bp); + return 0; +} Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc.h 2007-04-24 09:53:43.718660132 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc.h 2007-04-24 09:55:36.307883331 +1000 @@ -149,6 +149,16 @@ xfs_ialloc_read_agi( xfs_agnumber_t agno, /* allocation group number */ struct xfs_buf **bpp); /* allocation group hdr buf */ +/* + * Read in the allocation group header to initialise the per-ag data + * in the mount structure + */ +int +xfs_ialloc_pagi_init( + struct xfs_mount *mp, /* file system mount structure */ + struct xfs_trans *tp, /* transaction pointer */ + xfs_agnumber_t agno); /* allocation group number */ + #endif /* __KERNEL__ */ #endif /* __XFS_IALLOC_H__ */ Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-04-24 09:53:43.730658557 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2007-04-24 09:55:36.311882807 +1000 @@ -827,10 +827,8 @@ xfs_log_need_covered(xfs_mount_t *mp) SPLDECL(s); int needed = 0; xlog_t *log = mp->m_log; - bhv_vfs_t *vfsp = XFS_MTOVFS(mp); - if (vfs_test_for_freeze(vfsp) || XFS_FORCED_SHUTDOWN(mp) || - (vfsp->vfs_flag & VFS_RDONLY)) + if (!xfs_fs_writable(mp)) return 0; s = LOG_LOCK(log); Index: 2.6.x-xfs-new/fs/xfs/xfs_log_recover.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log_recover.c 2007-04-24 09:53:43.730658557 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log_recover.c 2007-04-24 09:55:36.311882807 +1000 @@ -927,6 +927,14 @@ xlog_find_tail( ASSIGN_ANY_LSN_HOST(log->l_last_sync_lsn, log->l_curr_cycle, after_umount_blk); *tail_blk = after_umount_blk; + + /* + * Note that the unmount was clean. If the unmount + * was not clean, we need to know this to rebuild the + * superblock counters from the perag headers if we + * have a filesystem using non-persistent counters. + */ + log->l_mp->m_flags |= XFS_MOUNT_WAS_CLEAN; } } Index: 2.6.x-xfs-new/fs/xfs/xfs_mount.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_mount.c 2007-04-24 09:53:43.730658557 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mount.c 2007-04-24 11:13:53.568072816 +1000 @@ -625,6 +625,64 @@ xfs_mount_common(xfs_mount_t *mp, xfs_sb sbp->sb_inopblock); mp->m_ialloc_blks = mp->m_ialloc_inos >> sbp->sb_inopblog; } + +/* + * xfs_initialize_perag_data + * + * Read in each per-ag structure so we can count up the number of + * allocated inodes, free inodes and used filesystem blocks as this + * information is no longer persistent in the superblock. Once we have + * this information, write it into the in-core superblock structure. + */ +STATIC int +xfs_initialize_perag_data(xfs_mount_t *mp, xfs_agnumber_t agcount) +{ + xfs_agnumber_t index; + xfs_perag_t *pag; + xfs_sb_t *sbp = &mp->m_sb; + uint64_t ifree = 0; + uint64_t ialloc = 0; + uint64_t bfree = 0; + uint64_t bfreelst = 0; + uint64_t btree = 0; + int error; + int s; + + for (index = 0; index < agcount; index++) { + /* + * read the agf, then the agi. This gets us + * all the inforamtion we need and populates the + * per-ag structures for us. + */ + error = xfs_alloc_pagf_init(mp, NULL, index, 0); + if (error) + return error; + + error = xfs_ialloc_pagi_init(mp, NULL, index); + if (error) + return error; + pag = &mp->m_perag[index]; + ifree += pag->pagi_freecount; + ialloc += pag->pagi_count; + bfree += pag->pagf_freeblks; + bfreelst += pag->pagf_flcount; + btree += pag->pagf_btreeblks; + } + /* + * Overwrite incore superblock counters with just-read data + */ + s = XFS_SB_LOCK(mp); + sbp->sb_ifree = ifree; + sbp->sb_icount = ialloc; + sbp->sb_fdblocks = bfree + bfreelst + btree; + XFS_SB_UNLOCK(mp, s); + + /* Fixup the per-cpu counters as well. */ + xfs_icsb_reinit_counters(mp); + + return 0; +} + /* * xfs_mountfs * @@ -968,6 +1026,34 @@ xfs_mountfs( } /* + * Now the log is mounted, we know if it was an unclean shutdown or + * not. If it was, with the first phase of recovery has completed, we + * have consistent AG blocks on disk. We have not recovered EFIs yet, + * but they are recovered transactionally in the second recovery phase + * later. + * + * Hence we can safely re-initialise incore superblock counters from + * the per-ag data. These may not be correct if the filesystem was not + * cleanly unmounted, so we need to wait for recovery to finish before + * doing this. + * + * If the filesystem was cleanly unmounted, then we can trust the + * values in the superblock to be correct and we don't need to do + * anything here. + * + * If we are currently making the filesystem, the initialisation will + * fail as the perag data is in an undefined state. + */ + + if (xfs_sb_version_haslazysbcount(&mp->m_sb) && + !XFS_LAST_UNMOUNT_WAS_CLEAN(mp) && + !mp->m_sb.sb_inprogress) { + error = xfs_initialize_perag_data(mp, sbp->sb_agcount); + if (error) { + goto error2; + } + } + /* * Get and sanity-check the root inode. * Save the pointer to it in the mount structure. */ @@ -1030,6 +1116,7 @@ xfs_mountfs( goto error4; } + /* * Complete the quota initialisation, post-log-replay component. */ @@ -1091,10 +1178,9 @@ xfs_unmountfs(xfs_mount_t *mp, struct cr xfs_binval(mp->m_rtdev_targp); } + xfs_log_sbcount(mp, 1); xfs_unmountfs_writesb(mp); - xfs_unmountfs_wait(mp); /* wait for async bufs */ - xfs_log_unmount(mp); /* Done! No more fs ops. */ xfs_freesb(mp); @@ -1141,6 +1227,62 @@ xfs_unmountfs_wait(xfs_mount_t *mp) } int +xfs_fs_writable(xfs_mount_t *mp) +{ + bhv_vfs_t *vfsp = XFS_MTOVFS(mp); + + return !(vfs_test_for_freeze(vfsp) || XFS_FORCED_SHUTDOWN(mp) || + (vfsp->vfs_flag & VFS_RDONLY)); +} + +/* + * xfs_log_sbcount + * + * Called either periodically to keep the on disk superblock values + * roughly up to date or from unmount to make sure the values are + * correct on a clean unmount. + * + * Note this code can be called during the process of freezing, so + * we may need to use the transaction allocator which does not not + * block when the transaction subsystem is in its frozen state. + */ +int +xfs_log_sbcount( + xfs_mount_t *mp, + uint sync) +{ + xfs_trans_t *tp; + int error; + + if (!xfs_fs_writable(mp)) + return 0; + + xfs_icsb_sync_counters(mp); + + /* + * we don't need to do this if we are updating the superblock + * counters on every modification. + */ + if (!xfs_sb_version_haslazysbcount(&mp->m_sb)) + return 0; + + tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT); + error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0, + XFS_DEFAULT_LOG_COUNT); + if (error) { + xfs_trans_cancel(tp, 0); + return error; + } + + xfs_mod_sb(tp, XFS_SB_IFREE | XFS_SB_ICOUNT | XFS_SB_FDBLOCKS); + if (sync) + xfs_trans_set_sync(tp); + xfs_trans_commit(tp, 0); + + return 0; +} + +int xfs_unmountfs_writesb(xfs_mount_t *mp) { xfs_buf_t *sbp; @@ -1151,16 +1293,15 @@ xfs_unmountfs_writesb(xfs_mount_t *mp) * skip superblock write if fs is read-only, or * if we are doing a forced umount. */ - sbp = xfs_getsb(mp, 0); if (!(XFS_MTOVFS(mp)->vfs_flag & VFS_RDONLY || XFS_FORCED_SHUTDOWN(mp))) { - xfs_icsb_sync_counters(mp); + sbp = xfs_getsb(mp, 0); + sb = XFS_BUF_TO_SBP(sbp); /* * mark shared-readonly if desired */ - sb = XFS_BUF_TO_SBP(sbp); if (mp->m_mk_sharedro) { if (!(sb->sb_flags & XFS_SBF_READONLY)) sb->sb_flags |= XFS_SBF_READONLY; @@ -1169,6 +1310,7 @@ xfs_unmountfs_writesb(xfs_mount_t *mp) xfs_fs_cmn_err(CE_NOTE, mp, "Unmounting, marking shared read-only"); } + XFS_BUF_UNDONE(sbp); XFS_BUF_UNREAD(sbp); XFS_BUF_UNDELAYWRITE(sbp); @@ -1183,8 +1325,8 @@ xfs_unmountfs_writesb(xfs_mount_t *mp) mp, sbp, XFS_BUF_ADDR(sbp)); if (error && mp->m_mk_sharedro) xfs_fs_cmn_err(CE_ALERT, mp, "Superblock write error detected while unmounting. Filesystem may not be marked shared readonly"); + xfs_buf_relse(sbp); } - xfs_buf_relse(sbp); return error; } Index: 2.6.x-xfs-new/fs/xfs/xfs_mount.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_mount.h 2007-04-24 09:53:43.730658557 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_mount.h 2007-04-24 09:55:36.315882282 +1000 @@ -441,12 +441,12 @@ typedef struct xfs_mount { /* * Flags for m_flags. */ -#define XFS_MOUNT_WSYNC (1ULL << 0) /* for nfs - all metadata ops +#define XFS_MOUNT_WSYNC (1ULL << 0) /* for nfs - all metadata ops must be synchronous except for space allocations */ -#define XFS_MOUNT_INO64 (1ULL << 1) +#define XFS_MOUNT_INO64 (1ULL << 1) /* (1ULL << 2) -- currently unused */ - /* (1ULL << 3) -- currently unused */ +#define XFS_MOUNT_WAS_CLEAN (1ULL << 3) #define XFS_MOUNT_FS_SHUTDOWN (1ULL << 4) /* atomic stop of all filesystem operations, typically for disk errors in metadata */ @@ -523,6 +523,8 @@ xfs_preferred_iosize(xfs_mount_t *mp) #define XFS_MAXIOFFSET(mp) ((mp)->m_maxioffset) +#define XFS_LAST_UNMOUNT_WAS_CLEAN(mp) \ + ((mp)->m_flags & XFS_MOUNT_WAS_CLEAN) #define XFS_FORCED_SHUTDOWN(mp) ((mp)->m_flags & XFS_MOUNT_FS_SHUTDOWN) #define xfs_force_shutdown(m,f) \ bhv_vfs_force_shutdown((XFS_MTOVFS(m)), f, __FILE__, __LINE__) @@ -614,6 +616,7 @@ typedef struct xfs_mod_sb { extern xfs_mount_t *xfs_mount_init(void); extern void xfs_mod_sb(xfs_trans_t *, __int64_t); +extern int xfs_log_sbcount(xfs_mount_t *, uint); extern void xfs_mount_free(xfs_mount_t *mp, int remove_bhv); extern int xfs_mountfs(struct bhv_vfs *, xfs_mount_t *mp, int); extern void xfs_mountfs_check_barriers(xfs_mount_t *mp); @@ -630,6 +633,7 @@ extern int xfs_mod_incore_sb_batch(xfs_m extern struct xfs_buf *xfs_getsb(xfs_mount_t *, int); extern int xfs_readsb(xfs_mount_t *, int); extern void xfs_freesb(xfs_mount_t *); +extern int xfs_fs_writable(xfs_mount_t *); extern void xfs_do_force_shutdown(bhv_desc_t *, int, char *, int); extern int xfs_syncsub(xfs_mount_t *, int, int *); extern int xfs_sync_inodes(xfs_mount_t *, int, int *); Index: 2.6.x-xfs-new/fs/xfs/xfs_sb.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_sb.h 2007-04-24 09:53:43.730658557 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_sb.h 2007-04-24 09:55:36.315882282 +1000 @@ -74,12 +74,13 @@ struct xfs_mount; */ #define XFS_SB_VERSION2_REALFBITS 0x00ffffff /* Mask: features */ #define XFS_SB_VERSION2_RESERVED1BIT 0x00000001 -#define XFS_SB_VERSION2_RESERVED2BIT 0x00000002 +#define XFS_SB_VERSION2_LAZYSBCOUNTBIT 0x00000002 /* Superblk counters */ #define XFS_SB_VERSION2_RESERVED4BIT 0x00000004 #define XFS_SB_VERSION2_ATTR2BIT 0x00000008 /* Inline attr rework */ #define XFS_SB_VERSION2_OKREALFBITS \ - (XFS_SB_VERSION2_ATTR2BIT) + (XFS_SB_VERSION2_LAZYSBCOUNTBIT | \ + XFS_SB_VERSION2_ATTR2BIT) #define XFS_SB_VERSION2_OKSASHFBITS \ (0) #define XFS_SB_VERSION2_OKREALBITS \ @@ -181,6 +182,9 @@ typedef enum { #define XFS_SB_SHARED_VN XFS_SB_MVAL(SHARED_VN) #define XFS_SB_UNIT XFS_SB_MVAL(UNIT) #define XFS_SB_WIDTH XFS_SB_MVAL(WIDTH) +#define XFS_SB_ICOUNT XFS_SB_MVAL(ICOUNT) +#define XFS_SB_IFREE XFS_SB_MVAL(IFREE) +#define XFS_SB_FDBLOCKS XFS_SB_MVAL(FDBLOCKS) #define XFS_SB_FEATURES2 XFS_SB_MVAL(FEATURES2) #define XFS_SB_NUM_BITS ((int)XFS_SBS_FIELDCOUNT) #define XFS_SB_ALL_BITS ((1LL << XFS_SB_NUM_BITS) - 1) @@ -188,7 +192,7 @@ typedef enum { (XFS_SB_UUID | XFS_SB_ROOTINO | XFS_SB_RBMINO | XFS_SB_RSUMINO | \ XFS_SB_VERSIONNUM | XFS_SB_UQUOTINO | XFS_SB_GQUOTINO | \ XFS_SB_QFLAGS | XFS_SB_SHARED_VN | XFS_SB_UNIT | XFS_SB_WIDTH | \ - XFS_SB_FEATURES2) + XFS_SB_ICOUNT | XFS_SB_IFREE | XFS_SB_FDBLOCKS | XFS_SB_FEATURES2) /* @@ -414,6 +418,12 @@ static inline int xfs_sb_version_hasmore * ((sbp)->sb_features2 & XFS_SB_VERSION2_FUNBIT) */ +static inline int xfs_sb_version_haslazysbcount(xfs_sb_t *sbp) +{ + return (XFS_SB_VERSION_HASMOREBITS(sbp) && \ + ((sbp)->sb_features2 & XFS_SB_VERSION2_LAZYSBCOUNTBIT)); +} + #define XFS_SB_VERSION_HASATTR2(sbp) xfs_sb_version_hasattr2(sbp) static inline int xfs_sb_version_hasattr2(xfs_sb_t *sbp) { Index: 2.6.x-xfs-new/fs/xfs/xfs_trans.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans.c 2007-04-24 09:53:43.730658557 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_trans.c 2007-04-24 09:55:36.315882282 +1000 @@ -427,6 +427,14 @@ undo_blocks: * * Mark the transaction structure to indicate that the superblock * needs to be updated before committing. + * + * Because we may not be keeping track of allocated/free inodes and + * used filesystem blocks in the superblock, we do not mark the + * superblock dirty in this transaction if we modify these fields. + * We still need to update the transaction deltas so that they get + * applied to the incore superblock, but we don't want them to + * cause the superblock to get locked and logged if these are the + * only fields in the superblock that the transaction modifies. */ void xfs_trans_mod_sb( @@ -434,13 +442,19 @@ xfs_trans_mod_sb( uint field, int64_t delta) { + uint32_t flags = (XFS_TRANS_DIRTY|XFS_TRANS_SB_DIRTY); + xfs_mount_t *mp = tp->t_mountp; switch (field) { case XFS_TRANS_SB_ICOUNT: tp->t_icount_delta += delta; + if (xfs_sb_version_haslazysbcount(&mp->m_sb)) + flags &= ~XFS_TRANS_SB_DIRTY; break; case XFS_TRANS_SB_IFREE: tp->t_ifree_delta += delta; + if (xfs_sb_version_haslazysbcount(&mp->m_sb)) + flags &= ~XFS_TRANS_SB_DIRTY; break; case XFS_TRANS_SB_FDBLOCKS: /* @@ -453,6 +467,8 @@ xfs_trans_mod_sb( ASSERT(tp->t_blk_res_used <= tp->t_blk_res); } tp->t_fdblocks_delta += delta; + if (xfs_sb_version_haslazysbcount(&mp->m_sb)) + flags &= ~XFS_TRANS_SB_DIRTY; break; case XFS_TRANS_SB_RES_FDBLOCKS: /* @@ -462,6 +478,8 @@ xfs_trans_mod_sb( */ ASSERT(delta < 0); tp->t_res_fdblocks_delta += delta; + if (xfs_sb_version_haslazysbcount(&mp->m_sb)) + flags &= ~XFS_TRANS_SB_DIRTY; break; case XFS_TRANS_SB_FREXTENTS: /* @@ -544,18 +562,23 @@ xfs_trans_apply_sb_deltas( (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta + tp->t_ag_btree_delta)); - if (tp->t_icount_delta != 0) { - INT_MOD(sbp->sb_icount, ARCH_CONVERT, tp->t_icount_delta); - } - if (tp->t_ifree_delta != 0) { - INT_MOD(sbp->sb_ifree, ARCH_CONVERT, tp->t_ifree_delta); - } + /* + * Only update the superblock counters if we are logging them + */ + if (!xfs_sb_version_haslazysbcount(&(tp->t_mountp->m_sb))) { + if (tp->t_icount_delta != 0) { + INT_MOD(sbp->sb_icount, ARCH_CONVERT, tp->t_icount_delta); + } + if (tp->t_ifree_delta != 0) { + INT_MOD(sbp->sb_ifree, ARCH_CONVERT, tp->t_ifree_delta); + } - if (tp->t_fdblocks_delta != 0) { - INT_MOD(sbp->sb_fdblocks, ARCH_CONVERT, tp->t_fdblocks_delta); - } - if (tp->t_res_fdblocks_delta != 0) { - INT_MOD(sbp->sb_fdblocks, ARCH_CONVERT, tp->t_res_fdblocks_delta); + if (tp->t_fdblocks_delta != 0) { + INT_MOD(sbp->sb_fdblocks, ARCH_CONVERT, tp->t_fdblocks_delta); + } + if (tp->t_res_fdblocks_delta != 0) { + INT_MOD(sbp->sb_fdblocks, ARCH_CONVERT, tp->t_res_fdblocks_delta); + } } if (tp->t_frextents_delta != 0) { @@ -627,6 +650,7 @@ xfs_trans_unreserve_and_mod_sb( { xfs_mod_sb_t msb[14]; /* If you add cases, add entries */ xfs_mod_sb_t *msbp; + xfs_mount_t *mp = tp->t_mountp; /* REFERENCED */ int error; int rsvd; @@ -659,8 +683,15 @@ xfs_trans_unreserve_and_mod_sb( * The t_res_fdblocks_delta and t_res_frextents_delta fields are * explicitly NOT applied to the in-core superblock. * The idea is that that has already been done. + * + * If we are not logging superblock counters, then the inode + * allocated/free and used block counts are not updated in the + * on disk superblock. In this case, XFS_TRANS_SB_DIRTY will + * not be set when the transaction is updated but we still need + * to update the incore superblock with the changes. */ - if (tp->t_flags & XFS_TRANS_SB_DIRTY) { + if (xfs_sb_version_haslazysbcount(&mp->m_sb) || + (tp->t_flags & XFS_TRANS_SB_DIRTY)) { if (tp->t_icount_delta != 0) { msbp->msb_field = XFS_SBS_ICOUNT; msbp->msb_delta = tp->t_icount_delta; @@ -676,6 +707,9 @@ xfs_trans_unreserve_and_mod_sb( msbp->msb_delta = tp->t_fdblocks_delta; msbp++; } + } + + if (tp->t_flags & XFS_TRANS_SB_DIRTY) { if (tp->t_frextents_delta != 0) { msbp->msb_field = XFS_SBS_FREXTENTS; msbp->msb_delta = tp->t_frextents_delta; Index: 2.6.x-xfs-new/fs/xfs/xfs_trans.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans.h 2007-04-24 09:53:43.734658033 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_trans.h 2007-04-24 09:55:36.319881757 +1000 @@ -94,7 +94,8 @@ typedef struct xfs_trans_header { #define XFS_TRANS_GROWFSRT_ZERO 38 #define XFS_TRANS_GROWFSRT_FREE 39 #define XFS_TRANS_SWAPEXT 40 -#define XFS_TRANS_TYPE_MAX 40 +#define XFS_TRANS_SB_COUNT 41 +#define XFS_TRANS_TYPE_MAX 41 /* new transaction types need to be reflected in xfs_logprint(8) */ Index: 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vfsops.c 2007-04-24 09:53:43.734658033 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c 2007-04-24 11:14:01.962980516 +1000 @@ -684,6 +684,7 @@ xfs_mntupdate( } else if (!(vfsp->vfs_flag & VFS_RDONLY)) { /* rw -> ro */ bhv_vfs_sync(vfsp, SYNC_FSDATA|SYNC_BDFLUSH|SYNC_ATTR, NULL); xfs_quiesce_fs(mp); + xfs_log_sbcount(mp, 1); xfs_log_unmount_write(mp); xfs_unmountfs_writesb(mp); vfsp->vfs_flag |= VFS_RDONLY; @@ -1535,6 +1536,15 @@ xfs_syncsub( } /* + * If asked, update the disk superblock with incore counter values if we + * are using non-persistent counters so that they don't get too far out + * of sync if we crash or get a forced shutdown. We don't want to force + * this to disk, just get a transaction into the iclogs.... + */ + if (flags & SYNC_SUPER) + xfs_log_sbcount(mp, 0); + + /* * Now check to see if the log needs a "dummy" transaction. */ @@ -2000,6 +2010,7 @@ xfs_freeze( ASSERT_ALWAYS(atomic_read(&mp->m_active_trans) == 0); /* Push the superblock and write an unmount record */ + xfs_log_sbcount(mp, 1); xfs_log_unmount_write(mp); xfs_unmountfs_writesb(mp); xfs_fs_log_dummy(mp); Index: 2.6.x-xfs-new/fs/xfs/xfsidbg.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfsidbg.c 2007-04-24 09:53:43.866640708 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfsidbg.c 2007-04-24 09:55:36.323881232 +1000 @@ -6042,6 +6042,7 @@ xfsidbg_print_trans_type(unsigned int t_ case XFS_TRANS_GROWFSRT_ZERO: kdb_printf("GROWFSRT_ZERO"); break; case XFS_TRANS_GROWFSRT_FREE: kdb_printf("GROWFSRT_FREE"); break; case XFS_TRANS_SWAPEXT: kdb_printf("SWAPEXT"); break; + case XFS_TRANS_SB_COUNT: kdb_printf("SB_COUNT"); break; case XFS_TRANS_DUMMY1: kdb_printf("DUMMY1"); break; case XFS_TRANS_DUMMY2: kdb_printf("DUMMY2"); break; case XLOG_UNMOUNT_REC_TYPE: kdb_printf("UNMOUNT"); break; Index: 2.6.x-xfs-new/fs/xfs/linux-2.4/xfs_super.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.4/xfs_super.c 2007-04-24 11:13:22.564106903 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.4/xfs_super.c 2007-04-24 11:14:01.894989362 +1000 @@ -504,7 +504,8 @@ vfs_sync_worker( { if (!(vfsp->vfs_flag & VFS_RDONLY)) bhv_vfs_sync(vfsp, SYNC_FSDATA | SYNC_BDFLUSH | \ - SYNC_ATTR | SYNC_REFCACHE, NULL); + SYNC_ATTR | SYNC_REFCACHE | SYNC_SUPER, + NULL); } STATIC int @@ -680,7 +681,7 @@ struct super_block *freeze_bdev(struct b wmb(); /* Flush any remaining inodes into buffers */ - bhv_vfs_sync(vfsp, SYNC_ATTR | SYNC_WAIT, NULL); + bhv_vfs_sync(vfsp, SYNC_SUPER | SYNC_ATTR | SYNC_WAIT, NULL); /* Push all buffers out to disk */ sync_buffers(sb->s_dev, 1); Index: 2.6.x-xfs-new/fs/xfs/linux-2.4/xfs_vfs.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.4/xfs_vfs.h 2007-04-24 11:13:22.556107943 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.4/xfs_vfs.h 2007-04-24 11:14:01.918986240 +1000 @@ -93,6 +93,7 @@ typedef enum { #define SYNC_REFCACHE 0x0040 /* prune some of the nfs ref cache */ #define SYNC_REMOUNT 0x0080 /* remount readonly, no dummy LRs */ #define SYNC_IOWAIT 0x0100 /* wait for all I/O to complete */ +#define SYNC_SUPER 0x0200 /* flush superblock to disk */ #define SHUTDOWN_META_IO_ERROR 0x0001 /* write attempt to metadata failed */ #define SHUTDOWN_LOG_IO_ERROR 0x0002 /* write attempt to the log failed */ Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_vfs.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_vfs.h 2007-04-24 11:13:22.540110025 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_vfs.h 2007-04-24 11:14:01.946982597 +1000 @@ -92,6 +92,7 @@ typedef enum { #define SYNC_REFCACHE 0x0040 /* prune some of the nfs ref cache */ #define SYNC_REMOUNT 0x0080 /* remount readonly, no dummy LRs */ #define SYNC_IOWAIT 0x0100 /* wait for all I/O to complete */ +#define SYNC_SUPER 0x0200 /* flush superblock to disk */ #define SHUTDOWN_META_IO_ERROR 0x0001 /* write attempt to metadata failed */ #define SHUTDOWN_LOG_IO_ERROR 0x0002 /* write attempt to the log failed */ From owner-xfs@oss.sgi.com Mon Apr 23 18:43:01 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 18:43:03 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3O1gvfB030392 for ; Mon, 23 Apr 2007 18:42:59 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA04753; Tue, 24 Apr 2007 11:42:51 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3O1goAf75054993; Tue, 24 Apr 2007 11:42:50 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3O1gnhI74029037; Tue, 24 Apr 2007 11:42:49 +1000 (AEST) Date: Tue, 24 Apr 2007 11:42:49 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: Review - flush blockdev on close Message-ID: <20070424014249.GF48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-archive-position: 11167 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs As suggested by Christoph, we probably should flush the block device as we complete the process of unmounting the filesystem. Patch attached. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_buf.c | 1 + 1 file changed, 1 insertion(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-04-24 09:32:22.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-04-24 11:12:42.181361346 +1000 @@ -1464,6 +1464,7 @@ xfs_free_buftarg( int external) { xfs_flush_buftarg(btp, 1); + xfs_blkdev_issue_flush(btp); if (external) xfs_blkdev_put(btp->bt_bdev); xfs_free_bufhash(btp); From owner-xfs@oss.sgi.com Mon Apr 23 18:58:29 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 18:58:31 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3O1wPfB001884 for ; Mon, 23 Apr 2007 18:58:27 -0700 Received: from boing.melbourne.sgi.com (boing.melbourne.sgi.com [134.14.55.141]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA05122; Tue, 24 Apr 2007 11:58:10 +1000 Date: Tue, 24 Apr 2007 12:00:38 +1000 From: Timothy Shimmin To: David Chinner , Christoph Hellwig cc: xfs-dev , xfs-oss Subject: Re: review: don't hold ilock when calling vn_iowait Message-ID: <1A5D0CA3BA5C7CF8B7241F39@timothy-shimmins-power-mac-g5.local> In-Reply-To: <20070423231706.GO32602149@melbourne.sgi.com> References: <20070422230303.GX32602149@melbourne.sgi.com> <20070423214338.GA17561@infradead.org> <20070423231706.GO32602149@melbourne.sgi.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-archive-position: 11168 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: tes@sgi.com Precedence: bulk X-list: xfs --On 24 April 2007 9:17:06 AM +1000 David Chinner wrote: > On Mon, Apr 23, 2007 at 10:43:38PM +0100, Christoph Hellwig wrote: >> On Mon, Apr 23, 2007 at 09:03:03AM +1000, David Chinner wrote: >> > >> > Regression introduced by recent freezing fixes - we should >> > not hold the ilock while waiting for I/O completion. >> >> Looks good, and actually simplies the twisted maze the xfs_sync_inodes is >> a little bit. And the missing IPOINTER_INSERT in the SYNC_CLOSE case >> looks like an actual bugfix. > > I had to look closely at that IPOINTER_INSERT case with SYNC_CLOSE; > it was actaully working properly because you'd always end up in > the SYNC_CLOSE case having inserted a pointer earlier on in the flow > of the function. It certainly wasn't obvious that it was doing the > right thing, though. > I find this existing code and the use of marker pointer macros a bit hard to follow. Can you explain where "earlier on in the flow of the function" we've inserted the marker pointer (and unlocked the inode-list lock). Obviously whenever we release the inode-list lock, we have to insert the marker first (which is what IPOINTER_INSERT does). But in what cases do we need to release the inode-list lock (m_ilock). Having a stab in looking at the code: * before xfs_finish_reclaim * before VN_RELE which can call xfs_inactive * in the vnode reference case, prior to locking inode ???? * prior to unlocking the inode and calling one of the flush or toss pages routines. * prior to unlocking the inode and reading in its buffer * Prior to flushing the inode (xfs_iflush) * Every so often if we loop a lot in the code (preempt variable and mask) We don't seem to remove the marker afterwards always although we do so on each iteration if we make it to the end of the loop. It would be nice if this could be clearer somehow. --Tim From owner-xfs@oss.sgi.com Mon Apr 23 20:08:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 20:08:38 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3O38XfB029221 for ; Mon, 23 Apr 2007 20:08:35 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA06634; Tue, 24 Apr 2007 13:08:28 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3O38RAf75054893; Tue, 24 Apr 2007 13:08:27 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3O38QeC74882064; Tue, 24 Apr 2007 13:08:26 +1000 (AEST) Date: Tue, 24 Apr 2007 13:08:26 +1000 From: David Chinner To: Timothy Shimmin Cc: David Chinner , Christoph Hellwig , xfs-dev , xfs-oss Subject: Re: review: don't hold ilock when calling vn_iowait Message-ID: <20070424030826.GG48531920@melbourne.sgi.com> References: <20070422230303.GX32602149@melbourne.sgi.com> <20070423214338.GA17561@infradead.org> <20070423231706.GO32602149@melbourne.sgi.com> <1A5D0CA3BA5C7CF8B7241F39@timothy-shimmins-power-mac-g5.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1A5D0CA3BA5C7CF8B7241F39@timothy-shimmins-power-mac-g5.local> User-Agent: Mutt/1.4.2.1i X-archive-position: 11169 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, Apr 24, 2007 at 12:00:38PM +1000, Timothy Shimmin wrote: > > > --On 24 April 2007 9:17:06 AM +1000 David Chinner wrote: > > >On Mon, Apr 23, 2007 at 10:43:38PM +0100, Christoph Hellwig wrote: > >>On Mon, Apr 23, 2007 at 09:03:03AM +1000, David Chinner wrote: > >>> > >>> Regression introduced by recent freezing fixes - we should > >>> not hold the ilock while waiting for I/O completion. > >> > >>Looks good, and actually simplies the twisted maze the xfs_sync_inodes is > >>a little bit. And the missing IPOINTER_INSERT in the SYNC_CLOSE case > >>looks like an actual bugfix. > > > >I had to look closely at that IPOINTER_INSERT case with SYNC_CLOSE; > >it was actaully working properly because you'd always end up in > >the SYNC_CLOSE case having inserted a pointer earlier on in the flow > >of the function. It certainly wasn't obvious that it was doing the > >right thing, though. > > > I find this existing code and the use of marker pointer macros a bit hard > to follow. Doesn't everyone? > Can you explain where "earlier on in the flow of the function" we've > inserted > the marker pointer (and unlocked the inode-list lock). I confused that with the removal of the vp == NULL checks I removed. Too many things, so little time. So yes, this probably does fix a bug. > It would be nice if this could be clearer somehow. Yes, we should be looking to rip all this cruft out because most of it is redundant - the generic inode writeback does most of this for us anyway. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 23 23:18:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 23:18:09 -0700 (PDT) Received: from tyo200.gate.nec.co.jp (TYO200.gate.nec.co.jp [210.143.35.50]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3O6I4fB025882 for ; Mon, 23 Apr 2007 23:18:06 -0700 Received: from tyo202.gate.nec.co.jp ([10.7.69.202]) by tyo200.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3O6I0nL029873 for ; Tue, 24 Apr 2007 15:18:02 +0900 (JST) Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.162]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3O6AGlG002306 for ; Tue, 24 Apr 2007 15:10:16 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l3O6AGM11436 for xfs@oss.sgi.com; Tue, 24 Apr 2007 15:10:16 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv.nec.co.jp (8.11.7/3.7W-MAILSV-NEC) with ESMTP id l3O6AFO04731 for ; Tue, 24 Apr 2007 15:10:15 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070424.151015.85902548 for ; Tue, 24 Apr 2007 15:10:15 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Tue Apr 24 15:10:15 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 2B980AE4B3; Tue, 24 Apr 2007 15:09:58 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l3O6AF4S006656; Tue, 24 Apr 2007 15:10:15 +0900 Message-Id: <200704240610.AA05249@TNESG9305.tnes.nec.co.jp> From: Utako Kusaka Date: Tue, 24 Apr 2007 15:10:21 +0900 To: Christoph Hellwig Cc: xfs@oss.sgi.com Subject: Re: [PATCH] Fix "quota -n" command in xfs_quota. In-Reply-To: <20070423212606.GE13572@infradead.org> References: <20070423212606.GE13572@infradead.org> MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=iso-2022-jp X-archive-position: 11170 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, Thanks for your comment. I have updated my patch as below. Signed-off-by: Utako Kusaka --- --- quota.orig 2007-04-18 10:36:38.000000000 +0900 +++ quota.c 2007-04-24 11:42:18.000000000 +0900 @@ -202,10 +202,12 @@ getusername( int numeric) { static char buffer[32]; - struct passwd *u; - if (!numeric && (u = getpwuid(uid))) - return u->pw_name; + if (!numeric) { + struct passwd *u = getpwuid(uid); + if (u) + return u->pw_name; + } snprintf(buffer, sizeof(buffer), "#%u", uid); return &buffer[0]; } @@ -247,10 +249,12 @@ getgroupname( int numeric) { static char buffer[32]; - struct group *g; - if (!numeric && (g = getgrgid(gid))) - return g->gr_name; + if (!numeric) { + struct group *g = getgrgid(gid); + if (g) + return g->gr_name; + } snprintf(buffer, sizeof(buffer), "#%u", gid); return &buffer[0]; } @@ -310,10 +314,12 @@ getprojectname( int numeric) { static char buffer[32]; - fs_project_t *p; - if ((p = getprprid(prid))) - return p->pr_name; + if (!numeric) { + fs_project_t *p = getprprid(prid); + if (p) + return p->pr_name; + } snprintf(buffer, sizeof(buffer), "#%u", (unsigned int)prid); return &buffer[0]; } Mon, 23 Apr 2007 22:26:06 +0100 Christoph Hellwig wrote$B!'(B > >Looks good to me, but the even the original code could be a little bit cleaner: > >> --- xfsprogs-2.8.20/quota/quota.orig 2007-04-18 10:36:38.000000000 +0900 >> +++ xfsprogs-2.8.20/quota/quota.c 2007-04-18 11:09:10.000000000 +0900 >> @@ -312,7 +312,7 @@ getprojectname( >> static char buffer[32]; >> fs_project_t *p; >> >> - if ((p = getprprid(prid))) >> + if (!numeric && (p = getprprid(prid))) >> return p->pr_name; >> snprintf(buffer, sizeof(buffer), "#%u", (unsigned int)prid); >> return &buffer[0]; > > if (!numeric) { > fs_project_t *p = getprprid(prid); > if (p) > return p->pr_name; > } > > snprintf(buffer, sizeof(buffer), "#%u", (unsigned int)prid); > return &buffer[0]; > > From owner-xfs@oss.sgi.com Mon Apr 23 23:46:37 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 23 Apr 2007 23:46:40 -0700 (PDT) Received: from mail.g-house.de (ns2.g-housing.de [81.169.133.75]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3O6kZfB032255 for ; Mon, 23 Apr 2007 23:46:37 -0700 Received: from [85.211.139.192] (helo=sheep.housecafe.de) by mail.g-house.de with esmtpsa (TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1HgEnF-0000f2-7s; Tue, 24 Apr 2007 08:46:29 +0200 Received: from localhost ([127.0.0.1] helo=derchris.gotdns.org) by sheep.housecafe.de with esmtp (Exim 4.63) (envelope-from ) id 1HgEnA-00071L-Qq; Tue, 24 Apr 2007 07:46:24 +0100 Received: from 194.246.123.250 (SquirrelMail authenticated user evil) by derchris.gotdns.org:8080 with HTTP; Tue, 24 Apr 2007 07:46:24 +0100 (BST) Message-ID: <21485.194.246.123.250.1177397184.squirrel@derchris.gotdns.org:8080> In-Reply-To: <20070423211952.GA13572@infradead.org> References: <20603.194.246.123.250.1177344480.squirrel@derchris.gotdns.org:8080> <462D0D23.7010803@sandeen.net> <20070423211952.GA13572@infradead.org> Date: Tue, 24 Apr 2007 07:46:24 +0100 (BST) Subject: Re: possible recursive locking detected From: "Christian Kujau" To: "Christoph Hellwig" Cc: "Eric Sandeen" , xfs@oss.sgi.com User-Agent: SquirrelMail/1.5.2 [SVN] MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-15 Content-Transfer-Encoding: 8bit X-archive-position: 11171 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lists@nerdbynature.de Precedence: bulk X-list: xfs On Mon, April 23, 2007 22:19, Christoph Hellwig wrote: > It's not really cosmetic. It means i_lock and i_iolock are beeing > acquired without an order that is detectable by lockdep. At the very first > it means annotations for lockdep are missing, because acquiring two > per-inode locks at the same time is a basic fact in unix filesystems. Thank you both for your comments, now I can sleep better again ;) Christian. -- BOFH excuse #442: Trojan horse ran out of hay From owner-xfs@oss.sgi.com Tue Apr 24 02:04:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 02:04:33 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3O94TfB004786 for ; Tue, 24 Apr 2007 02:04:30 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HgGkV-0007o4-4t; Tue, 24 Apr 2007 09:51:47 +0100 Date: Tue, 24 Apr 2007 09:51:47 +0100 From: Christoph Hellwig To: David Chinner Cc: Christoph Hellwig , xfs-dev , xfs-oss Subject: Re: review [1 of 3]: lazy superblock counters - core kernel Message-ID: <20070424085147.GA28820@infradead.org> References: <20070419231459.GX48531920@melbourne.sgi.com> <20070423220010.GA18325@infradead.org> <20070424012808.GD48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070424012808.GD48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11172 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Tue, Apr 24, 2007 at 11:28:08AM +1000, David Chinner wrote: > > This is really quite nasty. Should we at least force a cache flush here? > > Ok, so the patch I sent out was an older version that had a very similar > name to the current patch in my series (xfs-lazy-sb vs xfs_lazy_sb). > This code doesn't exist in the version I should have sent out. > > The latest version, plus the changes suggested here and with the > second patch folded back into it is attached. Looks like in the new code we simply ignore log reservation failures in xfs_log_sbcount? Otherwise this looks good to me. converting all sb feature checks to use the inlines would be a nice cleanup opportunity for someone bored :) From owner-xfs@oss.sgi.com Tue Apr 24 02:04:33 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 02:04:37 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3O94VfB004804 for ; Tue, 24 Apr 2007 02:04:33 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HgGpL-0007ya-8H; Tue, 24 Apr 2007 09:56:47 +0100 Date: Tue, 24 Apr 2007 09:56:47 +0100 From: Christoph Hellwig To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: Review - flush blockdev on close Message-ID: <20070424085647.GB28820@infradead.org> References: <20070424014249.GF48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070424014249.GF48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11173 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Tue, Apr 24, 2007 at 11:42:49AM +1000, David Chinner wrote: > > As suggested by Christoph, we probably should flush the block > device as we complete the process of unmounting the filesystem. > Patch attached. Looks good. > > Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c > =================================================================== > --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-04-24 09:32:22.000000000 +1000 > +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-04-24 11:12:42.181361346 +1000 > @@ -1464,6 +1464,7 @@ xfs_free_buftarg( > int external) > { > xfs_flush_buftarg(btp, 1); > + xfs_blkdev_issue_flush(btp); > if (external) > xfs_blkdev_put(btp->bt_bdev); > xfs_free_bufhash(btp); From owner-xfs@oss.sgi.com Tue Apr 24 02:10:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 02:10:08 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3O9A4fB007257 for ; Tue, 24 Apr 2007 02:10:05 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HgH28-0008VK-9J; Tue, 24 Apr 2007 10:10:00 +0100 Date: Tue, 24 Apr 2007 10:10:00 +0100 From: Christoph Hellwig To: David Chinner Cc: Timothy Shimmin , Christoph Hellwig , xfs-dev , xfs-oss Subject: Re: review: don't hold ilock when calling vn_iowait Message-ID: <20070424091000.GA31652@infradead.org> References: <20070422230303.GX32602149@melbourne.sgi.com> <20070423214338.GA17561@infradead.org> <20070423231706.GO32602149@melbourne.sgi.com> <1A5D0CA3BA5C7CF8B7241F39@timothy-shimmins-power-mac-g5.local> <20070424030826.GG48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070424030826.GG48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11174 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Tue, Apr 24, 2007 at 01:08:26PM +1000, David Chinner wrote: > > It would be nice if this could be clearer somehow. > > Yes, we should be looking to rip all this cruft out because most of > it is redundant - the generic inode writeback does most of this > for us anyway. In theory it does the same thing. The problem is that it's really hard to verify. Btw, before starting with this bit there's another item on my TODO list to simplify xfs_sync_inodes, and that's getting rid of the vp == NULL case totally. Per definition all vp == NULL inodes are on mp->m_del_inodes. So instead of letting xfs_sync_inodes deals with them we should always call into xfs_finish_reclaim_all after cleaning the latter up a little and veryfing we get the same behaviour. From owner-xfs@oss.sgi.com Tue Apr 24 03:12:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 03:13:00 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3OACvfB026156 for ; Tue, 24 Apr 2007 03:12:58 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3OACqLD029679 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Tue, 24 Apr 2007 12:12:52 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3OACqor029677 for xfs@oss.sgi.com; Tue, 24 Apr 2007 12:12:52 +0200 Date: Tue, 24 Apr 2007 12:12:52 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: [RFC PATCH 3/3] replace xfs_iflush_all with xfs_reclaim_all Message-ID: <20070424101252.GA28314@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11175 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs xfs_iflush_all is another variant of the loop all deleted inodes without vnode and flush them scheme. Use xfs_finish_reclaim_all without the nonblock flag and with XFS_IFLUSH_ASYNC as sync_flag instead. Btw, shouldn't xfs_quiesce_fs also pass XFS_IFLUSH_ASYNC to xfs_finish_reclaim_all? Signed-off-by: Christoph Hellwig Index: linux-2.6/fs/xfs/xfs_inode.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_inode.c 2007-04-24 10:58:55.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_inode.c 2007-04-24 11:00:42.000000000 +0200 @@ -3518,45 +3518,6 @@ corrupt_out: return XFS_ERROR(EFSCORRUPTED); } - -/* - * Flush all inactive inodes in mp. - */ -void -xfs_iflush_all( - xfs_mount_t *mp) -{ - xfs_inode_t *ip; - bhv_vnode_t *vp; - - again: - XFS_MOUNT_ILOCK(mp); - ip = mp->m_inodes; - if (ip == NULL) - goto out; - - do { - /* Make sure we skip markers inserted by sync */ - if (ip->i_mount == NULL) { - ip = ip->i_mnext; - continue; - } - - vp = XFS_ITOV_NULL(ip); - if (!vp) { - XFS_MOUNT_IUNLOCK(mp); - xfs_finish_reclaim(ip, 0, XFS_IFLUSH_ASYNC); - goto again; - } - - ASSERT(vn_count(vp) == 0); - - ip = ip->i_mnext; - } while (ip != mp->m_inodes); - out: - XFS_MOUNT_IUNLOCK(mp); -} - /* * xfs_iaccess: check accessibility of inode for mode. */ Index: linux-2.6/fs/xfs/xfs_inode.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_inode.h 2007-04-24 10:58:55.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_inode.h 2007-04-24 11:00:22.000000000 +0200 @@ -499,7 +499,6 @@ void xfs_ipin(xfs_inode_t *); void xfs_iunpin(xfs_inode_t *); int xfs_iextents_copy(xfs_inode_t *, xfs_bmbt_rec_t *, int); int xfs_iflush(xfs_inode_t *, uint); -void xfs_iflush_all(struct xfs_mount *); int xfs_iaccess(xfs_inode_t *, mode_t, cred_t *); uint xfs_iroundup(uint); void xfs_ichgtime(xfs_inode_t *, int); Index: linux-2.6/fs/xfs/xfs_mount.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_mount.c 2007-04-24 10:58:53.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_mount.c 2007-04-24 11:00:22.000000000 +0200 @@ -1084,7 +1084,7 @@ xfs_unmountfs(xfs_mount_t *mp, struct cr int64_t fsid; #endif - xfs_iflush_all(mp); + xfs_finish_reclaim_all(mp, 0, XFS_IFLUSH_ASYNC); XFS_QM_DQPURGEALL(mp, XFS_QMOPT_QUOTALL | XFS_QMOPT_UMOUNTING); From owner-xfs@oss.sgi.com Tue Apr 24 03:13:38 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 03:13:41 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3OADZfB026317 for ; Tue, 24 Apr 2007 03:13:37 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3OADVLD029752 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Tue, 24 Apr 2007 12:13:31 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3OADVE8029750 for xfs@oss.sgi.com; Tue, 24 Apr 2007 12:13:31 +0200 Date: Tue, 24 Apr 2007 12:13:31 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: [RFC PATCH 0/3] deleted inode reclaim cleanup Message-ID: <20070424101331.GA29731@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11176 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs I've ported forward my old patches related to reclaiming delete inodes on m_del_inodes. This hasn't gotten testing with the current kernel although it completed xfsqa in the version against a kernel from about a year ago. From owner-xfs@oss.sgi.com Tue Apr 24 03:13:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 03:13:44 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3OADcfB026337 for ; Tue, 24 Apr 2007 03:13:39 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3OADaLD029777 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Tue, 24 Apr 2007 12:13:37 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3OADauW029775 for xfs@oss.sgi.com; Tue, 24 Apr 2007 12:13:36 +0200 Date: Tue, 24 Apr 2007 12:13:36 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: [RFC PATCH 1/3]: cleanup xfs_finish_reclaim_all Message-ID: <20070424101336.GB29731@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11177 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs xfs_finish_reclaim currently has a rather odd style using a while loop and two local variables to implement a restart the list walking loop operation. Convert this to use a plain goto restart style as used all over the kernel. Also add a sync_mode argument instead of always using XFS_IFLUSH_DELWRI_ELSE_ASYNC - a new caller added in patch three will pass in a different argument. Signed-off-by: Christoph Hellwig Index: linux-2.6/fs/xfs/xfs_inode.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_inode.h 2007-04-24 10:44:38.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_inode.h 2007-04-24 10:44:42.000000000 +0200 @@ -461,7 +461,7 @@ void xfs_iunlock_map_shared(xfs_inode_t void xfs_ifunlock(xfs_inode_t *); void xfs_ireclaim(xfs_inode_t *); int xfs_finish_reclaim(xfs_inode_t *, int, int); -int xfs_finish_reclaim_all(struct xfs_mount *, int); +int xfs_finish_reclaim_all(struct xfs_mount *, int, int); /* * xfs_inode.c prototypes. Index: linux-2.6/fs/xfs/xfs_vfsops.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_vfsops.c 2007-04-24 10:44:38.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_vfsops.c 2007-04-24 10:45:50.000000000 +0200 @@ -628,7 +628,7 @@ xfs_quiesce_fs( xfs_refcache_purge_mp(mp); xfs_flush_buftarg(mp->m_ddev_targp, 0); - xfs_finish_reclaim_all(mp, 0); + xfs_finish_reclaim_all(mp, 0, XFS_IFLUSH_DELWRI_ELSE_ASYNC); /* This loop must run at least twice. * The first instance of the loop will flush @@ -1435,7 +1435,8 @@ xfs_syncsub( if (flags & (SYNC_ATTR|SYNC_DELWRI)) { if (flags & SYNC_BDFLUSH) - xfs_finish_reclaim_all(mp, 1); + xfs_finish_reclaim_all(mp, 1, + XFS_IFLUSH_DELWRI_ELSE_ASYNC); else error = xfs_sync_inodes(mp, flags, bypassed); } Index: linux-2.6/fs/xfs/xfs_vnodeops.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c 2007-04-24 10:44:38.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_vnodeops.c 2007-04-24 10:47:37.000000000 +0200 @@ -3908,36 +3908,29 @@ xfs_finish_reclaim( } int -xfs_finish_reclaim_all(xfs_mount_t *mp, int noblock) +xfs_finish_reclaim_all( + xfs_mount_t *mp, + int noblock, + int sync_mode) { - int purged; xfs_inode_t *ip, *n; - int done = 0; - while (!done) { - purged = 0; - XFS_MOUNT_ILOCK(mp); - list_for_each_entry_safe(ip, n, &mp->m_del_inodes, i_reclaim) { - if (noblock) { - if (xfs_ilock_nowait(ip, XFS_ILOCK_EXCL) == 0) - continue; - if (xfs_ipincount(ip) || - !xfs_iflock_nowait(ip)) { - xfs_iunlock(ip, XFS_ILOCK_EXCL); - continue; - } + restart: + XFS_MOUNT_ILOCK(mp); + list_for_each_entry_safe(ip, n, &mp->m_del_inodes, i_reclaim) { + if (noblock) { + if (xfs_ilock_nowait(ip, XFS_ILOCK_EXCL) == 0) + continue; + if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip)) { + xfs_iunlock(ip, XFS_ILOCK_EXCL); + continue; } - XFS_MOUNT_IUNLOCK(mp); - if (xfs_finish_reclaim(ip, noblock, - XFS_IFLUSH_DELWRI_ELSE_ASYNC)) - delay(1); - purged = 1; - break; } - - done = !purged; + XFS_MOUNT_IUNLOCK(mp); + if (xfs_finish_reclaim(ip, noblock, sync_mode)) + delay(1); + goto restart; } - XFS_MOUNT_IUNLOCK(mp); return 0; } From owner-xfs@oss.sgi.com Tue Apr 24 03:13:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 03:13:50 -0700 (PDT) Received: from mail.lst.de (verein.lst.de [213.95.11.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3OADkfB026425 for ; Tue, 24 Apr 2007 03:13:48 -0700 Received: from verein.lst.de (localhost [127.0.0.1]) by mail.lst.de (8.12.3/8.12.3/Debian-7.1) with ESMTP id l3OADgLD029799 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Tue, 24 Apr 2007 12:13:42 +0200 Received: (from hch@localhost) by verein.lst.de (8.12.3/8.12.3/Debian-6.6) id l3OADgmL029797 for xfs@oss.sgi.com; Tue, 24 Apr 2007 12:13:42 +0200 Date: Tue, 24 Apr 2007 12:13:42 +0200 From: Christoph Hellwig To: xfs@oss.sgi.com Subject: [RFC PATCH 2/3] use xfs_reclaim_all in xfs_syncsub Message-ID: <20070424101342.GC29731@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.39 X-archive-position: 11178 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@lst.de Precedence: bulk X-list: xfs xfs_sync_inodes currently opencodes xfs_finish_reclaim_all, except using an m_inodes walk and checking for a NULL vnode instead of walking m_del_inodes. Change the code in xfs_syncsub to always call xfs_finish_reclaim_all, and remove the code dealing with deleted inodes without vnode from xfs_sync_inodes. Signed-off-by: Christoph Hellwig Index: linux-2.6/fs/xfs/xfs_vfsops.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_vfsops.c 2007-04-24 10:58:55.000000000 +0200 +++ linux-2.6/fs/xfs/xfs_vfsops.c 2007-04-24 11:00:14.000000000 +0200 @@ -1018,36 +1018,15 @@ xfs_sync_inodes( continue; } - vp = XFS_ITOV_NULL(ip); - /* - * If the vnode is gone then this is being torn down, - * call reclaim if it is flushed, else let regular flush - * code deal with it later in the loop. + * Inodes on the delayed reclaim list are beeing dealt + * with in xfs_reclaim_all. One may have sneaked in + * here after we dropped the mount ilock, but we'll leave + * them to the next xfs_reclaim_all call. */ - + vp = XFS_ITOV_NULL(ip); if (vp == NULL) { - /* Skip ones already in reclaim */ - if (ip->i_flags & XFS_IRECLAIM) { - ip = ip->i_mnext; - continue; - } - if (xfs_ilock_nowait(ip, XFS_ILOCK_EXCL) == 0) { - ip = ip->i_mnext; - } else if ((xfs_ipincount(ip) == 0) && - xfs_iflock_nowait(ip)) { - IPOINTER_INSERT(ip, mp); - - xfs_finish_reclaim(ip, 1, - XFS_IFLUSH_DELWRI_ELSE_ASYNC); - - XFS_MOUNT_ILOCK(mp); - mount_locked = B_TRUE; - IPOINTER_REMOVE(ip, mp); - } else { - xfs_iunlock(ip, XFS_ILOCK_EXCL); - ip = ip->i_mnext; - } + ip = ip->i_mnext; continue; } @@ -1434,10 +1413,8 @@ xfs_syncsub( xfs_log_force(mp, (xfs_lsn_t)0, log_flags); if (flags & (SYNC_ATTR|SYNC_DELWRI)) { - if (flags & SYNC_BDFLUSH) - xfs_finish_reclaim_all(mp, 1, - XFS_IFLUSH_DELWRI_ELSE_ASYNC); - else + xfs_finish_reclaim_all(mp, 1, XFS_IFLUSH_DELWRI_ELSE_ASYNC); + if (!(flags & SYNC_BDFLUSH)) error = xfs_sync_inodes(mp, flags, bypassed); } From owner-xfs@oss.sgi.com Tue Apr 24 05:16:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 05:16:43 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3OCGZfB029237 for ; Tue, 24 Apr 2007 05:16:36 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l3OCGTUN015289 for ; Tue, 24 Apr 2007 08:16:29 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3OCGTkK373858 for ; Tue, 24 Apr 2007 08:16:29 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3OCGSju010126 for ; Tue, 24 Apr 2007 08:16:28 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3OCGRRr009970; Tue, 24 Apr 2007 08:16:28 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 086AC29EC6D; Tue, 24 Apr 2007 17:46:35 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3OCGXfl011310; Tue, 24 Apr 2007 17:46:33 +0530 Date: Tue, 24 Apr 2007 17:46:33 +0530 From: "Amit K. Arora" To: Jakub Jelinek Cc: Andreas Dilger , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, cmm@us.ibm.com, suparna@in.ibm.com, Andrew Morton , torvalds@linux-foundation.org Subject: Re: Interface for the new fallocate() system call Message-ID: <20070424121632.GA10136@amitarora.in.ibm.com> References: <20070316161704.GE8525@osiris.boeblingen.de.ibm.com> <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070420145918.GY355@devserv.devel.redhat.com> User-Agent: Mutt/1.4.1i X-archive-position: 11179 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs On Fri, Apr 20, 2007 at 10:59:18AM -0400, Jakub Jelinek wrote: > On Fri, Apr 20, 2007 at 07:21:46PM +0530, Amit K. Arora wrote: > > Ok. > > In this case we may have to consider following things: > > > > 1) Obviously, for this glibc will have to call fallocate() syscall with > > different arguments on s390, than other archs. I think this should be > > doable and should not be an issue with glibc folks (right?). > > glibc can cope with this easily, will just add > sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override > the generic Linux implementation. > > > 2) we also need to see how strace behaves in this case. With little > > knowledge that I have of strace, I don't think it should depend on > > argument ordering of a system call on different archs (since it uses > > ptrace internally and that should take care of it). But, it will be > > nice if someone can confirm this. > > strace would solve this with #ifdef mess, it already does that in many > places so guess another few lines don't make it significantly worse. I will work on the revised fallocate patchset and will post it soon. Thanks! -- Regards, Amit Arora From owner-xfs@oss.sgi.com Tue Apr 24 07:16:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 07:16:39 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3OEGXfB029223 for ; Tue, 24 Apr 2007 07:16:35 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id AAA22314; Wed, 25 Apr 2007 00:16:26 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3OEGPAf74928958; Wed, 25 Apr 2007 00:16:25 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3OEGO2575446170; Wed, 25 Apr 2007 00:16:24 +1000 (AEST) Date: Wed, 25 Apr 2007 00:16:24 +1000 From: David Chinner To: Christoph Hellwig Cc: David Chinner , xfs-dev , xfs-oss Subject: Re: review [1 of 3]: lazy superblock counters - core kernel Message-ID: <20070424141624.GU32602149@melbourne.sgi.com> References: <20070419231459.GX48531920@melbourne.sgi.com> <20070423220010.GA18325@infradead.org> <20070424012808.GD48531920@melbourne.sgi.com> <20070424085147.GA28820@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070424085147.GA28820@infradead.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11180 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, Apr 24, 2007 at 09:51:47AM +0100, Christoph Hellwig wrote: > On Tue, Apr 24, 2007 at 11:28:08AM +1000, David Chinner wrote: > > > This is really quite nasty. Should we at least force a cache flush here? > > > > Ok, so the patch I sent out was an older version that had a very similar > > name to the current patch in my series (xfs-lazy-sb vs xfs_lazy_sb). > > This code doesn't exist in the version I should have sent out. > > > > The latest version, plus the changes suggested here and with the > > second patch folded back into it is attached. > > Looks like in the new code we simply ignore log reservation > failures in xfs_log_sbcount? AFAICT, the only way we can get that error is a fileystem shutdown, which means we've got an unclean shutdown and so there's no not much point in syncing the superblock counters because we'll have to recover them anyway.... > Otherwise this looks good to me. Thanks for the reviews, Christoph. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue Apr 24 07:55:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 07:55:31 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3OEtQfB008321 for ; Tue, 24 Apr 2007 07:55:27 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1HgMQP-0005Ev-DM; Tue, 24 Apr 2007 15:55:25 +0100 Date: Tue, 24 Apr 2007 15:55:25 +0100 From: Christoph Hellwig To: David Chinner Cc: Christoph Hellwig , gnb@sgi.com, xfs-oss , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: review: don't block non-blocking writes when frozen Message-ID: <20070424145525.GA18918@infradead.org> Mail-Followup-To: Christoph Hellwig , David Chinner , gnb@sgi.com, xfs-oss , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <20070423002616.GY32602149@melbourne.sgi.com> <20070423212715.GF13572@infradead.org> <20070423231337.GN32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070423231337.GN32602149@melbourne.sgi.com> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11181 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Tue, Apr 24, 2007 at 09:13:37AM +1000, David Chinner wrote: > So, given the catch-22 you've just presented us can we revisit the > nfsd non-blocking I/O issue again? This affects anyone using DM > snapshots on their NFS servers and has nothing to do with HSMs > or DMAPI... > > FWIW, you can still do non-blocking userspace I/O to a file, so this > XFS patch is still valid for mainline (that's how I tested it). I've been investigating the situation a little bit more, and here's what's going on: - SuS allows for O_NONBLOCK on regular files as per http://www.opengroup.org/onlinepubs/007908799/xsh/write.html - actually implementing O_NONBLOCK semantics for regular fixes breaks userspace when poll/select claims files are ready to read/write but they aren't, see http://lkml.org/lkml/2004/10/17/17 So we can't really expose O_NONBLOCK on regular files to userspace, and we need to make sure in common code this does not happen. EJUKEBOX on snaphots does make sense, though. Can you please send a full patchseries for nfsd, the common code and the xfs writepath so that this actually gets used and behaviour is consistant for all (or at least most) filesystems? Also now that the patch goes to mainline please kill ugly FILP_DELAY_FLAG and just check the flags directly. And it should probably only check O_NONBLOCK. The only architecture having O_NDELAY different from O_NONBLOCK is sparc, and it already translates the value for us. From owner-xfs@oss.sgi.com Tue Apr 24 10:03:28 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 10:03:31 -0700 (PDT) Received: from fall-lakeland.atl.sa.earthlink.net (fall-lakeland.atl.sa.earthlink.net [207.69.195.103]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3OH3QfB028594 for ; Tue, 24 Apr 2007 10:03:28 -0700 Received: from pop-satin.atl.sa.earthlink.net ([207.69.195.63]) by fall-lakeland.atl.sa.earthlink.net with esmtp (Exim 4.34) id 1HgOEH-0005Y0-5U for xfs@oss.sgi.com; Tue, 24 Apr 2007 12:51:01 -0400 Received: from user-11fb6mm.dsl.mindspring.com ([66.245.154.214] helo=[192.168.0.10]) by pop-satin.atl.sa.earthlink.net with esmtp (Exim 3.36 #1) id 1HgOEF-0006ft-00 for xfs@oss.sgi.com; Tue, 24 Apr 2007 12:50:59 -0400 Subject: system freeze with xfs From: Jurgen Schulz To: xfs@oss.sgi.com Content-Type: text/plain Date: Tue, 24 Apr 2007 09:50:57 -0700 Message-Id: <1177433458.3360.12.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 (2.6.3-2.fc5) Content-Transfer-Encoding: 7bit X-archive-position: 11182 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jmschulz@earthlink.net Precedence: bulk X-list: xfs When I create an xfs filesystem and run 'stress' (http://weather.ou.edu/~apw/projects/stress/stress.html) I can induce a system freeze (hang, no panic). There are no issues if I create a jfs filesystem (on the same disk) or run 'stress' on a different disk (ext3) I'd like to know how I could go about debugging this, otherwise I will have to switch filesystem types. Details: % stress -d 2 --hdd-noclean --hdd-bytes 3G --verbose Fedora Core 5 2.6.20-1.2312.fc5 #1 Tue Apr 10 15:09:44 EDT 2007 i686 athlon i386 GNU/Linux # /sbin/mkfs.xfs -f /dev/VolTest/test meta-data=/dev/VolTest/test isize=256 agcount=16, agsize=198656 blks = sectsz=512 data = bsize=4096 blocks=3178496, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=2560, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=65536 blocks=0, rtextents=0 # /sbin/hdparm /dev/hdk /dev/hdk: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 25228/16/63, sectors = 25429824, start = 0 From owner-xfs@oss.sgi.com Tue Apr 24 10:32:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 10:32:18 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3OHWEfB008448 for ; Tue, 24 Apr 2007 10:32:15 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l3OHW52C020254; Tue, 24 Apr 2007 13:32:05 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3OHW5NE029709; Tue, 24 Apr 2007 13:32:05 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3OHW4vl015957; Tue, 24 Apr 2007 13:32:04 -0400 Message-ID: <462E3DEA.1070907@sandeen.net> Date: Tue, 24 Apr 2007 12:27:06 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: David Chinner CC: Christoph Hellwig , xfs-dev , xfs-oss Subject: Re: review: allocate alloc args References: <20070419073216.GT48531920@melbourne.sgi.com> <20070423212201.GB13572@infradead.org> <20070423223257.GM32602149@melbourne.sgi.com> In-Reply-To: <20070423223257.GM32602149@melbourne.sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11183 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs David Chinner wrote: > On Mon, Apr 23, 2007 at 10:22:01PM +0100, Christoph Hellwig wrote: >> I don't like doing even more dynamic allocations that deep >> down in the stack. > > I'm not a big fan of it either, but I don't really see any other > option here. We need a bunch of temporary space for structures > *somewhere*, and if there isn't enough stack space then it's > got to come frm somewhere else. How about a global array of such structures which can be accessed as needed. :) /me runs I think Christoph is on the right track here; find ways to make the functions use less stack down the chain, either by breaking them up, breaking up the large structures into what's actually needed, or something along those types of refactoring lines... I'm so burned out on this stuff though. :) -Eric From owner-xfs@oss.sgi.com Tue Apr 24 10:35:15 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 10:35:21 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3OHZCfB010141 for ; Tue, 24 Apr 2007 10:35:15 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l3OHYx1a022547; Tue, 24 Apr 2007 13:34:59 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3OHYrBQ030532; Tue, 24 Apr 2007 13:34:53 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3OHYqak016254; Tue, 24 Apr 2007 13:34:53 -0400 Message-ID: <462E3E93.8090400@sandeen.net> Date: Tue, 24 Apr 2007 12:29:55 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jurgen Schulz CC: xfs@oss.sgi.com Subject: Re: system freeze with xfs References: <1177433458.3360.12.camel@localhost.localdomain> In-Reply-To: <1177433458.3360.12.camel@localhost.localdomain> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11184 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Jurgen Schulz wrote: > When I create an xfs filesystem and run > 'stress' (http://weather.ou.edu/~apw/projects/stress/stress.html) I can > induce a system freeze (hang, no panic). There are no issues if I create > a jfs filesystem (on the same disk) or run 'stress' on a different disk > (ext3) > > I'd like to know how I could go about debugging this, otherwise I will > have to switch filesystem types. is that a threat? ;-) Try enabling sysrq, and do sysrq-t when it freezes, as a first pass, to see where all the threads are (stuck) at. You might also try this on a stock upstream kernel, and see if you have the same problem with 8k stacks vs. 4k (as is in FC6) You could also set up kdump and/or netdump (whatever FC5 supports...) and get a system dump at the time it freezes up, via sysrq-c. Thanks, -Eric From owner-xfs@oss.sgi.com Tue Apr 24 15:55:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Tue, 24 Apr 2007 15:55:58 -0700 (PDT) Received: from smtp104.sbc.mail.mud.yahoo.com (smtp104.sbc.mail.mud.yahoo.com [68.142.198.203]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3OMtsfB023573 for ; Tue, 24 Apr 2007 15:55:55 -0700 Received: (qmail 77826 invoked from network); 24 Apr 2007 22:55:53 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp104.sbc.mail.mud.yahoo.com with SMTP; 24 Apr 2007 22:55:52 -0000 X-YMail-OSG: DMgglAQVM1kljmf__vDrRSD7V6n8ocZbhagG_gkvf9uJuOri Received: by tuatara.stupidest.org (Postfix, from userid 10000) id A1E061826127; Tue, 24 Apr 2007 15:55:51 -0700 (PDT) Date: Tue, 24 Apr 2007 15:55:51 -0700 From: Chris Wedgwood To: Jurgen Schulz Cc: xfs@oss.sgi.com Subject: Re: system freeze with xfs Message-ID: <20070424225551.GA17832@tuatara.stupidest.org> References: <1177433458.3360.12.camel@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177433458.3360.12.camel@localhost.localdomain> X-archive-position: 11185 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Tue, Apr 24, 2007 at 09:50:57AM -0700, Jurgen Schulz wrote: > Fedora Core 5 2.6.20-1.2312.fc5 #1 Tue Apr 10 15:09:44 EDT 2007 i686 athlon i386 GNU/Linux Smells like 4k stack on i386 blowing up; try w/o the 4k stacks option and see how that hold up From owner-xfs@oss.sgi.com Wed Apr 25 00:09:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 25 Apr 2007 00:09:35 -0700 (PDT) Received: from m01.elite.ru (m01.elite.ru [89.108.83.249]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3P79RfB004823 for ; Wed, 25 Apr 2007 00:09:30 -0700 Received: from [10.0.0.6] (195.225.129.41 [195.225.129.41]) by m01.elite.ru (ISMS) with ESMTP id 1F2D3116A9 for ; Wed, 25 Apr 2007 10:52:02 +0400 (MSD) Date: Wed, 25 Apr 2007 10:52:07 +0400 From: Arkadiy Kulev X-Mailer: The Bat! (v3.85.03) Professional Reply-To: Arkadiy Kulev X-Priority: 3 (Normal) Message-ID: <981086804.20070425105207@ethaniel.com> To: xfs@oss.sgi.com Subject: switching back from 2.6.20 to 2.6.18 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 11186 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eth@ethaniel.com Precedence: bulk X-list: xfs Hello everyone, I have a strange error that is crashing my system (not involving XFS) so I want to switch back from 2.6.20 to 2.6.18. I have been using 2.6.20 for about 2 weeks now and I can't find the version of XFS, that comes with it. Is it safe to switch back from 2.6.20 to 2.6.18? I want to test and see if this fixes my crash bug. Will this break my data? Best regards, Arkadiy mailto:eth@ethaniel.com From owner-xfs@oss.sgi.com Wed Apr 25 01:35:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 25 Apr 2007 01:35:42 -0700 (PDT) Received: from smtp101.sbc.mail.mud.yahoo.com (smtp101.sbc.mail.mud.yahoo.com [68.142.198.200]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3P8ZcfB015121 for ; Wed, 25 Apr 2007 01:35:39 -0700 Received: (qmail 96566 invoked from network); 25 Apr 2007 08:35:37 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp101.sbc.mail.mud.yahoo.com with SMTP; 25 Apr 2007 08:35:37 -0000 X-YMail-OSG: 3oHbGQkVM1lB5uwfEGlWHPLQCPfJRlj_uGgqssfCpguub7o5Mg7bqTi2he0ZsngsBXk6yeXZPA-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id E4AFC1826127; Wed, 25 Apr 2007 01:35:35 -0700 (PDT) Date: Wed, 25 Apr 2007 01:35:35 -0700 From: Chris Wedgwood To: Arkadiy Kulev Cc: xfs@oss.sgi.com Subject: Re: switching back from 2.6.20 to 2.6.18 Message-ID: <20070425083535.GA10808@tuatara.stupidest.org> References: <981086804.20070425105207@ethaniel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <981086804.20070425105207@ethaniel.com> X-archive-position: 11187 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Wed, Apr 25, 2007 at 10:52:07AM +0400, Arkadiy Kulev wrote: > Is it safe to switch back from 2.6.20 to 2.6.18? I want to test and > see if this fixes my crash bug. Will this break my data? it should be fine From owner-xfs@oss.sgi.com Wed Apr 25 16:11:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 25 Apr 2007 16:11:48 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3PNBefB005748 for ; Wed, 25 Apr 2007 16:11:44 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA06986; Thu, 26 Apr 2007 09:11:35 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3PNBXAf74067575; Thu, 26 Apr 2007 09:11:34 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3PNBV0x75252164; Thu, 26 Apr 2007 09:11:31 +1000 (AEST) Date: Thu, 26 Apr 2007 09:11:31 +1000 From: David Chinner To: Eric Sandeen Cc: David Chinner , Christoph Hellwig , xfs-dev , xfs-oss Subject: Re: review: allocate alloc args Message-ID: <20070425231131.GM48531920@melbourne.sgi.com> References: <20070419073216.GT48531920@melbourne.sgi.com> <20070423212201.GB13572@infradead.org> <20070423223257.GM32602149@melbourne.sgi.com> <462E3DEA.1070907@sandeen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <462E3DEA.1070907@sandeen.net> User-Agent: Mutt/1.4.2.1i X-archive-position: 11188 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, Apr 24, 2007 at 12:27:06PM -0500, Eric Sandeen wrote: > David Chinner wrote: > > On Mon, Apr 23, 2007 at 10:22:01PM +0100, Christoph Hellwig wrote: > > >> I don't like doing even more dynamic allocations that deep > >> down in the stack. > > > > I'm not a big fan of it either, but I don't really see any other > > option here. We need a bunch of temporary space for structures > > *somewhere*, and if there isn't enough stack space then it's > > got to come frm somewhere else. > > How about a global array of such structures which can be accessed as > needed. :) > > /me runs > > I think Christoph is on the right track here; find ways to make the > functions use less stack down the chain, either by breaking them up, > breaking up the large structures into what's actually needed, or > something along those types of refactoring lines... If that is our only option, then I can't see us making any significant impact on the stack usage without a substantial rewrite of the code. Given how critical it is for this code to be correct, QA time for any substantial code change here is going to be measured in months.... So this approach is not going to give us any relief in the short/medium term and hence I have to question the value of doing such a rework because in the medium/long term ia32 is no longer important. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Wed Apr 25 18:41:20 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 25 Apr 2007 18:41:22 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3Q1fJfB002387 for ; Wed, 25 Apr 2007 18:41:20 -0700 Received: from [10.0.0.4] (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 3D70318022E01; Wed, 25 Apr 2007 20:41:18 -0500 (CDT) Message-ID: <4630033A.5050500@sandeen.net> Date: Wed, 25 Apr 2007 20:41:14 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (Macintosh/20070221) MIME-Version: 1.0 To: David Chinner CC: Christoph Hellwig , xfs-dev , xfs-oss Subject: Re: review: allocate alloc args References: <20070419073216.GT48531920@melbourne.sgi.com> <20070423212201.GB13572@infradead.org> <20070423223257.GM32602149@melbourne.sgi.com> <462E3DEA.1070907@sandeen.net> <20070425231131.GM48531920@melbourne.sgi.com> In-Reply-To: <20070425231131.GM48531920@melbourne.sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11189 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs David Chinner wrote: > So this approach is not going to give us any relief in the short/medium > term and hence I have to question the value of doing such a rework > because in the medium/long term ia32 is no longer important. It's been this way for so long already... I probably agree with you. -Eric From owner-xfs@oss.sgi.com Wed Apr 25 23:28:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Wed, 25 Apr 2007 23:28:25 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3Q6SJfB014323 for ; Wed, 25 Apr 2007 23:28:20 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id 83D5C92CE3A for ; Thu, 26 Apr 2007 14:20:19 +1000 (EST) Subject: [PATCH] make growfs check device size limits too From: Nathan Scott Reply-To: nscott@aconex.com To: xfs@oss.sgi.com Content-Type: multipart/mixed; boundary="=-Nuo0430tAw3aLPKSo1ct" Organization: Aconex Date: Thu, 26 Apr 2007 16:30:14 +1000 Message-Id: <1177569014.6273.367.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 X-archive-position: 11190 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs --=-Nuo0430tAw3aLPKSo1ct Content-Type: text/plain Content-Transfer-Encoding: 7bit On the mount path we check for a superblock that describes a filesystem to large for the running kernel to handle. This catches the case of an attempt to mount a >16TB filesystem on i386 (where we are limited by the page->index size, for XFS metadata buffers in xfs_buf.c). This patch makes similar checks on the growfs code paths for regular and realtime growth, else we can end up with filesystem corruption, it would seem (from #xfs chatter). Untested patch follows; probably better to do this as a macro, in a header, and call that in each place...? cheers. -- Nathan --=-Nuo0430tAw3aLPKSo1ct Content-Disposition: attachment; filename=growfs.patch Content-Type: text/x-patch; name=growfs.patch; charset=UTF-8 Content-Transfer-Encoding: 7bit --- fs/xfs/xfs_fsops.c.orig 2007-04-26 16:05:38.126936000 +1000 +++ fs/xfs/xfs_fsops.c 2007-04-26 16:17:03.385762000 +1000 @@ -148,6 +148,20 @@ return error; ASSERT(bp); xfs_buf_relse(bp); + /* + * Device drivers seem to be pathological liars... so, guess we + * better check that the size isn't something completely insane. + * Same check is done during mount, so we wont create something + * here that we cannot later mount, at least. + */ +#if XFS_BIG_BLKNOS /* Limited by ULONG_MAX of page cache index */ + if (unlikely( + (nb >> (PAGE_CACHE_SHIFT - sbp->sb_blocklog)) > ULONG_MAX)) +#else /* Limited by UINT_MAX of sectors */ + if (unlikely( + (nb << (sbp->sb_blocklog - BBSHIFT)) > UINT_MAX)) +#endif + return XFS_ERROR(E2BIG); new = nb; /* use new as a temporary here */ nb_mod = do_div(new, mp->m_sb.sb_agblocks); --- fs/xfs/xfs_rtalloc.c.orig 2007-04-26 16:16:34.695969000 +1000 +++ fs/xfs/xfs_rtalloc.c 2007-04-26 16:22:43.227000750 +1000 @@ -1893,6 +1893,20 @@ ASSERT(bp); xfs_buf_relse(bp); /* + * Device drivers seem to be pathological liars... so, guess we + * better check that the size isn't something completely insane. + * Same check is done during mount, so we wont create something + * here that we cannot later mount, at least. + */ +#if XFS_BIG_BLKNOS /* Limited by ULONG_MAX of page cache index */ + if (unlikely( + (nrblocks >> (PAGE_CACHE_SHIFT - sbp->sb_blocklog)) > ULONG_MAX)) +#else /* Limited by UINT_MAX of sectors */ + if (unlikely( + (nrblocks << (sbp->sb_blocklog - BBSHIFT)) > UINT_MAX)) +#endif + return XFS_ERROR(E2BIG); + /* * Calculate new parameters. These are the final values to be reached. */ nrextents = nrblocks; --=-Nuo0430tAw3aLPKSo1ct-- From owner-xfs@oss.sgi.com Thu Apr 26 00:34:31 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 00:34:35 -0700 (PDT) Received: from pentafluge.infradead.org (pentafluge.infradead.org [213.146.154.40]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3Q7YUfB032254 for ; Thu, 26 Apr 2007 00:34:31 -0700 Received: from hch by pentafluge.infradead.org with local (Exim 4.63 #1 (Red Hat Linux)) id 1Hgy7z-0006VT-Cp; Thu, 26 Apr 2007 08:10:55 +0100 Date: Thu, 26 Apr 2007 08:10:55 +0100 From: Christoph Hellwig To: Nathan Scott Cc: xfs@oss.sgi.com Subject: Re: [PATCH] make growfs check device size limits too Message-ID: <20070426071055.GA24411@infradead.org> References: <1177569014.6273.367.camel@edge> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177569014.6273.367.camel@edge> User-Agent: Mutt/1.4.2.2i X-SRS-Rewrite: SMTP reverse-path rewritten from by pentafluge.infradead.org See http://www.infradead.org/rpr.html X-archive-position: 11191 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: hch@infradead.org Precedence: bulk X-list: xfs On Thu, Apr 26, 2007 at 04:30:14PM +1000, Nathan Scott wrote: > On the mount path we check for a superblock that describes a filesystem > to large for the running kernel to handle. This catches the case of an > attempt to mount a >16TB filesystem on i386 (where we are limited by the > page->index size, for XFS metadata buffers in xfs_buf.c). > > This patch makes similar checks on the growfs code paths for regular and > realtime growth, else we can end up with filesystem corruption, it would > seem (from #xfs chatter). Untested patch follows; probably better to do > this as a macro, in a header, and call that in each place...? Yeah, the check should probably we in one place only. Given that's it's only used in slow pathes a function would probably do it. From owner-xfs@oss.sgi.com Thu Apr 26 03:44:18 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 03:44:21 -0700 (PDT) Received: from nz-out-0506.google.com (nz-out-0506.google.com [64.233.162.225]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QAiFfB007454 for ; Thu, 26 Apr 2007 03:44:18 -0700 Received: by nz-out-0506.google.com with SMTP id m22so615477nzf for ; Thu, 26 Apr 2007 03:44:15 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=MnzWPaFKQqKlZg5A8idOBXbfBLmFQPqKC3HAeRZ1fpqkiK0ATb4ZRC+c8hxk6jbM892z7N0Iby16PC8v9Q5OEDksHFeT4ZVf5nJYWKnPNlqaN3//9myMtN18E4rsIM1Z12bN0qQvpUuPycyd+mXxrRQCcDW+YXH0o+dEbztKyuU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=sXU/karJkdi5tkXk5Xv27Yy45R5FIvR5y7j/D9LBmDL8trBBP1C4nISZFoG9yav0dhyCAbTpTtzgj6ah0AqrfhVHCYJK0RRmjkzQoLnTMZFD2GhilZe/ZjSeZL6SfyLgdMEpPt15csF+qrqTFxtPa2i7Imoe4CjxgCjx93lMfYI= Received: by 10.115.110.6 with SMTP id n6mr536240wam.1177582672521; Thu, 26 Apr 2007 03:17:52 -0700 (PDT) Received: by 10.115.108.13 with HTTP; Thu, 26 Apr 2007 03:17:47 -0700 (PDT) Message-ID: <9a8748490704260317u211c244au3c5478ac02c12c43@mail.gmail.com> Date: Thu, 26 Apr 2007 12:17:47 +0200 From: "Jesper Juhl" To: "David Chinner" Subject: Re: 2.6.20.3 - possible recursive locking detected - in XFS Cc: linux-kernel@vger.kernel.org, xfs-masters@oss.sgi.com, xfs@oss.sgi.com, viro@zeniv.linux.org.uk In-Reply-To: <20070425233345.GN48531920@melbourne.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <200704251116.57737.jesper.juhl@gmail.com> <20070425233345.GN48531920@melbourne.sgi.com> X-archive-position: 11192 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jesper.juhl@gmail.com Precedence: bulk X-list: xfs On 26/04/07, David Chinner wrote: > On Wed, Apr 25, 2007 at 11:16:57AM +0200, Jesper Juhl wrote: > > > > Hi, > > > > For your information : > > > > Once in a while I see the message below after I've just created a new XFS filesystem, mount it and then start copying data to it. > > ..... > > > ============================================= > > [ INFO: possible recursive locking detected ] > > 2.6.20.3generic #1 > > --------------------------------------------- > > xfs_fsr/6117 is trying to acquire lock: > > (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x7d/0xa0 [xfs] > > > > but task is already holding lock: > > (&(&ip->i_lock)->mr_lock){----}, at: [] xfs_ilock+0x7d/0xa0 [xfs] > > Known false positive - XFS doesn't have the annotations needed for > this yet; we've got a patch that will probably make it's way into 6.5.22 that > should fix most of these issues. > Ok. Thanks for the feedback. -- Jesper Juhl Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html From owner-xfs@oss.sgi.com Thu Apr 26 10:13:53 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 10:13:56 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QHDqfB021432 for ; Thu, 26 Apr 2007 10:13:53 -0700 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l3QHDoed014138; Thu, 26 Apr 2007 13:13:50 -0400 Received: from pobox-2.corp.redhat.com (pobox-2.corp.redhat.com [10.11.255.15]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3QHDols018294; Thu, 26 Apr 2007 13:13:50 -0400 Received: from [10.15.80.10] (neon.msp.redhat.com [10.15.80.10]) by pobox-2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l3QHDmIx021489; Thu, 26 Apr 2007 13:13:49 -0400 Message-ID: <4630DDC2.2040705@sandeen.net> Date: Thu, 26 Apr 2007 12:13:38 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Jesper Juhl CC: David Chinner , linux-kernel@vger.kernel.org, xfs-masters@oss.sgi.com, xfs@oss.sgi.com, viro@zeniv.linux.org.uk Subject: Re: 2.6.20.3 - possible recursive locking detected - in XFS References: <200704251116.57737.jesper.juhl@gmail.com> <20070425233345.GN48531920@melbourne.sgi.com> <9a8748490704260317u211c244au3c5478ac02c12c43@mail.gmail.com> In-Reply-To: <9a8748490704260317u211c244au3c5478ac02c12c43@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-archive-position: 11193 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Jesper Juhl wrote: > On 26/04/07, David Chinner wrote: >> Known false positive - XFS doesn't have the annotations needed for >> this yet; we've got a patch that will probably make it's way into 6.5.22 that >> should fix most of these issues. wow... 6.5.22.... do we really have to wait that long? :) -Eric From owner-xfs@oss.sgi.com Thu Apr 26 10:50:58 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 10:51:01 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QHotfB000706 for ; Thu, 26 Apr 2007 10:50:57 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l3QHoqfb002909 for ; Thu, 26 Apr 2007 13:50:52 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3QHoqGr478240 for ; Thu, 26 Apr 2007 13:50:52 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3QHoqvK017740 for ; Thu, 26 Apr 2007 13:50:52 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3QHooKx017630; Thu, 26 Apr 2007 13:50:51 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id B712629EBC1; Thu, 26 Apr 2007 23:20:57 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3QHougr004216; Thu, 26 Apr 2007 23:20:57 +0530 Date: Thu, 26 Apr 2007 23:20:56 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 0/5] fallocate system call Message-ID: <20070426175056.GA25321@amitarora.in.ibm.com> References: <20070317111036.GC29931@parisc-linux.org> <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070424121632.GA10136@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11194 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs Based on the discussion, this new patchset uses following as the interface for fallocate() system call: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) It seems that only s390 architecture has a problem with such a layout of arguments in fallocate(). Thus for s390, we plan to have a wrapper (say, sys_s390_fallocate()) for the sys_fallocate(), which will get called by glibc when an application issues a fallocate() system call on s390. The s390 arch specific changes will be part of a separate patch (PATCH 2/5). It will be great if some s390 expert can verify the patch, since I have not been able to test it on s390 so far. It was also noted that minor changes might be required to strace code to take care of "different arguments on s390" issue. Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for preallocation and deallocation of preallocated blocks respectively. More modes can be added, when required. ToDos: ===== 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)) 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) so that posix_fallocate() and posix_fallocate64() call fallocate() system call 4> Changes to XFS to implement the fallocate inode operation Following patches follow: Patch 1/5 : fallocate() implementation in i86, x86_64 and powerpc Patch 2/5 : fallocate() on s390 Patch 3/5 : ext4: Extent overlap bugfix Patch 4/5 : ext4: fallocate support in ext4 Patch 5/5 : ext4: write support for preallocated blocks -- Regards, Amit Arora From owner-xfs@oss.sgi.com Thu Apr 26 11:03:30 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 11:03:34 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QI3SfB004499 for ; Thu, 26 Apr 2007 11:03:29 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e2.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l3QI3RbF027433 for ; Thu, 26 Apr 2007 14:03:27 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3QI3RXU551114 for ; Thu, 26 Apr 2007 14:03:27 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3QI3QDP020688 for ; Thu, 26 Apr 2007 14:03:27 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3QI3Pw7020631; Thu, 26 Apr 2007 14:03:25 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 7428B29EBC1; Thu, 26 Apr 2007 23:33:33 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3QI3XvT009314; Thu, 26 Apr 2007 23:33:33 +0530 Date: Thu, 26 Apr 2007 23:33:32 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Message-ID: <20070426180332.GA7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11195 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. NOTE: It is based on 2.6.21 kernel version. Signed-off-by: Amit Arora --- arch/i386/kernel/syscall_table.S | 1 arch/powerpc/kernel/sys_ppc32.c | 7 ++++++ arch/x86_64/kernel/functionlist | 1 fs/open.c | 41 +++++++++++++++++++++++++++++++++++++++ include/asm-i386/unistd.h | 3 +- include/asm-powerpc/systbl.h | 1 include/asm-powerpc/unistd.h | 3 +- include/asm-x86_64/unistd.h | 4 ++- include/linux/fs.h | 7 ++++++ include/linux/syscalls.h | 1 10 files changed, 66 insertions(+), 3 deletions(-) Index: linux-2.6.21/arch/i386/kernel/syscall_table.S =================================================================== --- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S +++ linux-2.6.21/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_fallocate /* 320 */ Index: linux-2.6.21/arch/x86_64/kernel/functionlist =================================================================== --- linux-2.6.21.orig/arch/x86_64/kernel/functionlist +++ linux-2.6.21/arch/x86_64/kernel/functionlist @@ -931,6 +931,7 @@ *(.text.sys_getitimer) *(.text.sys_getgroups) *(.text.sys_ftruncate) +*(.text.sys_fallocate) *(.text.sysfs_lookup) *(.text.sys_exit_group) *(.text.stub_fork) Index: linux-2.6.21/fs/open.c =================================================================== --- linux-2.6.21.orig/fs/open.c +++ linux-2.6.21/fs/open.c @@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned } #endif +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (len == 0 || offset < 0) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + if (!S_ISREG(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + if (offset + len > inode->i_sb->s_maxbytes) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; +out_fput: + fput(file); +out: + return ret; +} +EXPORT_SYMBOL(sys_fallocate); + /* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and Index: linux-2.6.21/include/asm-i386/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-i386/unistd.h +++ linux-2.6.21/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages 317 #define __NR_getcpu 318 #define __NR_epoll_pwait 319 +#define __NR_fallocate 320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.21/include/asm-powerpc/systbl.h =================================================================== --- linux-2.6.21.orig/include/asm-powerpc/systbl.h +++ linux-2.6.21/include/asm-powerpc/systbl.h @@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list) COMPAT_SYS_SPU(move_pages) SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) +COMPAT_SYS(fallocate) Index: linux-2.6.21/include/asm-powerpc/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-powerpc/unistd.h +++ linux-2.6.21/include/asm-powerpc/unistd.h @@ -326,10 +326,11 @@ #define __NR_move_pages 301 #define __NR_getcpu 302 #define __NR_epoll_pwait 303 +#define __NR_fallocate 304 #ifdef __KERNEL__ -#define __NR_syscalls 304 +#define __NR_syscalls 305 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls Index: linux-2.6.21/include/asm-x86_64/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-x86_64/unistd.h +++ linux-2.6.21/include/asm-x86_64/unistd.h @@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_fallocate 280 +__SYSCALL(__NR_fallocate, sys_fallocate) -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_fallocate #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.21/include/linux/fs.h =================================================================== --- linux-2.6.21.orig/include/linux/fs.h +++ linux-2.6.21/include/linux/fs.h @@ -264,6 +264,12 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * fallocate() modes + */ +#define FA_ALLOCATE 0x1 +#define FA_DEALLOCATE 0x2 + #ifdef __KERNEL__ #include @@ -1125,6 +1131,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *, int, loff_t, loff_t); }; struct seq_file; Index: linux-2.6.21/include/linux/syscalls.h =================================================================== --- linux-2.6.21.orig/include/linux/syscalls.h +++ linux-2.6.21/include/linux/syscalls.h @@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); Index: linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c =================================================================== --- linux-2.6.21.orig/arch/powerpc/kernel/sys_ppc32.c +++ linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c @@ -777,6 +777,13 @@ asmlinkage int compat_sys_truncate64(con return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, + u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, + ((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { From owner-xfs@oss.sgi.com Thu Apr 26 11:10:57 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 11:11:00 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QIAufB007236 for ; Thu, 26 Apr 2007 11:10:57 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e2.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l3QIAt0h001522 for ; Thu, 26 Apr 2007 14:10:55 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3QIAtlK507150 for ; Thu, 26 Apr 2007 14:10:55 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3QIAsd8006026 for ; Thu, 26 Apr 2007 14:10:55 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3QIArVU005924; Thu, 26 Apr 2007 14:10:54 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 7FCA629EBC1; Thu, 26 Apr 2007 23:41:01 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3QIB1Sl012259; Thu, 26 Apr 2007 23:41:01 +0530 Date: Thu, 26 Apr 2007 23:41:01 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 3/5] ext4: Extent overlap bugfix Message-ID: <20070426181101.GC7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11196 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This is a fix for an extent-overlap bug. The fallocate() implementation on ext4 depends on this bugfix. Though this fix had been posted earlier, but because it is still not part of mainline code, I have attached it here too. Signed-off-by: Amit Arora --- fs/ext4/extents.c | 50 ++++++++++++++++++++++++++++++++++++++-- include/linux/ext4_fs_extents.h | 1 2 files changed, 49 insertions(+), 2 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c =================================================================== --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -1129,6 +1129,45 @@ ext4_can_extents_be_merged(struct inode } /* + * ext4_ext_check_overlap: + * check if a portion of the "newext" extent overlaps with an + * existing extent. + * + * If there is an overlap discovered, it updates the length of the newext + * such that there will be no overlap, and then returns 1. + * If there is no overlap found, it returns 0. + */ +unsigned int ext4_ext_check_overlap(struct inode *inode, + struct ext4_extent *newext, + struct ext4_ext_path *path) +{ + unsigned long b1, b2; + unsigned int depth, len1; + + b1 = le32_to_cpu(newext->ee_block); + len1 = le16_to_cpu(newext->ee_len); + depth = ext_depth(inode); + if (!path[depth].p_ext) + goto out; + b2 = le32_to_cpu(path[depth].p_ext->ee_block); + + /* get the next allocated block if the extent in the path + * is before the requested block(s) */ + if (b2 < b1) { + b2 = ext4_ext_next_allocated_block(path); + if (b2 == EXT_MAX_BLOCK) + goto out; + } + + if (b1 + len1 > b2) { + newext->ee_len = cpu_to_le16(b2 - b1); + return 1; + } +out: + return 0; +} + +/* * ext4_ext_insert_extent: * tries to merge requsted extent into the existing extent or * inserts requested extent as new one into the tree, @@ -2032,7 +2071,15 @@ int ext4_ext_get_blocks(handle_t *handle /* allocate new block */ goal = ext4_ext_find_goal(inode, path, iblock); - allocated = max_blocks; + + /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */ + newex.ee_block = cpu_to_le32(iblock); + newex.ee_len = cpu_to_le16(max_blocks); + err = ext4_ext_check_overlap(inode, &newex, path); + if (err) + allocated = le16_to_cpu(newex.ee_len); + else + allocated = max_blocks; newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err); if (!newblock) goto out2; @@ -2040,7 +2087,6 @@ int ext4_ext_get_blocks(handle_t *handle goal, newblock, allocated); /* try to insert new extent into found leaf and return */ - newex.ee_block = cpu_to_le32(iblock); ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); err = ext4_ext_insert_extent(handle, inode, path, &newex); Index: linux-2.6.21/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode * extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *); From owner-xfs@oss.sgi.com Thu Apr 26 11:16:21 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 11:16:24 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QIGIfB009023 for ; Thu, 26 Apr 2007 11:16:20 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l3QIGI3W009891 for ; Thu, 26 Apr 2007 14:16:18 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3QIGIRY514426 for ; Thu, 26 Apr 2007 14:16:18 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3QIGHjr023135 for ; Thu, 26 Apr 2007 14:16:18 -0400 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3QIGFos023094; Thu, 26 Apr 2007 14:16:16 -0400 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 131A129EBC1; Thu, 26 Apr 2007 23:46:24 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3QIGNKf014395; Thu, 26 Apr 2007 23:46:23 +0530 Date: Thu, 26 Apr 2007 23:46:23 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 5/5] ext4: write support for preallocated blocks/extents Message-ID: <20070426181623.GE7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11197 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch adds write support for preallocated (using fallocate system call) blocks/extents. The preallocated extents in ext4 are marked "uninitialized", hence they need special handling especially while writing to them. This patch takes care of that. Signed-off-by: Amit Arora --- fs/ext4/extents.c | 228 +++++++++++++++++++++++++++++++++++----- include/linux/ext4_fs_extents.h | 1 2 files changed, 202 insertions(+), 27 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c =================================================================== --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -1141,6 +1141,51 @@ ext4_can_extents_be_merged(struct inode } /* + * ext4_ext_try_to_merge: + * tries to merge the "ex" extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass "ex - 1" as argument instead of "ex". + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done=0, uninitialized = 0; + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex < EXT_LAST_EXTENT(eh)) { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 < EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); + merge_done = 1; + BUG_ON(eh->eh_entries == 0); + } + + return merge_done; +} + + +/* * ext4_ext_check_overlap: * check if a portion of the "newext" extent overlaps with an * existing extent. @@ -1316,25 +1361,7 @@ has_space: merge: /* try to merge extents to the right */ - while (nearex < EXT_LAST_EXTENT(eh)) { - if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) - break; - /* merge with next extent! */ - if (ext4_ext_is_uninitialized(nearex)) - uninitialized = 1; - nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) - + ext4_ext_get_actual_len(nearex + 1)); - if (uninitialized) - ext4_ext_mark_uninitialized(nearex); - - if (nearex + 1 < EXT_LAST_EXTENT(eh)) { - len = (EXT_LAST_EXTENT(eh) - nearex - 1) - * sizeof(struct ext4_extent); - memmove(nearex + 1, nearex + 2, len); - } - eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1); - BUG_ON(eh->eh_entries == 0); - } + ext4_ext_try_to_merge(inode, path, nearex); /* try to merge extents to the left */ @@ -1999,15 +2026,149 @@ void ext4_ext_release(struct super_block #endif } +/* + * ext4_ext_convert_to_initialized: + * this function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three). Atleast one initialized extent + * and atmost two uninitialized extents can result. + * There are three possibilities: + * a> No split required: Entire extent should be initialized. + * b> Split into two extents: Only one end of the extent is being written to. + * c> Split into three extents: Somone is writing in middle of the extent. + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err = 0, ret = 0; + + depth = ext_depth(inode); + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + ee_block = le32_to_cpu(ex->ee_block); + ee_len = ext4_ext_get_actual_len(ex); + allocated = ee_len - (iblock - ee_block); + newblock = iblock - ee_block + ext_pblock(ex); + ex2 = ex; + + /* ex1: ee_block to iblock - 1 : uninitialized */ + if (iblock > ee_block) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* for sanity, update the length of the ex2 extent before + * we insert ex3, if ex1 is NULL. This is to avoid temporary + * overlap of blocks. + */ + if (!ex1 && allocated > max_blocks) + ex2->ee_len = cpu_to_le16(max_blocks); + /* ex3: to ee_block + ee_len : uninitialised */ + if (allocated > max_blocks) { + unsigned int newdepth; + ex3 = &newex; + ex3->ee_block = cpu_to_le32(iblock + max_blocks); + ext4_ext_store_pblock(ex3, newblock + max_blocks); + ex3->ee_len = cpu_to_le16(allocated - max_blocks); + ext4_ext_mark_uninitialized(ex3); + err = ext4_ext_insert_extent(handle, inode, path, ex3); + if (err) + goto out; + /* The depth, and hence eh & ex might change + * as part of the insert above. + */ + newdepth = ext_depth(inode); + if (newdepth != depth) + { + depth=newdepth; + path = ext4_ext_find_extent(inode, iblock, NULL); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + goto out; + } + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + if (ex2 != &newex) + ex2 = ex; + } + allocated = max_blocks; + } + /* If there was a change of depth as part of the + * insertion of ex3 above, we need to update the length + * of the ex1 extent again here + */ + if (ex1 && ex1 != ex) { + ex1 = ex; + ex1->ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = &newex; + } + /* ex2: iblock to iblock + maxblocks-1 : initialised */ + ex2->ee_block = cpu_to_le32(iblock); + ex2->ee_start = cpu_to_le32(newblock); + ext4_ext_store_pblock(ex2, newblock); + ex2->ee_len = cpu_to_le16(allocated); + if (ex2 != ex) + goto insert; + if ((err = ext4_ext_get_access(handle, inode, path + depth))) + goto out; + /* New (initialized) extent starts from the first block + * in the current extent. i.e., ex2 == ex + * We have to see if it can be merged with the extent + * on the left. + */ + if (ex2 > EXT_FIRST_EXTENT(eh)) { + /* To merge left, pass "ex2 - 1" to try_to_merge(), + * since it merges towards right _only_. + */ + ret = ext4_ext_try_to_merge(inode, path, ex2 - 1); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + depth = ext_depth(inode); + ex2--; + } + } + /* Try to Merge towards right. This might be required + * only when the whole extent is being written to. + * i.e. ex2==ex and ex3==NULL. + */ + if (!ex3) { + ret = ext4_ext_try_to_merge(inode, path, ex2); + if (ret) { + err = ext4_ext_correct_indexes(handle, inode, path); + if (err) + goto out; + } + } + /* Mark modified extent as dirty */ + err = ext4_ext_dirty(handle, inode, path + depth); + goto out; +insert: + err = ext4_ext_insert_extent(handle, inode, path, &newex); +out: + return err ? err : allocated; +} + int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, int create, int extend_disksize) { struct ext4_ext_path *path = NULL; + struct ext4_extent_header *eh; struct ext4_extent newex, *ex; ext4_fsblk_t goal, newblock; - int err = 0, depth; + int err = 0, depth, ret; unsigned long allocated = 0; __clear_bit(BH_New, &bh_result->b_state); @@ -2055,6 +2216,7 @@ int ext4_ext_get_blocks(handle_t *handle * this is why assert can't be put in ext4_ext_find_extent() */ BUG_ON(path[depth].p_ext == NULL && depth != 0); + eh = path[depth].p_hdr; ex = path[depth].p_ext; if (ex) { @@ -2063,13 +2225,9 @@ int ext4_ext_get_blocks(handle_t *handle unsigned short ee_len; /* - * Allow future support for preallocated extents to be added - * as an RO_COMPAT feature: * Uninitialized extents are treated as holes, except that - * we avoid (fail) allocating new blocks during a write. + * we split out initialized portions during a write. */ - if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) - goto out2; ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { @@ -2078,12 +2236,27 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); + /* Do not put uninitialized extent in the cache */ - if (!ext4_ext_is_uninitialized(ex)) + if (!ext4_ext_is_uninitialized(ex)) { ext4_ext_put_in_cache(inode, ee_block, ee_len, ee_start, EXT4_EXT_CACHE_EXTENT); - goto out; + goto out; + } + if (create == EXT4_CREATE_UNINITIALIZED_EXT) + goto out; + if (!create) + goto out2; + + ret = ext4_ext_convert_to_initialized(handle, inode, + path, iblock, + max_blocks); + if (ret <= 0) + goto out2; + else + allocated = ret; + goto outnew; } } @@ -2135,6 +2308,7 @@ int ext4_ext_get_blocks(handle_t *handle /* previous routine could use block we allocated */ newblock = ext_pblock(&newex); +outnew: __set_bit(BH_New, &bh_result->b_state); /* Cache only when it is _not_ an uninitialized extent */ Index: linux-2.6.21/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -203,6 +203,7 @@ ext4_ext_invalidate_cache(struct inode * extern int ext4_extent_tree_init(handle_t *, struct inode *); extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *); From owner-xfs@oss.sgi.com Thu Apr 26 11:28:07 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 11:28:12 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QIS6fB011943 for ; Thu, 26 Apr 2007 11:28:07 -0700 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3QI8BSC027096 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Thu, 26 Apr 2007 14:08:12 -0400 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e32.co.us.ibm.com (8.12.11.20060308/8.13.8) with ESMTP id l3QI4q3b011923 for ; Thu, 26 Apr 2007 14:04:52 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3QI7nFj038186 for ; Thu, 26 Apr 2007 12:07:49 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3QI7jvZ025997 for ; Thu, 26 Apr 2007 12:07:48 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3QI7gJL025849; Thu, 26 Apr 2007 12:07:44 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id CE53029EBC1; Thu, 26 Apr 2007 23:37:50 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3QI7oHp011024; Thu, 26 Apr 2007 23:37:50 +0530 Date: Thu, 26 Apr 2007 23:37:50 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 2/5] fallocate() on s390 Message-ID: <20070426180750.GB7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11198 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch implements support of fallocate system call on s390(x) platform. A wrapper is added to address the issue which s390 ABI has with "preferred" ordering of arguments in this system call (i.e. int, int, loff_t, loff_t). I will request s390 experts to please review this code and verify if this patch is correct. Thanks! Signed-off-by: Amit Arora --- arch/s390/kernel/compat_wrapper.S | 10 ++++++++++ arch/s390/kernel/sys_s390.c | 10 ++++++++++ arch/s390/kernel/syscalls.S | 1 + include/asm-s390/unistd.h | 3 ++- 4 files changed, 23 insertions(+), 1 deletion(-) Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S +++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper: llgtr %r2,%r2 # char * llgtr %r3,%r3 # struct compat_timeval * jg compat_sys_utimes + + .globl s390_fallocate_wrapper +s390_fallocate_wrapper: + lgfr %r2,%r2 # int + sllg %r3,%r3,32 # get high word of 64bit loff_t + or %r3,%r4 # get low word of 64bit loff_t + sllg %r4,%r5,32 # get high word of 64bit loff_t + or %r4,%r6 # get low word of 64bit loff_t + llgf %r5,164(%r15) # unsigned int + jg s390_fallocate Index: linux-2.6.21/arch/s390/kernel/sys_s390.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c +++ linux-2.6.21/arch/s390/kernel/sys_s390.c @@ -268,6 +268,16 @@ s390_fadvise64_64(struct fadvise64_64_ar } /* + * This is a wrapper to call sys_fallocate(). Since s390 ABI has a problem + * with the int, int, loff_t, loff_t ordering of arguments, this wrapper + * is required. + */ +asmlinkage long s390_fallocate(int fd, loff_t offset, loff_t len, int mode) +{ + return sys_fallocate(fd, mode, offset, len); +} + +/* * Do a system call from kernel instead of calling sys_execve so we * end up with proper pt_regs. */ Index: linux-2.6.21/arch/s390/kernel/syscalls.S =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.21/arch/s390/kernel/syscalls.S @@ -322,3 +322,4 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(s390_fallocate,s390_fallocate,s390_fallocate_wrapper) Index: linux-2.6.21/include/asm-s390/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-s390/unistd.h +++ linux-2.6.21/include/asm-s390/unistd.h @@ -251,8 +251,9 @@ #define __NR_getcpu 311 #define __NR_epoll_pwait 312 #define __NR_utimes 313 +#define __NR_fallocate 314 -#define NR_syscalls 314 +#define NR_syscalls 315 /* * There are some system calls that are not present on 64 bit, some From owner-xfs@oss.sgi.com Thu Apr 26 11:45:36 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 11:45:37 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3QIjXfB016305 for ; Thu, 26 Apr 2007 11:45:35 -0700 Received: from gort (mtv-vpn-sw-corp-0-216.corp.sgi.com [134.15.0.216]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id EAA06218; Fri, 27 Apr 2007 04:04:59 +1000 Reply-To: From: "Mike Gigante" To: "'David Chinner'" , "'Eric Sandeen'" Cc: "'Christoph Hellwig'" , "'xfs-dev'" , "'xfs-oss'" Subject: RE: review: allocate alloc args Date: Thu, 26 Apr 2007 11:06:43 -0700 Organization: SGI Fileserving Technologies Message-ID: <016b01c7882d$a9f09290$d8000f86@gort> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 In-Reply-To: <20070425231131.GM48531920@melbourne.sgi.com> Thread-Index: AceHj8w9zt8gsaiaTQeWxgsdQR0DwgAnZq7g X-archive-position: 11199 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: mg@sgi.com Precedence: bulk X-list: xfs > in the medium/long term ia32 is no longer important. Let's not engage in any largish or risky work specifically for ia32 and let's not burn precious QA resources for something which is irrelevant for our products. Mike From owner-xfs@oss.sgi.com Thu Apr 26 11:58:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 11:58:08 -0700 (PDT) Received: from over.ny.us.ibm.com (over.ny.us.ibm.com [32.97.182.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QIw0fB019504 for ; Thu, 26 Apr 2007 11:58:01 -0700 Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.151]) by pokfb.esmtp.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3QIDSq6030270 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Thu, 26 Apr 2007 14:13:29 -0400 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e33.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l3QIDRKT027670 for ; Thu, 26 Apr 2007 14:13:27 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3QIDR0J198268 for ; Thu, 26 Apr 2007 12:13:27 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3QIDQw1016398 for ; Thu, 26 Apr 2007 12:13:27 -0600 Received: from amitarora.in.ibm.com (amitarora.in.ibm.com [9.124.31.181]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3QIDOpF016181; Thu, 26 Apr 2007 12:13:25 -0600 Received: from amitarora.in.ibm.com (localhost.localdomain [127.0.0.1]) by amitarora.in.ibm.com (Postfix) with ESMTP id 9BF3529EBC1; Thu, 26 Apr 2007 23:43:32 +0530 (IST) Received: (from amit@localhost) by amitarora.in.ibm.com (8.13.1/8.13.1/Submit) id l3QIDW1J013266; Thu, 26 Apr 2007 23:43:32 +0530 Date: Thu, 26 Apr 2007 23:43:32 +0530 From: "Amit K. Arora" To: torvalds@osdl.org, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH 4/5] ext4: fallocate support in ext4 Message-ID: <20070426181332.GD7209@amitarora.in.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.1i X-archive-position: 11200 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: aarora@linux.vnet.ibm.com Precedence: bulk X-list: xfs This patch has the ext4 implemtation of fallocate system call. Signed-off-by: Amit Arora --- fs/ext4/extents.c | 201 +++++++++++++++++++++++++++++++--------- fs/ext4/file.c | 1 include/linux/ext4_fs.h | 7 + include/linux/ext4_fs_extents.h | 13 ++ 4 files changed, 179 insertions(+), 43 deletions(-) Index: linux-2.6.21/fs/ext4/extents.c =================================================================== --- linux-2.6.21.orig/fs/ext4/extents.c +++ linux-2.6.21/fs/ext4/extents.c @@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in } else if (path->p_ext) { ext_debug(" %d:%d:%llu ", le32_to_cpu(path->p_ext->ee_block), - le16_to_cpu(path->p_ext->ee_len), + ext4_ext_get_actual_len(path->p_ext), ext_pblock(path->p_ext)); } else ext_debug(" []"); @@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) { ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); } ext_debug("\n"); } @@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, ext_debug(" -> %d:%llu:%d ", le32_to_cpu(path->p_ext->ee_block), ext_pblock(path->p_ext), - le16_to_cpu(path->p_ext->ee_len)); + ext4_ext_get_actual_len(path->p_ext)); #ifdef CHECK_BINSEARCH { @@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand ext_debug("move %d:%llu:%d in new leaf %llu\n", le32_to_cpu(path[depth].p_ext->ee_block), ext_pblock(path[depth].p_ext), - le16_to_cpu(path[depth].p_ext->ee_len), + ext4_ext_get_actual_len(path[depth].p_ext), newblock); /*memmove(ex++, path[depth].p_ext++, sizeof(struct ext4_extent)); @@ -1107,7 +1107,19 @@ static int ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, struct ext4_extent *ex2) { - if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) != + unsigned short ext1_ee_len, ext2_ee_len; + + /* + * Make sure that either both extents are uninitialized, or + * both are _not_. + */ + if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2)) + return 0; + + ext1_ee_len = ext4_ext_get_actual_len(ex1); + ext2_ee_len = ext4_ext_get_actual_len(ex2); + + if (le32_to_cpu(ex1->ee_block) + ext1_ee_len != le32_to_cpu(ex2->ee_block)) return 0; @@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode * as an RO_COMPAT feature, refuse to merge to extents if * this can result in the top bit of ee_len being set. */ - if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN) + if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN) return 0; #ifdef AGGRESSIVE_TEST if (le16_to_cpu(ex1->ee_len) >= 4) return 0; #endif - if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2)) + if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2)) return 1; return 0; } @@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru unsigned int depth, len1; b1 = le32_to_cpu(newext->ee_block); - len1 = le16_to_cpu(newext->ee_len); + len1 = ext4_ext_get_actual_len(newext); depth = ext_depth(inode); if (!path[depth].p_ext) goto out; @@ -1181,9 +1193,9 @@ int ext4_ext_insert_extent(handle_t *han struct ext4_extent *ex, *fex; struct ext4_extent *nearex; /* nearest extent */ struct ext4_ext_path *npath = NULL; - int depth, len, err, next; + int depth, len, err, next, uninitialized = 0; - BUG_ON(newext->ee_len == 0); + BUG_ON(ext4_ext_get_actual_len(newext) == 0); depth = ext_depth(inode); ex = path[depth].p_ext; BUG_ON(path[depth].p_hdr == NULL); @@ -1191,14 +1203,23 @@ int ext4_ext_insert_extent(handle_t *han /* try to insert block into found extent and return */ if (ex && ext4_can_extents_be_merged(inode, ex, newext)) { ext_debug("append %d block to %d:%d (from %llu)\n", - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), le32_to_cpu(ex->ee_block), - le16_to_cpu(ex->ee_len), ext_pblock(ex)); + ext4_ext_get_actual_len(ex), ext_pblock(ex)); err = ext4_ext_get_access(handle, inode, path + depth); if (err) return err; - ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len) - + le16_to_cpu(newext->ee_len)); + + /* ext4_can_extents_be_merged should have checked that either + * both extents are uninitialized, or both aren't. Thus we + * need to check only one of them here. + */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(newext)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); eh = path[depth].p_hdr; nearex = ex; goto merge; @@ -1254,7 +1275,7 @@ has_space: ext_debug("first extent in the leaf: %d:%llu:%d\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len)); + ext4_ext_get_actual_len(newext)); path[depth].p_ext = EXT_FIRST_EXTENT(eh); } else if (le32_to_cpu(newext->ee_block) > le32_to_cpu(nearex->ee_block)) { @@ -1267,7 +1288,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 2, nearex + 1, len); } @@ -1280,7 +1301,7 @@ has_space: "move %d from 0x%p to 0x%p\n", le32_to_cpu(newext->ee_block), ext_pblock(newext), - le16_to_cpu(newext->ee_len), + ext4_ext_get_actual_len(newext), nearex, len, nearex + 1, nearex + 2); memmove(nearex + 1, nearex, len); path[depth].p_ext = nearex; @@ -1299,8 +1320,13 @@ merge: if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1)) break; /* merge with next extent! */ - nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len) - + le16_to_cpu(nearex[1].ee_len)); + if (ext4_ext_is_uninitialized(nearex)) + uninitialized = 1; + nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex) + + ext4_ext_get_actual_len(nearex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(nearex); + if (nearex + 1 < EXT_LAST_EXTENT(eh)) { len = (EXT_LAST_EXTENT(eh) - nearex - 1) * sizeof(struct ext4_extent); @@ -1370,8 +1396,8 @@ int ext4_ext_walk_space(struct inode *in end = le32_to_cpu(ex->ee_block); if (block + num < end) end = block + num; - } else if (block >= - le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) { + } else if (block >= le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex)) { /* need to allocate space after found extent */ start = block; end = block + num; @@ -1383,7 +1409,8 @@ int ext4_ext_walk_space(struct inode *in * by found extent */ start = block; - end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len); + end = le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex); if (block + num < end) end = block + num; exists = 1; @@ -1399,7 +1426,7 @@ int ext4_ext_walk_space(struct inode *in cbex.ec_type = EXT4_EXT_CACHE_GAP; } else { cbex.ec_block = le32_to_cpu(ex->ee_block); - cbex.ec_len = le16_to_cpu(ex->ee_len); + cbex.ec_len = ext4_ext_get_actual_len(ex); cbex.ec_start = ext_pblock(ex); cbex.ec_type = EXT4_EXT_CACHE_EXTENT; } @@ -1472,15 +1499,15 @@ ext4_ext_put_gap_in_cache(struct inode * ext_debug("cache gap(before): %lu [%lu:%lu]", (unsigned long) block, (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len)); + (unsigned long) ext4_ext_get_actual_len(ex)); } else if (block >= le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len)) { + + ext4_ext_get_actual_len(ex)) { lblock = le32_to_cpu(ex->ee_block) - + le16_to_cpu(ex->ee_len); + + ext4_ext_get_actual_len(ex); len = ext4_ext_next_allocated_block(path); ext_debug("cache gap(after): [%lu:%lu] %lu", (unsigned long) le32_to_cpu(ex->ee_block), - (unsigned long) le16_to_cpu(ex->ee_len), + (unsigned long) ext4_ext_get_actual_len(ex), (unsigned long) block); BUG_ON(len == lblock); len = len - lblock; @@ -1610,12 +1637,12 @@ static int ext4_remove_blocks(handle_t * unsigned long from, unsigned long to) { struct buffer_head *bh; + unsigned short ee_len = ext4_ext_get_actual_len(ex); int i; #ifdef EXTENTS_STATS { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); - unsigned short ee_len = le16_to_cpu(ex->ee_len); spin_lock(&sbi->s_ext_stats_lock); sbi->s_ext_blocks += ee_len; sbi->s_ext_extents++; @@ -1629,12 +1656,12 @@ static int ext4_remove_blocks(handle_t * } #endif if (from >= le32_to_cpu(ex->ee_block) - && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to == le32_to_cpu(ex->ee_block) + ee_len - 1) { /* tail removal */ unsigned long num; ext4_fsblk_t start; - num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from; - start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num; + num = le32_to_cpu(ex->ee_block) + ee_len - from; + start = ext_pblock(ex) + ee_len - num; ext_debug("free last %lu blocks starting %llu\n", num, start); for (i = 0; i < num; i++) { bh = sb_find_get_block(inode->i_sb, start + i); @@ -1642,12 +1669,12 @@ static int ext4_remove_blocks(handle_t * } ext4_free_blocks(handle, inode, start, num); } else if (from == le32_to_cpu(ex->ee_block) - && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) { + && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) { printk("strange request: removal %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } else { printk("strange request: removal(2) %lu-%lu from %u:%u\n", - from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len)); + from, to, le32_to_cpu(ex->ee_block), ee_len); } return 0; } @@ -1661,7 +1688,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc struct ext4_extent_header *eh; unsigned a, b, block, num; unsigned long ex_ee_block; - unsigned short ex_ee_len; + unsigned short ex_ee_len, uninitialized = 0; struct ext4_extent *ex; ext_debug("truncate since %lu in leaf\n", start); @@ -1676,7 +1703,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex = EXT_LAST_EXTENT(eh); ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex_ee_len = ext4_ext_get_actual_len(ex); while (ex >= EXT_FIRST_EXTENT(eh) && ex_ee_block + ex_ee_len > start) { @@ -1744,6 +1773,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc ex->ee_block = cpu_to_le32(block); ex->ee_len = cpu_to_le16(num); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); err = ext4_ext_dirty(handle, inode, path + depth); if (err) @@ -1753,7 +1784,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc ext_pblock(ex)); ex--; ex_ee_block = le32_to_cpu(ex->ee_block); - ex_ee_len = le16_to_cpu(ex->ee_len); + ex_ee_len = ext4_ext_get_actual_len(ex); } if (correct_index && eh->eh_entries) @@ -2029,7 +2060,7 @@ int ext4_ext_get_blocks(handle_t *handle if (ex) { unsigned long ee_block = le32_to_cpu(ex->ee_block); ext4_fsblk_t ee_start = ext_pblock(ex); - unsigned short ee_len = le16_to_cpu(ex->ee_len); + unsigned short ee_len; /* * Allow future support for preallocated extents to be added @@ -2037,8 +2068,9 @@ int ext4_ext_get_blocks(handle_t *handle * Uninitialized extents are treated as holes, except that * we avoid (fail) allocating new blocks during a write. */ - if (ee_len > EXT_MAX_LEN) + if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN) goto out2; + ee_len = ext4_ext_get_actual_len(ex); /* if found extent covers block, simply return it */ if (iblock >= ee_block && iblock < ee_block + ee_len) { newblock = iblock - ee_block + ee_start; @@ -2046,8 +2078,11 @@ int ext4_ext_get_blocks(handle_t *handle allocated = ee_len - (iblock - ee_block); ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock, ee_block, ee_len, newblock); - ext4_ext_put_in_cache(inode, ee_block, ee_len, - ee_start, EXT4_EXT_CACHE_EXTENT); + /* Do not put uninitialized extent in the cache */ + if (!ext4_ext_is_uninitialized(ex)) + ext4_ext_put_in_cache(inode, ee_block, + ee_len, ee_start, + EXT4_EXT_CACHE_EXTENT); goto out; } } @@ -2089,6 +2124,8 @@ int ext4_ext_get_blocks(handle_t *handle /* try to insert new extent into found leaf and return */ ext4_ext_store_pblock(&newex, newblock); newex.ee_len = cpu_to_le16(allocated); + if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */ + ext4_ext_mark_uninitialized(&newex); err = ext4_ext_insert_extent(handle, inode, path, &newex); if (err) goto out2; @@ -2100,8 +2137,10 @@ int ext4_ext_get_blocks(handle_t *handle newblock = ext_pblock(&newex); __set_bit(BH_New, &bh_result->b_state); - ext4_ext_put_in_cache(inode, iblock, allocated, newblock, - EXT4_EXT_CACHE_EXTENT); + /* Cache only when it is _not_ an uninitialized extent */ + if (create!=EXT4_CREATE_UNINITIALIZED_EXT) + ext4_ext_put_in_cache(inode, iblock, allocated, newblock, + EXT4_EXT_CACHE_EXTENT); out: if (allocated > max_blocks) allocated = max_blocks; @@ -2205,10 +2244,86 @@ int ext4_ext_writepage_trans_blocks(stru return needed; } +/* + * ext4_fallocate: + * preallocate space for a file + * mode is for future use, e.g. for unallocating preallocated blocks etc. + */ +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + handle_t *handle; + ext4_fsblk_t block, max_blocks; + int ret, ret2, nblocks = 0, retries = 0; + struct buffer_head map_bh; + unsigned int credits, blkbits = inode->i_blkbits; + + /* Currently supporting (pre)allocate mode _only_ */ + if (mode != FA_ALLOCATE) + return -EOPNOTSUPP; + + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return -ENOTTY; + + block = offset >> blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) + - block; + mutex_lock(&EXT4_I(inode)->truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(&EXT4_I(inode)->truncate_mutex); + handle=ext4_journal_start(inode, credits + + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1); + if (IS_ERR(handle)) + return PTR_ERR(handle); +retry: + ret = 0; + while (ret >= 0 && ret < max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, &map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); + if (ret > 0 && test_bit(BH_New, &map_bh.b_state) + && ((block + ret) > (i_size_read(inode) << blkbits))) + nblocks = nblocks + ret; + } + + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) + goto retry; + + /* Time to update the file size. + * Update only when preallocation was requested beyond the file size. + */ + if ((offset + len) > i_size_read(inode)) { + if (ret > 0) { + /* if no error, we assume preallocation succeeded completely */ + mutex_lock(&inode->i_mutex); + i_size_write(inode, offset + len); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } else if (ret < 0 && nblocks) { + /* Handle partial allocation scenario */ + loff_t newsize; + mutex_lock(&inode->i_mutex); + newsize = (nblocks << blkbits) + i_size_read(inode); + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits)); + EXT4_I(inode)->i_disksize = i_size_read(inode); + mutex_unlock(&inode->i_mutex); + } + } + ext4_mark_inode_dirty(handle, inode); + ret2 = ext4_journal_stop(handle); + if (ret > 0) + ret = ret2; + + return ret > 0 ? 0 : ret; +} + EXPORT_SYMBOL(ext4_mark_inode_dirty); EXPORT_SYMBOL(ext4_ext_invalidate_cache); EXPORT_SYMBOL(ext4_ext_insert_extent); EXPORT_SYMBOL(ext4_ext_walk_space); EXPORT_SYMBOL(ext4_ext_find_goal); EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); +EXPORT_SYMBOL(ext4_fallocate); Index: linux-2.6.21/fs/ext4/file.c =================================================================== --- linux-2.6.21.orig/fs/ext4/file.c +++ linux-2.6.21/fs/ext4/file.c @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_ .removexattr = generic_removexattr, #endif .permission = ext4_permission, + .fallocate = ext4_fallocate, }; Index: linux-2.6.21/include/linux/ext4_fs.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs.h +++ linux-2.6.21/include/linux/ext4_fs.h @@ -102,6 +102,8 @@ EXT4_GOOD_OLD_FIRST_INO : \ (s)->s_first_ino) #endif +#define EXT4_BLOCK_ALIGN(size, blkbits) (((size)+(1 << blkbits)-1) & \ + (~((1 << blkbits)-1))) /* * Macro-instructions used to manage fragments @@ -225,6 +227,10 @@ struct ext4_new_group_data { __u32 free_blocks_count; }; +/* Following is used by preallocation logic to tell get_blocks() that we + * want uninitialzed extents. + */ +#define EXT4_CREATE_UNINITIALIZED_EXT 2 /* * ioctl commands @@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t extern void ext4_ext_truncate(struct inode *, struct page *); extern void ext4_ext_init(struct super_block *); extern void ext4_ext_release(struct super_block *); +extern int ext4_fallocate(struct inode *, int, loff_t, loff_t); static inline int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, unsigned long max_blocks, struct buffer_head *bh, Index: linux-2.6.21/include/linux/ext4_fs_extents.h =================================================================== --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h +++ linux-2.6.21/include/linux/ext4_fs_extents.h @@ -125,6 +125,19 @@ struct ext4_ext_path { #define EXT4_EXT_CACHE_EXTENT 2 /* + * Macro-instructions to handle (mark/unmark/check/create) unitialized + * extents. Applications can issue an IOCTL for preallocation, which results + * in assigning unitialized extents to the file. + */ +#define ext4_ext_mark_uninitialized(ext) ((ext)->ee_len |= \ + cpu_to_le16(0x8000)) +#define ext4_ext_is_uninitialized(ext) ((le16_to_cpu((ext)->ee_len))& \ + 0x8000) +#define ext4_ext_get_actual_len(ext) ((le16_to_cpu((ext)->ee_len))& \ + 0x7FFF) + + +/* * to be called by ext4_ext_walk_space() * negative retcode - error * positive retcode - signal for ext4_ext_walk_space(), see below From owner-xfs@oss.sgi.com Thu Apr 26 12:47:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 12:47:10 -0700 (PDT) Received: from mail.uni-bamberg.de (ldap.uni-bamberg.de [141.13.240.52]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QJl3fB000638 for ; Thu, 26 Apr 2007 12:47:05 -0700 Received: from [192.168.1.100] [217.229.50.12] by mail.uni-bamberg.de with ESMTP (SMTPD-8.22) id A1B501D4; Thu, 26 Apr 2007 21:47:01 +0200 From: Martin Eisenhardt Organization: =?iso-8859-1?q?Otto-Friedrich-Universit=E4t?= Bamberg To: linux-xfs@oss.sgi.com, xfs@oss.sgi.com Subject: Unexpected XFS SB number 0x00000000 Date: Thu, 26 Apr 2007 21:46:55 +0200 User-Agent: KMail/1.9.6 MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart3543304.phMfCJkCqD"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <200704262146.59921.martin.eisenhardt@wiai.uni-bamberg.de> X-archive-position: 11201 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: martin.eisenhardt@wiai.uni-bamberg.de Precedence: bulk X-list: xfs --nextPart3543304.phMfCJkCqD Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Hello list(s), I run XFS on a software raid on Linux 2.6.19. When I invoke xfs_db in=20 read-only mode, I get: # xfs_db -r /dev/md0 xfs_db: unexpected XFS SB magic number 0x00000000 xfs_db: read failed: Invalid argument xfs_db: data size check failed Segmentation fault The system is still running, the filesystem seems to be fine (except for th= e=20 above): files are created, written, and deleted without any problem. So, I have two questions: * Is there a real problem, or might a quick reboot solve this? * If there is a real problem with the file system: What steps do you recomm= end=20 to overcome this problem? * How safe is it to run xfs_check and xfs_repair? Thanks in advance! Kind regards Martin Eisenhardt P.S.: Sorry for cross-posting, I just figure that maybe the XFS users on=20 non-linux systems might have a hint or two for me ... ;-) --nextPart3543304.phMfCJkCqD Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) iD8DBQBGMQGzVUsW4y0BHEIRAgz0AJ9b1WM82QzdM+XJw24JVEppa/kMhQCeOM/7 nFHLi8PkV2WrBLZj1WedVP4= =HQDp -----END PGP SIGNATURE----- --nextPart3543304.phMfCJkCqD-- From owner-xfs@oss.sgi.com Thu Apr 26 15:55:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 15:55:14 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QMt7fB014517 for ; Thu, 26 Apr 2007 15:55:09 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id 4BFE792C53E; Fri, 27 Apr 2007 06:47:09 +1000 (EST) Subject: Re: Unexpected XFS SB number 0x00000000 From: Nathan Scott Reply-To: nscott@aconex.com To: Martin Eisenhardt Cc: xfs@oss.sgi.com In-Reply-To: <200704262146.59921.martin.eisenhardt@wiai.uni-bamberg.de> References: <200704262146.59921.martin.eisenhardt@wiai.uni-bamberg.de> Content-Type: text/plain Organization: Aconex Date: Fri, 27 Apr 2007 08:57:19 +1000 Message-Id: <1177628239.6273.374.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-archive-position: 11202 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs On Thu, 2007-04-26 at 21:46 +0200, Martin Eisenhardt wrote: > Hello list(s), > > I run XFS on a software raid on Linux 2.6.19. When I invoke xfs_db in > read-only mode, I get: > > # xfs_db -r /dev/md0 > xfs_db: unexpected XFS SB magic number 0x00000000 > xfs_db: read failed: Invalid argument > xfs_db: data size check failed > Segmentation fault I think this segfault is fixed in recent xfs_db versions. > The system is still running, the filesystem seems to be fine (except for the > above): files are created, written, and deleted without any problem. > > So, I have two questions: > > * Is there a real problem, or might a quick reboot solve this? It looks like a real problem to me - something has written zeroes to the start of your partition, where the primary XFS superblock should be. If the filesystem is still mounted(?), I'd a/ make a backup copy of anything/everything precious there b/ try to get the incore copy of the XFS superblock flushed out (this assumes still mounted) - creater a file & use sync(1) - you might get lucky. > * If there is a real problem with the file system: What steps do you recommend > to overcome this problem? > * How safe is it to run xfs_check and xfs_repair? If you really have zeroes over your primary superblock, xfs_repair is your only option to fix that really (after unmounting). You wont get much joy from xfs_check, as its just a shell script wrapper around the xfs_db "check" command. > P.S.: Sorry for cross-posting, I just figure that maybe the XFS users on > non-linux systems might have a hint or two for me ... ;-) Theres only one list (both addresses point to the same place). cheers. -- Nathan From owner-xfs@oss.sgi.com Thu Apr 26 16:43:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 16:43:12 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3QNh7fB030577 for ; Thu, 26 Apr 2007 16:43:09 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id 8000E92C50E; Fri, 27 Apr 2007 07:35:09 +1000 (EST) Subject: Re: [PATCH] make growfs check device size limits too From: Nathan Scott Reply-To: nscott@aconex.com To: Christoph Hellwig Cc: xfs@oss.sgi.com In-Reply-To: <20070426071055.GA24411@infradead.org> References: <1177569014.6273.367.camel@edge> <20070426071055.GA24411@infradead.org> Content-Type: multipart/mixed; boundary="=-KqqObupthjG8eQxoXnzv" Organization: Aconex Date: Fri, 27 Apr 2007 09:45:20 +1000 Message-Id: <1177631120.6273.380.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 X-archive-position: 11203 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs --=-KqqObupthjG8eQxoXnzv Content-Type: text/plain Content-Transfer-Encoding: 7bit On Thu, 2007-04-26 at 08:10 +0100, Christoph Hellwig wrote: > On Thu, Apr 26, 2007 at 04:30:14PM +1000, Nathan Scott wrote: > > On the mount path we check for a superblock that describes a filesystem > > to large for the running kernel to handle. This catches the case of an > > attempt to mount a >16TB filesystem on i386 (where we are limited by the > > page->index size, for XFS metadata buffers in xfs_buf.c). > > > > This patch makes similar checks on the growfs code paths for regular and > > realtime growth, else we can end up with filesystem corruption, it would > > seem (from #xfs chatter). Untested patch follows; probably better to do > > this as a macro, in a header, and call that in each place...? > > Yeah, the check should probably we in one place only. Given that's it's > only used in slow pathes a function would probably do it. Here's a revised version... cheers. -- Nathan --=-KqqObupthjG8eQxoXnzv Content-Disposition: attachment; filename=fix-sb-size-checks Content-Type: text/x-patch; name=fix-sb-size-checks; charset=UTF-8 Content-Transfer-Encoding: 7bit Index: linux/fs/xfs/xfs_fsops.c =================================================================== --- linux.orig/fs/xfs/xfs_fsops.c 2007-04-27 09:00:57.306146750 +1000 +++ linux/fs/xfs/xfs_fsops.c 2007-04-27 09:41:22.897736750 +1000 @@ -140,6 +140,8 @@ xfs_growfs_data_private( pct = in->imaxpct; if (nb < mp->m_sb.sb_dblocks || pct < 0 || pct > 100) return XFS_ERROR(EINVAL); + if ((error = xfs_sb_validate_fsb_count(&mp->m_sb, nb))) + return error; dpct = pct - mp->m_sb.sb_imax_pct; error = xfs_read_buf(mp, mp->m_ddev_targp, XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1), Index: linux/fs/xfs/xfs_rtalloc.c =================================================================== --- linux.orig/fs/xfs/xfs_rtalloc.c 2007-04-27 09:16:57.558158750 +1000 +++ linux/fs/xfs/xfs_rtalloc.c 2007-04-27 09:38:03.705288000 +1000 @@ -1882,11 +1882,13 @@ xfs_growfs_rt( (nrblocks = in->newblocks) <= sbp->sb_rblocks || (sbp->sb_rblocks && (in->extsize != sbp->sb_rextsize))) return XFS_ERROR(EINVAL); + if ((error = xfs_sb_validate_fsb_count(sbp, nrblocks))) + return error; /* * Read in the last block of the device, make sure it exists. */ error = xfs_read_buf(mp, mp->m_rtdev_targp, - XFS_FSB_TO_BB(mp, in->newblocks - 1), + XFS_FSB_TO_BB(mp, nrblocks - 1), XFS_FSB_TO_BB(mp, 1), 0, &bp); if (error) return error; Index: linux/fs/xfs/xfs_mount.c =================================================================== --- linux.orig/fs/xfs/xfs_mount.c 2007-04-27 09:00:57.354149750 +1000 +++ linux/fs/xfs/xfs_mount.c 2007-04-27 09:42:07.700536750 +1000 @@ -202,6 +202,27 @@ xfs_mount_free( kmem_free(mp, sizeof(xfs_mount_t)); } +/* + * Check size of device based on the (data/realtime) block count. + * Note: this check is used by the growfs code as well as mount. + */ +int +xfs_sb_validate_fsb_count( + xfs_sb_t *sbp, + __uint64_t nblocks) +{ + ASSERT(PAGE_SHIFT >= sbp->sb_blocklog); + ASSERT(sbp->sb_blocklog >= BBSHIFT); + +#if XFS_BIG_BLKNOS /* Limited by ULONG_MAX of page cache index */ + if (nblocks >> (PAGE_CACHE_SHIFT - sbp->sb_blocklog) > ULONG_MAX) + return E2BIG; +#else /* Limited by UINT_MAX of sectors */ + if (nblocks << (sbp->sb_blocklog - BBSHIFT) > UINT_MAX) + return E2BIG; +#endif + return 0; +} /* * Check the validity of the SB found. @@ -284,18 +305,8 @@ xfs_mount_validate_sb( return XFS_ERROR(EFSCORRUPTED); } - ASSERT(PAGE_SHIFT >= sbp->sb_blocklog); - ASSERT(sbp->sb_blocklog >= BBSHIFT); - -#if XFS_BIG_BLKNOS /* Limited by ULONG_MAX of page cache index */ - if (unlikely( - (sbp->sb_dblocks >> (PAGE_SHIFT - sbp->sb_blocklog)) > ULONG_MAX || - (sbp->sb_rblocks >> (PAGE_SHIFT - sbp->sb_blocklog)) > ULONG_MAX)) { -#else /* Limited by UINT_MAX of sectors */ - if (unlikely( - (sbp->sb_dblocks << (sbp->sb_blocklog - BBSHIFT)) > UINT_MAX || - (sbp->sb_rblocks << (sbp->sb_blocklog - BBSHIFT)) > UINT_MAX)) { -#endif + if (xfs_sb_validate_fsb_count(sbp, sbp->sb_dblocks) || + xfs_sb_validate_fsb_count(sbp, sbp->sb_rblocks)) { xfs_fs_mount_cmn_err(flags, "file system too large to be mounted on this system."); return XFS_ERROR(E2BIG); Index: linux/fs/xfs/xfs_mount.h =================================================================== --- linux.orig/fs/xfs/xfs_mount.h 2007-04-27 09:25:44.667101000 +1000 +++ linux/fs/xfs/xfs_mount.h 2007-04-27 09:37:43.448022000 +1000 @@ -624,6 +624,7 @@ extern int xfs_sync_inodes(xfs_mount_t * extern xfs_agnumber_t xfs_initialize_perag(struct bhv_vfs *, xfs_mount_t *, xfs_agnumber_t); extern void xfs_xlatesb(void *, struct xfs_sb *, int, __int64_t); +extern int xfs_sb_validate_fsb_count(struct xfs_sb *, __uint64_t); extern struct xfs_dmops xfs_dmcore_stub; extern struct xfs_qmops xfs_qmcore_stub; --=-KqqObupthjG8eQxoXnzv-- From owner-xfs@oss.sgi.com Thu Apr 26 19:24:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 19:24:10 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3R2O7fB021749 for ; Thu, 26 Apr 2007 19:24:09 -0700 Received: from [10.0.0.4] (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id 633B41806A301; Thu, 26 Apr 2007 21:24:05 -0500 (CDT) Message-ID: <46315EC0.9080401@sandeen.net> Date: Thu, 26 Apr 2007 21:24:00 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (Macintosh/20070221) MIME-Version: 1.0 To: nscott@aconex.com CC: Christoph Hellwig , xfs@oss.sgi.com Subject: Re: [PATCH] make growfs check device size limits too References: <1177569014.6273.367.camel@edge> <20070426071055.GA24411@infradead.org> <1177631120.6273.380.camel@edge> In-Reply-To: <1177631120.6273.380.camel@edge> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11204 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Nathan Scott wrote: > On Thu, 2007-04-26 at 08:10 +0100, Christoph Hellwig wrote: >> On Thu, Apr 26, 2007 at 04:30:14PM +1000, Nathan Scott wrote: >>> On the mount path we check for a superblock that describes a filesystem >>> to large for the running kernel to handle. This catches the case of an >>> attempt to mount a >16TB filesystem on i386 (where we are limited by the >>> page->index size, for XFS metadata buffers in xfs_buf.c). >>> >>> This patch makes similar checks on the growfs code paths for regular and >>> realtime growth, else we can end up with filesystem corruption, it would >>> seem (from #xfs chatter). Untested patch follows; probably better to do >>> this as a macro, in a header, and call that in each place...? >> Yeah, the check should probably we in one place only. Given that's it's >> only used in slow pathes a function would probably do it. > > Here's a revised version... > > cheers. > Looks good to me -Eric From owner-xfs@oss.sgi.com Thu Apr 26 19:48:03 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 19:48:05 -0700 (PDT) Received: from tyo200.gate.nec.co.jp (TYO200.gate.nec.co.jp [210.143.35.50]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3R2m2fB024454 for ; Thu, 26 Apr 2007 19:48:03 -0700 Received: from tyo202.gate.nec.co.jp ([10.7.69.202]) by tyo200.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3R2m1ic023792 for ; Fri, 27 Apr 2007 11:48:01 +0900 (JST) Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.162]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3R2f3XZ015422 for ; Fri, 27 Apr 2007 11:41:03 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l3R2f2e09027 for xfs@oss.sgi.com; Fri, 27 Apr 2007 11:41:02 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv3.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id l3R2f2L10152 for ; Fri, 27 Apr 2007 11:41:02 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070427.114102.87502660 for ; Fri, 27 Apr 2007 11:41:03 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Fri Apr 27 11:41:02 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id 03D6CAE4B3; Fri, 27 Apr 2007 11:41:00 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l3R2f16X016831; Fri, 27 Apr 2007 11:41:01 +0900 Message-Id: <200704270240.AA05279@TNESG9305.tnes.nec.co.jp> Date: Fri, 27 Apr 2007 11:40:51 +0900 To: xfs@oss.sgi.com Subject: [PATCH] Fix segmentation fault when using xfs_quota state command. From: Utako Kusaka MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=us-ascii X-archive-position: 11205 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Hi, state command in xfs_quota causes segmentation fault when the path list is empty. This patch fixes it. Example: # ./xfs_quota -x xfs_quota_orig> path xfs_quota_orig> state Segmentation fault Signed-off-by: Utako Kusaka --- --- xfsprogs-2.8.20/quota/state.orig 2007-04-19 13:07:38.000000000 +0900 +++ xfsprogs-2.8.20/quota/state.c 2007-04-27 11:23:30.000000000 +0900 @@ -226,6 +226,9 @@ state_f( if (argc != optind) return command_usage(&state_cmd); + if (!fs_count) + return 0; + if ((fp = fopen_write_secure(fname)) == NULL) return 0; From owner-xfs@oss.sgi.com Thu Apr 26 20:56:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 20:56:15 -0700 (PDT) Received: from postoffice.aconex.com (mail.app.aconex.com [203.89.192.138]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3R3u8fB005288 for ; Thu, 26 Apr 2007 20:56:11 -0700 Received: from edge (unknown [203.89.192.141]) by postoffice.aconex.com (Postfix) with ESMTP id 6E56C92C554; Fri, 27 Apr 2007 11:48:07 +1000 (EST) Subject: Re: [PATCH] Fix segmentation fault when using xfs_quota state command. From: Nathan Scott Reply-To: nscott@aconex.com To: Utako Kusaka Cc: xfs@oss.sgi.com In-Reply-To: <200704270240.AA05279@TNESG9305.tnes.nec.co.jp> References: <200704270240.AA05279@TNESG9305.tnes.nec.co.jp> Content-Type: text/plain Organization: Aconex Date: Fri, 27 Apr 2007 13:58:22 +1000 Message-Id: <1177646302.6273.386.camel@edge> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-archive-position: 11206 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nscott@aconex.com Precedence: bulk X-list: xfs On Fri, 2007-04-27 at 11:40 +0900, Utako Kusaka wrote: > + if (!fs_count) > + return 0; > + Is the segfault due to this line in state.c? 237 else if (fs_path->fs_flags & FS_MOUNT_POINT) If so, a fix that is more clear to the reader might be: - else if (fs_path->fs_flags & FS_MOUNT_POINT) + else if (fs_path && fs_path->fs_flags & FS_MOUNT_POINT) This keeps the fix alongside the problematic access to fs_path. cheers. -- Nathan From owner-xfs@oss.sgi.com Thu Apr 26 23:16:45 2007 Received: with ECARTIS (v1.0.0; list xfs); Thu, 26 Apr 2007 23:16:48 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3R6GdfB001234 for ; Thu, 26 Apr 2007 23:16:44 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA24927; Fri, 27 Apr 2007 16:16:27 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3R6GQAf77571095; Fri, 27 Apr 2007 16:16:27 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3R6GNfL76927662; Fri, 27 Apr 2007 16:16:23 +1000 (AEST) Date: Fri, 27 Apr 2007 16:16:23 +1000 From: David Chinner To: Nathan Scott Cc: Christoph Hellwig , xfs@oss.sgi.com Subject: Re: [PATCH] make growfs check device size limits too Message-ID: <20070427061623.GB77450368@melbourne.sgi.com> References: <1177569014.6273.367.camel@edge> <20070426071055.GA24411@infradead.org> <1177631120.6273.380.camel@edge> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177631120.6273.380.camel@edge> User-Agent: Mutt/1.4.2.1i X-archive-position: 11207 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Fri, Apr 27, 2007 at 09:45:20AM +1000, Nathan Scott wrote: > On Thu, 2007-04-26 at 08:10 +0100, Christoph Hellwig wrote: > > On Thu, Apr 26, 2007 at 04:30:14PM +1000, Nathan Scott wrote: > > > On the mount path we check for a superblock that describes a filesystem > > > to large for the running kernel to handle. This catches the case of an > > > attempt to mount a >16TB filesystem on i386 (where we are limited by the > > > page->index size, for XFS metadata buffers in xfs_buf.c). > > > > > > This patch makes similar checks on the growfs code paths for regular and > > > realtime growth, else we can end up with filesystem corruption, it would > > > seem (from #xfs chatter). Untested patch follows; probably better to do > > > this as a macro, in a header, and call that in each place...? > > > > Yeah, the check should probably we in one place only. Given that's it's > > only used in slow pathes a function would probably do it. > > Here's a revised version... Added to my qa tree. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Fri Apr 27 00:42:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 27 Apr 2007 00:42:46 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [202.32.8.193]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3R7gbfB031035 for ; Fri, 27 Apr 2007 00:42:39 -0700 Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.197]) by tyo201.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id l3R7gZSt002420 for ; Fri, 27 Apr 2007 16:42:35 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id l3R7gZf29503 for xfs@oss.sgi.com; Fri, 27 Apr 2007 16:42:35 +0900 (JST) Received: from secsv3.tnes.nec.co.jp (tnesvc2.tnes.nec.co.jp [10.1.101.15]) by mailsv3.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id l3R7gZL16455 for ; Fri, 27 Apr 2007 16:42:35 +0900 (JST) Received: from tnesvc2.tnes.nec.co.jp ([10.1.101.15]) by secsv3.tnes.nec.co.jp (ExpressMail 5.10) with SMTP id 20070427.164234.75002200 for ; Fri, 27 Apr 2007 16:42:35 +0900 Received: FROM tnessv1.tnes.nec.co.jp BY tnesvc2.tnes.nec.co.jp ; Fri Apr 27 16:42:34 2007 +0900 Received: from rifu.bsd.tnes.nec.co.jp (rifu.bsd.tnes.nec.co.jp [10.1.104.1]) by tnessv1.tnes.nec.co.jp (Postfix) with ESMTP id E5224AE4B7; Fri, 27 Apr 2007 16:42:27 +0900 (JST) Received: from TNESG9305.tnes.nec.co.jp (TNESG9305.bsd.tnes.nec.co.jp [10.1.104.199]) by rifu.bsd.tnes.nec.co.jp (8.12.11/3.7W/BSD-TNES-MX01) with SMTP id l3R7gXtT011058; Fri, 27 Apr 2007 16:42:33 +0900 Message-Id: <200704270742.AA05287@TNESG9305.tnes.nec.co.jp> From: Utako Kusaka Date: Fri, 27 Apr 2007 16:42:27 +0900 To: nscott@aconex.com Cc: xfs@oss.sgi.com Subject: Re: [PATCH] Fix segmentation fault when using xfs_quota state command. In-Reply-To: <1177646302.6273.386.camel@edge> References: <1177646302.6273.386.camel@edge> MIME-Version: 1.0 X-Mailer: AL-Mail32 Version 1.13 Content-Type: text/plain; charset=iso-2022-jp X-archive-position: 11208 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: utako@tnes.nec.co.jp Precedence: bulk X-list: xfs Thanks for your comment. I thought processes after fopen_write_secure() is not necessary when the path list is empty. But is creating an empty file necessary when specifying -f option? This option isn't written in man page:) -- Utako Fri, 27 Apr 2007 13:58:22 +1000 Nathan Scott wrote$B!'(B >On Fri, 2007-04-27 at 11:40 +0900, Utako Kusaka wrote: >> + if (!fs_count) >> + return 0; >> + > >Is the segfault due to this line in state.c? >237 else if (fs_path->fs_flags & FS_MOUNT_POINT) > >If so, a fix that is more clear to the reader might be: > >- else if (fs_path->fs_flags & FS_MOUNT_POINT) >+ else if (fs_path && fs_path->fs_flags & FS_MOUNT_POINT) > >This keeps the fix alongside the problematic access to fs_path. > >cheers. > >-- >Nathan From owner-xfs@oss.sgi.com Fri Apr 27 02:15:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 27 Apr 2007 02:15:15 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3R9F9fB023007 for ; Fri, 27 Apr 2007 02:15:11 -0700 Received: from chook.melbourne.sgi.com (chook.melbourne.sgi.com [134.14.54.237]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id SAA28429 for ; Fri, 27 Apr 2007 18:50:46 +1000 Received: by chook.melbourne.sgi.com (Postfix, from userid 44625) id D7C6E5910FF9; Fri, 27 Apr 2007 18:50:45 +1000 (EST) To: xfs@oss.sgi.com Subject: TAKE 963965 - Add lockdep support for XFS Message-Id: <20070427085045.D7C6E5910FF9@chook.melbourne.sgi.com> Date: Fri, 27 Apr 2007 18:50:45 +1000 (EST) From: lachlan@sgi.com (Lachlan McIlroy) X-archive-position: 11209 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: lachlan@sgi.com Precedence: bulk X-list: xfs Add lockdep support for XFS Date: Fri Apr 27 18:47:16 AEST 2007 Workarea: vpn-emea-sw-emea-160-20.emea.sgi.com:/home/lachlan/isms/2.6.x-lockdep Inspected by: tes dgc Author: lachlan The following file(s) were checked into: longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb Modid: xfs-linux-melb:xfs-kern:28485a fs/xfs/xfs_vnodeops.c - 1.695 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vnodeops.c.diff?r1=text&tr1=1.695&r2=text&tr2=1.694&f=h fs/xfs/xfs_vfsops.c - 1.518 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_vfsops.c.diff?r1=text&tr1=1.518&r2=text&tr2=1.517&f=h fs/xfs/xfs_iget.c - 1.225 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_iget.c.diff?r1=text&tr1=1.225&r2=text&tr2=1.224&f=h fs/xfs/xfs_inode.h - 1.220 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_inode.h.diff?r1=text&tr1=1.220&r2=text&tr2=1.219&f=h fs/xfs/linux-2.6/mrlock.h - 1.22 - changed http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/mrlock.h.diff?r1=text&tr1=1.22&r2=text&tr2=1.21&f=h - Add lockdep support for XFS From owner-xfs@oss.sgi.com Fri Apr 27 05:10:47 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 27 Apr 2007 05:10:52 -0700 (PDT) Received: from mtagate8.uk.ibm.com (mtagate8.uk.ibm.com [195.212.29.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3RCAhfB012803 for ; Fri, 27 Apr 2007 05:10:44 -0700 Received: from d06nrmr1407.portsmouth.uk.ibm.com (d06nrmr1407.portsmouth.uk.ibm.com [9.149.38.185]) by mtagate8.uk.ibm.com (8.13.8/8.13.8) with ESMTP id l3RCAf3w135080 for ; Fri, 27 Apr 2007 12:10:41 GMT Received: from d06av04.portsmouth.uk.ibm.com (d06av04.portsmouth.uk.ibm.com [9.149.37.216]) by d06nrmr1407.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3RCAfVW2519164 for ; Fri, 27 Apr 2007 13:10:41 +0100 Received: from d06av04.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av04.portsmouth.uk.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3RCAeHQ009263 for ; Fri, 27 Apr 2007 13:10:41 +0100 Received: from localhost (dyn-9-152-198-55.boeblingen.de.ibm.com [9.152.198.55]) by d06av04.portsmouth.uk.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3RCAeVm009238; Fri, 27 Apr 2007 13:10:40 +0100 Date: Fri, 27 Apr 2007 14:10:03 +0200 From: Heiko Carstens To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070427121003.GA7808@osiris.boeblingen.de.ibm.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: mutt-ng/devel-r804 (Linux) X-archive-position: 11210 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: heiko.carstens@de.ibm.com Precedence: bulk X-list: xfs On Thu, Apr 26, 2007 at 11:20:56PM +0530, Amit K. Arora wrote: > Based on the discussion, this new patchset uses following as the > interface for fallocate() system call: > > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > It seems that only s390 architecture has a problem with such a layout of > arguments in fallocate(). Thus for s390, we plan to have a wrapper > (say, sys_s390_fallocate()) for the sys_fallocate(), which will get > called by glibc when an application issues a fallocate() system call > on s390. The s390 arch specific changes will be part of a separate > patch (PATCH 2/5). It will be great if some s390 expert can verify the > patch, since I have not been able to test it on s390 so far. After long discussions where at least two possible implementations were suggested that would work on _all_ architectures you chose one which doesn't and causes extra effort. > It was also noted that minor changes might be required to strace code > to take care of "different arguments on s390" issue. This is not limited to strace... Besides that the s390 backend looks ok. From owner-xfs@oss.sgi.com Fri Apr 27 07:47:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 27 Apr 2007 07:47:46 -0700 (PDT) Received: from longford.lazybastard.org (lazybastard.de [212.112.238.170]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3RElgfB027622 for ; Fri, 27 Apr 2007 07:47:43 -0700 Received: from joern by longford.lazybastard.org with local (Exim 4.50) id 1HhRfX-0006Jd-55; Fri, 27 Apr 2007 16:43:31 +0200 Date: Fri, 27 Apr 2007 16:43:28 +0200 From: =?utf-8?B?SsO2cm4=?= Engel To: Heiko Carstens Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070427144327.GC22949@lazybastard.org> References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070427121003.GA7808@osiris.boeblingen.de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070427121003.GA7808@osiris.boeblingen.de.ibm.com> User-Agent: Mutt/1.5.9i X-archive-position: 11211 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: joern@lazybastard.org Precedence: bulk X-list: xfs On Fri, 27 April 2007 14:10:03 +0200, Heiko Carstens wrote: > > After long discussions where at least two possible implementations > were suggested that would work on _all_ architectures you chose one > which doesn't and causes extra effort. I believe the long discussion also showed that every possible implementation has drawbacks. To me this one appeared to be the best of many bad choices. Is this implementation worse than we thought? Jörn -- The grand essentials of happiness are: something to do, something to love, and something to hope for. -- Allan K. Chalmers From owner-xfs@oss.sgi.com Fri Apr 27 10:47:10 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 27 Apr 2007 10:47:15 -0700 (PDT) Received: from mtagate1.de.ibm.com (mtagate1.de.ibm.com [195.212.29.150]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3RHl5fB018972 for ; Fri, 27 Apr 2007 10:47:06 -0700 Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate1.de.ibm.com (8.13.8/8.13.8) with ESMTP id l3RHl15a130500 for ; Fri, 27 Apr 2007 17:47:01 GMT Received: from d12av02.megacenter.de.ibm.com (d12av02.megacenter.de.ibm.com [9.149.165.228]) by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l3RHl1944079788 for ; Fri, 27 Apr 2007 19:47:01 +0200 Received: from d12av02.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av02.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l3RHl02S018720 for ; Fri, 27 Apr 2007 19:47:01 +0200 Received: from localhost (ICON-9-164-185-178.megacenter.de.ibm.com [9.164.185.178]) by d12av02.megacenter.de.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l3RHl0pJ018714; Fri, 27 Apr 2007 19:47:00 +0200 Date: Fri, 27 Apr 2007 19:46:13 +0200 From: Heiko Carstens To: =?iso-8859-1?Q?J=F6rn?= Engel Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070427174613.GA8228@osiris.ibm.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070427121003.GA7808@osiris.boeblingen.de.ibm.com> <20070427144327.GC22949@lazybastard.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070427144327.GC22949@lazybastard.org> User-Agent: mutt-ng/devel-r804 (Linux) X-archive-position: 11212 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: heiko.carstens@de.ibm.com Precedence: bulk X-list: xfs On Fri, Apr 27, 2007 at 04:43:28PM +0200, Jörn Engel wrote: > On Fri, 27 April 2007 14:10:03 +0200, Heiko Carstens wrote: > > > > After long discussions where at least two possible implementations > > were suggested that would work on _all_ architectures you chose one > > which doesn't and causes extra effort. > > I believe the long discussion also showed that every possible > implementation has drawbacks. To me this one appeared to be the best of > many bad choices. If one insists to have fd at first argument, what is wrong with having u32 arguments only? It's not that this syscall comes even close to what can be considered performance critical... > Is this implementation worse than we thought? It adds userspace overhead for one architecture. Every *trace and *libc needs special handling on s390 for this syscall. I would prefer to avoid this. From owner-xfs@oss.sgi.com Fri Apr 27 13:49:48 2007 Received: with ECARTIS (v1.0.0; list xfs); Fri, 27 Apr 2007 13:49:55 -0700 (PDT) Received: from smtp113.sbc.mail.mud.yahoo.com (smtp113.sbc.mail.mud.yahoo.com [68.142.198.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3RKnlfB010551 for ; Fri, 27 Apr 2007 13:49:48 -0700 Received: (qmail 58363 invoked from network); 27 Apr 2007 20:49:46 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp113.sbc.mail.mud.yahoo.com with SMTP; 27 Apr 2007 20:49:46 -0000 X-YMail-OSG: y9u4GFAVM1lSnko0_de631CasXG6lBCufWGg6FUaKWyV7RGcY3HE1AuQnrNRKR17G1sFFB4t3g-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 7D66F181371A; Fri, 27 Apr 2007 13:42:07 -0700 (PDT) Date: Fri, 27 Apr 2007 13:42:07 -0700 From: Chris Wedgwood To: Heiko Carstens Cc: =?iso-8859-1?Q?J=F6rn?= Engel , "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070427204207.GA29551@tuatara.stupidest.org> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070427121003.GA7808@osiris.boeblingen.de.ibm.com> <20070427144327.GC22949@lazybastard.org> <20070427174613.GA8228@osiris.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070427174613.GA8228@osiris.ibm.com> X-archive-position: 11213 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Fri, Apr 27, 2007 at 07:46:13PM +0200, Heiko Carstens wrote: > If one insists to have fd at first argument, what is wrong with > having u32 arguments only? Well, I was one of those who objected as it seems *UGLY* to me. > It's not that this syscall comes even close to what can be > considered performance critical... Right. > It adds userspace overhead for one architecture. Every *trace and > *libc needs special handling on s390 for this syscall. I would > prefer to avoid this. I'm not that bothered about it. I would prefer it did use clean 64-bit arguments, but given it's a non-critical syscall I'm don't think the aesthetics are worth impossing crud on s390 for. From owner-xfs@oss.sgi.com Sat Apr 28 03:36:42 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 28 Apr 2007 03:36:48 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3SAaffB012975 for ; Sat, 28 Apr 2007 03:36:42 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 00B7CF06BDC3; Sat, 28 Apr 2007 06:36:40 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id D84FF7167FE6; Sat, 28 Apr 2007 06:36:40 -0400 (EDT) Date: Sat, 28 Apr 2007 06:36:40 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: xfs@oss.sgi.com cc: linux-raid@vger.kernel.org Subject: Re: XFS on x86_64 Linux Question In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11215 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs With correct CC'd address. On Sat, 28 Apr 2007, Justin Piszcz wrote: > Hello-- > > Had a quick question, if I re-provision a host with an Intel Core Duo CPU > with x86_64 Linux; I create a software raid array and use the XFS > filesystem-- all in 64bit space... > > If I boot a recovery image such as Knoppix, it will not be able to work on > the filesystem correct? I would need a 64bit live CD? > > Does the same apply to software raid? Can I mount a software raid created in > a 64bit environment in a 32bit environment? > > Justin. > From owner-xfs@oss.sgi.com Sat Apr 28 03:35:50 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 28 Apr 2007 03:36:18 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3SAZmfB012588 for ; Sat, 28 Apr 2007 03:35:50 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 44A99F06BDC3; Sat, 28 Apr 2007 06:35:47 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 4048A7167FE6; Sat, 28 Apr 2007 06:35:47 -0400 (EDT) Date: Sat, 28 Apr 2007 06:35:47 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: xfs@oss.sgi.com cc: linux-raid@linux-kernel.vger.kernel.org Subject: XFS on x86_64 Linux Question Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11214 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Hello-- Had a quick question, if I re-provision a host with an Intel Core Duo CPU with x86_64 Linux; I create a software raid array and use the XFS filesystem-- all in 64bit space... If I boot a recovery image such as Knoppix, it will not be able to work on the filesystem correct? I would need a 64bit live CD? Does the same apply to software raid? Can I mount a software raid created in a 64bit environment in a 32bit environment? Justin. From owner-xfs@oss.sgi.com Sat Apr 28 05:09:55 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 28 Apr 2007 05:09:57 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3SC9sfB009720 for ; Sat, 28 Apr 2007 05:09:55 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 733C7F06BDC3; Sat, 28 Apr 2007 08:09:53 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 6E7D27167FE6; Sat, 28 Apr 2007 08:09:53 -0400 (EDT) Date: Sat, 28 Apr 2007 08:09:53 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: "Raz Ben-Jehuda(caro)" cc: xfs@oss.sgi.com, linux-raid@vger.kernel.org Subject: Re: XFS on x86_64 Linux Question In-Reply-To: <5d96567b0704280457t464e5c3fm81a9e22a5608751@mail.gmail.com> Message-ID: References: <5d96567b0704280457t464e5c3fm81a9e22a5608751@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11216 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Wow, probably better to stick with 32bit then? On Sat, 28 Apr 2007, Raz Ben-Jehuda(caro) wrote: > Justin hello > I have tested 32 to 64 bit porting of linux raid5 and xfs and LVM > it worked. though i cannot say I have tested throughly. it was a POC. > > On 4/28/07, Justin Piszcz wrote: >> With correct CC'd address. >> >> >> On Sat, 28 Apr 2007, Justin Piszcz wrote: >> >> > Hello-- >> > >> > Had a quick question, if I re-provision a host with an Intel Core Duo CPU >> > with x86_64 Linux; I create a software raid array and use the XFS >> > filesystem-- all in 64bit space... >> > >> > If I boot a recovery image such as Knoppix, it will not be able to work >> on >> > the filesystem correct? I would need a 64bit live CD? >> > >> > Does the same apply to software raid? Can I mount a software raid >> created in >> > a 64bit environment in a 32bit environment? >> > >> > Justin. >> > >> - >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > -- > Raz > From owner-xfs@oss.sgi.com Sat Apr 28 05:26:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 28 Apr 2007 05:26:14 -0700 (PDT) Received: from nz-out-0506.google.com (nz-out-0506.google.com [64.233.162.225]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3SCQAfB016487 for ; Sat, 28 Apr 2007 05:26:11 -0700 Received: by nz-out-0506.google.com with SMTP id m22so1307820nzf for ; Sat, 28 Apr 2007 05:26:09 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=C5goPCNRVzt32t+KvvFRZIqNhTZND+VHrDi9Q9tkugEQb4oEC6V6x6fWVOsCk7mPWkO3PjhgHBQKeT8KEvFWpbX/unApoTnCJarnxDA+WaWDWIiWmVNE6DJW8mAC1oGXoqIK+5L8NaRp9VV2e7qEGZz7coY5PuCk7Z53/GC7mvU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=GUylVCQS9nlj6SHBJgHXuejbsjulUK8bS2UsO9pnuAikrNsEeP8KjxBKHdTx6l64ayAu8SVGAKcrdDxxIqMvoey2PhqcR6ueUqsH212X85jVG0b3Ivaow1NoXNfscn5xoqwNBG5GIL5zN4LIthj7Um+PuMRRh/jGLLatwPgTAUw= Received: by 10.115.33.1 with SMTP id l1mr1370389waj.1177761439356; Sat, 28 Apr 2007 04:57:19 -0700 (PDT) Received: by 10.114.13.6 with HTTP; Sat, 28 Apr 2007 04:57:19 -0700 (PDT) Message-ID: <5d96567b0704280457t464e5c3fm81a9e22a5608751@mail.gmail.com> Date: Sat, 28 Apr 2007 14:57:19 +0300 From: "Raz Ben-Jehuda(caro)" To: "Justin Piszcz" Subject: Re: XFS on x86_64 Linux Question Cc: xfs@oss.sgi.com, linux-raid@vger.kernel.org In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: X-archive-position: 11217 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: raziebe@gmail.com Precedence: bulk X-list: xfs Justin hello I have tested 32 to 64 bit porting of linux raid5 and xfs and LVM it worked. though i cannot say I have tested throughly. it was a POC. On 4/28/07, Justin Piszcz wrote: > With correct CC'd address. > > > On Sat, 28 Apr 2007, Justin Piszcz wrote: > > > Hello-- > > > > Had a quick question, if I re-provision a host with an Intel Core Duo CPU > > with x86_64 Linux; I create a software raid array and use the XFS > > filesystem-- all in 64bit space... > > > > If I boot a recovery image such as Knoppix, it will not be able to work on > > the filesystem correct? I would need a 64bit live CD? > > > > Does the same apply to software raid? Can I mount a software raid created in > > a 64bit environment in a 32bit environment? > > > > Justin. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Raz From owner-xfs@oss.sgi.com Sat Apr 28 09:41:44 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 28 Apr 2007 09:41:51 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3SGfhfB008799 for ; Sat, 28 Apr 2007 09:41:44 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 27476F06BDC3; Sat, 28 Apr 2007 12:41:42 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 2247B7167FE6; Sat, 28 Apr 2007 12:41:42 -0400 (EDT) Date: Sat, 28 Apr 2007 12:41:42 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Andi Kleen cc: xfs@oss.sgi.com, Alan Piszcz , linux-raid@linux-kernel.vger.kernel.org Subject: Re: XFS on x86_64 Linux Question In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11218 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs On Sat, 28 Apr 2007, Andi Kleen wrote: > Justin Piszcz writes: > >> Had a quick question, if I re-provision a host with an Intel Core Duo >> CPU with x86_64 Linux; I create a software raid array and use the XFS >> filesystem-- all in 64bit space... >> >> If I boot a recovery image such as Knoppix, it will not be able to >> work on the filesystem correct? I would need a 64bit live CD? > > There used to be a XFS bug that 32bit journals couldn't > be replayed on 64bit and vice versa. This means you couldn't mount > a dirty file system on the other architecture. But this bug has been > long fixed and mounting clean file systems always worked. > > -Andi > Thanks for the prompt reply! I have more confidence in trying x86_64 now :) Justin. From owner-xfs@oss.sgi.com Sat Apr 28 09:47:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Sat, 28 Apr 2007 09:48:02 -0700 (PDT) Received: from mx1.suse.de (mx1.suse.de [195.135.220.2]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3SGlvfB010655 for ; Sat, 28 Apr 2007 09:47:58 -0700 Received: from Relay1.suse.de (mail2.suse.de [195.135.221.8]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.suse.de (Postfix) with ESMTP id D407E122FA; Sat, 28 Apr 2007 18:34:29 +0200 (CEST) To: Justin Piszcz Cc: xfs@oss.sgi.com, linux-raid@linux-kernel.vger.kernel.org Subject: Re: XFS on x86_64 Linux Question References: From: Andi Kleen Date: 28 Apr 2007 19:32:11 +0200 In-Reply-To: Message-ID: Lines: 15 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 11219 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: andi@firstfloor.org Precedence: bulk X-list: xfs Justin Piszcz writes: > Had a quick question, if I re-provision a host with an Intel Core Duo > CPU with x86_64 Linux; I create a software raid array and use the XFS > filesystem-- all in 64bit space... > > If I boot a recovery image such as Knoppix, it will not be able to > work on the filesystem correct? I would need a 64bit live CD? There used to be a XFS bug that 32bit journals couldn't be replayed on 64bit and vice versa. This means you couldn't mount a dirty file system on the other architecture. But this bug has been long fixed and mounting clean file systems always worked. -Andi From owner-xfs@oss.sgi.com Sun Apr 29 09:20:34 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 09:20:36 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3TGKXfB006291 for ; Sun, 29 Apr 2007 09:20:34 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id C5268B17FD53; Sun, 29 Apr 2007 12:20:26 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id BF6CC5000165 for ; Sun, 29 Apr 2007 12:20:26 -0400 (EDT) Date: Sun, 29 Apr 2007 12:20:26 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: xfs@oss.sgi.com Subject: mkfs.xfs cannot make sector size 64 KiB? Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11220 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs From the manpage: -b Block size options. This option specifies the fundamental block size of the filesys- tem. The valid suboptions are: log=value and size=value; only one can be supplied. The block size is specified either as a base two logarithm value with log=, or in bytes with size=. The default value is 4096 bytes (4 KiB), the minimum is 512, and the maximum is 65536 (64 KiB). XFS on Linux currently only supports pagesize or smaller blocks. The maximum size is 64 KiB, yet it seems only up to 32 KiB is valid? I am running x86_64. $ uname -m x86_64 p34:~# mkfs.xfs -b size=512 /dev/md3 mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). mkfs.xfs: Use the -f option to force overwrite. p34:~# mkfs.xfs -b size=4096 /dev/md3 mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). mkfs.xfs: Use the -f option to force overwrite. p34:~# mkfs.xfs -b size=8192 /dev/md3 mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). mkfs.xfs: Use the -f option to force overwrite. p34:~# mkfs.xfs -b size=16384 /dev/md3 mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). mkfs.xfs: Use the -f option to force overwrite. p34:~# mkfs.xfs -b size=32768 /dev/md3 mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). mkfs.xfs: Use the -f option to force overwrite. p34:~# mkfs.xfs -b size=65536 /dev/md3 illegal sector size 65536 Usage: mkfs.xfs /* blocksize */ [-b log=n|size=num] /* data subvol */ [-d agcount=n,agsize=n,file,name=xxx,size=num, (sunit=value,swidth=value|su=num,sw=num), sectlog=n|sectsize=num,unwritten=0|1] /* inode size */ [-i log=n|perblock=n|size=num,maxpct=n,attr=0|1|2] /* log subvol */ [-l agnum=n,internal,size=num,logdev=xxx,version=n sunit=value|su=num,sectlog=n|sectsize=num] /* label */ [-L label (maximum 12 characters)] /* naming */ [-n log=n|size=num,version=n] /* prototype file */ [-p fname] /* quiet */ [-q] /* realtime subvol */ [-r extsize=num,size=num,rtdev=xxx] /* sectorsize */ [-s log=n|size=num] /* version */ [-V] devicename is required unless -d name=xxx is given. is xxx (bytes), xxxs (sectors), xxxb (fs blocks), xxxk (xxx KiB), xxxm (xxx MiB), xxxg (xxx GiB), xxxt (xxx TiB) or xxxp (xxx PiB). is xxx (512 byte blocks). p34:~# Unless, the page size is not <= 64 for x86_64? Justin. From owner-xfs@oss.sgi.com Sun Apr 29 09:25:24 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 09:25:27 -0700 (PDT) Received: from lucidpixels.com (lucidpixels.com [75.144.35.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3TGPNfB007168 for ; Sun, 29 Apr 2007 09:25:24 -0700 Received: by lucidpixels.com (Postfix, from userid 1001) id 506C2B17FD53; Sun, 29 Apr 2007 12:25:19 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by lucidpixels.com (Postfix) with ESMTP id 4B7A05000165; Sun, 29 Apr 2007 12:25:19 -0400 (EDT) Date: Sun, 29 Apr 2007 12:25:19 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: xfs@oss.sgi.com cc: linux-kernel@vger.kernel.org Subject: Re: mkfs.xfs cannot make sector size 64 KiB? In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-archive-position: 11221 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: jpiszcz@lucidpixels.com Precedence: bulk X-list: xfs Adding LKML to cc list. On Sun, 29 Apr 2007, Justin Piszcz wrote: >> From the manpage: > > -b Block size options. > > This option specifies the fundamental block size of the > filesys- > tem. The valid suboptions are: log=value and size=value; > only > one can be supplied. The block size is specified either as > a > base two logarithm value with log=, or in bytes with size=. > The > default value is 4096 bytes (4 KiB), the minimum is 512, and > the > maximum is 65536 (64 KiB). XFS on Linux currently only > supports > pagesize or smaller blocks. > > The maximum size is 64 KiB, yet it seems only up to 32 KiB is valid? > > I am running x86_64. > > $ uname -m > x86_64 > > > p34:~# mkfs.xfs -b size=512 /dev/md3 > mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). > mkfs.xfs: Use the -f option to force overwrite. > p34:~# mkfs.xfs -b size=4096 /dev/md3 > mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). > mkfs.xfs: Use the -f option to force overwrite. > p34:~# mkfs.xfs -b size=8192 /dev/md3 > mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). > mkfs.xfs: Use the -f option to force overwrite. > p34:~# mkfs.xfs -b size=16384 /dev/md3 > mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). > mkfs.xfs: Use the -f option to force overwrite. > p34:~# mkfs.xfs -b size=32768 /dev/md3 > mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). > mkfs.xfs: Use the -f option to force overwrite. > p34:~# mkfs.xfs -b size=65536 /dev/md3 > illegal sector size 65536 > Usage: mkfs.xfs > /* blocksize */ [-b log=n|size=num] > /* data subvol */ [-d agcount=n,agsize=n,file,name=xxx,size=num, > (sunit=value,swidth=value|su=num,sw=num), > sectlog=n|sectsize=num,unwritten=0|1] > /* inode size */ [-i log=n|perblock=n|size=num,maxpct=n,attr=0|1|2] > /* log subvol */ [-l agnum=n,internal,size=num,logdev=xxx,version=n > sunit=value|su=num,sectlog=n|sectsize=num] > /* label */ [-L label (maximum 12 characters)] > /* naming */ [-n log=n|size=num,version=n] > /* prototype file */ [-p fname] > /* quiet */ [-q] > /* realtime subvol */ [-r extsize=num,size=num,rtdev=xxx] > /* sectorsize */ [-s log=n|size=num] > /* version */ [-V] > devicename > is required unless -d name=xxx is given. > is xxx (bytes), xxxs (sectors), xxxb (fs blocks), xxxk (xxx KiB), > xxxm (xxx MiB), xxxg (xxx GiB), xxxt (xxx TiB) or xxxp (xxx PiB). > is xxx (512 byte blocks). > p34:~# > > Unless, the page size is not <= 64 for x86_64? > > Justin. > > From owner-xfs@oss.sgi.com Sun Apr 29 10:22:40 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 10:22:43 -0700 (PDT) Received: from sandeen.net (sandeen.net [209.173.210.139]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3THMdfB014945 for ; Sun, 29 Apr 2007 10:22:40 -0700 Received: from [10.0.0.4] (liberator.sandeen.net [10.0.0.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sandeen.net (Postfix) with ESMTP id C7BC41806FDAE; Sun, 29 Apr 2007 12:22:37 -0500 (CDT) Message-ID: <4634D45D.5030807@sandeen.net> Date: Sun, 29 Apr 2007 12:22:37 -0500 From: Eric Sandeen User-Agent: Thunderbird 1.5.0.10 (Macintosh/20070221) MIME-Version: 1.0 To: Justin Piszcz CC: xfs@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: mkfs.xfs cannot make sector size 64 KiB? References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 11222 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: sandeen@sandeen.net Precedence: bulk X-list: xfs Justin Piszcz wrote: > Adding LKML to cc list. > > On Sun, 29 Apr 2007, Justin Piszcz wrote: > >>> From the manpage: >> >> -b Block size options. >> >> This option specifies the fundamental block size of the >> filesys- >> tem. The valid suboptions are: log=value and >> size=value; only >> one can be supplied. The block size is specified >> either as a >> base two logarithm value with log=, or in bytes with >> size=. The >> default value is 4096 bytes (4 KiB), the minimum is 512, >> and the >> maximum is 65536 (64 KiB). XFS on Linux currently only >> supports >> pagesize or smaller blocks. >> >> The maximum size is 64 KiB, yet it seems only up to 32 KiB is valid? above is block size, not sector size >> I am running x86_64. then you are limited to blocks (and therefore sectors) <= page size, so <= 4096. You can -mkfs- something bigger, but you won't be able to mount it. >> $ uname -m >> x86_64 >> >> >> p34:~# mkfs.xfs -b size=512 /dev/md3 >> mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). >> mkfs.xfs: Use the -f option to force overwrite. >> p34:~# mkfs.xfs -b size=4096 /dev/md3 >> mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). >> mkfs.xfs: Use the -f option to force overwrite. >> p34:~# mkfs.xfs -b size=8192 /dev/md3 >> mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). >> mkfs.xfs: Use the -f option to force overwrite. >> p34:~# mkfs.xfs -b size=16384 /dev/md3 >> mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). >> mkfs.xfs: Use the -f option to force overwrite. >> p34:~# mkfs.xfs -b size=32768 /dev/md3 >> mkfs.xfs: /dev/md3 appears to contain an existing filesystem (xfs). >> mkfs.xfs: Use the -f option to force overwrite. >> p34:~# mkfs.xfs -b size=65536 /dev/md3 >> illegal sector size 65536 This is mkfs.xfs trying to be smart about making larger "sectors" (== blocksize) on an md device, so that it does not switch the size of the IO requests between data & metadata, slowing things down significantly. however, -s Sector size options. This option specifies the fundamental sector size of the filesystem. The valid suboptions are: log=value and size=value; only one can be supplied. The sector size is specified either as a base two logarithm value with log=, or in bytes with size=. The default value is 512 bytes. The minimum value for sector size is 512; the maximum is 32768 (32 KiB). The sector size must be a power of 2 size and cannot be made larger than the filesys- tem block size. looks like a buglet where it is trying to make a block == sector == 64k, but sectors are limited to 32. But this is not what you want anyway, assuming you want to actually *mount* your new filesystem on x86_64. Just make take default blocksize (4k) and be happy. >> Usage: mkfs.xfs >> /* blocksize */ [-b log=n|size=num] >> /* data subvol */ [-d agcount=n,agsize=n,file,name=xxx,size=num, >> (sunit=value,swidth=value|su=num,sw=num), >> sectlog=n|sectsize=num,unwritten=0|1] >> /* inode size */ [-i >> log=n|perblock=n|size=num,maxpct=n,attr=0|1|2] >> /* log subvol */ [-l >> agnum=n,internal,size=num,logdev=xxx,version=n >> sunit=value|su=num,sectlog=n|sectsize=num] >> /* label */ [-L label (maximum 12 characters)] >> /* naming */ [-n log=n|size=num,version=n] >> /* prototype file */ [-p fname] >> /* quiet */ [-q] >> /* realtime subvol */ [-r extsize=num,size=num,rtdev=xxx] >> /* sectorsize */ [-s log=n|size=num] >> /* version */ [-V] >> devicename >> is required unless -d name=xxx is given. >> is xxx (bytes), xxxs (sectors), xxxb (fs blocks), xxxk (xxx KiB), >> xxxm (xxx MiB), xxxg (xxx GiB), xxxt (xxx TiB) or xxxp (xxx PiB). >> is xxx (512 byte blocks). >> p34:~# >> >> Unless, the page size is not <= 64 for x86_64? it's not, but that's not why this broke. -Eric >> Justin. >> >> > > From owner-xfs@oss.sgi.com Sun Apr 29 16:07:39 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 16:07:41 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3TN7afB010345 for ; Sun, 29 Apr 2007 16:07:37 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA09489; Mon, 30 Apr 2007 09:07:30 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3TN7TAf79886867; Mon, 30 Apr 2007 09:07:29 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3TN7REM80094545; Mon, 30 Apr 2007 09:07:27 +1000 (AEST) Date: Mon, 30 Apr 2007 09:07:27 +1000 From: David Chinner To: Justin Piszcz Cc: xfs@oss.sgi.com Subject: Re: mkfs.xfs cannot make sector size 64 KiB? Message-ID: <20070429230727.GL32602149@melbourne.sgi.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-archive-position: 11223 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sun, Apr 29, 2007 at 12:20:26PM -0400, Justin Piszcz wrote: > >From the manpage: > > -b Block size options. > > This option specifies the fundamental block size of the > filesys- > tem. The valid suboptions are: log=value and size=value; > only > one can be supplied. The block size is specified either as > a > base two logarithm value with log=, or in bytes with size=. > The > default value is 4096 bytes (4 KiB), the minimum is 512, and > the > maximum is 65536 (64 KiB). XFS on Linux currently only > supports > pagesize or smaller blocks. > > The maximum size is 64 KiB, yet it seems only up to 32 KiB is valid? > I am running x86_64. First question - What's the page size on x86_64? Answer: 4k. So while mkfs will allow you to make >4k block size filesystems, you can't mount them on x86_64 (yet). > p34:~# mkfs.xfs -b size=65536 /dev/md3 > illegal sector size 65536 That's not an illegal block size it's complaining about - that's a sector size that it thinks is wrong. By the time we check this, we've already validated the block size: 1312 if (sectorsize < XFS_MIN_SECTORSIZE || 1313 sectorsize > XFS_MAX_SECTORSIZE || sectorsize > blocksize) { 1314 fprintf(stderr, _("illegal sector size %d\n"), sectorsize); 1315 usage(); 1316 } But then we probe the underlying volume and get a "sectoralign" variable: 1301 if (!nodsflag && !xi.disfile) 1302 get_subvol_stripe_wrapper(dfile, SVTYPE_DATA, 1303 &xlv_dsunit, &xlv_dswidth, §oralign); 1304 if (sectoralign) { 1305 sectorsize = blocksize; 1306 sectorlog = libxfs_highbit32(sectorsize); 1307 if (loginternal) { 1308 lsectorsize = sectorsize; 1309 lsectorlog = sectorlog; 1310 } 1311 } 1312 if (sectorsize < XFS_MIN_SECTORSIZE || 1313 sectorsize > XFS_MAX_SECTORSIZE || sectorsize > blocksize) { 1314 fprintf(stderr, _("illegal sector size %d\n"), sectorsize); 1315 usage(); 1316 } And if we have sectoralign returned, we adjust the sector size to the block size and then we fail due to (xfs_alloc_btree.h): #define XFS_MAX_SECTORSIZE_LOG 15 /* i.e. 32768 bytes */ Hmmmm - I bet this is because you are using md raid here - this is probably the code that ensures that XFS doesn't use 512 byte writes that cause md raid cache flushes. This is triggers on: *sectalign = (md.level == 4 || md.level == 5 || md.level == 6); So I bet that you're using RAID4/5/6 on your md device. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sun Apr 29 17:47:35 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 17:47:38 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3U0lVfB022362 for ; Sun, 29 Apr 2007 17:47:34 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA11847; Mon, 30 Apr 2007 10:47:14 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3U0l9Af76806579; Mon, 30 Apr 2007 10:47:09 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3U0l2Rh80266961; Mon, 30 Apr 2007 10:47:02 +1000 (AEST) Date: Mon, 30 Apr 2007 10:47:02 +1000 From: David Chinner To: "Amit K. Arora" Cc: torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070430004702.GM32602149@melbourne.sgi.com> References: <20070321120425.GA27273@amitarora.in.ibm.com> <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070426175056.GA25321@amitarora.in.ibm.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11224 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Thu, Apr 26, 2007 at 11:20:56PM +0530, Amit K. Arora wrote: > Based on the discussion, this new patchset uses following as the > interface for fallocate() system call: > > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Ok, so now for the hard questions - what are the semantics of FA_ALLOCATE and FA_DEALLOCATE? For FA_ALLOCATE, it's supposed to change the file size if we allocate past EOF, right? What's the return value supposed to be? Zero for success, error otherwise? Does this update a/m/ctime at all? How persistent is this preallocation? Should it be there "forever" or for the lifetime of the currently open fd that it was preallocated on? For FA_DEALLOCATE, does it change the filesize at all? Or does it just punch a hole in the file? If it does change file size, what happens when you punch out preallocation beyond EOF? What's the return value supposed to be? > Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for > preallocation and deallocation of preallocated blocks respectively. More > modes can be added, when required. FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does not change file size) so we can preallocate beyond EOF for apps which use O_APPEND (i.e. changing file size would cause problems for them). > ToDos: > ===== > 1> Implementation on other architectures (other than i386, x86_64, > ppc64 and s390(x)) I'll have ia64 soon. > 2> A generic file system operation to handle fallocate > (generic_fallocate), for filesystems that do _not_ have the fallocate > inode operation implemented. > 3> Changes to glibc, > a) to support fallocate() system call > b) so that posix_fallocate() and posix_fallocate64() call > fallocate() system call > 4> Changes to XFS to implement the fallocate inode operation And that's what I'm doing now, hence all the questions ;) BTW, do you have a test program for this, or will I need to write one myself? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sun Apr 29 20:09:59 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 20:10:04 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3U39tfB005951 for ; Sun, 29 Apr 2007 20:09:58 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA14743; Mon, 30 Apr 2007 13:09:38 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3U39WAf80281160; Mon, 30 Apr 2007 13:09:32 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3U39OWQ80262647; Mon, 30 Apr 2007 13:09:24 +1000 (AEST) Date: Mon, 30 Apr 2007 13:09:24 +1000 From: David Chinner To: David Chinner Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH] ia64 fallocate syscall Message-ID: <20070430030924.GN32602149@melbourne.sgi.com> References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430004702.GM32602149@melbourne.sgi.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11225 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs ia64 fallocate syscall support. Signed-Off-By: Dave Chinner --- arch/ia64/kernel/entry.S | 1 + include/asm-ia64/unistd.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: 2.6.x-xfs-new/arch/ia64/kernel/entry.S =================================================================== --- 2.6.x-xfs-new.orig/arch/ia64/kernel/entry.S 2007-03-29 19:01:41.000000000 +1000 +++ 2.6.x-xfs-new/arch/ia64/kernel/entry.S 2007-04-27 19:12:56.829396661 +1000 @@ -1612,5 +1612,6 @@ sys_call_table: data8 sys_vmsplice data8 sys_ni_syscall // reserved for move_pages data8 sys_getcpu + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: 2.6.x-xfs-new/include/asm-ia64/unistd.h =================================================================== --- 2.6.x-xfs-new.orig/include/asm-ia64/unistd.h 2007-03-29 19:03:37.000000000 +1000 +++ 2.6.x-xfs-new/include/asm-ia64/unistd.h 2007-04-27 19:18:18.215568425 +1000 @@ -293,11 +293,12 @@ #define __NR_vmsplice 1302 /* 1303 reserved for move_pages */ #define __NR_getcpu 1304 +#define __NR_fallocate 1305 #ifdef __KERNEL__ -#define NR_syscalls 281 /* length of syscall table */ +#define NR_syscalls 282 /* length of syscall table */ #define __ARCH_WANT_SYS_RT_SIGACTION From owner-xfs@oss.sgi.com Sun Apr 29 20:11:49 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 20:11:51 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3U3BifB006476 for ; Sun, 29 Apr 2007 20:11:47 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA14802; Mon, 30 Apr 2007 13:11:34 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3U3BTAf79538282; Mon, 30 Apr 2007 13:11:30 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3U3BLkY80203546; Mon, 30 Apr 2007 13:11:21 +1000 (AEST) Date: Mon, 30 Apr 2007 13:11:21 +1000 From: David Chinner To: David Chinner Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH] XFS ->fallocate() support Message-ID: <20070430031121.GO32602149@melbourne.sgi.com> References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430004702.GM32602149@melbourne.sgi.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11226 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Add XFS support for ->fallocate() vector. Signed-Off-By: Dave Chinner --- fs/xfs/linux-2.6/xfs_iops.c | 48 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c 2007-02-07 13:24:32.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:02:16.225095992 +1000 @@ -812,6 +812,53 @@ xfs_vn_removexattr( return namesp->attr_remove(vp, attr, xflags); } +STATIC long +xfs_vn_fallocate( + struct inode *inode, + int mode, + loff_t offset, + loff_t len) +{ + long error = -EOPNOTSUPP; + bhv_vnode_t *vp = vn_from_inode(inode); + bhv_desc_t *bdp; + int do_setattr = 0; + xfs_flock64_t bf; + + bf.l_whence = 0; + bf.l_start = offset; + bf.l_len = len; + + bdp = bhv_lookup_range(VN_BHV_HEAD(vp), VNODE_POSITION_XFS, + VNODE_POSITION_XFS); + + switch (mode) { + case FA_ALLOCATE: /* changes file size */ + error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, + &bf, 0, NULL, 0); + if (offset + len > i_size_read(inode)) + do_setattr = offset + len; + break; + case FA_DEALLOCATE: + /* XXX: changes file size? this just punches a hole */ + error = xfs_change_file_space(bdp, XFS_IOC_UNRESVSP, + &bf, 0, NULL, 0); + break; + default: + break; + } + + /* Change file size if needed */ + if (!error && do_setattr) { + bhv_vattr_t va; + + va.va_mask = XFS_AT_SIZE; + va.va_size = do_setattr; + error = bhv_vop_setattr(vp, &va, 0, NULL); + } + + return error; +} struct inode_operations xfs_inode_operations = { .permission = xfs_vn_permission, @@ -822,6 +869,7 @@ struct inode_operations xfs_inode_operat .getxattr = xfs_vn_getxattr, .listxattr = xfs_vn_listxattr, .removexattr = xfs_vn_removexattr, + .fallocate = xfs_vn_fallocate, }; struct inode_operations xfs_dir_inode_operations = { From owner-xfs@oss.sgi.com Sun Apr 29 20:15:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 20:15:14 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3U3F4fB007735 for ; Sun, 29 Apr 2007 20:15:07 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA14929; Mon, 30 Apr 2007 13:14:54 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3U3EmAf79787909; Mon, 30 Apr 2007 13:14:48 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3U3EhCe80249947; Mon, 30 Apr 2007 13:14:43 +1000 (AEST) Date: Mon, 30 Apr 2007 13:14:43 +1000 From: David Chinner To: David Chinner Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: [PATCH] Add preallocation beyond EOF to fallocate Message-ID: <20070430031443.GP32602149@melbourne.sgi.com> References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430004702.GM32602149@melbourne.sgi.com> User-Agent: Mutt/1.4.2.1i X-archive-position: 11227 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Add new mode to ->fallocate() to allow allocation to occur beyond the current EOF without changing the file size. Implement in XFS ->fallocate() vector. Signed-Off-By: Dave Chinner --- fs/xfs/linux-2.6/xfs_iops.c | 8 +++++--- include/linux/fs.h | 1 + 2 files changed, 6 insertions(+), 3 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:02:16.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:09:48.233375382 +1000 @@ -833,11 +833,13 @@ xfs_vn_fallocate( VNODE_POSITION_XFS); switch (mode) { - case FA_ALLOCATE: /* changes file size */ - error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, - &bf, 0, NULL, 0); + case FA_ALLOCATE: /* changes file size */ if (offset + len > i_size_read(inode)) do_setattr = offset + len; + /* FALL THROUGH */ + case FA_PREALLOCATE: /* no filesize change */ + error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, + &bf, 0, NULL, 0); break; case FA_DEALLOCATE: /* XXX: changes file size? this just punches a hole */ Index: 2.6.x-xfs-new/include/linux/fs.h =================================================================== --- 2.6.x-xfs-new.orig/include/linux/fs.h 2007-04-27 18:48:01.000000000 +1000 +++ 2.6.x-xfs-new/include/linux/fs.h 2007-04-30 11:08:05.790903661 +1000 @@ -269,6 +269,7 @@ extern int dir_notify_enable; */ #define FA_ALLOCATE 0x1 #define FA_DEALLOCATE 0x2 +#define FA_PREALLOCATE 0x3 #ifdef __KERNEL__ From owner-xfs@oss.sgi.com Sun Apr 29 22:26:04 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 22:26:08 -0700 (PDT) Received: from smtp102.sbc.mail.mud.yahoo.com (smtp102.sbc.mail.mud.yahoo.com [68.142.198.201]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3U5Q3fB024514 for ; Sun, 29 Apr 2007 22:26:04 -0700 Received: (qmail 30176 invoked from network); 30 Apr 2007 05:26:02 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp102.sbc.mail.mud.yahoo.com with SMTP; 30 Apr 2007 05:26:02 -0000 X-YMail-OSG: 8AC.XHkVM1k1OIUaS66VuE4A9ak4Zz3i5RldyffApDgJ2fcxY0x7uKF83Ic1n_vsKfUDfpVNYQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id EEBF61827262; Sun, 29 Apr 2007 22:25:59 -0700 (PDT) Date: Sun, 29 Apr 2007 22:25:59 -0700 From: Chris Wedgwood To: David Chinner Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070430052559.GA13145@tuatara.stupidest.org> References: <20070329115126.GB7374@amitarora.in.ibm.com> <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430004702.GM32602149@melbourne.sgi.com> X-archive-position: 11228 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > For FA_ALLOCATE, it's supposed to change the file size if we > allocate past EOF, right? I would argue no. Use truncate for that. > For FA_DEALLOCATE, does it change the filesize at all? Same as above. > Or does > it just punch a hole in the file? Yes. > FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does > not change file size) so we can preallocate beyond EOF for apps > which use O_APPEND (i.e. changing file size would cause problems for > them). FA_ALLOCATE should be able to allocate past-EOF I would argue. From owner-xfs@oss.sgi.com Sun Apr 29 22:57:19 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 22:57:22 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3U5vGfB027585 for ; Sun, 29 Apr 2007 22:57:18 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA18721; Mon, 30 Apr 2007 15:56:51 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3U5uhAf80336303; Mon, 30 Apr 2007 15:56:44 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3U5uWKB80342056; Mon, 30 Apr 2007 15:56:32 +1000 (AEST) Date: Mon, 30 Apr 2007 15:56:32 +1000 From: David Chinner To: Chris Wedgwood Cc: David Chinner , "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070430055632.GR32602149@melbourne.sgi.com> References: <20070329101010.7a2b8783.akpm@linux-foundation.org> <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> <20070430052559.GA13145@tuatara.stupidest.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430052559.GA13145@tuatara.stupidest.org> User-Agent: Mutt/1.4.2.1i X-archive-position: 11229 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > For FA_ALLOCATE, it's supposed to change the file size if we > > allocate past EOF, right? > > I would argue no. Use truncate for that. I'm going from the ext4 implementation because the semantics have not been documented yet. IIRC, the argument for FA_ALLOCATE changing file size is that posix_fallocate() is supposed to change the file size. I think that having a mode for real preallocation and another for posix_fallocate is a valid thing to do... Note that the way XFS implements growing the file size after the allocation is via a truncate.... > > For FA_DEALLOCATE, does it change the filesize at all? > > Same as above. > > > Or does > > it just punch a hole in the file? > > Yes. That's would what I did because otherwise you'd use ftruncate64(). Without documented behaviour or an ext4 implementation, I have to ask what it's supposed to do, though ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Sun Apr 29 23:01:32 2007 Received: with ECARTIS (v1.0.0; list xfs); Sun, 29 Apr 2007 23:01:35 -0700 (PDT) Received: from smtp107.sbc.mail.mud.yahoo.com (smtp107.sbc.mail.mud.yahoo.com [68.142.198.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3U61UfB028760 for ; Sun, 29 Apr 2007 23:01:32 -0700 Received: (qmail 28837 invoked from network); 30 Apr 2007 06:01:28 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp107.sbc.mail.mud.yahoo.com with SMTP; 30 Apr 2007 06:01:28 -0000 X-YMail-OSG: DGv5zKIVM1nmz1f4yOPRwTsF3vHIB18l1eb1mqARs0tf8qBatECgPKeMGtte1rF9_Ftw_D5M5w-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 865531827261; Sun, 29 Apr 2007 23:01:26 -0700 (PDT) Date: Sun, 29 Apr 2007 23:01:26 -0700 From: Chris Wedgwood To: David Chinner Cc: "Amit K. Arora" , torvalds@osdl.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, suparna@in.ibm.com, cmm@us.ibm.com Subject: Re: [PATCH 0/5] fallocate system call Message-ID: <20070430060126.GA15038@tuatara.stupidest.org> References: <20070330071417.GI355@devserv.devel.redhat.com> <20070417125514.GA7574@amitarora.in.ibm.com> <20070418130600.GW5967@schatzie.adilger.int> <20070420135146.GA21352@amitarora.in.ibm.com> <20070420145918.GY355@devserv.devel.redhat.com> <20070424121632.GA10136@amitarora.in.ibm.com> <20070426175056.GA25321@amitarora.in.ibm.com> <20070430004702.GM32602149@melbourne.sgi.com> <20070430052559.GA13145@tuatara.stupidest.org> <20070430055632.GR32602149@melbourne.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430055632.GR32602149@melbourne.sgi.com> X-archive-position: 11230 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Mon, Apr 30, 2007 at 03:56:32PM +1000, David Chinner wrote: > On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > IIRC, the argument for FA_ALLOCATE changing file size is that > posix_fallocate() is supposed to change the file size. But it's not posix_fallocate; it's something more generic. glibc can do posix_fallocate using truncate + fallocate. > Note that the way XFS implements growing the file size after the > allocation is via a truncate.... What's wrong with that? That seems very reasonable. > That's would what I did because otherwise you'd use ftruncate64(). > Without documented behaviour or an ext4 implementation, I have to > ask what it's supposed to do, though ;) How many *real* users are there for ext4? Why does 'what ext4 does' define 'the semantics'? Surely semantics should be decided either by precedent (if there is an existing relevant userbase) or sensible thought and some debate? From owner-xfs@oss.sgi.com Mon Apr 30 05:31:11 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 30 Apr 2007 05:31:13 -0700 (PDT) Received: from mailsecure1.itc.griffith.edu.au (mailsecure1-out.itc.griffith.edu.au [132.234.242.61]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3UCV9fB027347 for ; Mon, 30 Apr 2007 05:31:10 -0700 Received: from mailsecure1.itc.griffith.edu.au (unknown [127.0.0.1]) by mailsecure1.itc.griffith.edu.au (Symantec Mail Security) with ESMTP id C7D4218F for ; Mon, 30 Apr 2007 22:01:43 +1000 (EST) X-AuditID: 84eaf23c-afaf3bb000004912-ad-4635daa76f99 Received: from nox-1.itc.griffith.edu.au (sc2bigip02-242.nms.griffith.edu.au [132.234.242.254]) by mailsecure1.itc.griffith.edu.au (Symantec Mail Security) with ESMTP id 9269F3018B for ; Mon, 30 Apr 2007 22:01:43 +1000 (EST) Received: from [132.234.242.254] (helo=studentemail.griffith.edu.au) by nox-1.itc.griffith.edu.au with esmtp (Exim 4.63) (envelope-from ) id 1HiUZb-000053-Hs for xfs@oss.sgi.com; Mon, 30 Apr 2007 22:01:43 +1000 Received: from [132.234.200.137] by studentemail.griffith.edu.au (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPA id <0JHB00G9182VYL70@studentemail.griffith.edu.au> for xfs@oss.sgi.com; Mon, 30 Apr 2007 22:01:43 +1000 (EST) Date: Mon, 30 Apr 2007 22:01:40 +1000 From: Stephen So Subject: Slow performance when extracting tarballs To: xfs@oss.sgi.com Message-id: <4635DAA4.4070402@griffith.edu.au> Organization: Griffith School of Engineering, Griffith University MIME-version: 1.0 Content-type: text/plain; charset=UTF-8 Content-transfer-encoding: 7BIT X-Enigmail-Version: 0.95.0 User-Agent: Thunderbird 2.0.0.0 (Windows/20070326) X-Brightmail-Tracker: AAAAAA== X-archive-position: 11231 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: s.so@griffith.edu.au Precedence: bulk X-list: xfs Hi everyone, Just a question about XFS when extracting bzipped tarballs containing lots of little files (e.g. the linux kernel source). I've noticed on my new laptop (Intel Core 2 Duo @ 2.16 GHz), which has FC6 i386, which has XFS partitions that when I extract these types of tarballs, my system becomes rather non responsive, my mp3 player starts skipping etc. After looking at top during the extraction process, bzip2 uses about 80-90% of CPU initially and extraction seems quite fast but after a few seconds, it drops to 30-40%, the system becomes non responsive, and extraction is much slower. I created a new partition of ext3 and reiserfs, did the same tarball extraction, and on both filesystems, bzip2 uses at least 90%, extraction is fast the whole way through, and the system is quite responsive. I created my XFS partition using the following command (I made a larger log file size since I heard that improve delete performance a bit): mkfs.xfs -l size=64m /dev/sda10 Then to mount this partition, I have these switches in my /etc/fstab file noatime, nodiratime, logbufs=8 I'm using kernel 2.6.20 that came from the FC6 updates repositories. So is there something wrong with my XFS setup? Is my log file too small? Or is this "normal" behaviour of XFS (i.e. that it excels best when working with very large files but not lots of little files)? Many thanks and best regards, Steve. -- ______________________________________________________ Dr Stephen So, PhD, MIEEE Griffith School of Engineering & Institute for Integrated and Intelligent Systems Science, Environment, Engineering and Technology Group Griffith University, Gold Coast Campus PMB 50, Gold Coast Mail Centre Gold Coast, QLD, 9726, Australia E-mail: s.so@griffith.edu.au Phone (Fax): +61 7 5552 8663 (8065) Homepage: http://maxwell.me.gu.edu.au/sso ______________________________________________________ From owner-xfs@oss.sgi.com Mon Apr 30 14:35:43 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 30 Apr 2007 14:35:47 -0700 (PDT) Received: from smtp111.sbc.mail.mud.yahoo.com (smtp111.sbc.mail.mud.yahoo.com [68.142.198.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3ULZgfB003268 for ; Mon, 30 Apr 2007 14:35:43 -0700 Received: (qmail 19353 invoked from network); 30 Apr 2007 21:35:41 -0000 Received: from unknown (HELO stupidest.org) (cwedgwood@sbcglobal.net@24.5.75.45 with login) by smtp111.sbc.mail.mud.yahoo.com with SMTP; 30 Apr 2007 21:35:41 -0000 X-YMail-OSG: xM.1fZcVM1k_Jhq5MZ1Q.3s9nx5lMEHxHQMpSFNeSg2AWYEBrhGqshGh8EvMWEmTlAReDlIkqQ-- Received: by tuatara.stupidest.org (Postfix, from userid 10000) id 298A01827262; Mon, 30 Apr 2007 14:35:39 -0700 (PDT) Date: Mon, 30 Apr 2007 14:35:38 -0700 From: Chris Wedgwood To: Stephen So Cc: xfs@oss.sgi.com Subject: Re: Slow performance when extracting tarballs Message-ID: <20070430213538.GA30809@tuatara.stupidest.org> References: <4635DAA4.4070402@griffith.edu.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4635DAA4.4070402@griffith.edu.au> X-archive-position: 11232 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: xfs On Mon, Apr 30, 2007 at 10:01:40PM +1000, Stephen So wrote: > After looking at top during the extraction process, bzip2 uses about > 80-90% of CPU initially and extraction seems quite fast but after a > few seconds, it drops to 30-40%, the system becomes non responsive, > and extraction is much slower. what does "vmstat 1" look like during this? > noatime, nodiratime, logbufs=8 have you also tried setting (increasing) logbsize? (i think you need v2 logs to make that work) From owner-xfs@oss.sgi.com Mon Apr 30 15:44:06 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 30 Apr 2007 15:44:08 -0700 (PDT) Received: from mail.clusterfs.com (mail.clusterfs.com [206.168.112.78]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l3UMi4fB015712 for ; Mon, 30 Apr 2007 15:44:06 -0700 Received: from localhost.adilger.int (S01060004e23cfc51.cg.shawcable.net [68.147.252.160]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.clusterfs.com (Postfix) with ESMTP id F10157BA31E; Mon, 30 Apr 2007 16:44:02 -0600 (MDT) Received: by localhost.adilger.int (Postfix, from userid 1000) id 68F704174; Mon, 30 Apr 2007 16:44:01 -0600 (MDT) Date: Mon, 30 Apr 2007 16:44:01 -0600 From: Andreas Dilger To: David Chinner Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070430224401.GX5967@schatzie.adilger.int> Mail-Followup-To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070419015426.GM48531920@melbourne.sgi.com> User-Agent: Mutt/1.4.1i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 X-archive-position: 11233 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: adilger@clusterfs.com Precedence: bulk X-list: xfs On Apr 19, 2007 11:54 +1000, David Chinner wrote: > > struct fiemap { > > __u64 fm_start; /* logical start offset of mapping (in/out) */ > > __u64 fm_len; /* logical length of mapping (in/out) */ > > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ > > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ > > __u64 fm_unused; > > struct fiemap_extent fm_extents[0]; > > } > > > > /* flags for the fiemap request */ > > #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/ > > #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */ > > #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/ > > No flags in the INCOMPAT range - shouldn't it be 0x3 at this point? This is actually for future use. Any flags that are added into this range must be understood by both sides or it should be considered an error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported. If it turns out that 8 bits is too small a range for INCOMPAT flags, then we can make 0x01000000 an incompat flag that means e.g. 0x00ff0000 are also incompat flags also. I'm assuming that all flags that will be in the original FIEMAP proposal will be understood by the implementations. Most filesystems can safely ignore FLAG_HSM_READ, for example, since they don't support HSM, and for that matter FLAG_SYNC is probably moot for most filesystems also because they do block allocation at preprw time. > SO, there's a HSM_READ flag above. If we are going to make this interface > useful for filesystems that have HSMs interacting with their extents, the > HSM needs to be able to query whether the extent is online (on disk), > has been migrated offline (on tape) or in dual-state (i.e. both online and > offline). Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't consider files that are both on disk and on secondary storage (which is no longer just tape anymore). I thought I'd call this FIEMAP_EXTENT_OFFLINE, but that has a confusing connotation that the extent is inaccessible, instead of just saying it is also on offline storage. What about FIEMAP_EXTENT_SECONDARY? Other proposals welcome. FIEMAP_EXTENT_SECONDARY could still be set even if the file is mapped. That would mean an offline-only file would be EXTENT_SECONDARY|EXTENT_UNKNOWN, while a dual-location file would be EXTENT_SECONDARY only. > > SUMMARY OF CHANGES > > ================== > > - add separate fe_flags word with flags from various suggestions: > > - FIEMAP_EXTENT_HOLE = extent has no space allocation > > - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data > > - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown > > (e.g. HSM, delalloc awaiting sync, etc) > > I'd like an explicit delalloc flag, not lumping it in with "unknown". > we *know* the extent is delalloc ;) Sure, FIEMAP_EXTENT_DELALLOC is fine. It is mostly redundant with EXTENT_UNKNOWN (and I think it only makes sense if DELALLOC is given in addition to UNKNOWN). I'd like to keep a generic "UNKNOWN" flag that can be used by applications that don't really care about why it is unmapped and in case there are other reasons in the future that an extent might be unmapped (e.g. fsck or storage layer reporting corruption or loss of that part of the file). > > > chook 681% xfs_bmap -vv fred > > > fred: > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > > 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010 > > > FLAG Values: > > > 010000 Unwritten preallocated extent > > > 001000 Doesn't begin on stripe unit > > > 000100 Doesn't end on stripe unit > > > 000010 Doesn't begin on stripe width > > > 000001 Doesn't end on stripe width > > > > Can you clarify the terminology here? What is a "stripe unit" and what is > > a "stripe width"? > > Stripe unit is equivalent of the chunk size in an MD RAID. It's the amount > of data that is written to each lun in a stripe before moving onto the > next stripe element. > > > Are there "N * stripe_unit = stripe_width" in e.g. a > > RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa? > > Yes, on simple configurations. In more complex HW RAID > configurations, we'll typically set the stripe unit to the width of > the RAID5 lun (N * segment size) and the stripe width to the number > of luns we've striped across. Can you propose reasonable flag names for these (I can't think of anything very good) and a clear explanation of what they mean. I suspect it will only be XFS that uses them initially. In mke2fs and ext4+mballoc there is the concept of stripe unit and stripe width, but as yet they are not communicated between the two very well. I'd be much happier if this info could be queried in a standard way from the block layer instead of the user having to specify it and the filesystem having to track it. > > > Ok, so the only way you can determine where you are in the file > > > is by adding up the length of each extent. What happens if the file > > > is changing underneath you e.g. someone punches out a hole > > > in teh file, or truncates and extends it again between ioctl() > > > calls? > > > > Well, that is always true with data once it is out of the caller. > > Sure, but this interface requires iterative calls where the n+1 > call is reliant on nothing changing since the first call to be > accurate. My question is how do you use this interface to reliably > and accurately get all the extents if you using iterative summing > like this? Maybe it wasn't clear, but the semantics of the ioctl are that it will return the first extent that contains the requested byte offset in fm_start. If the file has changed since the last call to FIEMAP then it will restart with the extent that covers this byte and continue on. In most cases the file mapping should be returnable in a single ioctl (assuming a reasonable extent count). > > > Also, what happens if you ask for an offset/len that doesn't map to > > > any extent boundaries - are you truncating the extents returned to > > > teh off/len passed in? > > > > The request offset will be returned as the start of the actual extent that > > it falls inside. And the returned extents will end with the extent that > > ends at or after the requested fm_start + fm_len. > > Ok, so you round the start inwards and the round end outwards. Can > you ensure that this is documented in the header file that describes > this interface? Sure. > > > xfs_bmap gets around this by finding out how many extents there are in the > > > file and allocating a buffer that big to hold all the extents so they > > > are gathered in a single atomic call (think sparse matrix files).... > > > > Yeah, except this might be persistent for a long time if it isn't fully > > read with a single ioctl and the app never continues reading but doesn't > > close the fd. > > Not sure I follow you here... Ah, I was thinking that XFS was keeping a copy of the whole extent mapping in the kernel to handle getting the data with separate calls. It does make sense to specify zero for the fm_extent_count array and a new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the extent data itself, for the non-verbose mode of filefrag, and for pre-allocating a buffer large enough to hold the file if that is important. I'm also going to add a FIEMAP_FLAG_LAST to mark the last extent in the file, so that iterators using a small buffer don't need to retry to get the last extent, and it is possible in case of e.g. EINTR (or whatever) to return a short list without signalling EOF. I think this is cleaner than returning a HOLE extent from EOF to ~0ULL. Another question about semantics - - does XFS return an extent for the metadata parts of the file (e.g. btree)? - does XFS return preallocated extents beyond EOF? - does XFS allow non-root users to call xfs_bmap on files they don't own, or use by non-root users at all? The FIBMAP ioctl is for privileged users only, and I wonder if FIEMAP should be the same, or at least disallow mapping files that the user can't access especially with FLAG_SYNC and/or FLAG_HSM_READ. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From owner-xfs@oss.sgi.com Mon Apr 30 16:28:05 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 30 Apr 2007 16:28:08 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l3UNS2fB028014 for ; Mon, 30 Apr 2007 16:28:04 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA13665; Tue, 1 May 2007 09:27:57 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l3UNRuAf81167509; Tue, 1 May 2007 09:27:56 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l3UNRtGS80335177; Tue, 1 May 2007 09:27:55 +1000 (AEST) Date: Tue, 1 May 2007 09:27:55 +1000 From: David Chinner To: Stephen So Cc: xfs@oss.sgi.com Subject: Re: Slow performance when extracting tarballs Message-ID: <20070430232755.GT32602149@melbourne.sgi.com> References: <4635DAA4.4070402@griffith.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4635DAA4.4070402@griffith.edu.au> User-Agent: Mutt/1.4.2.1i X-archive-position: 11234 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 30, 2007 at 10:01:40PM +1000, Stephen So wrote: > Hi everyone, > > Just a question about XFS when extracting bzipped tarballs containing > lots of little files (e.g. the linux kernel source). I've noticed on my > new laptop (Intel Core 2 Duo @ 2.16 GHz), which has FC6 i386, which has > XFS partitions that when I extract these types of tarballs, my system > becomes rather non responsive, my mp3 player starts skipping etc. After > looking at top during the extraction process, bzip2 uses about 80-90% of > CPU initially and extraction seems quite fast but after a few seconds, > it drops to 30-40%, the system becomes non responsive, and extraction is > much slower. The log probably fills up and then it falls back to the speed that your disk can write back all the metadata. > I created a new partition of ext3 and reiserfs, did the same tarball > extraction, and on both filesystems, bzip2 uses at least 90%, extraction > is fast the whole way through, and the system is quite responsive. > > I created my XFS partition using the following command (I made a larger > log file size since I heard that improve delete performance a bit): > > mkfs.xfs -l size=64m /dev/sda10 > > Then to mount this partition, I have these switches in my /etc/fstab file > > noatime, nodiratime, logbufs=8 If you don't care about the filesystem always being able to recover correctly when power fails (i.e. can lead to filesystem coruption on power failure) you can also use the "nobarrier" option which can significant;y speed up metadata perfromance on XFS. > I'm using kernel 2.6.20 that came from the FC6 updates repositories. So > is there something wrong with my XFS setup? Is my log file too small? > Or is this "normal" behaviour of XFS (i.e. that it excels best when > working with very large files but not lots of little files)? XFS excels at large files and/or lots and lots of files. On small files it performs adequately but is not the fastest filesystem around. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 30 21:23:09 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 30 Apr 2007 21:23:12 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l414N5fB026879 for ; Mon, 30 Apr 2007 21:23:07 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id OAA20976; Tue, 1 May 2007 14:22:57 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id l414MtAf81205408; Tue, 1 May 2007 14:22:55 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id l414MsVa81006826; Tue, 1 May 2007 14:22:54 +1000 (AEST) Date: Tue, 1 May 2007 14:22:54 +1000 From: David Chinner To: David Chinner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation Message-ID: <20070501042254.GD77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070430224401.GX5967@schatzie.adilger.int> User-Agent: Mutt/1.4.2.1i X-archive-position: 11235 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > On Apr 19, 2007 11:54 +1000, David Chinner wrote: > > > struct fiemap { > > > __u64 fm_start; /* logical start offset of mapping (in/out) */ > > > __u64 fm_len; /* logical length of mapping (in/out) */ > > > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ > > > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ > > > __u64 fm_unused; > > > struct fiemap_extent fm_extents[0]; > > > } > > > > > > /* flags for the fiemap request */ > > > #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/ > > > #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */ > > > #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/ > > > > No flags in the INCOMPAT range - shouldn't it be 0x3 at this point? > > This is actually for future use. Any flags that are added into this range > must be understood by both sides or it should be considered an error. Flags > outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported. > If it turns out that 8 bits is too small a range for INCOMPAT flags, then > we can make 0x01000000 an incompat flag that means e.g. 0x00ff0000 are also > incompat flags also. Ah, ok. So it's not really a set of "compatibility" flags, it's more a "compulsory" set. Under those terms, i don't really see why this is necessary - either the filesystem will understand the flags or it will return EINVAL or ignore them... > I'm assuming that all flags that will be in the original FIEMAP proposal > will be understood by the implementations. Most filesystems can safely > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for > that matter FLAG_SYNC is probably moot for most filesystems also because > they do block allocation at preprw time. Exactly my point - so why do we really need to encode a compulsory set of flags in the API? > > SO, there's a HSM_READ flag above. If we are going to make this interface > > useful for filesystems that have HSMs interacting with their extents, the > > HSM needs to be able to query whether the extent is online (on disk), > > has been migrated offline (on tape) or in dual-state (i.e. both online and > > offline). > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't I disagree - why would you want to indicate the state is unknown when we know very well that it is offline? > consider files that are both on disk and on secondary storage (which is > no longer just tape anymore). I thought I'd call this FIEMAP_EXTENT_OFFLINE, > but that has a confusing connotation that the extent is inaccessible, instead > of just saying it is also on offline storage. What about > FIEMAP_EXTENT_SECONDARY? Other proposals welcome. Effectively, when your extent is offline in the HSM, it is inaccessable, and you have to bring it back from tape so it becomes accessible again. i.e. some action is necessary on behalf of the user to make it accessible. So I think that OFFLINE is a good name for this state because it really is inaccessible. Also, I don't think "secondary" is a good term because most large systems have more than one tier of storage. One possibility is "HSM_RESIDENT" which indicates the extent is current and resident with a HSM's archive.... > FIEMAP_EXTENT_SECONDARY could still be set even if the file is mapped. > That would mean an offline-only file would be EXTENT_SECONDARY|EXTENT_UNKNOWN, > while a dual-location file would be EXTENT_SECONDARY only. I much prefer OFFLINE|HSM_RESIDENT and HSM_RESIDENT as it is far more descriptive as to what the state is (which certainly isn't unknown). > > > SUMMARY OF CHANGES > > > ================== > > > - add separate fe_flags word with flags from various suggestions: > > > - FIEMAP_EXTENT_HOLE = extent has no space allocation > > > - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data > > > - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown > > > (e.g. HSM, delalloc awaiting sync, etc) > > > > I'd like an explicit delalloc flag, not lumping it in with "unknown". > > we *know* the extent is delalloc ;) > > Sure, FIEMAP_EXTENT_DELALLOC is fine. It is mostly redundant with > EXTENT_UNKNOWN (and I think it only makes sense if DELALLOC is given in > addition to UNKNOWN). I disagree that it is redundant - in case you hadn't already noticed I dislike the idea of "unknown" meaning one of several possible known states ;) > I'd like to keep a generic "UNKNOWN" flag that can > be used by applications that don't really care about why it is unmapped > and in case there are other reasons in the future that an extent might > be unmapped (e.g. fsck or storage layer reporting corruption or loss of > that part of the file). Sure. > > > > chook 681% xfs_bmap -vv fred > > > > fred: > > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > > > 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010 > > > > FLAG Values: > > > > 010000 Unwritten preallocated extent > > > > 001000 Doesn't begin on stripe unit > > > > 000100 Doesn't end on stripe unit > > > > 000010 Doesn't begin on stripe width > > > > 000001 Doesn't end on stripe width > > > > > > Can you clarify the terminology here? What is a "stripe unit" and what is > > > a "stripe width"? > > > > Stripe unit is equivalent of the chunk size in an MD RAID. It's the amount > > of data that is written to each lun in a stripe before moving onto the > > next stripe element. > > > > > Are there "N * stripe_unit = stripe_width" in e.g. a > > > RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa? > > > > Yes, on simple configurations. In more complex HW RAID > > configurations, we'll typically set the stripe unit to the width of > > the RAID5 lun (N * segment size) and the stripe width to the number > > of luns we've striped across. > > Can you propose reasonable flag names for these (I can't think of anything > very good) and a clear explanation of what they mean. I suspect it will > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > the concept of stripe unit and stripe width, but as yet they are not > communicated between the two very well. I'd be much happier if this info > could be queried in a standard way from the block layer instead of the > user having to specify it and the filesystem having to track it. My preference is definitely for a separate ioctl to grab the filesystem geometry so this stuff can be calculated in userspace. i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't bother trying to define names until we decide which appraoch we take to implement this. The problem with the block layer is that ther is no standard way of getting this information - every volume manager has a different method. In XFS, mkfs.xfs does the work of getting this information to see in the filesystem superblock. Here's the code for getting sunit/swidth from the underlying block device: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ Not much in common there ;) The other problem is that there is no guarantee the filesystem and the block layer are using the same values as they can be overridden by mount options. e.g the block layer may have a stripe unit of 512k, but it's perfectly valid for the filesystem to use a multiple of this for it's sunit, or to not use a sunit at all. Hence you really do need to track this in the filesystem and query it from there... > > > > xfs_bmap gets around this by finding out how many extents there are in the > > > > file and allocating a buffer that big to hold all the extents so they > > > > are gathered in a single atomic call (think sparse matrix files).... > > > > > > Yeah, except this might be persistent for a long time if it isn't fully > > > read with a single ioctl and the app never continues reading but doesn't > > > close the fd. > > > > Not sure I follow you here... > > Ah, I was thinking that XFS was keeping a copy of the whole extent > mapping in the kernel to handle getting the data with separate calls. Actually, it keeps the whole mapping in the kernel to make lookups fast and relatively simple. > It does make sense to specify zero for the fm_extent_count array and a > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > extent data itself, for the non-verbose mode of filefrag, and for > pre-allocating a buffer large enough to hold the file if that is important. Rather than rely on implicit behaviour of "pass in extent count of zero and a don't try to return any extents" to return the number of extents on the file, why not just explicitly define this as a valid input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS > I'm also going to add a FIEMAP_FLAG_LAST to mark the last extent in the file, > so that iterators using a small buffer don't need to retry to get the last > extent, and it is possible in case of e.g. EINTR (or whatever) to return a > short list without signalling EOF. I think this is cleaner than returning > a HOLE extent from EOF to ~0ULL. Yes, good idea. > Another question about semantics - > - does XFS return an extent for the metadata parts of the file (e.g. btree)? No, but we can return the extent map for the attribute fork (i.e. extended attrs) if asked for (XFS_IOC_GETBMAPA). > - does XFS return preallocated extents beyond EOF? Yes - they are part of the extent map for the file. > - does XFS allow non-root users to call xfs_bmap on files they don't own, or > use by non-root users at all? Users can run xfs_bmap on any file they have permission to open(O_RDONLY). > The FIBMAP ioctl is for privileged users > only, and I wonder if FIEMAP should be the same, or at least disallow > mapping files that the user can't access especially with FLAG_SYNC and/or > FLAG_HSM_READ. I see little reason for restricting FI[BE]MAP to privileged users - anyone should be able to determine if files they have permission to access are fragmented. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Mon Apr 30 21:44:12 2007 Received: with ECARTIS (v1.0.0; list xfs); Mon, 30 Apr 2007 21:44:15 -0700 (PDT) Received: from rwcrmhc13.comcast.net (rwcrmhc13.comcast.net [216.148.227.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id l414iBfB030262 for ; Mon, 30 Apr 2007 21:44:12 -0700 Received: from [192.168.1.10] (c-67-171-1-120.hsd1.wa.comcast.net[67.171.1.120]) by comcast.net (rwcrmhc13) with SMTP id <20070501043907m130034vcje>; Tue, 1 May 2007 04:39:07 +0000 Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation From: Nicholas Miell To: David Chinner Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org In-Reply-To: <20070501042254.GD77450368@melbourne.sgi.com> References: <20070412110550.GM5967@schatzie.adilger.int> <20070416112252.GJ48531920@melbourne.sgi.com> <20070419002139.GK5967@schatzie.adilger.int> <20070419015426.GM48531920@melbourne.sgi.com> <20070430224401.GX5967@schatzie.adilger.int> <20070501042254.GD77450368@melbourne.sgi.com> Content-Type: text/plain Date: Mon, 30 Apr 2007 21:39:06 -0700 Message-Id: <1177994346.3362.5.camel@entropy> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 (2.8.3-2.0.njm.1) Content-Transfer-Encoding: 7bit X-archive-position: 11236 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: nmiell@comcast.net Precedence: bulk X-list: xfs On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > On Apr 19, 2007 11:54 +1000, David Chinner wrote: > > > > struct fiemap { > > > > __u64 fm_start; /* logical start offset of mapping (in/out) */ > > > > __u64 fm_len; /* logical length of mapping (in/out) */ > > > > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ > > > > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ > > > > __u64 fm_unused; > > > > struct fiemap_extent fm_extents[0]; > > > > } > > > > > > > > /* flags for the fiemap request */ > > > > #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/ > > > > #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */ > > > > #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/ > > > > > > No flags in the INCOMPAT range - shouldn't it be 0x3 at this point? > > > > This is actually for future use. Any flags that are added into this range > > must be understood by both sides or it should be considered an error. Flags > > outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported. > > If it turns out that 8 bits is too small a range for INCOMPAT flags, then > > we can make 0x01000000 an incompat flag that means e.g. 0x00ff0000 are also > > incompat flags also. > > Ah, ok. So it's not really a set of "compatibility" flags, > it's more a "compulsory" set. Under those terms, i don't really > see why this is necessary - either the filesystem will understand > the flags or it will return EINVAL or ignore them... > > > I'm assuming that all flags that will be in the original FIEMAP proposal > > will be understood by the implementations. Most filesystems can safely > > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for > > that matter FLAG_SYNC is probably moot for most filesystems also because > > they do block allocation at preprw time. > > Exactly my point - so why do we really need to encode a compulsory set > of flags in the API? > Because flags have meaning, independent of whether or not the filesystem understands them. And if the filesystem chooses to ignore critically important flags (instead of returning EINVAL), bad things may happen. So, either the filesystem will understand the flag or iff the unknown flag is in the incompat set, it will return EINVAL or else the unknown flag will be safely ignored. -- Nicholas Miell