From owner-xfs@oss.sgi.com Tue Apr 1 03:54:15 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 03:54:24 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,STOX_REPLY_TYPE autolearn=ham version=3.3.0-r574664 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.168.29]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m31AsCZf019813 for ; Tue, 1 Apr 2008 03:54:15 -0700 X-ASG-Debug-ID: 1207047285-562601da0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from tyo201.gate.nec.co.jp (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 0868A71FB12 for ; Tue, 1 Apr 2008 03:54:45 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [202.32.8.193]) by cuda.sgi.com with ESMTP id kkOsEnCzEHwgECVp for ; Tue, 01 Apr 2008 03:54:45 -0700 (PDT) Received: from mailgate3.nec.co.jp (mailgate53F.nec.co.jp [10.7.69.162]) by tyo201.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31Asipx028227; Tue, 1 Apr 2008 19:54:44 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id m31Ashm26568; Tue, 1 Apr 2008 19:54:43 +0900 (JST) Received: from togyo.jp.nec.com (togyo.jp.nec.com [10.26.220.4]) by mailsv4.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31Ash5A016406; Tue, 1 Apr 2008 19:54:43 +0900 (JST) Received: from TNESB07336 ([10.64.168.65] [10.64.168.65]) by mail.jp.nec.com with ESMTP; Tue, 1 Apr 2008 19:54:43 +0900 Message-Id: <2530BB4B166747659C8F65C9C3DE7CFB@nsl.ad.nec.co.jp> From: "Takashi Sato" To: "David Chinner" Cc: "David Chinner" , , , , , References: <20080328180736t-sato@mail.jp.nec.com> <20080331000057.GI108924158@sgi.com> In-Reply-To: <20080331000057.GI108924158@sgi.com> X-ASG-Orig-Subj: Re: [RFC PATCH 2/2] Add timeout feature Subject: Re: [RFC PATCH 2/2] Add timeout feature Date: Tue, 1 Apr 2008 19:54:42 +0900 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Windows Mail 6.0.6000.16480 X-MimeOLE: Produced By Microsoft MimeOLE V6.0.6000.16545 X-Barracuda-Connect: TYO201.gate.nec.co.jp[202.32.8.193] X-Barracuda-Start-Time: 1207047289 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.46520 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15116 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: t-sato@yk.jp.nec.com Precedence: bulk X-list: xfs Hi, David Chinner wrote: > The timeout is not for the freeze operation - the timeout is > only set up once the freeze is complete. i.e: > > $ time sudo ~/test_src/xfs_io -f -x -c 'gfreeze 10' /mnt/scratch/test > freezing with level = 10 > > real 0m23.204s > user 0m0.008s > sys 0m0.012s > > The freeze takes 23s, and then the 10s timeout is started. So > this timeout does not protect against freeze_bdev() hangs at all. > All it does is introduce silent unfreezing of the block device that > can not be synchronised with the application that is operating > on the frozen device. Exactly my timeout feature is only for an application, not for freeze_bdev(). I think it is needed for the situation we can't unfreeze from userspace. (e.g. Freezing the root filesystem) > FWIW, resetting this timeout from userspace is unreliable - there's > no guarantee that under load your userspace process will get to run > again inside the timeout to reset it, hence leaving you with a > unfrozen filesystem when you really want it frozen... The timeout period specified to the reset ioctl should be much larger than the interval for calling the reset ioctl repeatedly. (e.g timeout period = 2 minutes, calling interval = 5 seconds) The reset ioctl will work under such setting. If a timeout still occurs before a reset, it would imply that an unexpected problem (e.g. deadlock) occur in an application. Cheers, Takashi From owner-xfs@oss.sgi.com Tue Apr 1 05:00:09 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 05:00:17 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.3 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.168.28]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m31C072n001075 for ; Tue, 1 Apr 2008 05:00:09 -0700 X-ASG-Debug-ID: 1207051242-1da2025b0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from smtp7-g19.free.fr (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 589EC126E470 for ; Tue, 1 Apr 2008 05:00:42 -0700 (PDT) Received: from smtp7-g19.free.fr (smtp7-g19.free.fr [212.27.42.64]) by cuda.sgi.com with ESMTP id 2UuA7V0r8wBjabRR for ; Tue, 01 Apr 2008 05:00:42 -0700 (PDT) Received: from smtp7-g19.free.fr (localhost [127.0.0.1]) by smtp7-g19.free.fr (Postfix) with ESMTP id DB223322882; Tue, 1 Apr 2008 14:00:41 +0200 (CEST) Received: from galadriel.home (pla78-1-82-235-234-79.fbx.proxad.net [82.235.234.79]) by smtp7-g19.free.fr (Postfix) with ESMTP id 8F95632286A; Tue, 1 Apr 2008 14:00:41 +0200 (CEST) Date: Tue, 1 Apr 2008 14:00:35 +0200 From: Emmanuel Florac To: David Chinner Cc: xfs@oss.sgi.com X-ASG-Orig-Subj: Re: Serious XFS crash Subject: Re: Serious XFS crash Message-ID: <20080401140035.46470306@galadriel.home> In-Reply-To: <20080325233611.GW103491721@sgi.com> References: <20080325185453.3a1957dd@galadriel.home> <20080325233611.GW103491721@sgi.com> Organization: Intellique X-Mailer: Claws Mail 2.9.1 (GTK+ 2.8.20; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 X-Barracuda-Connect: smtp7-g19.free.fr[212.27.42.64] X-Barracuda-Start-Time: 1207051243 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.46522 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id m31C092n001077 X-archive-position: 15117 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: eflorac@intellique.com Precedence: bulk X-list: xfs Le Wed, 26 Mar 2008 10:36:11 +1100 vous écriviez: > What sector size is being used for the XFS filesystem? If it's > not the same as teh filesystem block size, then XFS can't have done > this itself because the offset that this garbage starts at would > not be block aligned..... I've gone thru the logs. This machine had a serious XFS crash on march 6 due to bad blocks (failed drive in the RAID-5). Is it possible that the March 19 XFS crash is related to this, i. e. after running xfs_repair on march 6 it remained some on-disk garbage that provoked a new crash a couple of weeks later? Here is the march 6 crash : Mar 6 10:42:46 system3 kernel: [xfs_alloc_read_agf+244/432] xfs_alloc_read_agf+0xf4/0x1b0 Mar 6 10:42:46 system3 kernel: [xfs_alloc_fix_freelist+1000/1120] xfs_alloc_fix_freelist+0x3e8/0x460 Mar 6 10:42:46 system3 last message repeated 2 times Mar 6 10:42:46 system3 kernel: [_xfs_trans_commit+489/928] _xfs_trans_commit+0x1e9/0x3a0 Mar 6 10:42:46 system3 kernel: [xfs_free_extent+152/224] xfs_free_extent+0x98/0xe0 Mar 6 10:42:46 system3 kernel: [xfs_bmap_finish+263/400] xfs_bmap_finish+0x107/0x190 Mar 6 10:42:46 system3 kernel: [xfs_itruncate_finish+544/976] xfs_itruncate_finish+0x220/0x3d0 Mar 6 10:42:46 system3 kernel: [xfs_trans_ijoin+43/128] xfs_trans_ijoin+0x2b/0x80 Mar 6 10:42:46 system3 kernel: [xfs_inactive+1195/1296] xfs_inactive+0x4ab/0x510 Mar 6 10:42:46 system3 kernel: [xfs_fs_clear_inode+156/192] xfs_fs_clear_inode+0x9c/0xc0 Mar 6 10:42:46 system3 kernel: [invalidate_inode_buffers+21/112] invalidate_inode_buffers+0x15/0x70 Mar 6 10:42:46 system3 kernel: [clear_inode+212/320] clear_inode+0xd4/0x140 Mar 6 10:42:46 system3 kernel: [truncate_inode_pages+23/32] truncate_inode_pages+0x17/0x20 Mar 6 10:42:46 system3 kernel: [generic_delete_inode+264/272] generic_delete_inode+0x108/0x110 Mar 6 10:42:46 system3 kernel: [iput+83/112] iput+0x53/0x70 Mar 6 10:42:46 system3 kernel: [do_unlinkat+186/272] do_unlinkat+0xba/0x110 Mar 6 10:42:46 system3 kernel: [sys_fcntl64+89/144] sys_fcntl64+0x59/0x90 Mar 6 10:42:46 system3 kernel: [syscall_call+7/11] syscall_call+0x7/0xb Mar 6 10:42:46 system3 kernel: xfs_force_shutdown(md0,0x8) called from line 4267 of file fs/xfs/xfs_bmap.c. Return address = 0xc0256b29 Mar 6 10:51:19 system3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=6, LBA=0xE6E00. Mar 6 10:51:20 system3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=6, LBA=0xE6DCA. -- -------------------------------------------------- Emmanuel Florac www.intellique.com -------------------------------------------------- From owner-xfs@oss.sgi.com Tue Apr 1 05:15:47 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 05:15:54 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.4 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_24, J_CHICKENPOX_53 autolearn=no version=3.3.0-r574664 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.168.29]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m31CFimi003143 for ; Tue, 1 Apr 2008 05:15:47 -0700 X-ASG-Debug-ID: 1207052180-5002001d0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from tyo201.gate.nec.co.jp (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2BDFD72047F for ; Tue, 1 Apr 2008 05:16:20 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [202.32.8.193]) by cuda.sgi.com with ESMTP id IGMg2GftSWpUw1iF for ; Tue, 01 Apr 2008 05:16:20 -0700 (PDT) Received: from mailgate4.nec.co.jp ([10.7.69.184]) by tyo201.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31CGIZw000059; Tue, 1 Apr 2008 21:16:18 +0900 (JST) Received: (from root@localhost) by mailgate4.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id m31CGIR06923; Tue, 1 Apr 2008 21:16:18 +0900 (JST) Received: from kaishu.jp.nec.com (kaishu.jp.nec.com [10.26.220.5]) by mailsv.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31CGHkb009544; Tue, 1 Apr 2008 21:16:17 +0900 (JST) Received: from TNESB07336 ([10.64.168.65] [10.64.168.65]) by mail.jp.nec.com with ESMTP; Tue, 1 Apr 2008 21:16:15 +0900 To: David Chinner Cc: "linux-fsdevel@vger.kernel.org" , "linux-ext4@vger.kernel.org" , "xfs@oss.sgi.com" , "dm-devel@redhat.com" , "linux-kernel@vger.kernel.org" X-ASG-Orig-Subj: [RFC PATCH 0/3] freeze feature ver 1.1 Subject: [RFC PATCH 0/3] freeze feature ver 1.1 Message-Id: <20080401211614t-sato@mail.jp.nec.com> Mime-Version: 1.0 X-Mailer: WeMail32[2.51] ID:1K0086 From: Takashi Sato Date: Tue, 1 Apr 2008 21:16:14 +0900 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Barracuda-Connect: TYO201.gate.nec.co.jp[202.32.8.193] X-Barracuda-Start-Time: 1207052181 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.46524 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15118 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: t-sato@yk.jp.nec.com Precedence: bulk X-list: xfs Hi, David Chinner wrote: > Patch below to remove the XFS specific ioctl interfaces for this > functionality. Thank you very much for your patch to remove the XFS specific code. I have merged your patch into 2.6.25-rc7 and included it in the freeze patch-set as the following [PATCH 2/3]. [PATCH 1/3] Implement generic freeze feature The ioctls for the generic freeze feature are below. o Freeze the filesystem int ioctl(int fd, int FIFREEZE, arg) fd: The file descriptor of the mountpoint FIFREEZE: request code for the freeze arg: Ignored Return value: 0 if the operation succeeds. Otherwise, -1 o Unfreeze the filesystem int ioctl(int fd, int FITHAW, arg) fd: The file descriptor of the mountpoint FITHAW: request code for unfreeze arg: Ignored Return value: 0 if the operation succeeds. Otherwise, -1 [PATCH 2/3] Remove XFS specific ioctl interfaces for freeze feature It removes XFS specific ioctl interfaces and request codes for freeze feature. This patch has been supplied by David Chinner. [PATCH 3/3] Add timeout feature The timeout feature is added to freeze ioctl. And new ioctl to reset the timeout period is added. o Freeze the filesystem int ioctl(int fd, int FIFREEZE, long *timeval) fd: The file descriptor of the mountpoint FIFREEZE: request code for the freeze timeval: the timeout period in seconds If it's 0 or 1, the timeout isn't set. This special case of "1" is implemented to keep the compatibility with XFS applications. Return value: 0 if the operation succeeds. Otherwise, -1 o Reset the timeout period This is useful for the application to set the timeval more accurately. For example, the freezer resets the timeval to 10 seconds every 5 seconds. In this approach, even if the freezer causes a deadlock by accessing the frozen filesystem, it will be solved by the timeout in 10 seconds and the freezer can recognize that at the next reset of timeval. int ioctl(int fd, int FIFREEZE_RESET_TIMEOUT, long *timeval) fd:file descriptor of mountpoint FIFREEZE_RESET_TIMEOUT: request code for reset of timeout period timeval: new timeout period in seconds Return value: 0 if the operation succeeds. Otherwise, -1 Error number: If the filesystem has already been unfrozen, errno is set to EINVAL. Any comments are very welcome. Cheers, Takashi From owner-xfs@oss.sgi.com Tue Apr 1 05:17:00 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 05:17:07 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_53 autolearn=no version=3.3.0-r574664 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.168.28]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m31CGw9j003456 for ; Tue, 1 Apr 2008 05:17:00 -0700 X-ASG-Debug-ID: 1207052253-1da103a50000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from tyo201.gate.nec.co.jp (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 695FF126E715 for ; Tue, 1 Apr 2008 05:17:33 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [202.32.8.193]) by cuda.sgi.com with ESMTP id 5Y3e8tZGWUuXbusR for ; Tue, 01 Apr 2008 05:17:33 -0700 (PDT) Received: from mailgate3.nec.co.jp (mailgate54C.nec.co.jp [10.7.69.197]) by tyo201.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31CHWXX000909; Tue, 1 Apr 2008 21:17:32 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id m31CHWi24315; Tue, 1 Apr 2008 21:17:32 +0900 (JST) Received: from kuichi.jp.nec.com (kuichi.jp.nec.com [10.26.220.17]) by mailsv4.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31CHWcX005370; Tue, 1 Apr 2008 21:17:32 +0900 (JST) Received: from TNESB07336 ([10.64.168.65] [10.64.168.65]) by mail.jp.nec.com with ESMTP; Tue, 1 Apr 2008 21:17:29 +0900 To: David Chinner Cc: "linux-ext4@vger.kernel.org" , "xfs@oss.sgi.com" , "linux-fsdevel@vger.kernel.org" , "dm-devel@redhat.com" , "linux-kernel@vger.kernel.org" X-ASG-Orig-Subj: [RFC PATCH 1/3] Implement generic freeze feature Subject: [RFC PATCH 1/3] Implement generic freeze feature Message-Id: <20080401211729t-sato@mail.jp.nec.com> Mime-Version: 1.0 X-Mailer: WeMail32[2.51] ID:1K0086 From: Takashi Sato Date: Tue, 1 Apr 2008 21:17:29 +0900 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Barracuda-Connect: TYO201.gate.nec.co.jp[202.32.8.193] X-Barracuda-Start-Time: 1207052254 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -1.42 X-Barracuda-Spam-Status: No, SCORE=-1.42 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests=MARKETING_SUBJECT X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.46525 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.60 MARKETING_SUBJECT Subject contains popular marketing words X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15119 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: t-sato@yk.jp.nec.com Precedence: bulk X-list: xfs The ioctls for the generic freeze feature are below. o Freeze the filesystem int ioctl(int fd, int FIFREEZE, arg) fd: The file descriptor of the mountpoint FIFREEZE: request code for the freeze arg: Ignored Return value: 0 if the operation succeeds. Otherwise, -1 o Unfreeze the filesystem int ioctl(int fd, int FITHAW, arg) fd: The file descriptor of the mountpoint FITHAW: request code for unfreeze arg: Ignored Return value: 0 if the operation succeeds. Otherwise, -1 Signed-off-by: Takashi Sato Signed-off-by: Masayuki Hamaguchi --- fs/block_dev.c | 3 +++ fs/buffer.c | 25 +++++++++++++++++++++++++ fs/ioctl.c | 35 +++++++++++++++++++++++++++++++++++ fs/super.c | 32 +++++++++++++++++++++++++++++++- include/linux/fs.h | 7 +++++++ 5 files changed, 101 insertions(+), 1 deletion(-) diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7.org/fs/block_dev.c linux-2.6.25-rc7-freeze/fs/block_ dev.c --- linux-2.6.25-rc7.org/fs/block_dev.c 2008-03-26 10:38:14.000000000 +0900 +++ linux-2.6.25-rc7-freeze/fs/block_dev.c 2008-03-27 09:26:36.000000000 +0900 @@ -284,6 +284,9 @@ static void init_once(struct kmem_cache INIT_LIST_HEAD(&bdev->bd_holder_list); #endif inode_init_once(&ei->vfs_inode); + + /* Initialize semaphore for freeze. */ + sema_init(&bdev->bd_freeze_sem, 1); } static inline void __bd_forget(struct inode *inode) diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7.org/fs/buffer.c linux-2.6.25-rc7-freeze/fs/buffer.c --- linux-2.6.25-rc7.org/fs/buffer.c 2008-03-26 10:38:14.000000000 +0900 +++ linux-2.6.25-rc7-freeze/fs/buffer.c 2008-03-26 20:32:23.000000000 +0900 @@ -201,6 +201,19 @@ struct super_block *freeze_bdev(struct b { struct super_block *sb; + down(&bdev->bd_freeze_sem); + sb = get_super_without_lock(bdev); + + /* If super_block has been already frozen, return. */ + if (sb && sb->s_frozen != SB_UNFROZEN) { + put_super(sb); + up(&bdev->bd_freeze_sem); + return sb; + } + + if (sb) + put_super(sb); + down(&bdev->bd_mount_sem); sb = get_super(bdev); if (sb && !(sb->s_flags & MS_RDONLY)) { @@ -219,6 +232,9 @@ struct super_block *freeze_bdev(struct b } sync_blockdev(bdev); + + up(&bdev->bd_freeze_sem); + return sb; /* thaw_bdev releases s->s_umount and bd_mount_sem */ } EXPORT_SYMBOL(freeze_bdev); @@ -232,6 +248,13 @@ EXPORT_SYMBOL(freeze_bdev); */ void thaw_bdev(struct block_device *bdev, struct super_block *sb) { + down(&bdev->bd_freeze_sem); + + if (sb && sb->s_frozen == SB_UNFROZEN) { + up(&bdev->bd_freeze_sem); + return; + } + if (sb) { BUG_ON(sb->s_bdev != bdev); @@ -244,6 +267,8 @@ void thaw_bdev(struct block_device *bdev } up(&bdev->bd_mount_sem); + + up(&bdev->bd_freeze_sem); } EXPORT_SYMBOL(thaw_bdev); diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7.org/fs/ioctl.c linux-2.6.25-rc7-freeze/fs/ioctl.c --- linux-2.6.25-rc7.org/fs/ioctl.c 2008-03-26 10:38:14.000000000 +0900 +++ linux-2.6.25-rc7-freeze/fs/ioctl.c 2008-03-26 20:22:17.000000000 +0900 @@ -13,6 +13,7 @@ #include #include #include +#include #include @@ -181,6 +182,40 @@ int do_vfs_ioctl(struct file *filp, unsi } else error = -ENOTTY; break; + + case FIFREEZE: { + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; + + if (!capable(CAP_SYS_ADMIN)) { + error = -EPERM; + break; + } + + /* If filesystem doesn't support freeze feature, return. */ + if (sb->s_op->write_super_lockfs == NULL) { + error = -EINVAL; + break; + } + + /* Freeze. */ + freeze_bdev(sb->s_bdev); + + break; + } + + case FITHAW: { + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; + + if (!capable(CAP_SYS_ADMIN)) { + error = -EPERM; + break; + } + + /* Thaw. */ + thaw_bdev(sb->s_bdev, sb); + break; + } + default: if (S_ISREG(filp->f_path.dentry->d_inode->i_mode)) error = file_ioctl(filp, cmd, arg); diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7.org/fs/super.c linux-2.6.25-rc7-freeze/fs/super.c --- linux-2.6.25-rc7.org/fs/super.c 2008-03-26 10:38:14.000000000 +0900 +++ linux-2.6.25-rc7-freeze/fs/super.c 2008-03-26 20:23:21.000000000 +0900 @@ -154,7 +154,7 @@ int __put_super_and_need_restart(struct * Drops a temporary reference, frees superblock if there's no * references left. */ -static void put_super(struct super_block *sb) +void put_super(struct super_block *sb) { spin_lock(&sb_lock); __put_super(sb); @@ -507,6 +507,36 @@ rescan: EXPORT_SYMBOL(get_super); +/* + * get_super_without_lock - Get super_block from block_device without lock. + * @bdev: block device struct + * + * Scan the superblock list and finds the superblock of the file system + * mounted on the block device given. This doesn't lock anyone. + * %NULL is returned if no match is found. + */ +struct super_block *get_super_without_lock(struct block_device *bdev) +{ + struct super_block *sb; + + if (!bdev) + return NULL; + + spin_lock(&sb_lock); + list_for_each_entry(sb, &super_blocks, s_list) { + if (sb->s_bdev == bdev) { + if (sb->s_root) { + sb->s_count++; + spin_unlock(&sb_lock); + return sb; + } + } + } + spin_unlock(&sb_lock); + return NULL; +} +EXPORT_SYMBOL(get_super_without_lock); + struct super_block * user_get_super(dev_t dev) { struct super_block *sb; diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7.org/include/linux/fs.h linux-2.6.25-rc7-freeze/inclu de/linux/fs.h --- linux-2.6.25-rc7.org/include/linux/fs.h 2008-03-26 10:38:14.000000000 +0900 +++ linux-2.6.25-rc7-freeze/include/linux/fs.h 2008-03-26 20:27:44.000000000 +0900 @@ -223,6 +223,8 @@ extern int dir_notify_enable; #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ +#define FIFREEZE _IOWR('X', 119, int) /* Freeze */ +#define FITHAW _IOWR('X', 120, int) /* Thaw */ #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) @@ -548,6 +550,9 @@ struct block_device { * care to not mess up bd_private for that case. */ unsigned long bd_private; + + /* Semaphore for freeze */ + struct semaphore bd_freeze_sem; }; /* @@ -1926,7 +1931,9 @@ extern int do_vfs_ioctl(struct file *fil extern void get_filesystem(struct file_system_type *fs); extern void put_filesystem(struct file_system_type *fs); extern struct file_system_type *get_fs_type(const char *name); +extern void put_super(struct super_block *sb); extern struct super_block *get_super(struct block_device *); +extern struct super_block *get_super_without_lock(struct block_device *); extern struct super_block *user_get_super(dev_t); extern void drop_super(struct super_block *sb); From owner-xfs@oss.sgi.com Tue Apr 1 05:17:36 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 05:17:48 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.168.28]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m31CHWiL003648 for ; Tue, 1 Apr 2008 05:17:36 -0700 X-ASG-Debug-ID: 1207052287-27d8029e0000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from tyo202.gate.nec.co.jp (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id E13E4126E71C for ; Tue, 1 Apr 2008 05:18:08 -0700 (PDT) Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206]) by cuda.sgi.com with ESMTP id KCrfxBzQz1I9jofA for ; Tue, 01 Apr 2008 05:18:08 -0700 (PDT) Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.193]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31CI7fG016064; Tue, 1 Apr 2008 21:18:07 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id m31CI7S13363; Tue, 1 Apr 2008 21:18:07 +0900 (JST) Received: from kaishu.jp.nec.com (kaishu.jp.nec.com [10.26.220.5]) by mailsv4.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31CI6fA005594; Tue, 1 Apr 2008 21:18:06 +0900 (JST) Received: from TNESB07336 ([10.64.168.65] [10.64.168.65]) by mail.jp.nec.com with ESMTP; Tue, 1 Apr 2008 21:18:06 +0900 To: David Chinner Cc: "linux-fsdevel@vger.kernel.org" , "linux-ext4@vger.kernel.org" , "xfs@oss.sgi.com" , "dm-devel@redhat.com" , "linux-kernel@vger.kernel.org" X-ASG-Orig-Subj: [RFC PATCH 2/3] Remove XFS specific ioctl interfaces for freeze feature Subject: [RFC PATCH 2/3] Remove XFS specific ioctl interfaces for freeze feature Message-Id: <20080401211806t-sato@mail.jp.nec.com> Mime-Version: 1.0 X-Mailer: WeMail32[2.51] ID:1K0086 From: Takashi Sato Date: Tue, 1 Apr 2008 21:18:05 +0900 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Barracuda-Connect: TYO202.gate.nec.co.jp[202.32.8.206] X-Barracuda-Start-Time: 1207052288 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.46525 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15120 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: t-sato@yk.jp.nec.com Precedence: bulk X-list: xfs It removes XFS specific ioctl interfaces and request codes for freeze feature. This patch has been supplied by David Chinner. Signed-off-by: Dave Chinner Signed-off-by: Takashi Sato --- linux-2.6/xfs_ioctl.c | 15 --------------- linux-2.6/xfs_ioctl32.c | 2 -- xfs_fs.h | 4 ++-- 3 files changed, 2 insertions(+), 19 deletions(-) diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-freeze/fs/xfs/linux-2.6/xfs_ioctl.c linux-2.6.25-rc7 -xfs/fs/xfs/linux-2.6/xfs_ioctl.c --- linux-2.6.25-rc7-freeze/fs/xfs/linux-2.6/xfs_ioctl.c 2008-04-01 13:22:43.000000000 +0900 +++ linux-2.6.25-rc7-xfs/fs/xfs/linux-2.6/xfs_ioctl.c 2008-04-01 14:34:21.000000000 +0900 @@ -906,21 +906,6 @@ xfs_ioctl( return -error; } - case XFS_IOC_FREEZE: - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (inode->i_sb->s_frozen == SB_UNFROZEN) - freeze_bdev(inode->i_sb->s_bdev); - return 0; - - case XFS_IOC_THAW: - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - if (inode->i_sb->s_frozen != SB_UNFROZEN) - thaw_bdev(inode->i_sb->s_bdev, inode->i_sb); - return 0; - case XFS_IOC_GOINGDOWN: { __uint32_t in; diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-freeze/fs/xfs/linux-2.6/xfs_ioctl32.c linux-2.6.25-r c7-xfs/fs/xfs/linux-2.6/xfs_ioctl32.c --- linux-2.6.25-rc7-freeze/fs/xfs/linux-2.6/xfs_ioctl32.c 2008-04-01 13:22:43.000000000 +0900 +++ linux-2.6.25-rc7-xfs/fs/xfs/linux-2.6/xfs_ioctl32.c 2008-04-01 14:31:59.000000000 +0900 @@ -398,8 +398,6 @@ xfs_compat_ioctl( case XFS_IOC_FSGROWFSDATA: case XFS_IOC_FSGROWFSLOG: case XFS_IOC_FSGROWFSRT: - case XFS_IOC_FREEZE: - case XFS_IOC_THAW: case XFS_IOC_GOINGDOWN: case XFS_IOC_ERROR_INJECTION: case XFS_IOC_ERROR_CLEARALL: diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-freeze/fs/xfs/xfs_fs.h linux-2.6.25-rc7-xfs/fs/xfs/x fs_fs.h --- linux-2.6.25-rc7-freeze/fs/xfs/xfs_fs.h 2008-04-01 13:22:48.000000000 +0900 +++ linux-2.6.25-rc7-xfs/fs/xfs/xfs_fs.h 2008-04-01 14:31:59.000000000 +0900 @@ -473,8 +473,8 @@ typedef struct xfs_handle { #define XFS_IOC_ERROR_INJECTION _IOW ('X', 116, struct xfs_error_injection) #define XFS_IOC_ERROR_CLEARALL _IOW ('X', 117, struct xfs_error_injection) /* XFS_IOC_ATTRCTL_BY_HANDLE -- deprecated 118 */ -#define XFS_IOC_FREEZE _IOWR('X', 119, int) -#define XFS_IOC_THAW _IOWR('X', 120, int) +/* XFS_IOC_FREEZE -- FIFREEZE 119 */ +/* XFS_IOC_THAW -- FITHAW 120 */ #define XFS_IOC_FSSETDM_BY_HANDLE _IOW ('X', 121, struct xfs_fsop_setdm_handlereq) #define XFS_IOC_ATTRLIST_BY_HANDLE _IOW ('X', 122, struct xfs_fsop_attrlist_handlereq) #define XFS_IOC_ATTRMULTI_BY_HANDLE _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq) From owner-xfs@oss.sgi.com Tue Apr 1 05:21:31 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 05:21:39 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.4 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_24, J_CHICKENPOX_53 autolearn=no version=3.3.0-r574664 Received: from cuda.sgi.com (cuda1.sgi.com [192.48.168.28]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m31CLVf3004684 for ; Tue, 1 Apr 2008 05:21:31 -0700 X-ASG-Debug-ID: 1207052525-27d802d30000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from tyo201.gate.nec.co.jp (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id EFA11126E7E7 for ; Tue, 1 Apr 2008 05:22:06 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [202.32.8.193]) by cuda.sgi.com with ESMTP id 4l3t4s8J5i8Bqj2k for ; Tue, 01 Apr 2008 05:22:06 -0700 (PDT) Received: from mailgate3.nec.co.jp (mailgate54C.nec.co.jp [10.7.69.197]) by tyo201.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31CM58p003950; Tue, 1 Apr 2008 21:22:05 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id m31CM5j28845; Tue, 1 Apr 2008 21:22:05 +0900 (JST) Received: from shoin.jp.nec.com (shoin.jp.nec.com [10.26.220.3]) by mailsv4.nec.co.jp (8.13.8/8.13.4) with ESMTP id m31CM5JV007657; Tue, 1 Apr 2008 21:22:05 +0900 (JST) Received: from TNESB07336 ([10.64.168.65] [10.64.168.65]) by mail.jp.nec.com with ESMTP; Tue, 1 Apr 2008 21:22:04 +0900 To: David Chinner Cc: "linux-ext4@vger.kernel.org" , "xfs@oss.sgi.com" , "dm-devel@redhat.com" , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" X-ASG-Orig-Subj: [RFC PATCH 3/3] Add timeout feature Subject: [RFC PATCH 3/3] Add timeout feature Message-Id: <20080401212204t-sato@mail.jp.nec.com> Mime-Version: 1.0 X-Mailer: WeMail32[2.51] ID:1K0086 From: Takashi Sato Date: Tue, 1 Apr 2008 21:22:04 +0900 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Barracuda-Connect: TYO201.gate.nec.co.jp[202.32.8.193] X-Barracuda-Start-Time: 1207052526 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.46525 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15121 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: t-sato@yk.jp.nec.com Precedence: bulk X-list: xfs The timeout feature is added to freeze ioctl. And new ioctl to reset the timeout period is added. o Freeze the filesystem int ioctl(int fd, int FIFREEZE, long *timeval) fd: The file descriptor of the mountpoint FIFREEZE: request code for the freeze timeval: the timeout period in seconds If it's 0 or 1, the timeout isn't set. This special case of "1" is implemented to keep the compatibility with XFS applications. Return value: 0 if the operation succeeds. Otherwise, -1 o Reset the timeout period int ioctl(int fd, int FIFREEZE_RESET_TIMEOUT, long *timeval) fd:file descriptor of mountpoint FIFREEZE_RESET_TIMEOUT: request code for reset of timeout period timeval: new timeout period in seconds Return value: 0 if the operation succeeds. Otherwise, -1 Error number: If the filesystem has already been unfrozen, errno is set to EINVAL. Signed-off-by: Takashi Sato Signed-off-by: Masayuki Hamaguchi --- drivers/md/dm.c | 2 - fs/block_dev.c | 2 + fs/buffer.c | 14 ++++++++- fs/ioctl.c | 64 +++++++++++++++++++++++++++++++++++++++++++- fs/super.c | 52 +++++++++++++++++++++++++++++++++++ fs/xfs/xfs_fsops.c | 2 - include/linux/buffer_head.h | 2 - include/linux/fs.h | 8 +++++ 8 files changed, 140 insertions(+), 6 deletions(-) diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-xfs/drivers/md/dm.c linux-2.6.25-rc7-timeout/drivers /md/dm.c --- linux-2.6.25-rc7-xfs/drivers/md/dm.c 2008-04-01 14:21:37.000000000 +0900 +++ linux-2.6.25-rc7-timeout/drivers/md/dm.c 2008-04-01 13:25:11.000000000 +0900 @@ -1407,7 +1407,7 @@ static int lock_fs(struct mapped_device WARN_ON(md->frozen_sb); - md->frozen_sb = freeze_bdev(md->suspended_bdev); + md->frozen_sb = freeze_bdev(md->suspended_bdev, 0); if (IS_ERR(md->frozen_sb)) { r = PTR_ERR(md->frozen_sb); md->frozen_sb = NULL; diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-xfs/fs/block_dev.c linux-2.6.25-rc7-timeout/fs/block _dev.c --- linux-2.6.25-rc7-xfs/fs/block_dev.c 2008-04-01 14:22:34.000000000 +0900 +++ linux-2.6.25-rc7-timeout/fs/block_dev.c 2008-04-01 13:27:38.000000000 +0900 @@ -287,6 +287,8 @@ static void init_once(struct kmem_cache /* Initialize semaphore for freeze. */ sema_init(&bdev->bd_freeze_sem, 1); + /* Setup freeze timeout function. */ + INIT_DELAYED_WORK(&bdev->bd_freeze_timeout, freeze_timeout); } static inline void __bd_forget(struct inode *inode) diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-xfs/fs/buffer.c linux-2.6.25-rc7-timeout/fs/buffer.c --- linux-2.6.25-rc7-xfs/fs/buffer.c 2008-04-01 14:22:26.000000000 +0900 +++ linux-2.6.25-rc7-timeout/fs/buffer.c 2008-04-01 13:27:14.000000000 +0900 @@ -190,14 +190,17 @@ int fsync_bdev(struct block_device *bdev /** * freeze_bdev -- lock a filesystem and force it into a consistent state - * @bdev: blockdevice to lock + * @bdev: blockdevice to lock + * @timeout_msec: timeout period * * This takes the block device bd_mount_sem to make sure no new mounts * happen on bdev until thaw_bdev() is called. * If a superblock is found on this device, we take the s_umount semaphore * on it to make sure nobody unmounts until the snapshot creation is done. + * If timeout_msec is bigger than 0, this registers the delayed work for + * timeout of the freeze feature. */ -struct super_block *freeze_bdev(struct block_device *bdev) +struct super_block *freeze_bdev(struct block_device *bdev, long timeout_msec) { struct super_block *sb; @@ -233,6 +236,10 @@ struct super_block *freeze_bdev(struct b sync_blockdev(bdev); + /* Setup unfreeze timer. */ + if (timeout_msec > 0) + add_freeze_timeout(bdev, timeout_msec); + up(&bdev->bd_freeze_sem); return sb; /* thaw_bdev releases s->s_umount and bd_mount_sem */ @@ -255,6 +262,9 @@ void thaw_bdev(struct block_device *bdev return; } + /* Delete unfreeze timer. */ + del_freeze_timeout(bdev); + if (sb) { BUG_ON(sb->s_bdev != bdev); diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-xfs/fs/ioctl.c linux-2.6.25-rc7-timeout/fs/ioctl.c --- linux-2.6.25-rc7-xfs/fs/ioctl.c 2008-04-01 14:22:38.000000000 +0900 +++ linux-2.6.25-rc7-timeout/fs/ioctl.c 2008-04-01 13:27:46.000000000 +0900 @@ -184,6 +184,8 @@ int do_vfs_ioctl(struct file *filp, unsi break; case FIFREEZE: { + long timeout_sec; + long timeout_msec; struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; if (!capable(CAP_SYS_ADMIN)) { @@ -197,8 +199,31 @@ int do_vfs_ioctl(struct file *filp, unsi break; } + /* arg(sec) to tick value. */ + error = get_user(timeout_sec, (long __user *) arg); + if (error != 0) + break; + /* + * If 1 is specified as the timeout period, + * it will be changed into 0 to keep the compatibility + * of XFS application(xfs_freeze). + */ + if (timeout_sec < 0) { + error = -EINVAL; + break; + } else if (timeout_sec < 2) { + timeout_sec = 0; + } + + timeout_msec = timeout_sec * 1000; + /* overflow case */ + if (timeout_msec < 0) { + error = -EINVAL; + break; + } + /* Freeze. */ - freeze_bdev(sb->s_bdev); + freeze_bdev(sb->s_bdev, timeout_msec); break; } @@ -216,6 +241,43 @@ int do_vfs_ioctl(struct file *filp, unsi break; } + case FIFREEZE_RESET_TIMEOUT: { + long timeout_sec; + long timeout_msec; + struct super_block *sb + = filp->f_path.dentry->d_inode->i_sb; + + if (!capable(CAP_SYS_ADMIN)) { + error = -EPERM; + break; + } + + /* arg(sec) to tick value */ + error = get_user(timeout_sec, (long __user *) arg); + if (error) + break; + timeout_msec = timeout_sec * 1000; + if (timeout_msec < 0) { + error = -EINVAL; + break; + } + + if (sb) { + down(&sb->s_bdev->bd_freeze_sem); + if (sb->s_frozen == SB_UNFROZEN) { + up(&sb->s_bdev->bd_freeze_sem); + error = -EINVAL; + break; + } + /* setup unfreeze timer */ + if (timeout_msec > 0) + add_freeze_timeout(sb->s_bdev, + timeout_msec); + up(&sb->s_bdev->bd_freeze_sem); + } + break; + } + default: if (S_ISREG(filp->f_path.dentry->d_inode->i_mode)) error = file_ioctl(filp, cmd, arg); diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-xfs/fs/super.c linux-2.6.25-rc7-timeout/fs/super.c --- linux-2.6.25-rc7-xfs/fs/super.c 2008-04-01 14:22:34.000000000 +0900 +++ linux-2.6.25-rc7-timeout/fs/super.c 2008-04-01 13:27:41.000000000 +0900 @@ -983,3 +983,55 @@ struct vfsmount *kern_mount_data(struct } EXPORT_SYMBOL_GPL(kern_mount_data); + +/* + * freeze_timeout - Thaw the filesystem. + * + * @work: work queue (delayed_work.work) + * + * Called by the delayed work when elapsing the timeout period. + * Thaw the filesystem. + */ +void freeze_timeout(struct work_struct *work) +{ + struct block_device *bd = container_of(work, + struct block_device, bd_freeze_timeout.work); + + struct super_block *sb = get_super_without_lock(bd); + + thaw_bdev(bd, sb); + + if (sb) + put_super(sb); +} +EXPORT_SYMBOL_GPL(freeze_timeout); + +/* + * add_freeze_timeout - Add timeout for freeze. + * + * @bdev: block device struct + * @timeout_msec: timeout period + * + * Add the delayed work for freeze timeout to the delayed work queue. + */ +void add_freeze_timeout(struct block_device *bdev, long timeout_msec) +{ + s64 timeout_jiffies = msecs_to_jiffies(timeout_msec); + + /* Set delayed work queue */ + cancel_delayed_work(&bdev->bd_freeze_timeout); + schedule_delayed_work(&bdev->bd_freeze_timeout, timeout_jiffies); +} + +/* + * del_freeze_timeout - Delete timeout for freeze. + * + * @bdev: block device struct + * + * Delete the delayed work for freeze timeout from the delayed work queue. + */ +void del_freeze_timeout(struct block_device *bdev) +{ + if (delayed_work_pending(&bdev->bd_freeze_timeout)) + cancel_delayed_work(&bdev->bd_freeze_timeout); +} diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-xfs/fs/xfs/xfs_fsops.c linux-2.6.25-rc7-timeout/fs/x fs/xfs_fsops.c --- linux-2.6.25-rc7-xfs/fs/xfs/xfs_fsops.c 2008-04-01 14:22:33.000000000 +0900 +++ linux-2.6.25-rc7-timeout/fs/xfs/xfs_fsops.c 2008-04-01 13:27:35.000000000 +0900 @@ -623,7 +623,7 @@ xfs_fs_goingdown( { switch (inflags) { case XFS_FSOP_GOING_FLAGS_DEFAULT: { - struct super_block *sb = freeze_bdev(mp->m_super->s_bdev); + struct super_block *sb = freeze_bdev(mp->m_super->s_bdev, 0); if (sb && !IS_ERR(sb)) { xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT); diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-xfs/include/linux/buffer_head.h linux-2.6.25-rc7-tim eout/include/linux/buffer_head.h --- linux-2.6.25-rc7-xfs/include/linux/buffer_head.h 2008-04-01 14:22:39.000000000 +0900 +++ linux-2.6.25-rc7-timeout/include/linux/buffer_head.h 2008-04-01 13:27:53.000000000 +0900 @@ -170,7 +170,7 @@ int sync_blockdev(struct block_device *b void __wait_on_buffer(struct buffer_head *); wait_queue_head_t *bh_waitq_head(struct buffer_head *bh); int fsync_bdev(struct block_device *); -struct super_block *freeze_bdev(struct block_device *); +struct super_block *freeze_bdev(struct block_device *, long timeout_msec); void thaw_bdev(struct block_device *, struct super_block *); int fsync_super(struct super_block *); int fsync_no_super(struct block_device *); diff -uprN -X /home/sho/pub/MC/freeze-set/dontdiff linux-2.6.25-rc7-xfs/include/linux/fs.h linux-2.6.25-rc7-timeout/incl ude/linux/fs.h --- linux-2.6.25-rc7-xfs/include/linux/fs.h 2008-04-01 14:22:39.000000000 +0900 +++ linux-2.6.25-rc7-timeout/include/linux/fs.h 2008-04-01 13:27:53.000000000 +0900 @@ -8,6 +8,7 @@ #include #include +#include /* * It's silly to have NR_OPEN bigger than NR_FILE, but you can change @@ -225,6 +226,7 @@ extern int dir_notify_enable; #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ #define FITHAW _IOWR('X', 120, int) /* Thaw */ +#define FIFREEZE_RESET_TIMEOUT _IO(0x00, 3) /* Reset freeze timeout */ #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) @@ -551,6 +553,8 @@ struct block_device { */ unsigned long bd_private; + /* Delayed work for freeze */ + struct delayed_work bd_freeze_timeout; /* Semaphore for freeze */ struct semaphore bd_freeze_sem; }; @@ -2104,5 +2108,9 @@ int proc_nr_files(struct ctl_table *tabl int get_filesystem_list(char * buf); +extern void add_freeze_timeout(struct block_device *bdev, long timeout_msec); +extern void del_freeze_timeout(struct block_device *bdev); +extern void freeze_timeout(struct work_struct *work); + #endif /* __KERNEL__ */ #endif /* _LINUX_FS_H */ From owner-xfs@oss.sgi.com Tue Apr 1 18:11:03 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:11:24 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_64 autolearn=no version=3.3.0-r574664 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m321Aogu014887 for ; Tue, 1 Apr 2008 18:11:01 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA10735; Wed, 2 Apr 2008 09:15:53 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m31NFqsT117416482; Wed, 2 Apr 2008 09:15:53 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id m31NFqKd118679480; Wed, 2 Apr 2008 09:15:52 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 2 Apr 2008 09:15:52 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: [Patch] Cacheline align xlog_t Message-ID: <20080401231552.GV103491721@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15129 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Reorganise xlog_t for better cacheline isolation of contention To reduce contention on the log in large CPU count, separate out different parts of the xlog_t structure onto different cachelines. Move each lock onto a different cacheline along with all the members that are accessed/modified while that lock is held. Also, move the debugging code into debug code. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 5 +--- fs/xfs/xfs_log_priv.h | 55 +++++++++++++++++++++++++++----------------------- 2 files changed, 32 insertions(+), 28 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2008-03-13 14:03:38.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2008-03-13 14:20:21.803846380 +1100 @@ -1237,9 +1237,9 @@ xlog_alloc_log(xfs_mount_t *mp, XFS_BUF_SET_FSPRIVATE2(bp, (unsigned long)1); iclog->ic_bp = bp; iclog->hic_data = bp->b_addr; - +#ifdef DEBUG log->l_iclog_bak[i] = (xfs_caddr_t)&(iclog->ic_header); - +#endif head = &iclog->ic_header; memset(head, 0, sizeof(xlog_rec_header_t)); head->h_magicno = cpu_to_be32(XLOG_HEADER_MAGIC_NUM); @@ -1250,7 +1250,6 @@ xlog_alloc_log(xfs_mount_t *mp, head->h_fmt = cpu_to_be32(XLOG_FMT); memcpy(&head->h_fs_uuid, &mp->m_sb.sb_uuid, sizeof(uuid_t)); - iclog->ic_size = XFS_BUF_SIZE(bp) - log->l_iclog_hsize; iclog->ic_state = XLOG_STATE_ACTIVE; iclog->ic_log = log; Index: 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log_priv.h 2008-03-13 14:06:58.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h 2008-03-13 14:20:31.478596832 +1100 @@ -402,8 +402,29 @@ typedef struct xlog_in_core { * that round off problems won't occur when releasing partial reservations. */ typedef struct log { + /* The following fields don't need locking */ + struct xfs_mount *l_mp; /* mount point */ + struct xfs_buf *l_xbuf; /* extra buffer for log + * wrapping */ + struct xfs_buftarg *l_targ; /* buftarg of log */ + uint l_flags; + uint l_quotaoffs_flag; /* XFS_DQ_*, for QUOTAOFFs */ + struct xfs_buf_cancel **l_buf_cancel_table; + int l_iclog_hsize; /* size of iclog header */ + int l_iclog_heads; /* # of iclog header sectors */ + uint l_sectbb_log; /* log2 of sector size in BBs */ + uint l_sectbb_mask; /* sector size (in BBs) + * alignment mask */ + int l_iclog_size; /* size of log in bytes */ + int l_iclog_size_log; /* log power size of log */ + int l_iclog_bufs; /* number of iclog buffers */ + xfs_daddr_t l_logBBstart; /* start block of log */ + int l_logsize; /* size of log in bytes */ + int l_logBBsize; /* size of log in BB chunks */ + /* The following block of fields are changed while holding icloglock */ - sema_t l_flushsema; /* iclog flushing semaphore */ + sema_t l_flushsema ____cacheline_aligned_in_smp; + /* iclog flushing semaphore */ int l_flushcnt; /* # of procs waiting on this * sema */ int l_covered_state;/* state of "covering disk @@ -413,27 +434,14 @@ typedef struct log { xfs_lsn_t l_tail_lsn; /* lsn of 1st LR with unflushed * buffers */ xfs_lsn_t l_last_sync_lsn;/* lsn of last LR on disk */ - struct xfs_mount *l_mp; /* mount point */ - struct xfs_buf *l_xbuf; /* extra buffer for log - * wrapping */ - struct xfs_buftarg *l_targ; /* buftarg of log */ - xfs_daddr_t l_logBBstart; /* start block of log */ - int l_logsize; /* size of log in bytes */ - int l_logBBsize; /* size of log in BB chunks */ int l_curr_cycle; /* Cycle number of log writes */ int l_prev_cycle; /* Cycle number before last * block increment */ int l_curr_block; /* current logical log block */ int l_prev_block; /* previous logical log block */ - int l_iclog_size; /* size of log in bytes */ - int l_iclog_size_log; /* log power size of log */ - int l_iclog_bufs; /* number of iclog buffers */ - - /* The following field are used for debugging; need to hold icloglock */ - char *l_iclog_bak[XLOG_MAX_ICLOGS]; /* The following block of fields are changed while holding grant_lock */ - spinlock_t l_grant_lock; + spinlock_t l_grant_lock ____cacheline_aligned_in_smp; xlog_ticket_t *l_reserve_headq; xlog_ticket_t *l_write_headq; int l_grant_reserve_cycle; @@ -441,20 +449,17 @@ typedef struct log { int l_grant_write_cycle; int l_grant_write_bytes; - /* The following fields don't need locking */ #ifdef XFS_LOG_TRACE struct ktrace *l_trace; struct ktrace *l_grant_trace; #endif - uint l_flags; - uint l_quotaoffs_flag; /* XFS_DQ_*, for QUOTAOFFs */ - struct xfs_buf_cancel **l_buf_cancel_table; - int l_iclog_hsize; /* size of iclog header */ - int l_iclog_heads; /* # of iclog header sectors */ - uint l_sectbb_log; /* log2 of sector size in BBs */ - uint l_sectbb_mask; /* sector size (in BBs) - * alignment mask */ -} xlog_t; + + /* The following field are used for debugging; need to hold icloglock */ +#ifdef DEBUG + char *l_iclog_bak[XLOG_MAX_ICLOGS]; +#endif + +} xlog_t ____cacheline_aligned_in_smp; #define XLOG_FORCED_SHUTDOWN(log) ((log)->l_flags & XLOG_IO_ERROR) From owner-xfs@oss.sgi.com Tue Apr 1 18:10:57 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:11:13 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m321Aogq014887 for ; Tue, 1 Apr 2008 18:10:55 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA12833; Wed, 2 Apr 2008 10:17:59 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m320HwsT118671985; Wed, 2 Apr 2008 10:17:59 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id m320HwEw118432079; Wed, 2 Apr 2008 10:17:58 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 2 Apr 2008 10:17:58 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: [Patch] fix lock inversion in forced unmount Message-ID: <20080402001758.GY103491721@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15127 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Fix lock inversion in forced shutdown. Recent changes to xlog_state_release_iclog() placed the grant_lock inside the icloglock. forced unmount of the log does this the opposite way around, but does not depend on the order for correct working. Fix the inversion by changing the order locks are gained in xfs_log_force_umount(). Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2008-04-01 21:00:13.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2008-04-02 08:35:20.282633878 +1000 @@ -3502,8 +3502,8 @@ xfs_log_force_umount( * before we mark the filesystem SHUTDOWN and wake * everybody up to tell the bad news. */ - spin_lock(&log->l_grant_lock); spin_lock(&log->l_icloglock); + spin_lock(&log->l_grant_lock); mp->m_flags |= XFS_MOUNT_FS_SHUTDOWN; XFS_BUF_DONE(mp->m_sb_bp); /* From owner-xfs@oss.sgi.com Tue Apr 1 18:11:10 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:11:41 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-0.4 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_52, SUBJ_TICKET autolearn=no version=3.3.0-r574664 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m321Aoh0014887 for ; Tue, 1 Apr 2008 18:11:08 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA10686; Wed, 2 Apr 2008 09:14:40 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m31NEdsT118620777; Wed, 2 Apr 2008 09:14:39 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id m31NEdLb118697076; Wed, 2 Apr 2008 09:14:39 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 2 Apr 2008 09:14:39 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: [Patch] Remove xlog_ticket allocator Message-ID: <20080401231439.GU103491721@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15132 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Remove the xlog_ticket allocator The ticket allocator is just a simple slab implementation internal to the log. It requires the icloglock to be held when manipulating it and this contributes to contention on that lock. Just kill the entire allocator and use a memory zone instead. While there, allow us to gracefully fail allocation with ENOMEM. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 137 ++++---------------------------------------------- fs/xfs/xfs_log_priv.h | 9 +-- fs/xfs/xfs_vfsops.c | 12 ++-- fs/xfs/xfsidbg.c | 11 +--- 4 files changed, 25 insertions(+), 144 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2008-03-13 13:58:08.866070224 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2008-03-13 14:03:38.448138656 +1100 @@ -41,6 +41,7 @@ #include "xfs_inode.h" #include "xfs_rw.h" +kmem_zone_t *xfs_log_ticket_zone; #define xlog_write_adv_cnt(ptr, len, off, bytes) \ { (ptr) += (bytes); \ @@ -73,8 +74,6 @@ STATIC int xlog_state_get_iclog_space(x xlog_ticket_t *ticket, int *continued_write, int *logoffsetp); -STATIC void xlog_state_put_ticket(xlog_t *log, - xlog_ticket_t *tic); STATIC int xlog_state_release_iclog(xlog_t *log, xlog_in_core_t *iclog); STATIC void xlog_state_switch_iclogs(xlog_t *log, @@ -101,7 +100,6 @@ STATIC void xlog_ungrant_log_space(xlog_ /* local ticket functions */ -STATIC void xlog_state_ticket_alloc(xlog_t *log); STATIC xlog_ticket_t *xlog_ticket_get(xlog_t *log, int unit_bytes, int count, @@ -330,7 +328,7 @@ xfs_log_done(xfs_mount_t *mp, */ xlog_trace_loggrant(log, ticket, "xfs_log_done: (non-permanent)"); xlog_ungrant_log_space(log, ticket); - xlog_state_put_ticket(log, ticket); + xlog_ticket_put(log, ticket); } else { xlog_trace_loggrant(log, ticket, "xfs_log_done: (permanent)"); xlog_regrant_reserve_log_space(log, ticket); @@ -469,6 +467,8 @@ xfs_log_reserve(xfs_mount_t *mp, /* may sleep if need to allocate more tickets */ internal_ticket = xlog_ticket_get(log, unit_bytes, cnt, client, flags); + if (!internal_ticket) + return XFS_ERROR(ENOMEM); internal_ticket->t_trans_type = t_type; *ticket = internal_ticket; xlog_trace_loggrant(log, internal_ticket, @@ -693,7 +693,7 @@ xfs_log_unmount_write(xfs_mount_t *mp) if (tic) { xlog_trace_loggrant(log, tic, "unmount rec"); xlog_ungrant_log_space(log, tic); - xlog_state_put_ticket(log, tic); + xlog_ticket_put(log, tic); } } else { /* @@ -1208,7 +1208,6 @@ xlog_alloc_log(xfs_mount_t *mp, spin_lock_init(&log->l_icloglock); spin_lock_init(&log->l_grant_lock); initnsema(&log->l_flushsema, 0, "ic-flush"); - xlog_state_ticket_alloc(log); /* wait until after icloglock inited */ /* log record size must be multiple of BBSIZE; see xlog_rec_header_t */ ASSERT((XFS_BUF_SIZE(bp) & BBMASK) == 0); @@ -1541,7 +1540,6 @@ STATIC void xlog_dealloc_log(xlog_t *log) { xlog_in_core_t *iclog, *next_iclog; - xlog_ticket_t *tic, *next_tic; int i; iclog = log->l_iclog; @@ -1562,22 +1560,6 @@ xlog_dealloc_log(xlog_t *log) spinlock_destroy(&log->l_icloglock); spinlock_destroy(&log->l_grant_lock); - /* XXXsup take a look at this again. */ - if ((log->l_ticket_cnt != log->l_ticket_tcnt) && - !XLOG_FORCED_SHUTDOWN(log)) { - xfs_fs_cmn_err(CE_WARN, log->l_mp, - "xlog_dealloc_log: (cnt: %d, total: %d)", - log->l_ticket_cnt, log->l_ticket_tcnt); - /* ASSERT(log->l_ticket_cnt == log->l_ticket_tcnt); */ - - } else { - tic = log->l_unmount_free; - while (tic) { - next_tic = tic->t_next; - kmem_free(tic, PAGE_SIZE); - tic = next_tic; - } - } xfs_buf_free(log->l_xbuf); #ifdef XFS_LOG_TRACE if (log->l_trace != NULL) { @@ -2798,18 +2780,6 @@ xlog_ungrant_log_space(xlog_t *log, /* - * Atomically put back used ticket. - */ -STATIC void -xlog_state_put_ticket(xlog_t *log, - xlog_ticket_t *tic) -{ - spin_lock(&log->l_icloglock); - xlog_ticket_put(log, tic); - spin_unlock(&log->l_icloglock); -} /* xlog_state_put_ticket */ - -/* * Flush iclog to disk if this is the last reference to the given iclog and * the WANT_SYNC bit is set. * @@ -3179,92 +3149,19 @@ xlog_state_want_sync(xlog_t *log, xlog_i */ /* - * Algorithm doesn't take into account page size. ;-( - */ -STATIC void -xlog_state_ticket_alloc(xlog_t *log) -{ - xlog_ticket_t *t_list; - xlog_ticket_t *next; - xfs_caddr_t buf; - uint i = (PAGE_SIZE / sizeof(xlog_ticket_t)) - 2; - - /* - * The kmem_zalloc may sleep, so we shouldn't be holding the - * global lock. XXXmiken: may want to use zone allocator. - */ - buf = (xfs_caddr_t) kmem_zalloc(PAGE_SIZE, KM_SLEEP); - - spin_lock(&log->l_icloglock); - - /* Attach 1st ticket to Q, so we can keep track of allocated memory */ - t_list = (xlog_ticket_t *)buf; - t_list->t_next = log->l_unmount_free; - log->l_unmount_free = t_list++; - log->l_ticket_cnt++; - log->l_ticket_tcnt++; - - /* Next ticket becomes first ticket attached to ticket free list */ - if (log->l_freelist != NULL) { - ASSERT(log->l_tail != NULL); - log->l_tail->t_next = t_list; - } else { - log->l_freelist = t_list; - } - log->l_ticket_cnt++; - log->l_ticket_tcnt++; - - /* Cycle through rest of alloc'ed memory, building up free Q */ - for ( ; i > 0; i--) { - next = t_list + 1; - t_list->t_next = next; - t_list = next; - log->l_ticket_cnt++; - log->l_ticket_tcnt++; - } - t_list->t_next = NULL; - log->l_tail = t_list; - spin_unlock(&log->l_icloglock); -} /* xlog_state_ticket_alloc */ - - -/* - * Put ticket into free list - * - * Assumption: log lock is held around this call. + * Free a used ticket. */ STATIC void xlog_ticket_put(xlog_t *log, xlog_ticket_t *ticket) { sv_destroy(&ticket->t_sema); - - /* - * Don't think caching will make that much difference. It's - * more important to make debug easier. - */ -#if 0 - /* real code will want to use LIFO for caching */ - ticket->t_next = log->l_freelist; - log->l_freelist = ticket; - /* no need to clear fields */ -#else - /* When we debug, it is easier if tickets are cycled */ - ticket->t_next = NULL; - if (log->l_tail) { - log->l_tail->t_next = ticket; - } else { - ASSERT(log->l_freelist == NULL); - log->l_freelist = ticket; - } - log->l_tail = ticket; -#endif /* DEBUG */ - log->l_ticket_cnt++; + kmem_zone_free(xfs_log_ticket_zone, ticket); } /* xlog_ticket_put */ /* - * Grab ticket off freelist or allocation some more + * Allocate and initialise a new log ticket. */ STATIC xlog_ticket_t * xlog_ticket_get(xlog_t *log, @@ -3276,21 +3173,9 @@ xlog_ticket_get(xlog_t *log, xlog_ticket_t *tic; uint num_headers; - alloc: - if (log->l_freelist == NULL) - xlog_state_ticket_alloc(log); /* potentially sleep */ - - spin_lock(&log->l_icloglock); - if (log->l_freelist == NULL) { - spin_unlock(&log->l_icloglock); - goto alloc; - } - tic = log->l_freelist; - log->l_freelist = tic->t_next; - if (log->l_freelist == NULL) - log->l_tail = NULL; - log->l_ticket_cnt--; - spin_unlock(&log->l_icloglock); + tic = kmem_zone_zalloc(xfs_log_ticket_zone, KM_SLEEP|KM_MAYFAIL); + if (!tic) + return NULL; /* * Permanent reservations have up to 'cnt'-1 active log operations Index: 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log_priv.h 2008-03-13 13:59:10.806160556 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h 2008-03-13 14:06:58.110733971 +1100 @@ -242,7 +242,7 @@ typedef struct xlog_res { typedef struct xlog_ticket { sv_t t_sema; /* sleep on this semaphore : 20 */ - struct xlog_ticket *t_next; /* :4|8 */ + struct xlog_ticket *t_next; /* :4|8 */ struct xlog_ticket *t_prev; /* :4|8 */ xlog_tid_t t_tid; /* transaction identifier : 4 */ int t_curr_res; /* current reservation in bytes : 4 */ @@ -406,13 +406,8 @@ typedef struct log { sema_t l_flushsema; /* iclog flushing semaphore */ int l_flushcnt; /* # of procs waiting on this * sema */ - int l_ticket_cnt; /* free ticket count */ - int l_ticket_tcnt; /* total ticket count */ int l_covered_state;/* state of "covering disk * log entries" */ - xlog_ticket_t *l_freelist; /* free list of tickets */ - xlog_ticket_t *l_unmount_free;/* kmem_free these addresses */ - xlog_ticket_t *l_tail; /* free list of tickets */ xlog_in_core_t *l_iclog; /* head log queue */ spinlock_t l_icloglock; /* grab to change iclog state */ xfs_lsn_t l_tail_lsn; /* lsn of 1st LR with unflushed @@ -478,6 +473,8 @@ extern struct xfs_buf *xlog_get_bp(xlog_ extern void xlog_put_bp(struct xfs_buf *); extern int xlog_bread(xlog_t *, xfs_daddr_t, int, struct xfs_buf *); +extern kmem_zone_t *xfs_log_ticket_zone; + /* iclog tracing */ #define XLOG_TRACE_GRAB_FLUSH 1 #define XLOG_TRACE_REL_FLUSH 2 Index: 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_vfsops.c 2008-03-13 13:58:08.866070224 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_vfsops.c 2008-03-13 13:59:59.208010688 +1100 @@ -68,15 +68,17 @@ xfs_init(void) /* * Initialize all of the zone allocators we use. */ + xfs_log_ticket_zone = kmem_zone_init(sizeof(xlog_ticket_t), + "xfs_log_ticket"); xfs_bmap_free_item_zone = kmem_zone_init(sizeof(xfs_bmap_free_item_t), - "xfs_bmap_free_item"); + "xfs_bmap_free_item"); xfs_btree_cur_zone = kmem_zone_init(sizeof(xfs_btree_cur_t), - "xfs_btree_cur"); - xfs_trans_zone = kmem_zone_init(sizeof(xfs_trans_t), "xfs_trans"); - xfs_da_state_zone = - kmem_zone_init(sizeof(xfs_da_state_t), "xfs_da_state"); + "xfs_btree_cur"); + xfs_da_state_zone = kmem_zone_init(sizeof(xfs_da_state_t), + "xfs_da_state"); xfs_dabuf_zone = kmem_zone_init(sizeof(xfs_dabuf_t), "xfs_dabuf"); xfs_ifork_zone = kmem_zone_init(sizeof(xfs_ifork_t), "xfs_ifork"); + xfs_trans_zone = kmem_zone_init(sizeof(xfs_trans_t), "xfs_trans"); xfs_acl_zone_init(xfs_acl_zone, "xfs_acl"); xfs_mru_cache_init(); xfs_filestream_init(); Index: 2.6.x-xfs-new/fs/xfs/xfsidbg.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfsidbg.c 2008-03-13 13:07:25.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfsidbg.c 2008-03-13 14:10:13.489855395 +1100 @@ -5607,9 +5607,9 @@ xfsidbg_xiclog(xlog_in_core_t *iclog) be32_to_cpu(iclog->ic_header.h_magicno), be32_to_cpu(iclog->ic_header.h_cycle), be32_to_cpu(iclog->ic_header.h_version), - be64_to_cpu(iclog->ic_header.h_lsn)); + (unsigned long long)be64_to_cpu(iclog->ic_header.h_lsn)); kdb_printf("tail_lsn: 0x%Lx len: %d prev_block: %d num_ops: %d\n", - be64_to_cpu(iclog->ic_header.h_tail_lsn), + (unsigned long long)be64_to_cpu(iclog->ic_header.h_tail_lsn), be32_to_cpu(iclog->ic_header.h_len), be32_to_cpu(iclog->ic_header.h_prev_block), be32_to_cpu(iclog->ic_header.h_num_logops)); @@ -5829,11 +5829,8 @@ xfsidbg_xlog(xlog_t *log) }; kdb_printf("xlog at 0x%p\n", log); - kdb_printf("&flushsm: 0x%p flushcnt: %d tic_cnt: %d tic_tcnt: %d \n", - &log->l_flushsema, log->l_flushcnt, - log->l_ticket_cnt, log->l_ticket_tcnt); - kdb_printf("freelist: 0x%p tail: 0x%p ICLOG: 0x%p \n", - log->l_freelist, log->l_tail, log->l_iclog); + kdb_printf("&flushsm: 0x%p flushcnt: %d ICLOG: 0x%p \n", + &log->l_flushsema, log->l_flushcnt, log->l_iclog); kdb_printf("&icloglock: 0x%p tail_lsn: %s last_sync_lsn: %s \n", &log->l_icloglock, xfs_fmtlsn(&log->l_tail_lsn), xfs_fmtlsn(&log->l_last_sync_lsn)); From owner-xfs@oss.sgi.com Tue Apr 1 18:10:54 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:11:09 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m321Aogo014887 for ; Tue, 1 Apr 2008 18:10:52 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA13155; Wed, 2 Apr 2008 10:29:43 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m320TgsT118520072; Wed, 2 Apr 2008 10:29:42 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id m320TeXR116352804; Wed, 2 Apr 2008 10:29:40 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 2 Apr 2008 10:29:40 +1000 From: David Chinner To: Eric Sandeen Cc: xfs-oss Subject: Re: [PATCH] combined features2 fixup patches (updating/rewriting what was sent in other threads) Message-ID: <20080402002940.GZ103491721@sgi.com> References: <47F0546C.9070709@sandeen.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <47F0546C.9070709@sandeen.net> User-Agent: Mutt/1.4.2.1i X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15126 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Sun, Mar 30, 2008 at 10:03:08PM -0500, Eric Sandeen wrote: > Ensure "both" features2 slots are consistent, and set mp attr2 flag. > > Since older kernels may look in the sb_bad_features2 slot for > flags, rather than zeroing it out on fixup, we should make it > equal to the sb_features2 value. > > Also, if the ATTR2 flag was not found prior to features2 > fixup, it was not set in the mount flags, so re-check after the > fixup so that the current session will use the feature. > > Also fix up the comments to reflect these changes. > > Signed-off-by: Eric Sandeen > --- > > Index: linux-2.6-xfs/fs/xfs/xfs_mount.c > =================================================================== > --- linux-2.6-xfs.orig/fs/xfs/xfs_mount.c > +++ linux-2.6-xfs/fs/xfs/xfs_mount.c > @@ -967,22 +967,26 @@ xfs_mountfs( > xfs_mount_common(mp, sbp); > > /* > - * Check for a bad features2 field alignment. This happened on > - * some platforms due to xfs_sb_t not being 64bit size aligned > - * when sb_features was added and hence the compiler put it in > - * the wrong place. > + * Check for a mismatched features2 values. Older kernels > + * read & wrote into the wrong sb offset for sb_features2 > + * on some platforms due to xfs_sb_t not being 64bit size aligned > + * when sb_features2 was added, which made older superblock > + * reading/writing routines swap it as a 64-bit value. > * > - * If we detect a bad field, we or the set bits into the existing > - * features2 field in case it has already been modified and we > - * don't want to lose any features. Zero the bad one and mark > - * the two fields as needing updates once the transaction subsystem > - * is online. > + * For backwards compatibility, we make both slots equal. > + * > + * If we detect a mismatched field, we OR the set bits into the > + * existing features2 field in case it has already been modified; we > + * don't want to lose any features. We then update the bad location > + * with the ORed value so that older kernels will see any features2 > + * flags, and mark the two fields as needing updates once the > + * transaction subsystem is online. > */ > - if (xfs_sb_has_bad_features2(sbp)) { > + if (xfs_sb_has_mismatched_features2(sbp)) { > cmn_err(CE_WARN, > "XFS: correcting sb_features alignment problem"); > sbp->sb_features2 |= sbp->sb_bad_features2; > - sbp->sb_bad_features2 = 0; > + sbp->sb_bad_features2 = sbp->sb_features2; > update_flags |= XFS_SB_FEATURES2 | XFS_SB_BAD_FEATURES2; Probably should update XFS_MOUNT_ATTR2 here, not later. i.e. before we mount he log and start recovery. > @@ -1181,6 +1185,12 @@ xfs_mountfs( > xfs_mount_log_sb(mp, update_flags); > > /* > + * Re-check for ATTR2 in case it was found in bad_features2 slot. > + */ > + if (xfs_sb_version_hasattr2(&mp->m_sb)) > + mp->m_flags |= XFS_MOUNT_ATTR2; > + Rather than here. > /* > - * Detect a bad features2 field > + * Detect a mismatched features2 field. Older kernels read/wrote > + * this into the wrong slot, so to be safe we keep them in sync. > */ > -static inline int xfs_sb_has_bad_features2(xfs_sb_t *sbp) > +static inline int xfs_sb_has_mismatched_features2(xfs_sb_t *sbp) > { > - return (sbp->sb_bad_features2 != 0); > + return (sbp->sb_bad_features2 != sbp->sb_features2); > } Yep, makes sense. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue Apr 1 18:11:07 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:11:41 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m321Aogw014887 for ; Tue, 1 Apr 2008 18:11:04 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA10655; Wed, 2 Apr 2008 09:13:49 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m31NDnsT118704719; Wed, 2 Apr 2008 09:13:49 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id m31NDmb3116282386; Wed, 2 Apr 2008 09:13:48 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 2 Apr 2008 09:13:48 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: [Patch] Per iclog callback chain lock Message-ID: <20080401231348.GT103491721@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15131 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Introduce an iclog callback chain lock. Rather than use the icloglock for protecting the iclog completion callback chain, use a new per-iclog lock so that walking the callback chain doesn't require holding a global lock. This reduces contention on the icloglock during log buffer I/O completion as the callback chain lock is take for every callback that is issued. On large log buffers, this can number in the hundreds to thousands per iclog so isolating the lock to the iclog makes a lot of sense. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 35 +++++++++++++++++++---------------- fs/xfs/xfs_log_priv.h | 33 ++++++++++++++++++++++++++------- 2 files changed, 45 insertions(+), 23 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2008-03-13 13:10:23.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2008-03-13 19:35:51.251913648 +1100 @@ -397,12 +397,10 @@ xfs_log_notify(xfs_mount_t *mp, /* mo void *iclog_hndl, /* iclog to hang callback off */ xfs_log_callback_t *cb) { - xlog_t *log = mp->m_log; xlog_in_core_t *iclog = (xlog_in_core_t *)iclog_hndl; int abortflg; - cb->cb_next = NULL; - spin_lock(&log->l_icloglock); + spin_lock(&iclog->ic_callback_lock); abortflg = (iclog->ic_state & XLOG_STATE_IOERROR); if (!abortflg) { ASSERT_ALWAYS((iclog->ic_state == XLOG_STATE_ACTIVE) || @@ -411,7 +409,7 @@ xfs_log_notify(xfs_mount_t *mp, /* mo *(iclog->ic_callback_tail) = cb; iclog->ic_callback_tail = &(cb->cb_next); } - spin_unlock(&log->l_icloglock); + spin_unlock(&iclog->ic_callback_lock); return abortflg; } /* xfs_log_notify */ @@ -1257,6 +1255,8 @@ xlog_alloc_log(xfs_mount_t *mp, iclog->ic_size = XFS_BUF_SIZE(bp) - log->l_iclog_hsize; iclog->ic_state = XLOG_STATE_ACTIVE; iclog->ic_log = log; + atomic_set(&iclog->ic_refcnt, 0); + spin_lock_init(&iclog->ic_callback_lock); iclog->ic_callback_tail = &(iclog->ic_callback); iclog->ic_datap = (char *)iclog->hic_data + log->l_iclog_hsize; @@ -1990,7 +1990,7 @@ xlog_state_clean_log(xlog_t *log) if (iclog->ic_state == XLOG_STATE_DIRTY) { iclog->ic_state = XLOG_STATE_ACTIVE; iclog->ic_offset = 0; - iclog->ic_callback = NULL; /* don't need to free */ + ASSERT(iclog->ic_callback == NULL); /* * If the number of ops in this iclog indicate it just * contains the dummy transaction, we can @@ -2193,37 +2193,40 @@ xlog_state_do_callback( be64_to_cpu(iclog->ic_header.h_lsn); spin_unlock(&log->l_grant_lock); - /* - * Keep processing entries in the callback list - * until we come around and it is empty. We - * need to atomically see that the list is - * empty and change the state to DIRTY so that - * we don't miss any more callbacks being added. - */ - spin_lock(&log->l_icloglock); } else { + spin_unlock(&log->l_icloglock); ioerrors++; } - cb = iclog->ic_callback; + /* + * Keep processing entries in the callback list until + * we come around and it is empty. We need to + * atomically see that the list is empty and change the + * state to DIRTY so that we don't miss any more + * callbacks being added. + */ + spin_lock(&iclog->ic_callback_lock); + cb = iclog->ic_callback; while (cb) { iclog->ic_callback_tail = &(iclog->ic_callback); iclog->ic_callback = NULL; - spin_unlock(&log->l_icloglock); + spin_unlock(&iclog->ic_callback_lock); /* perform callbacks in the order given */ for (; cb; cb = cb_next) { cb_next = cb->cb_next; cb->cb_func(cb->cb_arg, aborted); } - spin_lock(&log->l_icloglock); + spin_lock(&iclog->ic_callback_lock); cb = iclog->ic_callback; } loopdidcallbacks++; funcdidcallbacks++; + spin_lock(&log->l_icloglock); ASSERT(iclog->ic_callback == NULL); + spin_unlock(&iclog->ic_callback_lock); if (!(iclog->ic_state & XLOG_STATE_IOERROR)) iclog->ic_state = XLOG_STATE_DIRTY; Index: 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log_priv.h 2008-02-22 13:48:25.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_log_priv.h 2008-03-13 19:34:57.430809151 +1100 @@ -324,6 +324,19 @@ typedef struct xlog_rec_ext_header { * - ic_offset is the current number of bytes written to in this iclog. * - ic_refcnt is bumped when someone is writing to the log. * - ic_state is the state of the iclog. + * + * Because of cacheline contention on large machines, we need to separate + * various resources onto different cachelines. To start with, make the + * structure cacheline aligned. The following fields can be contended on + * by independent processes: + * + * - ic_callback_* + * - ic_refcnt + * - fields protected by the global l_icloglock + * + * so we need to ensure that these fields are located in separate cachelines. + * We'll put all the read-only and l_icloglock fields in the first cacheline, + * and move everything else out to subsequent cachelines. */ typedef struct xlog_iclog_fields { sv_t ic_forcesema; @@ -332,18 +345,23 @@ typedef struct xlog_iclog_fields { struct xlog_in_core *ic_prev; struct xfs_buf *ic_bp; struct log *ic_log; - xfs_log_callback_t *ic_callback; - xfs_log_callback_t **ic_callback_tail; -#ifdef XFS_LOG_TRACE - struct ktrace *ic_trace; -#endif int ic_size; int ic_offset; - atomic_t ic_refcnt; int ic_bwritecnt; ushort_t ic_state; char *ic_datap; /* pointer to iclog data */ -} xlog_iclog_fields_t; +#ifdef XFS_LOG_TRACE + struct ktrace *ic_trace; +#endif + + /* Callback structures need their own cacheline */ + spinlock_t ic_callback_lock ____cacheline_aligned_in_smp; + xfs_log_callback_t *ic_callback; + xfs_log_callback_t **ic_callback_tail; + + /* reference counts need their own cacheline */ + atomic_t ic_refcnt ____cacheline_aligned_in_smp; +} xlog_iclog_fields_t ____cacheline_aligned_in_smp; typedef union xlog_in_core2 { xlog_rec_header_t hic_header; @@ -366,6 +384,7 @@ typedef struct xlog_in_core { #define ic_bp hic_fields.ic_bp #define ic_log hic_fields.ic_log #define ic_callback hic_fields.ic_callback +#define ic_callback_lock hic_fields.ic_callback_lock #define ic_callback_tail hic_fields.ic_callback_tail #define ic_trace hic_fields.ic_trace #define ic_size hic_fields.ic_size From owner-xfs@oss.sgi.com Tue Apr 1 18:11:14 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:11:27 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m321Aoh2014887 for ; Tue, 1 Apr 2008 18:11:11 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA11016; Wed, 2 Apr 2008 09:24:08 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m31NO7sT118601451; Wed, 2 Apr 2008 09:24:07 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id m31NO6CX118689677; Wed, 2 Apr 2008 09:24:06 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 2 Apr 2008 09:24:06 +1000 From: David Chinner To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: [Patch] Per iclog callback chain lock Message-ID: <20080401232406.GX103491721@sgi.com> References: <20080401231348.GT103491721@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080401231348.GT103491721@sgi.com> User-Agent: Mutt/1.4.2.1i X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15130 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Wed, Apr 02, 2008 at 09:13:48AM +1000, David Chinner wrote: > Introduce an iclog callback chain lock. > > Rather than use the icloglock for protecting the iclog completion > callback chain, use a new per-iclog lock so that walking the > callback chain doesn't require holding a global lock. > > This reduces contention on the icloglock during log buffer I/O > completion as the callback chain lock is take for every callback > that is issued. This is not accurate - the callback chain is removed in bulk then walked without the lock, but will loop over the iclog chain in case callbacks were added while processing the chain (not sure if that can even happen, though). [mental note - don't write patch descriptions before first coffee completion occurs.] Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue Apr 1 18:11:00 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:11:19 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m321Aogs014887 for ; Tue, 1 Apr 2008 18:10:58 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA10899; Wed, 2 Apr 2008 09:18:16 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m31NIGsT118164335; Wed, 2 Apr 2008 09:18:16 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id m31NIFof118690659; Wed, 2 Apr 2008 09:18:15 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 2 Apr 2008 09:18:15 +1000 From: David Chinner To: xfs-dev Cc: xfs-oss Subject: [Patch] unique per-AG inode generation number initialisation Message-ID: <20080401231815.GW103491721@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15128 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs Don't initialise new inode generation numbers to zero When we allocation new inode chunks, we initialise the generation numbers to zero. This works fine until we delete a chunk and then reallocate it, resulting in the same inode numbers but with a reset generation count. This can result in inode/generation pairs of different inodes occurring relatively close together. Given that the inode/gen pair makes up the "unique" portion of an NFS filehandle on XFS, this can result in file handles cached on clients being seen on the wire from the server but refer to a different file. This causes .... issues for NFS clients. Hence we need a unique generation number initialisation for each inode to prevent reuse of a small portion of the generation number space. Make this initialiser per-allocation group so that it is not a single point of contention in the filesystem, and increment it on every allocation within an AG to reduce the chance that a generation number is reused for a given inode number if the inode chunk is deleted and reallocated immediately afterwards. It is safe to add the agi_newinogen field to the AGI without using a feature bit. If an older kernel is used, it simply will not update the field on allocation. If the kernel is updated and the field has garbage in it, then it's like having a random seed to the generation number.... Signed-off-by: Dave Chinner --- fs/xfs/xfs_ag.h | 4 +++- fs/xfs/xfs_ialloc.c | 30 ++++++++++++++++++++++-------- 2 files changed, 25 insertions(+), 9 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_ag.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ag.h 2008-01-18 18:30:06.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_ag.h 2008-03-26 13:03:41.122918236 +1100 @@ -121,6 +121,7 @@ typedef struct xfs_agi { * still being referenced. */ __be32 agi_unlinked[XFS_AGI_UNLINKED_BUCKETS]; + __be32 agi_newinogen; /* inode cluster generation */ } xfs_agi_t; #define XFS_AGI_MAGICNUM 0x00000001 @@ -134,7 +135,8 @@ typedef struct xfs_agi { #define XFS_AGI_NEWINO 0x00000100 #define XFS_AGI_DIRINO 0x00000200 #define XFS_AGI_UNLINKED 0x00000400 -#define XFS_AGI_NUM_BITS 11 +#define XFS_AGI_NEWINOGEN 0x00000800 +#define XFS_AGI_NUM_BITS 12 #define XFS_AGI_ALL_BITS ((1 << XFS_AGI_NUM_BITS) - 1) /* disk block (xfs_daddr_t) in the AG */ Index: 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ialloc.c 2008-03-25 15:41:27.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_ialloc.c 2008-03-26 14:29:47.998554368 +1100 @@ -309,6 +309,8 @@ xfs_ialloc_ag_alloc( free = XFS_MAKE_IPTR(args.mp, fbuf, i); free->di_core.di_magic = cpu_to_be16(XFS_DINODE_MAGIC); free->di_core.di_version = version; + free->di_core.di_gen = agi->agi_newinogen; + be32_add_cpu(&agi->agi_newinogen, 1); free->di_next_unlinked = cpu_to_be32(NULLAGINO); xfs_ialloc_log_di(tp, fbuf, i, XFS_DI_CORE_BITS | XFS_DI_NEXT_UNLINKED); @@ -347,7 +349,8 @@ xfs_ialloc_ag_alloc( * Log allocation group header fields */ xfs_ialloc_log_agi(tp, agbp, - XFS_AGI_COUNT | XFS_AGI_FREECOUNT | XFS_AGI_NEWINO); + XFS_AGI_COUNT | XFS_AGI_FREECOUNT | + XFS_AGI_NEWINO | XFS_AGI_NEWINOGEN); /* * Modify/log superblock values for inode count and inode free count. */ @@ -896,11 +899,12 @@ nextag: ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino + offset); XFS_INOBT_CLR_FREE(&rec, offset); rec.ir_freecount--; + be32_add_cpu(&agi->agi_newinogen, 1); if ((error = xfs_inobt_update(cur, rec.ir_startino, rec.ir_freecount, rec.ir_free))) goto error0; be32_add(&agi->agi_freecount, -1); - xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT); + xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT | XFS_AGI_NEWINOGEN); down_read(&mp->m_peraglock); mp->m_perag[tagno].pagi_freecount--; up_read(&mp->m_peraglock); @@ -1320,6 +1324,11 @@ xfs_ialloc_compute_maxlevels( /* * Log specified fields for the ag hdr (inode section) + * + * We don't log the unlinked inode fields through here; they + * get logged directly to the buffer. Hence we have a discontinuity + * in the fields we are logging and we need two calls to map all + * the dirtied parts of the agi.... */ void xfs_ialloc_log_agi( @@ -1342,22 +1351,27 @@ xfs_ialloc_log_agi( offsetof(xfs_agi_t, agi_newino), offsetof(xfs_agi_t, agi_dirino), offsetof(xfs_agi_t, agi_unlinked), + offsetof(xfs_agi_t, agi_newinogen), sizeof(xfs_agi_t) }; + int log_newino = fields & XFS_AGI_NEWINOGEN; + #ifdef DEBUG xfs_agi_t *agi; /* allocation group header */ agi = XFS_BUF_TO_AGI(bp); ASSERT(be32_to_cpu(agi->agi_magicnum) == XFS_AGI_MAGIC); #endif - /* - * Compute byte offsets for the first and last fields. - */ + fields &= ~XFS_AGI_NEWINOGEN; + + /* Compute byte offsets for the first and last fields. */ xfs_btree_offsets(fields, offsets, XFS_AGI_NUM_BITS, &first, &last); - /* - * Log the allocation group inode header buffer. - */ xfs_trans_log_buf(tp, bp, first, last); + if (log_newino) { + xfs_btree_offsets(XFS_AGI_NEWINOGEN, offsets, XFS_AGI_NUM_BITS, + &first, &last); + xfs_trans_log_buf(tp, bp, first, last); + } } /* From owner-xfs@oss.sgi.com Tue Apr 1 18:11:17 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:11:42 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m321Aoh4014887 for ; Tue, 1 Apr 2008 18:11:15 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id JAA10354; Wed, 2 Apr 2008 09:00:46 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m31N0jsT118685679; Wed, 2 Apr 2008 09:00:46 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id m31N0jA4117633674; Wed, 2 Apr 2008 09:00:45 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 2 Apr 2008 09:00:44 +1000 From: David Chinner To: David Chinner Cc: xfs-dev , xfs@oss.sgi.com Subject: Re: [Review] Improve XFS error checking and propagation Message-ID: <20080401230044.GS103491721@sgi.com> References: <20080311010420.GD155407@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080311010420.GD155407@sgi.com> User-Agent: Mutt/1.4.2.1i X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15133 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs ping? On Tue, Mar 11, 2008 at 12:04:21PM +1100, David Chinner wrote: > A recent paper at the FAST08 conference highlighted a large number > of unchecked error paths in Linux filesystems and I/O layers. As a > subsystem, XFS had the highest aggregate numbers of bad error > propagation. A tarball which contains a quilt patch series of 32 > patches aimed at improving this situation can be found here: > > http://oss.sgi.com/~dgc/xfs/error-check/xfs-error-checking.tar.gz > > The paper "EIO: Error Handling is Occasionally Correct" can be found > here: > > http://www.cs.wisc.edu/adsl/Publications/eio-fast08.html > > And the in depth results here: > > http://www.cs.wisc.edu/adsl/Publications/eio-fast08/readme.html > http://www.cs.wisc.edu/adsl/Publications/eio-fast08/ > > The XFS results I've been working from are here: > > http://www.cs.wisc.edu/adsl/Publications/eio-fast08/fullfs-xfs-without-false-positives.txt > > and included below is an annotated version of this file as I've > worked through it. The graph of the XFS error paths is a good > visual representation of how the bad error paths tend to cluster > together: > > http://www.cs.wisc.edu/adsl/Publications/eio-fast08/singlefs-xfs.pdf > > (you'll need at least 800% zoom to be able to read it at all) > > The paper analysed a 2.6.15 kernel, but I've been working against an > xfs-dev tree (~2.6.24). Of the 101 reported problems for the 2.6.15 > kernel that was analysed: > > - 7 did not exist anymore (bhv layer, dirv1, write path changes) > - 11 were false positives that were not modified > - 24 were false positives that have been patched to remove > (e.g. int xfs_foo() to void xfs_foo()) > - 37 real problems where an error needed to be returned and are > fixed in the patch series. > - 3 where there is no error path to return an error and no > point in even warning about it (ENOSPC flushing) > - 10 where there is no error path to return an error, but > patched to warn to the syslog about potential data loss > or metadata I/O errors > - 4 were already fixed in the xfs-dev tree > - 2 where the error is ignored because we must continue anyway > (patched to warn to syslog) > - 4 that I haven't yet fixed (xfs_buf_iostrategy and > xfs_buf_iostart) because I need to think about them more. > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > > > ---------------------------------------- fs/xfs/ ---- > > d 1 xfs_write -> _xfs_log_force fs/xfs/linux-2.6/xfs_lrw.c 881 > d 2 xfs_write -> _xfs_log_force fs/xfs/linux-2.6/xfs_lrw.c 884 > F 3 xfs_flush_device -> _xfs_log_force fs/xfs/linux-2.6/xfs_super.c 547 > F 4 xfs_qm_dqflush -> _xfs_log_force fs/xfs/quota/xfs_dquot.c 1294 > F 5 xfs_qm_dqflock_pushbuf_wait -> _xfs_log_force fs/xfs/quota/xfs_dquot.c 1591 > F 6 xfs_qm_dqunpin_wait -> _xfs_log_force fs/xfs/quota/xfs_dquot_item.c 204 > F 7 xfs_qm_dquot_logitem_pushbuf -> _xfs_log_force fs/xfs/quota/xfs_dquot_item.c 267 > F 8 xfs_alloc_search_busy -> _xfs_log_force fs/xfs/xfs_alloc.c 2593 > F 9 xfs_iunpin_wait -> _xfs_log_force fs/xfs/xfs_inode.c 2847 > F 10 xfs_iflush -> _xfs_log_force fs/xfs/xfs_inode.c 3243 > F 11 xfs_inode_item_pushbuf -> _xfs_log_force fs/xfs/xfs_inode_item.c 819 > P 12 xfs_log_unmount_write -> _xfs_log_force fs/xfs/xfs_log.c 529 > F 13 xlog_recover_finish -> _xfs_log_force fs/xfs/xfs_log_recover.c 3961 > F 14 xfs_unmountfs -> _xfs_log_force fs/xfs/xfs_mount.c 1088 > F 15 xfs_trans_push_ail -> _xfs_log_force fs/xfs/xfs_trans_ail.c 198 > F 16 xfs_syncsub -> _xfs_log_force fs/xfs/xfs_vfsops.c 1440 > F 17 xfs_syncsub -> _xfs_log_force fs/xfs/xfs_vfsops.c 1455 > F 18 xfs_syncsub -> _xfs_log_force fs/xfs/xfs_vfsops.c 1491 > F 19 xfs_syncsub -> _xfs_log_force fs/xfs/xfs_vfsops.c 1543 > P 20 xfs_fsync -> _xfs_log_force fs/xfs/xfs_vnodeops.c 1129 > P 21 xfs_qm_write_sb_changes -> _xfs_trans_commit fs/xfs/quota/xfs_qm.c 2414 > P 22 xfs_qm_scall_setqlim -> _xfs_trans_commit fs/xfs/quota/xfs_qm_syscalls.c 739 > P 23 xfs_itruncate_finish -> _xfs_trans_commit fs/xfs/xfs_inode.c 1718 > P 24 xlog_recover_process_efi -> _xfs_trans_commit fs/xfs/xfs_log_recover.c 3047 > P 25 xlog_recover_clear_agi_bucket -> _xfs_trans_commit fs/xfs/xfs_log_recover.c 3174 > PB 26 xfs_mount_log_sbunit -> _xfs_trans_commit fs/xfs/xfs_mount.c 1579 > P 27 xfs_growfs_rt_alloc -> _xfs_trans_commit fs/xfs/xfs_rtalloc.c 154 > P 28 xfs_growfs_rt_alloc -> _xfs_trans_commit fs/xfs/xfs_rtalloc.c 191 > P 29 xfs_growfs_rt -> _xfs_trans_commit fs/xfs/xfs_rtalloc.c 2103 > P 30 xfs_inactive_attrs -> _xfs_trans_commit fs/xfs/xfs_vnodeops.c 1505 > C 31 xfs_inactive -> _xfs_trans_commit fs/xfs/xfs_vnodeops.c 1790 > d 32 xfs_initialize_vnode -> bhv_insert fs/xfs/linux-2.6/xfs_super.c 220 > d 33 vfs_insertops -> bhv_insert fs/xfs/linux-2.6/xfs_vfs.c 259 > N 34 linvfs_truncate -> block_truncate_page fs/xfs/linux-2.6/xfs_iops.c 651 > G 35 fs_flushinval_pages -> filemap_fdatawait fs/xfs/linux-2.6/xfs_fs_subr.c 83 > G 36 fs_flush_pages -> filemap_fdatawait fs/xfs/linux-2.6/xfs_fs_subr.c 108 > G 37 fs_flushinval_pages -> filemap_fdatawrite fs/xfs/linux-2.6/xfs_fs_subr.c 82 > G 38 fs_flush_pages -> filemap_fdatawrite fs/xfs/linux-2.6/xfs_fs_subr.c 105 > n 39 xfs_flush_inode_work -> filemap_flush fs/xfs/linux-2.6/xfs_super.c 508 > PM 40 xlog_sync -> pagebuf_associate_memory fs/xfs/xfs_log.c 1358 > PM 41 xlog_sync -> pagebuf_associate_memory fs/xfs/xfs_log.c 1395 > PM 42 xlog_write_log_records -> pagebuf_associate_memory fs/xfs/xfs_log_recover.c 1156 > PM 43 xlog_write_log_records -> pagebuf_associate_memory fs/xfs/xfs_log_recover.c 1159 > PM 44 xlog_do_recovery_pass -> pagebuf_associate_memory fs/xfs/xfs_log_recover.c 3646 > PM 45 xlog_do_recovery_pass -> pagebuf_associate_memory fs/xfs/xfs_log_recover.c 3653 > PM 46 xlog_do_recovery_pass -> pagebuf_associate_memory fs/xfs/xfs_log_recover.c 3705 > PM 47 xlog_do_recovery_pass -> pagebuf_associate_memory fs/xfs/xfs_log_recover.c 3711 > M 48 xfs_buf_read_flags -> pagebuf_iostart fs/xfs/linux-2.6/xfs_buf.c 636 > M 49 xfsbufd -> pagebuf_iostrategy fs/xfs/linux-2.6/xfs_buf.c 1755 > M 50 xfs_flush_buftarg -> pagebuf_iostrategy fs/xfs/linux-2.6/xfs_buf.c 1816 > M 51 XFS_bwrite -> pagebuf_iostrategy fs/xfs/linux-2.6/xfs_buf.h 503 > n 52 xfs_flush_device_work -> sync_blockdev fs/xfs/linux-2.6/xfs_super.c 533 > f 53 exit_xfs_fs -> unregister_filesystem fs/xfs/linux-2.6/xfs_super.c 999 > PM 54 xfs_acl_vset -> xfs_acl_vremove fs/xfs/xfs_acl.c 326 > f 55 xfs_ialloc_ag_select -> xfs_alloc_pagf_init fs/xfs/xfs_ialloc.c 411 > P 56 xfs_qm_dqflush -> xfs_bawrite fs/xfs/quota/xfs_dquot.c 1300 > N 57 xfs_qm_dqflock_pushbuf_wait -> xfs_bawrite fs/xfs/quota/xfs_dquot.c 1595 > N 58 xfs_qm_dquot_logitem_pushbuf -> xfs_bawrite fs/xfs/quota/xfs_dquot_item.c 275 > N 59 xfs_buf_item_push -> xfs_bawrite fs/xfs/xfs_buf_item.c 669 > P 60 xfs_iflush -> xfs_bawrite fs/xfs/xfs_inode.c 3249 > N 61 xfs_inode_item_pushbuf -> xfs_bawrite fs/xfs/xfs_inode_item.c 823 > F 62 xfs_qm_dqflush -> xfs_bdwrite fs/xfs/quota/xfs_dquot.c 1298 > F 63 xfs_qm_dqiter_bufs -> xfs_bdwrite fs/xfs/quota/xfs_qm.c 1551 > F 64 xfs_iflush -> xfs_bdwrite fs/xfs/xfs_inode.c 3247 > F 65 xlog_recover_do_buffer_trans -> xfs_bdwrite fs/xfs/xfs_log_recover.c 2271 > F 66 xlog_recover_do_inode_trans -> xfs_bdwrite fs/xfs/xfs_log_recover.c 2535 > F 67 xlog_recover_do_dquot_trans -> xfs_bdwrite fs/xfs/xfs_log_recover.c 2664 > C 68 xfs_inactive -> xfs_bmap_finish fs/xfs/xfs_vnodeops.c 1788 > P 69 xfs_iomap_write_allocate -> xfs_bmap_last_offset fs/xfs/xfs_iomap.c 787 > d 70 xfs_dir_leaf_rebalance -> xfs_dir_leaf_compact fs/xfs/xfs_dir_leaf.c 1146 > d 71 xfs_dir_leaf_rebalance -> xfs_dir_leaf_compact fs/xfs/xfs_dir_leaf.c 1176 > d 72 xfs_dir_leaf_to_shortform -> xfs_dir_shortform_addname fs/xfs/xfs_dir_leaf.c 693 > P 73 xlog_recover_process_efi -> xfs_free_extent fs/xfs/xfs_log_recover.c 3041 > n 74 xfs_inode_item_push -> xfs_iflush fs/xfs/xfs_inode_item.c 879 > P 75 xlog_recover_do_inode_trans -> xfs_imap fs/xfs/xfs_log_recover.c 2320 > f 76 xfs_bmap_add_extent -> xfs_mod_incore_sb fs/xfs/xfs_bmap.c 689 > f 77 xfs_bmap_add_extent_hole_delay -> xfs_mod_incore_sb fs/xfs/xfs_bmap.c 1918 > f 78 xfs_bmap_del_extent -> xfs_mod_incore_sb fs/xfs/xfs_bmap.c 3117 > f 79 xfs_bmapi -> xfs_mod_incore_sb fs/xfs/xfs_bmap.c 4801 > f 80 xfs_bmapi -> xfs_mod_incore_sb fs/xfs/xfs_bmap.c 4805 > f 81 xfs_bunmapi -> xfs_mod_incore_sb fs/xfs/xfs_bmap.c 5452 > f 82 xfs_bunmapi -> xfs_mod_incore_sb fs/xfs/xfs_bmap.c 5458 > f 83 xfs_trans_reserve -> xfs_mod_incore_sb fs/xfs/xfs_trans.c 305 > P 84 xfs_qm_quotacheck -> xfs_mount_reset_sbqflags fs/xfs/quota/xfs_qm.c 1962 > N 85 xfs_qm_dqpurge -> xfs_qm_dqflush fs/xfs/quota/xfs_dquot.c 1505 > NB 86 xfs_qm_dquot_logitem_push -> xfs_qm_dqflush fs/xfs/quota/xfs_dquot_item.c 168 > N 87 xfs_qm_shake_freelist -> xfs_qm_dqflush fs/xfs/quota/xfs_qm.c 2134 > N 88 xfs_qm_dqreclaim_one -> xfs_qm_dqflush fs/xfs/quota/xfs_qm.c 2306 > PM 89 xfs_qm_quotacheck -> xfs_qm_dqflush_all fs/xfs/quota/xfs_qm.c 1930 > P 90 xfs_qm_scall_quotaoff -> xfs_qm_log_quotaoff fs/xfs/quota/xfs_qm_syscalls.c 291 > P 91 xfs_qm_scall_quotaoff -> xfs_qm_log_quotaoff_end fs/xfs/quota/xfs_qm_syscalls.c 347 > F 92 xfs_qm_newmount -> xfs_qm_mount_quotas fs/xfs/quota/xfs_qm_bhv.c 273 > F 93 xfs_qm_endmount -> xfs_qm_mount_quotas fs/xfs/quota/xfs_qm_bhv.c 301 > PM 94 xfs_quiesce_fs -> xfs_syncsub fs/xfs/xfs_vfsops.c 632 > P 95 xlog_recover_process_efi -> xfs_trans_reserve fs/xfs/xfs_log_recover.c 3036 > P 96 xlog_recover_clear_agi_bucket -> xfs_trans_reserve fs/xfs/xfs_log_recover.c 3152 > P 97 xfs_qm_scall_trunc_qfiles -> xfs_truncate_file fs/xfs/quota/xfs_qm_syscalls.c 395 > P 98 xfs_qm_scall_trunc_qfiles -> xfs_truncate_file fs/xfs/quota/xfs_qm_syscalls.c 404 > P 99 xfs_log_unmount_write -> xlog_state_release_iclog fs/xfs/xfs_log.c 570 > P 100 xfs_log_unmount_write -> xlog_state_release_iclog fs/xfs/xfs_log.c 606 > f 101 xfs_log_force_umount -> xlog_state_sync_all fs/xfs/xfs_log.c 3586 > > f = false positive > F = false positive + patch to remove condition > G = patch in mainline git tree already > M = __must_check annotations found this as well > P = real, patch to fix > n = no error path to return error > N = no error path to return error, patch to warn about error added > d = does not exist anymore. > B = some other bug found and fixed at same time > C = error ignored, must continue anyway. If silent, made noisy > > Notes: > > - all the xfs_mod_incore_sb() are false positive because they are freeing > blocks or extents which means there can never be an error returned. The only > error that can be returned is ENOSPC when trying to allocate blocks.... > > - none of the callers of xfs_mount_log_sb() check the return value. > > - new function xfs_log_sbcount failed to check return of xfs_trans_commit. > Callers are failing to check return value. > > - most of the callers to xfs_log_force() are not interested in errors - they'll > get them through other means (i.e. log error implies filesystem shutdown). > Only a handful of callers really should return errors, such as fsync(), > sync writes or synchronous transaction commits. > -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue Apr 1 18:24:33 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:24:45 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m321OUSP019932 for ; Tue, 1 Apr 2008 18:24:31 -0700 Received: from snort.melbourne.sgi.com (snort.melbourne.sgi.com [134.14.54.149]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA14823; Wed, 2 Apr 2008 11:24:58 +1000 Received: from snort.melbourne.sgi.com (localhost [127.0.0.1]) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m321OvsT62904127; Wed, 2 Apr 2008 11:24:57 +1000 (AEST) Received: (from dgc@localhost) by snort.melbourne.sgi.com (SGI-8.12.5/8.12.5/Submit) id m321Oqp6116799419; Wed, 2 Apr 2008 11:24:52 +1000 (AEST) X-Authentication-Warning: snort.melbourne.sgi.com: dgc set sender to dgc@sgi.com using -f Date: Wed, 2 Apr 2008 11:24:52 +1000 From: David Chinner To: Michael Nishimoto Cc: David Chinner , XFS Mailing List Subject: Re: Definition of XFS_DQUOT_LOGRES() Message-ID: <20080402012452.GA103491721@sgi.com> References: <47F16988.2080406@agami.com> <20080401012856.GL103491721@sgi.com> <47F2A321.60907@agami.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <47F2A321.60907@agami.com> User-Agent: Mutt/1.4.2.1i X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15134 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: dgc@sgi.com Precedence: bulk X-list: xfs On Tue, Apr 01, 2008 at 02:03:29PM -0700, Michael Nishimoto wrote: > David Chinner wrote: > >On Mon, Mar 31, 2008 at 03:45:28PM -0700, Michael Nishimoto wrote: > >>The comment for XFS_DQUOT_LOGRES states that we need to reserve space > >>for 3 dquots. I can't figure out why we need to add this amount to *all* > >>operations and why this amount wasn't added after doing a runtime > >>quotaon check. > > > >It probably could be done that way. But given that: > > > >>/* > >> * In the worst case, when both user and group quotas are on, > >> * we can have a max of three dquots changing in a single transaction. > >> */ > >>#define XFS_DQUOT_LOGRES(mp) (sizeof(xfs_disk_dquot_t) * 3) > > > >sizeof(xfs_disk_dquot_t) = 104 bytes, > > > >the overall addition to the reservations is minor considering: > > > >[0]kdb> xtrres 0xe0000038055ac6c0 > >write: 109752 truncate: 223672 rename: 305976 > >link: 153144 remove: 153144 symlink: 158520 > >create: 158392 mkdir: 158392 ifree: 58936 > >ichange: 2104 growdata: 45696 swrite: 384 > >addafork: 70584 writeid: 384 attrinval: 179328 > >attrset: 22968 attrrm: 90552 clearagi: 1152 > >growrtalloc: 66048 growrtzero: 4224 growrtfree: 6272 > >[0]kdb> > > > >on a 14GB filesystem most of the transactions this is added to > >are on the far side of 150k and that means we're talking about less > >than 0.2% of the entire reservation comes from the dquot. With > >larger block sizes and/or larger filesystems, these get much > >larger. e.g. same 14GB device, 64k block size instead of 4k: > > > >[0]kdb> xtrres 0xe00000b8027d39f8 > >write: 987576 truncate: 1977272 rename: 2891064 > >link: 1445688 remove: 1445688 symlink: 1512504 > >create: 1511864 mkdir: 1511864 ifree: 470584 > >ichange: 1592 growdata: 395904 swrite: 384 > >addafork: 658616 writeid: 384 attrinval: 1581696 > >attrset: 329656 attrrm: 791480 clearagi: 640 > >growrtalloc: 592640 growrtzero: 65664 growrtfree: 67200 > > > >The rename reservation is *2.8MB* (up from 300k). IOWs, 300 bytes is > >really noise when it comes to reservation space. (OT: See why I want to > >increase the log size now? :) > > > >Is it worth the complexity of adding this dquot reservation at > >runtime for a best case reduction of 0.2% in log space reservation > >usage? Probably not, but patches can be convincing ;) > > > >Cheers, > > > >Dave. > > Here is a patch to fix a sign problem when growing the log to 2G. > Michael, can you repost this patch with a Signed-off-by tag and..... > --- xfs_log.2.c 2008-04-01 11:55:45.000000000 -0700 > +++ xfs_log.3.c 2008-04-01 11:56:53.000000000 -0700 > @@ -230,20 +230,24 @@ > static void > xlog_grant_add_space_write(struct log *log, int bytes) > { > - log->l_grant_write_bytes += bytes; > - if (log->l_grant_write_bytes > log->l_logsize) { > - log->l_grant_write_bytes -= log->l_logsize; > - log->l_grant_write_cycle++; > + int __tmp = (log)->l_logsize - (log)->l_grant_write_bytes; No need for "__" in the tmp var, nor the () around log... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group From owner-xfs@oss.sgi.com Tue Apr 1 18:50:03 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 18:50:12 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-1.2 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from cuda.sgi.com (cuda2.sgi.com [192.48.168.29]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m321o0oJ026399 for ; Tue, 1 Apr 2008 18:50:02 -0700 X-ASG-Debug-ID: 1207101033-62f700850000-NocioJ X-Barracuda-URL: http://cuda.sgi.com:80/cgi-bin/mark.cgi Received: from ext.agami.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id D07B67268CC for ; Tue, 1 Apr 2008 18:50:33 -0700 (PDT) Received: from ext.agami.com (64.221.212.177.ptr.us.xo.net [64.221.212.177]) by cuda.sgi.com with ESMTP id oowllHUU0NRSfBDv for ; Tue, 01 Apr 2008 18:50:33 -0700 (PDT) Received: from agami.com (mail [192.168.168.5]) by ext.agami.com (8.12.5/8.12.5) with ESMTP id m31L37eL029945 for ; Tue, 1 Apr 2008 14:03:07 -0700 Received: from mx1.agami.com (mx1.agami.com [10.123.10.30]) by agami.com (8.12.11/8.12.11) with ESMTP id m31L37ec015683 for ; Tue, 1 Apr 2008 14:03:07 -0700 Received: from [10.123.4.142] ([10.123.4.142]) by mx1.agami.com with Microsoft SMTPSVC(6.0.3790.1830); Tue, 1 Apr 2008 14:03:30 -0700 Message-ID: <47F2A321.60907@agami.com> Date: Tue, 01 Apr 2008 14:03:29 -0700 From: Michael Nishimoto User-Agent: Mail/News 1.5.0.4 (X11/20060629) MIME-Version: 1.0 To: David Chinner CC: XFS Mailing List X-ASG-Orig-Subj: Re: Definition of XFS_DQUOT_LOGRES() Subject: Re: Definition of XFS_DQUOT_LOGRES() References: <47F16988.2080406@agami.com> <20080401012856.GL103491721@sgi.com> In-Reply-To: <20080401012856.GL103491721@sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 01 Apr 2008 21:03:30.0160 (UTC) FILETIME=[D66E6700:01C8943B] X-Barracuda-Connect: 64.221.212.177.ptr.us.xo.net[64.221.212.177] X-Barracuda-Start-Time: 1207101036 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Virus-Scanned: by cuda.sgi.com at sgi.com X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using per-user scores of TAG_LEVEL=2.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=2.1 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.46580 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15135 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: miken@agami.com Precedence: bulk X-list: xfs David Chinner wrote: > On Mon, Mar 31, 2008 at 03:45:28PM -0700, Michael Nishimoto wrote: >> The comment for XFS_DQUOT_LOGRES states that we need to reserve space >> for 3 dquots. I can't figure out why we need to add this amount to *all* >> operations and why this amount wasn't added after doing a runtime >> quotaon check. > > It probably could be done that way. But given that: > >> /* >> * In the worst case, when both user and group quotas are on, >> * we can have a max of three dquots changing in a single transaction. >> */ >> #define XFS_DQUOT_LOGRES(mp) (sizeof(xfs_disk_dquot_t) * 3) > > sizeof(xfs_disk_dquot_t) = 104 bytes, > > the overall addition to the reservations is minor considering: > > [0]kdb> xtrres 0xe0000038055ac6c0 > write: 109752 truncate: 223672 rename: 305976 > link: 153144 remove: 153144 symlink: 158520 > create: 158392 mkdir: 158392 ifree: 58936 > ichange: 2104 growdata: 45696 swrite: 384 > addafork: 70584 writeid: 384 attrinval: 179328 > attrset: 22968 attrrm: 90552 clearagi: 1152 > growrtalloc: 66048 growrtzero: 4224 growrtfree: 6272 > [0]kdb> > > on a 14GB filesystem most of the transactions this is added to > are on the far side of 150k and that means we're talking about less > than 0.2% of the entire reservation comes from the dquot. With > larger block sizes and/or larger filesystems, these get much > larger. e.g. same 14GB device, 64k block size instead of 4k: > > [0]kdb> xtrres 0xe00000b8027d39f8 > write: 987576 truncate: 1977272 rename: 2891064 > link: 1445688 remove: 1445688 symlink: 1512504 > create: 1511864 mkdir: 1511864 ifree: 470584 > ichange: 1592 growdata: 395904 swrite: 384 > addafork: 658616 writeid: 384 attrinval: 1581696 > attrset: 329656 attrrm: 791480 clearagi: 640 > growrtalloc: 592640 growrtzero: 65664 growrtfree: 67200 > > The rename reservation is *2.8MB* (up from 300k). IOWs, 300 bytes is > really noise when it comes to reservation space. (OT: See why I want to > increase the log size now? :) > > Is it worth the complexity of adding this dquot reservation at > runtime for a best case reduction of 0.2% in log space reservation > usage? Probably not, but patches can be convincing ;) > > Cheers, > > Dave. Here is a patch to fix a sign problem when growing the log to 2G. --- xfs_log.2.c 2008-04-01 11:55:45.000000000 -0700 +++ xfs_log.3.c 2008-04-01 11:56:53.000000000 -0700 @@ -230,20 +230,24 @@ static void xlog_grant_add_space_write(struct log *log, int bytes) { - log->l_grant_write_bytes += bytes; - if (log->l_grant_write_bytes > log->l_logsize) { - log->l_grant_write_bytes -= log->l_logsize; - log->l_grant_write_cycle++; + int __tmp = (log)->l_logsize - (log)->l_grant_write_bytes; + if (__tmp > bytes) + (log)->l_grant_write_bytes += bytes; + else { + (log)->l_grant_write_cycle++; + (log)->l_grant_write_bytes = bytes - __tmp; } } static void xlog_grant_add_space_reserve(struct log *log, int bytes) { - log->l_grant_reserve_bytes += bytes; - if (log->l_grant_reserve_bytes > log->l_logsize) { - log->l_grant_reserve_bytes -= log->l_logsize; - log->l_grant_reserve_cycle++; + int __tmp = (log)->l_logsize - (log)->l_grant_reserve_bytes; + if (__tmp > bytes) + (log)->l_grant_reserve_bytes += bytes; + else { + (log)->l_grant_reserve_cycle++; + (log)->l_grant_reserve_bytes = bytes - __tmp; } } From owner-xfs@oss.sgi.com Tue Apr 1 19:57:42 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 19:57:52 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from relay.sgi.com (relay2.corp.sgi.com [192.26.58.22]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m322vfsn003159 for ; Tue, 1 Apr 2008 19:57:42 -0700 Received: from outhouse.melbourne.sgi.com (outhouse.melbourne.sgi.com [134.14.52.145]) by relay2.corp.sgi.com (Postfix) with ESMTP id 7456B30409A; Tue, 1 Apr 2008 19:58:13 -0700 (PDT) Received: from itchy (xaiki@itchy.melbourne.sgi.com [134.14.55.96]) by outhouse.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m322w9jm083470; Wed, 2 Apr 2008 13:58:11 +1100 (AEDT) From: Niv Sardi To: David Chinner Cc: xfs-dev , xfs@oss.sgi.com Subject: Re: [Review] Improve XFS error checking and propagation References: <20080311010420.GD155407@sgi.com> <20080401230044.GS103491721@sgi.com> Date: Wed, 02 Apr 2008 13:58:09 +1100 In-Reply-To: <20080401230044.GS103491721@sgi.com> (David Chinner's message of "Wed, 2 Apr 2008 09:00:44 +1000") Message-ID: User-Agent: Gnus/5.110007 (No Gnus v0.7) Emacs/23.0.60 (i486-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15136 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: xaiki@cxhome.ath.cx Precedence: bulk X-list: xfs David Chinner writes: > On Tue, Mar 11, 2008 at 12:04:21PM +1100, David Chinner wrote: >> A recent paper at the FAST08 conference highlighted a large number >> of unchecked error paths in Linux filesystems and I/O layers. As a >> subsystem, XFS had the highest aggregate numbers of bad error >> propagation. A tarball which contains a quilt patch series of 32 >> patches aimed at improving this situation can be found here: >> >> http://oss.sgi.com/~dgc/xfs/error-check/xfs-error-checking.tar.gz All looks good except some minor typo-editing, and NOK xfs-mustcheck-quotamount.patch # need to check if can happen when forcing quotas I'm not sure what happens if we really DO want quotas (specified on mount line and such). OK xfs-mustcheck-reset-dqcounts.patch OK xfs-mustcheck-dqflushall.patch OK xfs-mustcheck-acl-setmode.patch OK xfs-mustcheck-search-busy.patch EDITED xfs-mustcheck-compute-diff.patch # xfs_fs_cmn_err alignment OK xfs-mustcheck-bmap-adjacent.patch OK xfs-mustcheck-iflush-fork.patch # less error handeling !! OK xfs-mustcheck-bulkstat-dinode.patch OK xfs-mustcheck-quiesce-fs.patch OK xfs-mustcheck-bdstrat.patch OK xfs-fix-error-prototypes.patch # not error handeling related OK xfs-mustcheck-acl-vremove.patch OK xfs-mustcheck-icsb-disable.patch OK xfs-mustcheck-ioend-unwritten.patch OK xfs-mustcheck-buf-associate.patch OK xfs-mustcheck-reserve-blocks.patch EDITED xfs-mustcheck-bawrite.patch # xfs_fs_cmn_err alignment OK xfs-mustcheck-bdwrite.patch OK xfs-mustcheck-truncate-page.patch # might be incomplete EDITED xfs-mustcheck-dqflush.patch # slight style change/typo OK xfs-mustcheck-reset-sbqflags.patch OK xfs-mustcheck-quotaoff.patch EDITED xfs-mustcheck-inactive.patch # slight style change/typo OK xfs-mustcheck-trans-reserve.patch OK xfs-mustcheck-quota-trunc.patch OK xfs-mustcheck-log-unmount.patch OK xfs-mustcheck-free-extent.patch OK xfs-mustcheck-xfs-imap.patch OK xfs-mustcheck-bmap-last-offset.patch OK xfs-mustcheck-trans-commit.patch OK xfs-mustcheck-log-force.patch Cheers, -- Niv Sardi From owner-xfs@oss.sgi.com Tue Apr 1 21:02:18 2008 Received: with ECARTIS (v1.0.0; list xfs); Tue, 01 Apr 2008 21:02:37 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.3.0-r574664 (2007-09-11) on oss.sgi.com X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00 autolearn=ham version=3.3.0-r574664 Received: from relay.sgi.com (netops-testserver-3.corp.sgi.com [192.26.57.72]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m3242GBC017636 for ; Tue, 1 Apr 2008 21:02:17 -0700 Received: from outhouse.melbourne.sgi.com (outhouse.melbourne.sgi.com [134.14.52.145]) by netops-testserver-3.corp.sgi.com (Postfix) with ESMTP id 8D1DC9088D; Tue, 1 Apr 2008 21:02:46 -0700 (PDT) Received: from itchy (xaiki@itchy.melbourne.sgi.com [134.14.55.96]) by outhouse.melbourne.sgi.com (SGI-8.12.5/8.12.5) with ESMTP id m3242fjm085594; Wed, 2 Apr 2008 15:02:44 +1100 (AEDT) From: Niv Sardi To: David Chinner Cc: xfs-dev , xfs-oss Subject: Re: [Patch] unique per-AG inode generation number initialisation References: <20080401231815.GW103491721@sgi.com> Date: Wed, 02 Apr 2008 15:02:42 +1100 In-Reply-To: <20080401231815.GW103491721@sgi.com> (David Chinner's message of "Wed, 2 Apr 2008 09:18:15 +1000") Message-ID: User-Agent: Gnus/5.110007 (No Gnus v0.7) Emacs/23.0.60 (i486-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Scanned: ClamAV 0.91.2/6021/Wed Feb 27 15:55:48 2008 on oss.sgi.com X-Virus-Status: Clean X-archive-position: 15137 X-ecartis-version: Ecartis v1.0.0 Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com X-original-sender: xaiki@cxhome.ath.cx Precedence: bulk X-list: xfs David Chinner writes: > Don't initialise new inode generation numbers to zero > > When we allocation new inode chunks, we initialise the generation > numbers to zero. This works fine until we delete a chunk and then > reallocate it, resulting in the same inode numbers but with a > reset generation count. This can result in inode/generation > pairs of different inodes occurring relatively close together. > > Given that the inode/gen pair makes up the "unique" portion of > an NFS filehandle on XFS, this can result in file handles cached > on clients being seen on the wire from the server but refer to > a different file. This causes .... issues for NFS clients. > > Hence we need a unique generation number initialisation for > each inode to prevent reuse of a small portion of the generation > number space. Make this initialiser per-allocation group so > that it is not a single point of contention in the filesystem, > and increment it on every allocation within an AG to reduce the > chance that a generation number is reused for a given inode number > if the inode chunk is deleted and reallocated immediately > afterwards. > > It is safe to add the agi_newinogen field to the AGI without > using a feature bit. If an older kernel is used, it simply > will not update the field on allocation. If the kernel is > updated and the field has garbage in it, then it's like having a > random seed to the generation number.... > > Signed-off-by: Dave Chinner > --- > fs/xfs/xfs_ag.h | 4 +++- > fs/xfs/xfs_ialloc.c | 30 ++++++++++++++++++++++-------- > 2 files changed, 25 insertions(+), 9 deletions(-) Appart from the bit of overhead all seems goo