From yuan@pauli.utmb.edu Wed Nov 3 14:15:44 2004 Received: with ECARTIS (v1.0.0; list pagg); Wed, 03 Nov 2004 14:15:49 -0800 (PST) Received: from pauli.utmb.edu (pauli.utmb.edu [129.109.59.102]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iA3MFhMf017052 for ; Wed, 3 Nov 2004 14:15:44 -0800 Received: from pauli.utmb.edu (pauli.utmb.edu [129.109.59.102]) by pauli.utmb.edu (8.12.11/8.12.11) with ESMTP id iA3MFHXP014281 for ; Wed, 3 Nov 2004 16:15:17 -0600 Subject: CSA instllation From: Yuan Xu Reply-To: yuan@pauli.utmb.edu To: pagg@oss.sgi.com Content-Type: text/plain Organization: SCSB Message-Id: <1099520117.14190.41.camel@pauli.utmb.edu> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.5 (1.4.5-1) Date: Wed, 03 Nov 2004 16:15:17 -0600 Content-Transfer-Encoding: 7bit X-archive-position: 45 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: yuan@pauli.utmb.edu Precedence: bulk X-list: pagg Dear Sir: I am trying to install CSA on my linux cluster running kernel: kernel-smp-2.4.20-24.8 and having troubles to make it run. Here is what I did. 1) I installed job-1.2.0-1.i386.rpm and csa-2.1.1-1.i386.rpm 2) It was OK and got no problems to install them 3) I followed the man page of csa to start the job and csa 4) /etc/init.d/csa start worked and gave no error messages 5) /etc/init.d/job start gave me following error messages: Enabling Linux Jobs Failed to load the job module. Consult /var/log/messages for additional error messages to determine the cause of failure. and log file: messages showed: insmod: insmod: job: no module by that name found 6) When I ran /usr/sbin/csaswitch -c on -n csa and got Can't open /proc/csa System Error(2): No such file or directory. Can't open /proc/csa System Error(2): No such file or directory. Unable to get the Kernel and Daemon accounting status information. Unable to get the Kernel and Daemon accounting status information. System Error(2): No such file or directory Now my questions are: 1) What version of csa and job should I used for my kernel. I know the kernel is kind of old but we have to live with it. 2) It looks like I have to install patches also. So what versions of patches do I have to use and 3) How do I apply patches to my kernel. 4) How do I load the required module I will be very delighted if you could help me solve my CSA installation problems. Thank you very much and have a good day -- Have a good day Yuan Xu, PhD SCSB, UTMB Galveston TX 77555-0857 4097476805(O), 2816145220(H) yuan@pauli.utmb.edu http://www.scsb.utmb.edu http://www.scsb.utmb.edu/Compu_Core/index.html http://planck.utmb.edu From erikj@subway.americas.sgi.com Wed Nov 3 14:44:04 2004 Received: with ECARTIS (v1.0.0; list pagg); Wed, 03 Nov 2004 14:44:11 -0800 (PST) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iA3Mi4D5017910 for ; Wed, 3 Nov 2004 14:44:04 -0800 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id iA3NxMB9007129 for ; Wed, 3 Nov 2004 15:59:22 -0800 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id iA3Mgfam1322227 for ; Wed, 3 Nov 2004 16:42:41 -0600 (CST) Received: from subway.americas.sgi.com (subway.americas.sgi.com [128.162.236.152]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id iA3MgftC14408745 for ; Wed, 3 Nov 2004 16:42:41 -0600 (CST) Received: from subway.americas.sgi.com (localhost [127.0.0.1]) by subway.americas.sgi.com (SGI-8.12.5/8.12.5/erikj-IRIX6519-news) with ESMTP id iA3Mgfcc2397610 for ; Wed, 3 Nov 2004 16:42:41 -0600 (CST) Received: from localhost (erikj@localhost) by subway.americas.sgi.com (SGI-8.12.5/8.12.5/Submit) with ESMTP id iA3MgfTr2399036 for ; Wed, 3 Nov 2004 16:42:41 -0600 (CST) Date: Wed, 3 Nov 2004 16:42:41 -0600 From: Erik Jacobson To: pagg@oss.sgi.com Subject: New pagg patch for 2.6.9 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 46 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@subway.americas.sgi.com Precedence: bulk X-list: pagg Hi there. A problem was found with the PAGG patch so I decided to re-spin the patch for 2.6.9 and include the fix. In a certain unlikely situation in copy_process (fork.c), it was possible that a child task could be "aborted". Since this aborted task is torn back down within copy_process itself, do_exit isn't called. Previously, the only place we called pagg_detach was in the do_exit function. The end result is we do a pagg_attach for the process but not a matching pagg_detach in this rare situation. Using Linux job as an example here, this could mean a "bogus" task is in a job container and trying to do certain things to that bogus task could result in system panics since the task is "mostly gone". For example, if job_killjid tried to signal this bogus task, it would cause a panic (null pointer dereference) since the signal handler isn't attached to the bogus task any more. The fix was to add a pagg_detach call right under bad_fork_cleanup_namespace: in copy_process. Find the 'linux-2.6.9-pagg.patch' patch at the PAGG web site. http://oss.sgi.com/projects/pagg/ Click on "Download" on the left. Thank you. -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From jlan@sgi.com Fri Nov 5 15:43:01 2004 Received: with ECARTIS (v1.0.0; list pagg); Fri, 05 Nov 2004 15:43:06 -0800 (PST) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iA5Nh0tP027065 for ; Fri, 5 Nov 2004 15:43:01 -0800 Received: from omx2.sgi.com ([198.149.32.25]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id iA5KXFxT016334 for ; Fri, 5 Nov 2004 14:33:16 -0600 Received: from spindle.corp.sgi.com (spindle.corp.sgi.com [198.29.75.13]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id iA5Ln7xi019984 for ; Fri, 5 Nov 2004 13:49:07 -0800 Received: from mtv-vpn-hw-jlan-2.corp.sgi.com (mtv-vpn-hw-jlan-2.corp.sgi.com [134.15.18.195]) by spindle.corp.sgi.com (8.12.9/8.12.9/generic_config-1.2) with ESMTP id iA5KWDGI67743596; Fri, 5 Nov 2004 12:32:13 -0800 (PST) Received: from sgi.com (mtv-vpn-hw-jlan-2.corp.sgi.com [127.0.0.1]) by mtv-vpn-hw-jlan-2.corp.sgi.com (8.12.8/8.12.8) with ESMTP id iA5KXlHV014058; Fri, 5 Nov 2004 12:33:48 -0800 Message-ID: <418BE3AB.1070909@sgi.com> Date: Fri, 05 Nov 2004 12:33:47 -0800 From: Jay Lan User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: zh-tw, en-us, en, zh-cn, zh-hk MIME-Version: 1.0 To: yuan@pauli.utmb.edu CC: pagg@oss.sgi.com Subject: Re: CSA instllation References: <1099520117.14190.41.camel@pauli.utmb.edu> <418BCB88.4040308@sgi.com> In-Reply-To: <418BCB88.4040308@sgi.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 47 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: jlan@sgi.com Precedence: bulk X-list: pagg Hi Yuan, Here is a pointer to a good HOWTO documentation on building a linux kernel and patching a kernel, if you need it: http://oss.sgi.com/LDP/HOWTO/Kernel-HOWTO/index.html Cheeers, - jay Jay Lan wrote: > Yuan Xu wrote: > >> Dear Sir: >> >> I am trying to install CSA on my linux cluster running >> kernel: kernel-smp-2.4.20-24.8 and having troubles to make it run. >> >> Here is what I did. >> >> 1) I installed job-1.2.0-1.i386.rpm and csa-2.1.1-1.i386.rpm >> 2) It was OK and got no problems to install them >> 3) I followed the man page of csa to start the job and csa >> 4) /etc/init.d/csa start worked and gave no error messages >> 5) /etc/init.d/job start gave me following error messages: >> >> Enabling Linux Jobs >> Failed to load the job module. Consult /var/log/messages for >> additional error messages to determine the cause of failure. >> >> and log file: messages showed: >> >> insmod: insmod: job: no module by that name found >> >> 6) When I ran /usr/sbin/csaswitch -c on -n csa and got >> >> Can't open /proc/csa >> System Error(2): No such file or directory. >> Can't open /proc/csa >> System Error(2): No such file or directory. >> Unable to get the Kernel and Daemon accounting status information. >> Unable to get the Kernel and Daemon accounting status information. >> System Error(2): No such file or directory >> >> Now my questions are: > > > Dear Yuan Xu, > > Those csa and job rpms need to have kernel patches applied. > >> >> 1) What version of csa and job should I used for my kernel. I know the >> kernel is kind of old but we have to live with it. > > > Please try to use 2.4.26 version of pagg, job and csa kernel patches > since they contain bug fixes found so far and are the latest version > we maintain for 2.4 kernel. I am sorry we do not have patches for > 2.4.20. You likely need to resolve a few conflicts. Please feel free > to ask if you encounter any problem. > > Assuming you are using 2.4.26 kernel patches, the job and csa rpm > you already installed are fine. > >> >> 2) It looks like I have to install patches also. So what versions of >> patches do I have to use and > > > See above. > >> >> 3) How do I apply patches to my kernel. > > > You need to set up a linux kernel build area. You can download a 2.4 > linux kernel of youe choice from: > ftp://ftp.kernel.org/pub/linux/kernel/v2.4 > > and then use 'patch' command to apply the kernel patches to your > kernel base. > > A README file from the kernel source tells you how to build a kernel. > >> >> 4) How do I load the required module > > > After you build your new kernel and reboot, the csa and job should > get loaded at system startup. > >> >> I will be very delighted if you could help me solve my CSA installation >> problems. >> >> Thank you very much and have a good day > > > I apologized for late response. My email filtering did not work well. > I aleady turned off the filtering. :( > > Cheers! > - jay > >> >> From jlan@sgi.com Fri Nov 5 15:43:03 2004 Received: with ECARTIS (v1.0.0; list pagg); Fri, 05 Nov 2004 15:43:07 -0800 (PST) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iA5Nh2d4027072 for ; Fri, 5 Nov 2004 15:43:03 -0800 Received: from spindle.corp.sgi.com (spindle.corp.sgi.com [198.29.75.13]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id iA5InHxT026084 for ; Fri, 5 Nov 2004 12:49:17 -0600 Received: from mtv-vpn-hw-jlan-2.corp.sgi.com (mtv-vpn-hw-jlan-2.corp.sgi.com [134.15.18.195]) by spindle.corp.sgi.com (8.12.9/8.12.9/generic_config-1.2) with ESMTP id iA5InEGI67837962; Fri, 5 Nov 2004 10:49:15 -0800 (PST) Received: from sgi.com (mtv-vpn-hw-jlan-2.corp.sgi.com [127.0.0.1]) by mtv-vpn-hw-jlan-2.corp.sgi.com (8.12.8/8.12.8) with ESMTP id iA5IomHV013703; Fri, 5 Nov 2004 10:50:49 -0800 Message-ID: <418BCB88.4040308@sgi.com> Date: Fri, 05 Nov 2004 10:50:48 -0800 From: Jay Lan User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: zh-tw, en-us, en, zh-cn, zh-hk MIME-Version: 1.0 To: yuan@pauli.utmb.edu CC: pagg@oss.sgi.com Subject: Re: CSA instllation References: <1099520117.14190.41.camel@pauli.utmb.edu> In-Reply-To: <1099520117.14190.41.camel@pauli.utmb.edu> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 48 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: jlan@sgi.com Precedence: bulk X-list: pagg Yuan Xu wrote: > Dear Sir: > > I am trying to install CSA on my linux cluster running > kernel: kernel-smp-2.4.20-24.8 and having troubles to make it run. > > Here is what I did. > > 1) I installed job-1.2.0-1.i386.rpm and csa-2.1.1-1.i386.rpm > 2) It was OK and got no problems to install them > 3) I followed the man page of csa to start the job and csa > 4) /etc/init.d/csa start worked and gave no error messages > 5) /etc/init.d/job start gave me following error messages: > > Enabling Linux Jobs > Failed to load the job module. Consult /var/log/messages for > additional error messages to determine the cause of failure. > > and log file: messages showed: > > insmod: insmod: job: no module by that name found > > 6) When I ran /usr/sbin/csaswitch -c on -n csa and got > > Can't open /proc/csa > System Error(2): No such file or directory. > Can't open /proc/csa > System Error(2): No such file or directory. > Unable to get the Kernel and Daemon accounting status information. > Unable to get the Kernel and Daemon accounting status information. > System Error(2): No such file or directory > > Now my questions are: Dear Yuan Xu, Those csa and job rpms need to have kernel patches applied. > > 1) What version of csa and job should I used for my kernel. I know the > kernel is kind of old but we have to live with it. Please try to use 2.4.26 version of pagg, job and csa kernel patches since they contain bug fixes found so far and are the latest version we maintain for 2.4 kernel. I am sorry we do not have patches for 2.4.20. You likely need to resolve a few conflicts. Please feel free to ask if you encounter any problem. Assuming you are using 2.4.26 kernel patches, the job and csa rpm you already installed are fine. > > 2) It looks like I have to install patches also. So what versions of > patches do I have to use and See above. > > 3) How do I apply patches to my kernel. You need to set up a linux kernel build area. You can download a 2.4 linux kernel of youe choice from: ftp://ftp.kernel.org/pub/linux/kernel/v2.4 and then use 'patch' command to apply the kernel patches to your kernel base. A README file from the kernel source tells you how to build a kernel. > > 4) How do I load the required module After you build your new kernel and reboot, the csa and job should get loaded at system startup. > > I will be very delighted if you could help me solve my CSA installation > problems. > > Thank you very much and have a good day I apologized for late response. My email filtering did not work well. I aleady turned off the filtering. :( Cheers! - jay > > From kingsley@sw.oz.au Sun Nov 7 19:38:32 2004 Received: with ECARTIS (v1.0.0; list pagg); Sun, 07 Nov 2004 19:38:35 -0800 (PST) Received: from smtp.sw.oz.au (IDENT:FWUSER@alt.aurema.com [203.217.18.57]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iA83cLuD012912 for ; Sun, 7 Nov 2004 19:38:31 -0800 Received: from kingsley.sw.oz.au (kingsley.sw.oz.au [192.41.203.97]) by smtp.sw.oz.au with ESMTP id iA83amYL027439 for ; Mon, 8 Nov 2004 14:36:48 +1100 (EST) Received: from kingsley.sw.oz.au (localhost.localdomain [127.0.0.1]) by kingsley.sw.oz.au (8.12.10/8.12.10) with ESMTP id iA83am28018513 for ; Mon, 8 Nov 2004 14:36:48 +1100 Received: (from kingsley@localhost) by kingsley.sw.oz.au (8.12.10/8.12.10/Submit) id iA83amjl018511 for pagg@oss.sgi.com; Mon, 8 Nov 2004 14:36:48 +1100 Date: Mon, 8 Nov 2004 14:36:48 +1100 From: Kingsley Cheung To: pagg@oss.sgi.com Subject: [patch] Registration Check for Compulsory Hooks Message-ID: <20041108033648.GB18308@aurema.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="zx4FCpZtqtKETZ7O" Content-Disposition: inline User-Agent: Mutt/1.4.1i X-Scanned-By: MIMEDefang 2.44 X-archive-position: 49 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kingsley@aurema.com Precedence: bulk X-list: pagg --zx4FCpZtqtKETZ7O Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi, I was reading through the linux-2.6.9-pagg.patch when I noticed that "init()", "exec()" are considered optional but "attach()" and "detach()" are compulsory. If that is true, then would perhaps there should be checks during registration to ensure "attach()" and "detach()" are defined? An untested patch against Linux 2.6.5 (that section of code hasn't changed in 2.6.9) is attached. Thanks, -- Kingsley --zx4FCpZtqtKETZ7O Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="pagg.patch" Index: linux/kernel/pagg.c =================================================================== --- linux/kernel/pagg.c 22 Aug 2004 01:19:59 -0000 +++ linux/kernel/pagg.c 8 Nov 2004 03:30:39 -0000 @@ -236,6 +236,8 @@ return -EINVAL; /* error */ if (pagg_hook_new->name == NULL || strlen(pagg_hook_new->name) > PAGG_NAMELN) return -EINVAL; /* error */ + if (!pagg_hook_new->attach || !pagg_hook_new->detach) + return -EINVAL; /* error */ /* Try to insert new hook entry into the pagg hook list */ down_write(&pagg_hook_list_sem); --zx4FCpZtqtKETZ7O-- From erikj@subway.americas.sgi.com Sun Nov 7 20:37:16 2004 Received: with ECARTIS (v1.0.0; list pagg); Sun, 07 Nov 2004 20:37:20 -0800 (PST) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iA84bGef019497 for ; Sun, 7 Nov 2004 20:37:16 -0800 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id iA84awxT006357 for ; Sun, 7 Nov 2004 22:36:58 -0600 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id iA84alam1601464; Sun, 7 Nov 2004 22:36:47 -0600 (CST) Received: from subway.americas.sgi.com (subway.americas.sgi.com [128.162.236.152]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id iA84altC14681859; Sun, 7 Nov 2004 22:36:47 -0600 (CST) Received: from subway.americas.sgi.com (localhost [127.0.0.1]) by subway.americas.sgi.com (SGI-8.12.5/8.12.5/erikj-IRIX6519-news) with ESMTP id iA84alcc2613312; Sun, 7 Nov 2004 22:36:47 -0600 (CST) Received: from localhost (erikj@localhost) by subway.americas.sgi.com (SGI-8.12.5/8.12.5/Submit) with ESMTP id iA84akcu2613665; Sun, 7 Nov 2004 22:36:46 -0600 (CST) Date: Sun, 7 Nov 2004 22:36:46 -0600 From: Erik Jacobson To: Kingsley Cheung cc: pagg@oss.sgi.com Subject: Re: [patch] Registration Check for Compulsory Hooks In-Reply-To: <20041108033648.GB18308@aurema.com> Message-ID: References: <20041108033648.GB18308@aurema.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 50 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@subway.americas.sgi.com Precedence: bulk X-list: pagg Thank you for pointing this out. Indeed. If someone set up a pagg hook with attach set to null, the next procss forked after the kernel module was loaded would cause a panic. So I agree we should add your change. I'll take care of this tomorrow. Thanks! Erik On Mon, 8 Nov 2004, Kingsley Cheung wrote: > Hi, > > I was reading through the linux-2.6.9-pagg.patch when I noticed that > "init()", "exec()" are considered optional but "attach()" and > "detach()" are compulsory. If that is true, then would perhaps there > should be checks during registration to ensure "attach()" and > "detach()" are defined? > > An untested patch against Linux 2.6.5 (that section of code hasn't > changed in 2.6.9) is attached. > > Thanks, > -- > Kingsley > -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From erikj@subway.americas.sgi.com Mon Nov 8 08:17:36 2004 Received: with ECARTIS (v1.0.0; list pagg); Mon, 08 Nov 2004 08:17:40 -0800 (PST) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iA8GHa3E003077 for ; Mon, 8 Nov 2004 08:17:36 -0800 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id iA8HXZT8020686 for ; Mon, 8 Nov 2004 09:33:35 -0800 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id iA8GGDam1634672 for ; Mon, 8 Nov 2004 10:16:13 -0600 (CST) Received: from subway.americas.sgi.com (subway.americas.sgi.com [128.162.236.152]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id iA8GGDtC14859302 for ; Mon, 8 Nov 2004 10:16:13 -0600 (CST) Received: from subway.americas.sgi.com (localhost [127.0.0.1]) by subway.americas.sgi.com (SGI-8.12.5/8.12.5/erikj-IRIX6519-news) with ESMTP id iA8GGCcc2682243 for ; Mon, 8 Nov 2004 10:16:12 -0600 (CST) Received: from localhost (erikj@localhost) by subway.americas.sgi.com (SGI-8.12.5/8.12.5/Submit) with ESMTP id iA8GGCva2673015 for ; Mon, 8 Nov 2004 10:16:12 -0600 (CST) Date: Mon, 8 Nov 2004 10:16:12 -0600 From: Erik Jacobson To: pagg@oss.sgi.com Subject: Re: New pagg patch for 2.6.9 In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 51 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@subway.americas.sgi.com Precedence: bulk X-list: pagg I updated the PAGG patch to include Kingsley Cheung's fix to avoid a kernel panic if the kernel module author has a NULL attach or detach pointer in the PAGG hook. The kernel module will now fail to load instead of causing a kernel panic. Find the 'linux-2.6.9-pagg.patch-2' patch at the PAGG web site. http://oss.sgi.com/projects/pagg/ Click on "Download" on the left. Thanks. -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From kingsley@sw.oz.au Mon Nov 8 15:23:01 2004 Received: with ECARTIS (v1.0.0; list pagg); Mon, 08 Nov 2004 15:23:06 -0800 (PST) Received: from smtp.sw.oz.au (IDENT:FWUSER@alt.aurema.com [203.217.18.57]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iA8NMuQd031648 for ; Mon, 8 Nov 2004 15:22:59 -0800 Received: from kingsley.sw.oz.au (kingsley.sw.oz.au [192.41.203.97]) by smtp.sw.oz.au with ESMTP id iA8N0fYL009717; Tue, 9 Nov 2004 10:00:41 +1100 (EST) Received: from kingsley.sw.oz.au (localhost.localdomain [127.0.0.1]) by kingsley.sw.oz.au (8.12.10/8.12.10) with ESMTP id iA8N0e28019547; Tue, 9 Nov 2004 10:00:41 +1100 Received: (from kingsley@localhost) by kingsley.sw.oz.au (8.12.10/8.12.10/Submit) id iA8N0eqw019545; Tue, 9 Nov 2004 10:00:40 +1100 Date: Tue, 9 Nov 2004 10:00:39 +1100 From: Kingsley Cheung To: Erik Jacobson Cc: pagg@oss.sgi.com Subject: PAGG Threading Issues (was Re: [patch] Registration Check for Compulsory Hooks) Message-ID: <20041108230039.GE18308@aurema.com> References: <20041108033648.GB18308@aurema.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.1i X-Scanned-By: MIMEDefang 2.44 X-archive-position: 52 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kingsley@aurema.com Precedence: bulk X-list: pagg Hi Erik, Some other issues as well. Please correct me if I'm wrong :) 1) I've noticed that pagg_detach() has been moved from __put_task_struct() to do_exit() between the pagg patches for Linux 2.6.8. This occurred as described at http://oss.sgi.com/archives/pagg/2004-08/msg00002.html Anyway, I think the move might have raised a problem with the registration code. Below is the relevant segment of code I'd like to consider: get_task_struct(p); read_unlock(&tasklist_lock); down_write(&p->pagg_sem); paggp = pagg_get(p, pagg_hook_new->name); if (paggp == NULL) { paggp = pagg_alloc(p, pagg_hook_new); if (paggp != NULL) init_result = pagg_hook_new->init(p, paggp); else init_result = -ENOMEM; } up_write(&p->pagg_sem); read_lock(&tasklist_lock); /* Like in remove_client_paggs_from_all_tasks, if the task * disappeared on us while we were going through the * for_each_process loop, we need to start over with that loop. * That's why we have the list_empty here */ task_exited = list_empty(&p->tasks); put_task_struct(p); Now the tasks on the task list include those that are zombies. This occurs during exit_notify(), which sets their state to ZOMBIE but leaves them on the task list. Therefore the above code will allocate paggs for zombie tasks as well. [Note that task_exited won't trigger as the task is still on the task list.] Now unfortunately, by moving pagg_detach() out of __put_task_struct(), when the zombie has its task structure released, the pagg that was allocated during registration is no longer detached. Thus we'll could have a memory leak once the task structure is dropped and detach_pagg() is not invoked. In fact the leak could happen not just for zombies, but for tasks just about to exit. The race occurs in do_exit(), in which the detach_pagg() could be called before we allocate the pagg. Perhaps the quickest solution is to place a detach_pagg() call back into __put_task_struct(). Another suggestion might be to check for the PF_EXITING flag when allocating the pagg. 2) Is pagg intended to be a container for Linux tasks or thread groups? I'm avoiding the use of the word "process" as that could be confusing in this context. The reason I'm asking is because throughout the pagg registration code the for_each_process() macro is used. This happens in remove_client_paggs_from_all_tasks() and pagg_hook_register(). The problem is, however, since 2.6 for_each_process() only traverses thread group leader tasks. Thus the registration and deregistration code only considers thread group leaders. In contrast the pagg_attach() and pagg_detach() hooks are placed to catch all tasks in the fork and exit code. With the way the registration and deregistration code is behaving, this means that we won't catch all existing tasks and we won't clean up after all tasks properly for multi-threaded applications. So what is pagg intended for? To catch every single task? In that case the do_each_thread() and while_each_thread() macros should be used. Or is pagg a container for thread groups only? The changes for that would require much more thought... 3) Is the "module" pointer used anywhere in the pagg_hook struct meant to be used in the pagg code or in the client code? It doesn't seem to be used in the pagg code at all. Thanks! -- Kingsley From erikj@subway.americas.sgi.com Tue Nov 9 13:02:55 2004 Received: with ECARTIS (v1.0.0; list pagg); Tue, 09 Nov 2004 13:03:01 -0800 (PST) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iA9L2taJ020540 for ; Tue, 9 Nov 2004 13:02:55 -0800 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id iA9L2bxT001504 for ; Tue, 9 Nov 2004 15:02:37 -0600 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id iA9L2Qam1722533; Tue, 9 Nov 2004 15:02:26 -0600 (CST) Received: from subway.americas.sgi.com (subway.americas.sgi.com [128.162.236.152]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id iA9L2QtC14889554; Tue, 9 Nov 2004 15:02:26 -0600 (CST) Received: from subway.americas.sgi.com (localhost [127.0.0.1]) by subway.americas.sgi.com (SGI-8.12.5/8.12.5/erikj-IRIX6519-news) with ESMTP id iA9L2Qcc2743607; Tue, 9 Nov 2004 15:02:26 -0600 (CST) Received: from localhost (erikj@localhost) by subway.americas.sgi.com (SGI-8.12.5/8.12.5/Submit) with ESMTP id iA9L2OTm2747868; Tue, 9 Nov 2004 15:02:25 -0600 (CST) Date: Tue, 9 Nov 2004 15:02:24 -0600 From: Erik Jacobson To: Kingsley Cheung cc: pagg@oss.sgi.com Subject: Re: PAGG Threading Issues (was Re: [patch] Registration Check for Compulsory Hooks) In-Reply-To: <20041108230039.GE18308@aurema.com> Message-ID: References: <20041108033648.GB18308@aurema.com> <20041108230039.GE18308@aurema.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 53 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@subway.americas.sgi.com Precedence: bulk X-list: pagg I think you make some good points here and I honestly need to do some research (and probably talk to some co-workers) before I can get you a good response. We've got some deadlines we're up against right now. Would it be ok if I defer this for a couple days? If you want to be sure the ball isn't dropped, you could file a Bugzilla bug: http://oss.sgi.com/bugzilla/ In any case, I hope to get back to you in a few days. If you need a response sooner let me know and I'll try to shuffle some things around. On Tue, 9 Nov 2004, Kingsley Cheung wrote: > Hi Erik, > > Some other issues as well. Please correct me if I'm wrong :) > > 1) I've noticed that pagg_detach() has been moved from > __put_task_struct() to do_exit() between the pagg patches for Linux > 2.6.8. This occurred as described at > http://oss.sgi.com/archives/pagg/2004-08/msg00002.html > > Anyway, I think the move might have raised a problem with the > registration code. Below is the relevant segment of code I'd like to > consider: > > get_task_struct(p); > read_unlock(&tasklist_lock); > down_write(&p->pagg_sem); > paggp = pagg_get(p, pagg_hook_new->name); > if (paggp == NULL) { > paggp = pagg_alloc(p, pagg_hook_new); > if (paggp != NULL) > init_result = pagg_hook_new->init(p, paggp); > else > init_result = -ENOMEM; > } > up_write(&p->pagg_sem); > read_lock(&tasklist_lock); > /* Like in remove_client_paggs_from_all_tasks, if the task > * disappeared on us while we were going through the > * for_each_process loop, we need to start over with that loop. > * That's why we have the list_empty here */ > task_exited = list_empty(&p->tasks); > put_task_struct(p); > > Now the tasks on the task list include those that are zombies. This > occurs during exit_notify(), which sets their state to ZOMBIE but > leaves them on the task list. Therefore the above code will allocate > paggs for zombie tasks as well. [Note that task_exited won't trigger > as the task is still on the task list.] > > Now unfortunately, by moving pagg_detach() out of __put_task_struct(), > when the zombie has its task structure released, the pagg that was > allocated during registration is no longer detached. Thus we'll could > have a memory leak once the task structure is dropped and > detach_pagg() is not invoked. > > In fact the leak could happen not just for zombies, but for tasks just > about to exit. The race occurs in do_exit(), in which the > detach_pagg() could be called before we allocate the pagg. > > Perhaps the quickest solution is to place a detach_pagg() call back > into __put_task_struct(). Another suggestion might be to check for > the PF_EXITING flag when allocating the pagg. > > > 2) Is pagg intended to be a container for Linux tasks or thread > groups? I'm avoiding the use of the word "process" as that could be > confusing in this context. > > The reason I'm asking is because throughout the pagg registration code > the for_each_process() macro is used. This happens in > remove_client_paggs_from_all_tasks() and pagg_hook_register(). The > problem is, however, since 2.6 for_each_process() only traverses > thread group leader tasks. Thus the registration and deregistration > code only considers thread group leaders. > > In contrast the pagg_attach() and pagg_detach() hooks are placed to > catch all tasks in the fork and exit code. With the way the > registration and deregistration code is behaving, this means that we > won't catch all existing tasks and we won't clean up after all tasks > properly for multi-threaded applications. > > So what is pagg intended for? To catch every single task? In that > case the do_each_thread() and while_each_thread() macros should be > used. Or is pagg a container for thread groups only? The changes for > that would require much more thought... > > > 3) Is the "module" pointer used anywhere in the pagg_hook struct meant > to be used in the pagg code or in the client code? It doesn't seem to > be used in the pagg code at all. > > > Thanks! > -- > Kingsley > -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From kingsley@aurema.com Tue Nov 9 22:45:24 2004 Received: with ECARTIS (v1.0.0; list pagg); Tue, 09 Nov 2004 22:45:28 -0800 (PST) Received: from smtp.sw.oz.au (IDENT:FWUSER@alt.aurema.com [203.217.18.57]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id iAA6jMRm027095 for ; Tue, 9 Nov 2004 22:45:23 -0800 Received: from smtp.sw.oz.au (localhost [127.0.0.1]) by smtp.sw.oz.au with ESMTP id iAA6esSP023394; Wed, 10 Nov 2004 17:40:54 +1100 (EST) Received: (from kingsley@localhost) by smtp.sw.oz.au id iAA6erkF023387; Wed, 10 Nov 2004 17:40:53 +1100 (EST) Date: Wed, 10 Nov 2004 17:40:53 +1100 From: kingsley@aurema.com To: Erik Jacobson Cc: pagg@oss.sgi.com Subject: Re: PAGG Threading Issues (was Re: [patch] Registration Check for Compulsory Hooks) Message-ID: <20041110064053.GB6092@aurema.com> References: <20041108033648.GB18308@aurema.com> <20041108230039.GE18308@aurema.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-Scanned-By: MIMEDefang 2.44 X-archive-position: 54 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kingsley@aurema.com Precedence: bulk X-list: pagg On Tue, Nov 09, 2004 at 03:02:24PM -0600, Erik Jacobson wrote: > I think you make some good points here and I honestly need to do some > research (and probably talk to some co-workers) before I can get you a > good response. We've got some deadlines we're up against right now. > > Would it be ok if I defer this for a couple days? Sure thing, that's fine with me. > > If you want to be sure the ball isn't dropped, you could file a Bugzilla > bug: > > http://oss.sgi.com/bugzilla/ > Okay, I might do that. > In any case, I hope to get back to you in a few days. If you need a > response sooner let me know and I'll try to shuffle some things around. > > On Tue, 9 Nov 2004, Kingsley Cheung wrote: > > > Hi Erik, > > > > Some other issues as well. Please correct me if I'm wrong :) (snip) Thanks, -- Kingsley