From kingsley@aurema.com Wed Sep 14 02:09:23 2005 Received: with ECARTIS (v1.0.0; list pagg); Wed, 14 Sep 2005 02:09:28 -0700 (PDT) Received: from smtp.sw.oz.au (alt.aurema.com [203.217.18.57]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8E99KiL028341 for ; Wed, 14 Sep 2005 02:09:22 -0700 Received: from smtp.sw.oz.au (localhost [127.0.0.1]) by smtp.sw.oz.au with ESMTP id j8E96RXk006835; Wed, 14 Sep 2005 19:06:27 +1000 (EST) Received: (from kingsley@localhost) by smtp.sw.oz.au id j8E96QfM006834; Wed, 14 Sep 2005 19:06:26 +1000 (EST) Date: Wed, 14 Sep 2005 19:06:26 +1000 From: kingsley@aurema.com To: Erik Jacobson Cc: pagg@oss.sgi.com, tonyt@aurema.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050914090626.GK13682@aurema.com> References: <20050617014512.GA10285@aurema.com> <20050623143301.GB32764@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050623143301.GB32764@sgi.com> User-Agent: Mutt/1.4.2.1i X-Scanned-By: MIMEDefang 2.52 on 192.41.203.35 X-archive-position: 104 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kingsley@aurema.com Precedence: bulk X-list: pagg On Thu, Jun 23, 2005 at 09:33:01AM -0500, Erik Jacobson wrote: > Kingsley - my first attempt skipped the list, sorry. > > > While testing the propagation of pagg_attach errors to fork() I > > noticed that the detach callback is called again for the client > > I'm sorry it's taking a while for me to get back to you. > I had kicked your patch around to a couple people internally and > I think we want to investigate the error path more before we > take it as part of the PAGG patch. > > Does anybody else on the list have thoughts on this change? > > Thanks for the submission. I'd like to do a bit more research. Hi Erik, Has there been any progress on this? Thanks, -- Kingsley From erikj@sgi.com Fri Sep 16 08:30:12 2005 Received: with ECARTIS (v1.0.0; list pagg); Fri, 16 Sep 2005 08:30:18 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8GFUCiL014323 for ; Fri, 16 Sep 2005 08:30:12 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8GHTYJ4009509 for ; Fri, 16 Sep 2005 10:29:34 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8GFQ3DN15764529 for ; Fri, 16 Sep 2005 10:26:03 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8GFQ2S93473994 for ; Fri, 16 Sep 2005 10:26:02 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id BD4976022F4A; Fri, 16 Sep 2005 10:26:02 -0500 (CDT) Date: Fri, 16 Sep 2005 10:26:02 -0500 From: Erik Jacobson To: pagg@oss.sgi.com Subject: Future of PAGG? Message-ID: <20050916152602.GB4739@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6i X-archive-position: 105 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg To be honest, I've been a bit frustrated at how to proceed with PAGG. Yesterday, we kicked around some ideas to perhaps propose to the community that implement things other ways. As far as we can tell, these things are not as efficient as PAGG. For example, we explored the use of notifier lists that are already available in the kernel. This implements the callback portion of PAGG, but not the portion that associates data per-task. Another co-worker also observed that locking of notifier lists isn't really provided by the notifier list infrastructure itself - so unless you're really careful, a module could remove itself from the list while it's being walked. Another problem with notifier lists is that it probably would have reduced performance. Instead of knowing you can "do nothing" if the pagg list is null per task (and that task is probably already cached on the machine), you have to walk a notifier list each time. It will possibly reduce fork performance. The only reason a pagg-like system using notifier lists would get accepted is because it uses tools already in the kernel instead of our own setup. So this part might be attractive to some in the community. In the past, we had pushed PAGG more for its grouping abilities rather than calling it something like "task notifier list with data" (or some slick name that means that). We described it like this somewhat because the community seems to frown on generic callbacks. Perhaps the world has changed now. After all, notifier lists are generic callouts and are in the kernel now... Of course, they aren't in the fork or exit paths. All of this is compounded by a lack of support from PAGG's users. We know various people outside of SGI use PAGG, but they never have stepped up to say they are users when it counts. If we could have some users, our community position would be improved. I think having callouts in fork, exec, and exit are really needed. If you look at what we do during a fork in the kernel, you can spot a few things in the generic kernel itself that could use generic callouts or PAGG instead. It has the potential to reduce, at least a little bit, the number of calls made in a fork. At a meeting yesterday, I was asked to look in to implementing Linux Job, SGI inescapable Job Containers, without using PAGG. Instead, I was asked to try these notifier lists. Because we don't feel we can get a Job ID in the task struct, I'll need to implement table lookups to associate a task with data about the task. After some discussion today, I'm not sure notifier lists are the answer due to reduced performance and locking issues. So what do you think? Should I try to implement a reduced version of PAGG that uses notifier lists for the callout piece and pagg lists like we have today for the task associated data? (performance issues, locking issues with notifier lists). Again, this is only attractive because it uses tools in the kernel itself. I thought one idea is I could give PAGG, mostly as-is, one more shot. I can reduce it to it's bare essentials, perhaps removing some functionality. I can re-name it to something that better describes what it does, and try once again to get it accepted by the community of LKML. I thought I'd start on LSE-tech before LKML to get some ideas. Does this sound like a good approach? I'd like to work with this mailing list to try to organize support for this. If there are PAGG users, and you don't want to see us stop maintaining PAGG, maybe you could join the discussion so people know the patch is used. SGI really needs something that is PAGG-like for its open sourced projects such as Job, CSA, and two open-source but non-pushed projects in-house. But the community is interested in more than "a patch SGI needs for itself." If we can't get PAGG in, we'll have to work out other ways to get our open source projects that use PAGG accepted. In the end, I think lack of users is the biggest problem with getting something PAGG-like accepted. Please let me know if there are other ideas. From erikj@sgi.com Sat Sep 17 08:36:53 2005 Received: with ECARTIS (v1.0.0; list pagg); Sat, 17 Sep 2005 08:36:58 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8HFariL003974 for ; Sat, 17 Sep 2005 08:36:53 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8HHanmx023368 for ; Sat, 17 Sep 2005 10:36:49 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8HFYADN15832790 for ; Sat, 17 Sep 2005 10:34:10 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8HFYAS93528299 for ; Sat, 17 Sep 2005 10:34:10 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 16FAD6028D21; Sat, 17 Sep 2005 10:34:10 -0500 (CDT) Date: Sat, 17 Sep 2005 10:34:10 -0500 From: Erik Jacobson To: pagg@oss.sgi.com Subject: PAGG ideas for next attempt: new docs, new name? Message-ID: <20050917153409.GA17708@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6i X-archive-position: 106 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg I'm looking for feedback on these ideas. I'm sending this to the PAGG list. After I gather feedback from you and some co-workers, I'll be posting this to lse-tech and some other folks as well. I'll then work on the code side of the changes. Please see the justification section for sure. I'm not sure I should say that stuff - so let me know if it is silly to put there. (I'll wait until Monday or Tuesday to send this off to a broader audience to be sure it isn't lost in the weekend). === I am re-working what used to be PAGG to have a new name, better documentation, and better variable names. My hope is that I can present this to the community for inclusion in the kernel and I'm hoping to have a couple of the users of this help by explaining how they use it. I feel one reason PAGG didn't get attention was because it's true function was obscured by its name and the names of functions and variables within. The first step of this for me was to write some new documentation using the new names for the pieces. Before I propose this to the broader community, I'd like to get feedback. After that, I plan to re-write the code to match and post it. If a variable seems too long (some are), perhaps provide a suggested shorer name. The name of PN itself is fair game. It turns out it was hard to pick a name for this thing. Process Notification (PN) -------------------- PN provides a method (service) for kernel modules to be notified when certain events happen in the life of a process. Events we support include fork, exit, and exec. A special init event is also supported (see events below). More events could be added. PN also provides a generic data pointer for the modules to work with so that data can be associated per process. A kernel module will register (pn_register) a service request (pn_service_request) with PN. The request tells PN which notifications the kernel module wants. The kernel module passes along function pointers to be called for these events (exit, fork, exec) in the service request. >From the process point of view, each process has a kernel module subscriber list (pn_module_subscriber_list). These kernel modules are the ones who want notification about the life of the process. As described above, each kernel module subscriber on the list has a generic data pointer to point to data associated with the process. In the case of fork, PN will allocate the same kernel module subscriber list for the new child that existed for the parent. The kernel module's function pointer for fork is also called so the kernel module can do what ever it needs to do when a parent forks. For exit, similar things happen but the exit function pointer for each kernel module subscriber is called and the kernel module subscriber list for that task is deleted. Events ------ Events are stages of a processes life that kernel modules care about. The fork event is a spot in copy_process when a parent forks. The exit event happens when a process is going away. We also support an exec event, which happens when a process execs. Finally, there is an init event. This special event makes it so this kernel module will be associated with all current processes in the system. This is used when a kernel module wants to keep track of all current processes as opposed to just those it associates by itself (and children that follow). The events a kernel module cares about are set up in the pn_service_request structure - see usage below. When setting up a pn_service_request structure, you designate which events you care about by either associating NULL (meaning you don't care about that event) or a pointer to the function to run when the event is triggered. fork and exit are currently required. How do processes become associated with kernel modules? ------------------------------------------------------- Your kernel module itself can use the pn_alloc function to associate a given process with a given pn_service_request structure. This adds your kernel module to the subscriber list of the process. In the case of inescapable job containers making use of PAM, when PAM allows a person to log in, PAM contacts job (via a PAM job module which uses the job userland library) and the kernel Job code will call pn_alloc to associate the process with PN. From that point on, the kernel module will be notified about events in the process's life that the module cares about. Likewise, your kernel module can remove an association between it and a given process by using pn_subscriber_free. Example Usage ------------- === filling out the pn_service_request structure === A kernel module wishing to use PN needs to set up a pn_service_request structure. This structure tells PN which events you care about and what functions to call when those events are triggered. In addition, you supply a name (usually the kernel module name). The entry is always filled out as shown below. .module is usually set to THIS_MODULE. data can be optionally used to store a pointer with the service request structure. Example of a filled out pn_service_request: static struct pn_service_request pn_service_request = { .module = THIS_MODULE, .name = "test_module", .data = NULL, .entry = LIST_HEAD_INIT(pn_service_request.entry), .init = test_init, .fork = test_attach, .exit = test_detach, .exec = test_exec, }; The above pn_service_request says the kernel module "test_module" cares about events fork, exit, exec, and init. In fork, call the kernel module's test_attach function. In exec, call test_exec. In exit, call test_detach. The init event is specified, so all processes on the system will be associated with this kernel module and the test_init function will be run for each. === Registering with PN === You will likely register with PN in your kernel module's module_init function. Here is an example: static int __init test_module_init(void) { int rc = pn_register(&pn_service_request); if (rc < 0) { return -1; } return 0; } === Example init event function ==== Since the init event is defined, it means this kernel module is added to the subscriber list of all processes -- it will receive notification about events it cares about for all processes and all children that follow. Of course, if a kernel module doesn't need to know about all current processes, that module shouldn't implement this and '.init' in the pn_service_request structure would be NULL. This is as opposed to the normal method where the kernel module adds itself to the subscriber list of a process using pn_alloc. static int test_init(struct task_struct *tsk, struct pn_subscriber *subscriber) { if (pn_get_subscriber(tsk, "test_module") == NULL) dprintk("ERROR PN expected \"%s\" PID = %d\n", "test_module", tsk->pid); dprintk("FYI PN init hook fired for PID = %d\n", tsk->pid); atomic_inc(&init_count); return 0; } === Example fork (test_attach) function === This function is executed when a process forks - this is associated with the pn_callout callout in copy_process. There would be a very similar test_detach function (not shown). PN will add the kernel module to the notification list for the child process automatically and then execute this fork function pointer (test_attach in this example). However, the kernel module can control if the kernel module stays on the processes's subscriber list and wants notification by the return value. A negative value results in the fork failing. zero is success. >0 means success, but the kernel module doesn't want the to be associated with that specific process (doesn't want notification). In other words, if >0 is returned, your kernel module is saying that it doesn't want to be on the subscriber list for this process. static int test_attach(struct task_struct *tsk, struct pagg *pagg, void *vp) { dprintk("PN attach hook fired for PID = %d\n", tsk->pid); atomic_inc(&attach_count); return 0; } === Example exec event function === And here is an example function to run when a task gets to exec. So any time a "tracked" process gets to exec, this would execute. More hooks/callouts similar to this one could be implemented as there is demand for them. static void test_exec(struct task_struct *tsk, struct pn_subscriber *subscriber) { dprintk("PN exec hook fired for PID %d\n", tsk->pid); atomic_inc(&exec_count); } === Unregistering with PN === You will likely wish to unregister with PN in the kernel module's module_exit function. Here is an example: static void __exit test_module_cleanup(void) { pn_unregister(&pn_service_request); printk("detach called %d times...\n", atomic_read(&detach_count)); printk("attach called %d times...\n", atomic_read(&attach_count)); printk("init called %d times...\n", atomic_read(&init_count)); printk("exec called %d times ...\n", atomic_read(&exec_count)); if (atomic_read(&attach_count) + atomic_read(&init_count) != atomic_read(&detach_count)) printk("PN PROBLEM: attach count + init count SHOULD equal detach cound and doesn't\n"); else printk("Good - attach count + init count equals detach count.\n"); } === Actually using data associated with the process in your module === The above examples show you how to create an example kernel module using PN, but it doesn't show what you might do with the data pointer associated with a given process. Linux Inescapable Jobs is a good example of making use of PN. Some versions of it use PAGG, which is what PN is based on. A new Job patch should be available soon if not already. See oss.sgi.com/projects/pagg. A Job is a group of processes from which a process cannot escape. A batch scheduling system such as LSF may use Job to put possibly otherwise unrelated processes together to be tracked and signaled as ia set including any children that follow. If the Job PAM module is used, each login processes gets a job ID and the children become part of the job by default. In Job, we want to know whenever a parent forks a new process or whenever a process exits. So Job gets notified for these events, and adds the process to the list of processes in the job (or removes then in the case of exit). To efficiently add a job, we need to know which Job the parent was in. This information, in our case, is what is stored in the data pointer within the pn_subscriber structure associated with a given process. pn_get_subscriber is used to retrieve the PN subscriber for a given process and kernel module. Like this: subscriber = pn_get_subscriber(task, name); Where name is your kernel module's name (as provided in the pn_service_request structure) and task is the process you're interested in. Please be careful about locking. The task structure has a pn_subscriber_list_sem to be used for locking. An example code snip follows: /* We have a valid task now */ get_task_struct(task); /* Ensure the task doesn't vanish on us */ read_unlock(&tasklist_lock); /* Unlock the tasklist */ down_write(&task->pn_subscriber_list_sem); /* write lock subscriber list */ subscriber = pn_get_subscriber(task, pagg_hook.name); if (subscriber) { detachpid.r_jid = ((struct job_attach *)subscriber->data)->job->jid; subscriber->pn_subscriber_request->detach(task, subscriber); pn_subscriber_free(subscriber); } else { errcode = -ENODATA; } put_task_struct(task); /* Done accessing the task */ up_write(&task->pn_subscriber_list_sem); /* write unlock subscriber list */ In the above snip, we make sure we have a task that won't disappear on us. Then we write lock the pn_subscriber_list-sem to be sure it doesn't change on it. We write lock (rather than read) because we're going to be removing an entry from it. If there is a subscriber for this kernel module matching the given process, we store the jid (job identifier in Job), we call our own detach function directly (in Job, this associated with the exit event), and we remove the subscriber from the subscriber list. This means this kernel module will no longer get notifications of events for this task. The detachjid.r_jid line above is an example of retrieving data from the data pointer for the given subscriber. History ------- Process Notification used to be known as PAGG (Process Aggregates). It was re-written to be called Process Notification because we believe this better describes its purpose. Structures and functions were re-named to be more clear and to reflect the new name. Why Not Notifier Lists? ----------------------- We investigated the use of notifier lists, available in newer kernels. There were two reasons we didn't use them to implement PAGG. 1) There seems to be some tricky locking issues with notifier lists. For example, if a kernel module exits while the notifier list is walked, we could have trouble. There may be means to work around this 2) Notifier lists would not be as efficient as PN for kernel modules wishing to associate data with processes. With PN, if the pn_subscriber_list of a given task is NULL, we can instantly know there are no kernel modules that care about the process. Further, the callbacks happen in places were the task struct is likely to be cached. So this is a quick operation. With notifier lists, the scope is system wide rather than per process. As long as one kernel module wants to be notified, we have to walk the notifier list and potentially waste cycles. Some Justification ------------------ Some have argued that PAGG in the past shouldn't be used because it will allow interesting things to be implemented outside of the kernel. While this might be a small risk, having these in place allows customers and users to implement kernel components that you don't want to see in the kernel anyway. SGI may have HPC needs that very few other people are interested in. We in fact have 4 open source projects that make use of PAGG (and will convert to PN). At least one of these projects is urgent for our customers but is simply not interesting to enough people to maintain in the kernel itself. In a world where all customers need to run on standard distributions to be supported by the distributor, we're left in a situation where: a) The distributor doesn't want to take patches not accepted in the kernel b) The community wants everything important in the kernel c) The community wants only things having multiple users in the kernel d) SGI has things that are only interesting to SGI systems and it's customers (not multiple users) e) There is no option to re-build kernels while staying in a supported environment. We find it hard to support customers in this catch 22 situation. PN allows us to implement our open source projects outside of the mainline kernel. We do offer things like Job for inclusion, but so far haven't met with success in getting it accepted. We feel PN is very useful for kernel components already in the kernel too. There is a potential to reduce the number of calls in the copy_process path, for example. One could also envision things in the task struct that are used slightly less frequently could be implemented to use PN. -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From pj@sgi.com Sat Sep 17 10:47:23 2005 Received: with ECARTIS (v1.0.0; list pagg); Sat, 17 Sep 2005 10:47:28 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8HHlNiL011877 for ; Sat, 17 Sep 2005 10:47:23 -0700 Received: from nodin.corp.sgi.com (nodin.corp.sgi.com [192.26.51.193]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8HJlKMi012025 for ; Sat, 17 Sep 2005 12:47:20 -0700 Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by nodin.corp.sgi.com (SGI-8.12.5/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8HHhebT110505423 for ; Sat, 17 Sep 2005 10:43:40 -0700 (PDT) Received: from v0 (mtv-vpn-hw-masa-1.corp.sgi.com [134.15.25.210]) by cthulhu.engr.sgi.com (SGI-8.12.5/8.12.5) with SMTP id j8HHgdps14648002; Sat, 17 Sep 2005 10:42:39 -0700 (PDT) Date: Sat, 17 Sep 2005 10:42:39 -0700 From: Paul Jackson To: Erik Jacobson Cc: pagg@oss.sgi.com Subject: Re: PAGG ideas for next attempt: new docs, new name? Message-Id: <20050917104239.26cb7e49.pj@sgi.com> In-Reply-To: <20050917153409.GA17708@sgi.com> References: <20050917153409.GA17708@sgi.com> Organization: SGI X-Mailer: Sylpheed version 2.0.0beta5 (GTK+ 2.4.9; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 107 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: pj@sgi.com Precedence: bulk X-list: pagg Erik wrote: > I'm looking for feedback on these ideas. Oohhh - lots of nice words. A couple of random thoughts on first glance now, then I will try to give this a closer read later today. Let me place in evidence the other notifier thingies currently in the kernel: dnotify - directory (Stephen Rothwell) fsnotify - filesystem (Robert Love - a 'FAM' like thing) inotify - inode based (John McCutchan - basis of fsnotify) notify - generic (Alan Cox needed for network devices) First, observe that these other notifiers don't use two-letter acronyms, but rather pseudo-words, to name themselves and to prefix their kernel global symbols. The "TN" name is too cryptic. You need a pseudo-word, that you use consistently and methodically, every place possible. When someone sees a line of kernel code mentioning "inotify_inode_queue_event", they have a pretty good idea what sort of subsystem is involved. When someone sees a mention of "pn_get_subscriber", they will likely not realize this as quickly. Perhaps a few long standing types in the kernel, such as tasks and inodes, get to use the very short, familiar names of just a letter or two, but the less well known types requirer longer more explicit names. Besides the base name 'notify', two other possibilities that come to my mind for the base part of the name are 'callout' and 'hook'. I'm partial to 'callout'. For one thing, this distinguishes rather nicely between two different mechanisms: 1) Some thread is asking to have notice sent to it of particular kinds of events, and 2) You want threads to callout to an extra piece of code when they undergo particular kinds of events. The rule of thumb I'd suggest is to use 'notify' when the receiver is some other thread, and 'callout' when the receiver is a code snippet executing in the context of the thread originally experiencing the event of interest. Beware that the above "notify" mechanisms may or may not follow this rule of thumb; I don't know without thinking harder than I want to right now. There are 268 instances of the 7-char string "callout" in all the kernel source, 5273 instances of the 5-char string "notif", and 2456 instances of the 4-char string "hook". In the 28534 symbols that list in a "nm vmlinux" of a kernel I have at hand, there are 10 instances of the 4-char string "hook", 161 instances of the 5-char string "notif", and zero (0) of "callout". So besides having a suitable meaning, "callout" doesn't collide with existing kernel names. Other words that come to mind that might be worth playing with here: trigger, event, handler and exit (as in IBM's MVS "user exit", "file exit" and "installation exit".) You might want to peruse the literature for IBM-style exits, and look for an opportunity to get someone from IBM with experience in such mechanisms to contemplate what it would take to provide them for Linux in a community acceptable form. However, "callout" will convey the intended meaning to far more Linux hackers than "exit", which will only convey the intended sense to those with IBM background (or at least a beer drinking friend who is expert in such ;). If this were MVS, I'd be recommending "task exit". In any event, you might want to list the other notifier like mechanisms (listed above) in your post, and compare and contrast (whatever happened to Carl Rigg ?) them with your proposed mechanism. Anyhow ... a couple more thoughts, besides the naming issue. If the current notifier lists have technical limitations with locking and efficiency, then what would it take to fix them up, rather than introduce a new mechanism? Are these limitations inherent and unavoidable in any mechanism that has the API of the current notifier lists, or are they an internal accident of the implementation? If the latter, can the implementation be fixed? If the former, can you clearly explain why notifier list, or anything so conceived and so dedicated with such an API, must necessarily suffer from such technical limitations? A key concern, which you face head on (good!) is that such mechanisms as this "allow interesting things to be implemented outside of the kernel." You explain nicely why we need such, but you don't explain how we keep some proprietary competitor of Linux from abusing your mechanism. I'd prefer that this mechanism only allow GPL loadable modules to hook into it, and I wish there were someway to ensure that the portion "outside the kernel" was also GPL. There are legal and competitive business issues here that need to be addressed. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.925.600.0401 From pj@sgi.com Mon Sep 19 02:41:31 2005 Received: with ECARTIS (v1.0.0; list pagg); Mon, 19 Sep 2005 02:41:41 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8J9fViL013075 for ; Mon, 19 Sep 2005 02:41:31 -0700 Received: from nodin.corp.sgi.com (nodin.corp.sgi.com [192.26.51.193]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8JBfFo6021384 for ; Mon, 19 Sep 2005 04:41:15 -0700 Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by nodin.corp.sgi.com (SGI-8.12.5/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8J9bLbT114872374 for ; Mon, 19 Sep 2005 02:37:21 -0700 (PDT) Received: from v0 (mtv-vpn-hw-masa-1.corp.sgi.com [134.15.25.210]) by cthulhu.engr.sgi.com (SGI-8.12.5/8.12.5) with SMTP id j8J9aKps14991294; Mon, 19 Sep 2005 02:36:20 -0700 (PDT) Date: Mon, 19 Sep 2005 02:36:20 -0700 From: Paul Jackson To: Erik Jacobson Cc: pagg@oss.sgi.com Subject: Re: PAGG ideas for next attempt: new docs, new name? Message-Id: <20050919023620.38ec1820.pj@sgi.com> In-Reply-To: <20050917153409.GA17708@sgi.com> References: <20050917153409.GA17708@sgi.com> Organization: SGI X-Mailer: Sylpheed version 2.0.0beta5 (GTK+ 2.4.9; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 108 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: pj@sgi.com Precedence: bulk X-list: pagg Erik wrote: > I feel one reason PAGG didn't get attention was because it's true function > was obscured by its name and the names of functions and variables within. I tend to agree. > Finally, there is an init event. When does this init event occur - at the beginning of something, I presume. I'm just not clear of what. > fork event is a spot in copy_process when a parent forks. So this event is in the parent, not the child? That seems slightly odd. > The init event is specified, so all processes on the system will > be associated with this kernel module and the test_init function > will be run for each. Ah .. run when? Perhaps the last line above would be clearer as: > will be run for each process in the system when the module is loaded. > int rc = pn_register(&pn_service_request); > if (rc < 0) { > return -1; > } Should that "return -1" be a "return -ERRNO" for some error number? > unrelated processes together to be tracked and signaled as ia set Is that "ia" a typo? > /* We have a valid task now */ > get_task_struct(task); /* Ensure the task doesn't vanish on us */ > read_unlock(&tasklist_lock); /* Unlock the tasklist */ > ... What is the piece of code, beginning with these three lines, doing? > In a world where all customers need to run on standard distributions to > be supported by the distributor, we're left in a situation where: Can this section be turned around into something more positive, and less SGI specific. And can the problems that seem to be associated with this be directly addressed: 1) It could be abused by competitors of Open Source, to leverage Linux kernel work while avoiding GPL constraints on their key code (much as happens with device drivers now, e.g. Nvidia). 2) It opens up a Pandoras box of opportunities for poor quality (or at least inadequately tested) code compromising the stability of the system, with attendant support nightmares. For example, "user exit" code for IBM operating systems was one of the areas that had the greatest difficulty adapting to Y2K. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.925.600.0401 From erikj@sgi.com Mon Sep 19 06:23:55 2005 Received: with ECARTIS (v1.0.0; list pagg); Mon, 19 Sep 2005 06:24:02 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8JDNtiL002736 for ; Mon, 19 Sep 2005 06:23:55 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8JFNkF5018782 for ; Mon, 19 Sep 2005 08:23:46 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8JDJnDN15950134; Mon, 19 Sep 2005 08:19:49 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8JDJnS93642837; Mon, 19 Sep 2005 08:19:49 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 0ED146022F4A; Mon, 19 Sep 2005 08:19:49 -0500 (CDT) Date: Mon, 19 Sep 2005 08:19:49 -0500 From: Erik Jacobson To: Paul Jackson Cc: Erik Jacobson , pagg@oss.sgi.com Subject: Re: PAGG ideas for next attempt: new docs, new name? Message-ID: <20050919131948.GA4488@sgi.com> References: <20050917153409.GA17708@sgi.com> <20050917104239.26cb7e49.pj@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050917104239.26cb7e49.pj@sgi.com> User-Agent: Mutt/1.5.6i X-archive-position: 109 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > A couple of random thoughts on first glance now, then I will try to > give this a closer read later today. Hi Paul. I changed over to pnotify, which makes variable names really long. But I understand what you're saying here. > If the current notifier lists have technical limitations with locking > and efficiency, then what would it take to fix them up, rather than > introduce a new mechanism? Are these limitations inherent and > unavoidable in any mechanism that has the API of the current notifier > lists, or are they an internal accident of the implementation? If the > latter, can the implementation be fixed? If the former, can you > clearly explain why notifier list, or anything so conceived and so > dedicated with such an API, must necessarily suffer from such technical > limitations? I'm still munching on this part. > A key concern, which you face head on (good!) is that such mechanisms > as this "allow interesting things to be implemented outside of the > kernel." You explain nicely why we need such, but you don't explain I added a blurb about exporting the symbols with EXPORT_SYMBOL_GPL. I also changed the Justification quite a bit per your suggestions in a separate email. I'll post a new version a little later. From erikj@sgi.com Mon Sep 19 07:09:11 2005 Received: with ECARTIS (v1.0.0; list pagg); Mon, 19 Sep 2005 07:09:26 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8JE9BiL004595 for ; Mon, 19 Sep 2005 07:09:11 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8JG95ve024831 for ; Mon, 19 Sep 2005 09:09:05 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8JE59DN15955434; Mon, 19 Sep 2005 09:05:09 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8JE59S93652656; Mon, 19 Sep 2005 09:05:09 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 0E89E6022F4A; Mon, 19 Sep 2005 09:05:09 -0500 (CDT) Date: Mon, 19 Sep 2005 09:05:09 -0500 From: Erik Jacobson To: Paul Jackson Cc: Erik Jacobson , pagg@oss.sgi.com Subject: Re: PAGG ideas for next attempt: new docs, new name? Message-ID: <20050919140508.GA8488@sgi.com> References: <20050917153409.GA17708@sgi.com> <20050917104239.26cb7e49.pj@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050917104239.26cb7e49.pj@sgi.com> User-Agent: Mutt/1.5.6i X-archive-position: 110 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > If the current notifier lists have technical limitations with locking > and efficiency, then what would it take to fix them up, rather than > introduce a new mechanism? Are these limitations inherent and > unavoidable in any mechanism that has the API of the current notifier > lists, or are they an internal accident of the implementation? If the > latter, can the implementation be fixed? If the former, can you > clearly explain why notifier list, or anything so conceived and so > dedicated with such an API, must necessarily suffer from such technical > limitations? I removed the stuff I said about locking issues. They probably exist, but I am not quite sure how they would be solved. So I instead focused on the efficency aspects. The reason they are less efficient is because, as long as there is one subscriber to the notifer list somewhere on the system, you always have a list to walk. With process notification, you only walk the list if a kernel module is interested in a given task. That way, if a kernel module is only associated with a few tasks on the system, we don't end up walking lists all the time. The other piece that is missing is a data pointer associated with a task. Without that, you'd have to add entries to the task struct or implement table lookups to find data associated with processes. One solution that Jack Steiner actually wrote up a prototype for over the weekend is notifier lists in the task struct itself. So if that is interesting to folks, we have at least some data on it. I haven't tried to implement Job on top of it yet but if people think that direction is interesting, I can implement Job on this sooner. Otherwise, I'm more comfortable with something closer to PAGG that has received a lot of exposure already and has all the features the community here has requested so far (except for one outstanding request from Kingsley). Erik From erikj@sgi.com Mon Sep 19 08:08:17 2005 Received: with ECARTIS (v1.0.0; list pagg); Mon, 19 Sep 2005 08:08:31 -0700 (PDT) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8JF8GiL007310 for ; Mon, 19 Sep 2005 08:08:17 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8JF5XxT003086 for ; Mon, 19 Sep 2005 10:05:33 -0500 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8JF5VDN15957018; Mon, 19 Sep 2005 10:05:32 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8JF5US93659905; Mon, 19 Sep 2005 10:05:30 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id CC4826022F4A; Mon, 19 Sep 2005 10:05:30 -0500 (CDT) Date: Mon, 19 Sep 2005 10:05:30 -0500 From: Erik Jacobson To: Paul Jackson Cc: Erik Jacobson , pagg@oss.sgi.com Subject: Re: PAGG ideas for next attempt: new docs, new name? Message-ID: <20050919150530.GD8488@sgi.com> References: <20050917153409.GA17708@sgi.com> <20050919023620.38ec1820.pj@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050919023620.38ec1820.pj@sgi.com> User-Agent: Mutt/1.5.6i X-archive-position: 111 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > When does this init event occur - at the beginning of something, > I presume. I'm just not clear of what. Ok, I added "at the time of registration": ... Finally, there is an init event. This special event makes it so this kernel module will be associated with all current processes in the system at the time of registration. This is used when a kernel module wants to keep track of all current processes as opposed to just those it associates by itself (and children that follow). > > fork event is a spot in copy_process when a parent forks. > > So this event is in the parent, not the child? That seems > slightly odd. The module gets notified when the parent forks and the child is being created. The child receives the same allocation list that the parent had but the kernel module has some control here based on return value to decide if the new process should really be associated with the kernel module or not. > > int rc = pn_register(&pn_service_request); > > if (rc < 0) { > > return -1; > > } > > Should that "return -1" be a "return -ERRNO" for some error number? I'm not sure; we certainly haven't been doing that so far. > > unrelated processes together to be tracked and signaled as ia set > Is that "ia" a typo? Yes :) > > /* We have a valid task now */ > > get_task_struct(task); /* Ensure the task doesn't vanish on us */ > > read_unlock(&tasklist_lock); /* Unlock the tasklist */ > > ... > What is the piece of code, beginning with these three lines, doing? I'll try to come up with a better generic example. It's supposed to show how to use the data pointer because the over-simple examples for using pnotify earlier in the doc aren't sophisticated enough to show that. Erik From erikj@sgi.com Mon Sep 19 08:38:31 2005 Received: with ECARTIS (v1.0.0; list pagg); Mon, 19 Sep 2005 08:38:38 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8JFcViL012864 for ; Mon, 19 Sep 2005 08:38:31 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8JHc6BZ004036 for ; Mon, 19 Sep 2005 10:38:06 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8JFY9DN15961102 for ; Mon, 19 Sep 2005 10:34:09 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8JFY9S93624561 for ; Mon, 19 Sep 2005 10:34:09 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 1534A6022F4A; Mon, 19 Sep 2005 10:34:09 -0500 (CDT) Date: Mon, 19 Sep 2005 10:34:09 -0500 From: Erik Jacobson To: pagg@oss.sgi.com Subject: Revised Process Notification proposed docs Message-ID: <20050919153408.GA13872@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6i X-archive-position: 112 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg Ok, here is a second pass at this... I am re-working what used to be PAGG to have a new name, better documentation, and better variable names. My hope is that I can present this to the community for inclusion in the kernel and I'm hoping to have a couple of the users of this help by explaining how they use it. I feel one reason PAGG didn't get attention was because it's true function was obscured by its name and the names of functions and variables within. The first step of this for me was to write some new documentation using the new names for the pieces. Before I propose this to the broader community, I'd like to get feedback. After that, I plan to re-write the code to match and post it. If a variable seems too long (some are), perhaps provide a suggested shorer name. The name of pnotify itself is fair game. It turns out it was hard to pick a name for this thing. Process Notification (pnotify) -------------------- pnotify provides a method (service) for kernel modules to be notified when certain events happen in the life of a process. Events we support include fork, exit, and exec. A special init event is also supported (see events below). More events could be added. pnotify also provides a generic data pointer for the modules to work with so that data can be associated per process. A kernel module will register (pnotify_register) a service request (pnotify_service_request) with pnotify. The request tells pnotify which notifications the kernel module wants. The kernel module passes along function pointers to be called for these events (exit, fork, exec) in the service request. From the process point of view, each process has a kernel module subscriber list (pnotify_module_subscriber_list). These kernel modules are the ones who want notification about the life of the process. As described above, each kernel module subscriber on the list has a generic data pointer to point to data associated with the process. In the case of fork, pnotify will allocate the same kernel module subscriber list for the new child that existed for the parent. The kernel module's function pointer for fork is also called so the kernel module can do what ever it needs to do when a parent forks. For exit, similar things happen but the exit function pointer for each kernel module subscriber is called and the kernel module subscriber list for that task is deleted. Events ------ Events are stages of a processes life that kernel modules care about. The fork event is a spot in copy_process when a parent forks. The exit event happens when a process is going away. We also support an exec event, which happens when a process execs. Finally, there is an init event. This special event makes it so this kernel module will be associated with all current processes in the system at the time of registration. This is used when a kernel module wants to keep track of all current processes as opposed to just those it associates by itself (and children that follow). The events a kernel module cares about are set up in the pnotify_service_request structure - see usage below. When setting up a pnotify_service_request structure, you designate which events you care about by either associating NULL (meaning you don't care about that event) or a pointer to the function to run when the event is triggered. fork and exit are currently required. How do processes become associated with kernel modules? ------------------------------------------------------- Your kernel module itself can use the pnotify_alloc function to associate a given process with a given pnotify_service_request structure. This adds your kernel module to the subscriber list of the process. In the case of inescapable job containers making use of PAM, when PAM allows a person to log in, PAM contacts job (via a PAM job module which uses the job userland library) and the kernel Job code will call pnotify_alloc to associate the process with pnotify. From that point on, the kernel module will be notified about events in the process's life that the module cares about. Likewise, your kernel module can remove an association between it and a given process by using pnotify_subscriber_free. Example Usage ------------- === filling out the pnotify_service_request structure === A kernel module wishing to use pnotify needs to set up a pnotify_service_request structure. This structure tells pnotify which events you care about and what functions to call when those events are triggered. In addition, you supply a name (usually the kernel module name). The entry is always filled out as shown below. .module is usually set to THIS_MODULE. data can be optionally used to store a pointer with the service request structure. Example of a filled out pnotify_service_request: static struct pnotify_service_request pnotify_service_request = { .module = THIS_MODULE, .name = "test_module", .data = NULL, .entry = LIST_HEAD_INIT(pnotify_service_request.entry), .init = test_init, .fork = test_attach, .exit = test_detach, .exec = test_exec, }; The above pnotify_service_request says the kernel module "test_module" cares about events fork, exit, exec, and init. In fork, call the kernel module's test_attach function. In exec, call test_exec. In exit, call test_detach. The init event is specified, so all processes on the system will be associated with this kernel module and the test_init function will be run for each. === Registering with pnotify === You will likely register with pnotify in your kernel module's module_init function. Here is an example: static int __init test_module_init(void) { int rc = pnotify_register(&pnotify_service_request); if (rc < 0) { return -1; } return 0; } === Example init event function ==== Since the init event is defined, it means this kernel module is added to the subscriber list of all processes -- it will receive notification about events it cares about for all processes and all children that follow. Of course, if a kernel module doesn't need to know about all current processes, that module shouldn't implement this and '.init' in the pnotify_service_request structure would be NULL. This is as opposed to the normal method where the kernel module adds itself to the subscriber list of a process using pnotify_alloc. static int test_init(struct task_struct *tsk, struct pnotify_subscriber *subscriber) { if (pnotify_get_subscriber(tsk, "test_module") == NULL) dprintk("ERROR pnotify expected \"%s\" PID = %d\n", "test_module", tsk->pid); dprintk("FYI pnotify init hook fired for PID = %d\n", tsk->pid); atomic_inc(&init_count); return 0; } === Example fork (test_attach) function === This function is executed when a process forks - this is associated with the pnotify_callout callout in copy_process. There would be a very similar test_detach function (not shown). pnotify will add the kernel module to the notification list for the child process automatically and then execute this fork function pointer (test_attach in this example). However, the kernel module can control whether the kernel module stays on the process's subscriber list and wants notification by the return value. A negative value results in the fork failing. zero is success. >0 means success, but the kernel module doesn't want the to be associated with that specific process (doesn't want notification). In other words, if >0 is returned, your kernel module is saying that it doesn't want to be on the subscriber list for this process. static int test_attach(struct task_struct *tsk, struct pnotify_subscriber *subscriber, void *vp) { dprintk("pnotify attach hook fired for PID = %d\n", tsk->pid); atomic_inc(&attach_count); return 0; } === Example exec event function === And here is an example function to run when a task gets to exec. So any time a "tracked" process gets to exec, this would execute. More hooks/callouts similar to this one could be implemented as there is demand for them. static void test_exec(struct task_struct *tsk, struct pnotify_subscriber *subscriber) { dprintk("pnotify exec hook fired for PID %d\n", tsk->pid); atomic_inc(&exec_count); } === Unregistering with pnotify === You will likely wish to unregister with pnotify in the kernel module's module_exit function. Here is an example: static void __exit test_module_cleanup(void) { pnotify_unregister(&pnotify_service_request); printk("detach called %d times...\n", atomic_read(&detach_count)); printk("attach called %d times...\n", atomic_read(&attach_count)); printk("init called %d times...\n", atomic_read(&init_count)); printk("exec called %d times ...\n", atomic_read(&exec_count)); if (atomic_read(&attach_count) + atomic_read(&init_count) != atomic_read(&detach_count)) printk("pnotify PROBLEM: attach count + init count SHOULD equal detach cound and doesn't\n"); else printk("Good - attach count + init count equals detach count.\n"); } === Actually using data associated with the process in your module === The above examples show you how to create an example kernel module using pnotify, but they didn't show what you might do with the data pointer associated with a given process. Below, find an example of accessing the data pointer for a given task from within a kernel module making use of pnotify. pnotify_get_subscriber is used to retrieve the pnotify subscriber for a given process and kernel module. Like this: subscriber = pnotify_get_subscriber(task, name); Where name is your kernel module's name (as provided in the pnotify_service_request structure) and task is the process you're interested in. Please be careful about locking. The task structure has a pnotify_subscriber_list_sem to be used for locking. This example retrieves a given task in a way that ensures it doesn't disappear while we try to access it (that's why we do locking for the tasklist_lock and task). The pnotify subscriber list is locked to ensure the list doesn't change as we search it with pnotify_get_subscriber. read_lock(&tasklist_lock); get_task_struct(task); /* Ensure the task doesn't vanish on us */ read_unlock(&tasklist_lock); /* Unlock the tasklist */ down_read(&task->pnotify_subscriber_list_sem); /* readlock subscriber list */ subscriber = pnotify_get_subscriber(task, name); if (subscriber) { /* Get the widgitId associated with this task */ widgitId = ((widgitId_t *)subscriber->data); } put_task_struct(task); /* Done accessing the task */ up_read(&task->pnotify_subscriber_list_sem); /* unlock subscriber list */ History ------- Process Notification used to be known as PAGG (Process Aggregates). It was re-written to be called Process Notification because we believe this better describes its purpose. Structures and functions were re-named to be more clear and to reflect the new name. Why Not Notifier Lists? ----------------------- We investigated the use of notifier lists, available in newer kernels. Notifier lists would not be as efficient as pnotify for kernel modules wishing to associate data with processes. With pnotify, if the pnotify_subscriber_list of a given task is NULL, we can instantly know there are no kernel modules that care about the process. Further, the callbacks happen in places were the task struct is likely to be cached. So this is a quick operation. With notifier lists, the scope is system wide rather than per process. As long as one kernel module wants to be notified, we have to walk the notifier list and potentially waste cycles. In the case of pnotify, we only walk lists if we're interested about a specific task. On a system where pnotify is used to track only a few processes, the overhead of walking the notifier list is high compared to the overhead of walking the kernel module subscriber list only when a kernel module is interested in a given process. Overlooking performance issues, notifier lists in and of themselves wouldn't solve the problem pnotify solves anyway. Although you could argue notifier lists can implement the callback portion of pnotify, there is no association of data with a given process. This is a needed for kernel modules to efficiently associate a task with a data pointer without cluttering up the task struct. Some Justification ------------------ We feel that pnotify could be used to reduce the size of the task struct or the number of functions in copy_process. For example, if another part of the kernel needs to know when a process is forking or exiting, they could use pnotify instead of adding additional code to task struct, copy_process, or exit. Some have argued that PAGG in the past shouldn't be used because it will allow interesting things to be implemented outside of the kernel. While this might be a small risk, having these in place allows customers and users to implement kernel components that you don't want to see in the kernel anyway. For example, a certain vendor may have an urgent need to implement kernel functionality or special types of accounting that nobody else is interested in. That doesn't mean the code isn't open-source, it just means it isn't applicable to all of Linux because it satisfies a niche. All of pnotify's functionality that needs to be exported is exported with EXPORT_SYMBOL_GPL to discourage abuse. The risk already exists in the kernel for people to implement modules outside the kernel that suffer from less peer review and possibly bad programming practice. pnotify could add more oppurtunities for out-of-tree kernel module authors to make new modules. I believe this is somewhat mitigated by the already-existing 'tainted' warnings in the kernel. From erikj@sgi.com Mon Sep 19 09:57:05 2005 Received: with ECARTIS (v1.0.0; list pagg); Mon, 19 Sep 2005 09:57:17 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8JGv2iL019163 for ; Mon, 19 Sep 2005 09:57:04 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8JIvCG1014897 for ; Mon, 19 Sep 2005 11:57:12 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8JGsFDN15962524; Mon, 19 Sep 2005 11:54:15 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8JGsFS93674713; Mon, 19 Sep 2005 11:54:15 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 03AFF6022F4A; Mon, 19 Sep 2005 11:54:14 -0500 (CDT) Date: Mon, 19 Sep 2005 11:54:14 -0500 From: Erik Jacobson To: pagg@oss.sgi.com, Christoph Lameter Subject: another new rev of the docs... Message-ID: <20050919165414.GA18134@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6i X-archive-position: 113 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg Here is another revision taking in many suggestions from Dean Nelson. The return values for the fork function pointer are defined, and some function names changed. === I am re-working what used to be PAGG to have a new name, better documentation, and better variable names. My hope is that I can present this to the community for inclusion in the kernel and I'm hoping to have a couple of the users of this help by explaining how they use it. I feel one reason PAGG didn't get attention was because it's true function was obscured by its name and the names of functions and variables within. The first step of this for me was to write some new documentation using the new names for the pieces. Before I propose this to the broader community, I'd like to get feedback. After that, I plan to re-write the code to match and post it. If a variable seems too long (some are), perhaps provide a suggested shorer name. The name of pnotify itself is fair game. It turns out it was hard to pick a name for this thing. Process Notification (pnotify) -------------------- pnotify provides a method (service) for kernel modules to be notified when certain events happen in the life of a process. Events we support include fork, exit, and exec. A special init event is also supported (see events below). More events could be added. pnotify also provides a generic data pointer for the modules to work with so that data can be associated per process. A kernel module will register (pnotify_register) a service request describing events it cares about (pnotify_events) with pnotify_register. The request tells pnotify which notifications the kernel module wants. The kernel module passes along function pointers to be called for these events (exit, fork, exec) in the pnotify_events service request. From the process point of view, each process has a kernel module subscriber list (pnotify_module_subscriber_list). These kernel modules are the ones who want notification about the life of the process. As described above, each kernel module subscriber on the list has a generic data pointer to point to data associated with the process. In the case of fork, pnotify will allocate the same kernel module subscriber list for the new child that existed for the parent. The kernel module's function pointer for fork is also called for the child being constructed so the kernel module can do what ever it needs to do when a parent forks this child. Special return values apply for the fork event that don't to others. They are described in the fork example below. For exit, similar things happen but the exit function pointer for each kernel module subscriber is called and the kernel module subscriber entry for that process is deleted. Events ------ Events are stages of a processes life that kernel modules care about. The fork event is triggered in a certain location in copy_process when a parent forks. The exit event happens when a process is going away. We also support an exec event, which happens when a process execs. Finally, there is an init event. This special event makes it so this kernel module will be associated with all current processes in the system at the time of registration. This is used when a kernel module wants to keep track of all current processes as opposed to just those it associates by itself (and children that follow). The events a kernel module cares about are set up in the pnotify_events structure - see usage below. When setting up a pnotify_events, you designate which events you care about by either associating NULL (meaning you don't care about that event) or a pointer to the function to run when the event is triggered. The fork event is currently required. How do processes become associated with kernel modules? ------------------------------------------------------- Your kernel module itself can use the pnotify_subscribe function to associate a given process with a given pnotify_events structure. This adds your kernel module to the subscriber list of the process. In the case of inescapable job containers making use of PAM, when PAM allows a person to log in, PAM contacts job (via a PAM job module which uses the job userland library) and the kernel Job code will call pnotify_subscribe to associate the process with pnotify. From that point on, the kernel module will be notified about events in the process's life that the module cares about (as well, as any children that process may later have). Likewise, your kernel module can remove an association between it and a given process by using pnotify_unsubscribe. Example Usage ------------- === filling out the pnotify_events structure === A kernel module wishing to use pnotify needs to set up a pnotify_events structure. This structure tells pnotify which events you care about and what functions to call when those events are triggered. In addition, you supply a name (usually the kernel module name). The entry is always filled out as shown below. .module is usually set to THIS_MODULE. data can be optionally used to store a pointer with the pnotify_events structure. Example of a filled out pnotify_events: static struct pnotify_events pnotify_events = { .module = THIS_MODULE, .name = "test_module", .data = NULL, .entry = LIST_HEAD_INIT(pnotify_events.entry), .init = test_init, .fork = test_attach, .exit = test_detach, .exec = test_exec, }; The above pnotify_events structure says the kernel module "test_module" cares about events fork, exit, exec, and init. In fork, call the kernel module's test_attach function. In exec, call test_exec. In exit, call test_detach. The init event is specified, so all processes on the system will be associated with this kernel module during registration and the test_init function will be run for each. === Registering with pnotify === You will likely register with pnotify in your kernel module's module_init function. Here is an example: static int __init test_module_init(void) { int rc = pnotify_register(&pnotify_events); if (rc < 0) { return -1; } return 0; } === Example init event function ==== Since the init event is defined, it means this kernel module is added to the subscriber list of all processes -- it will receive notification about events it cares about for all processes and all children that follow. Of course, if a kernel module doesn't need to know about all current processes, that module shouldn't implement this and '.init' in the pnotify_events structure would be NULL. This is as opposed to the normal method where the kernel module adds itself to the subscriber list of a process using pnotify_subscribe. static int test_init(struct task_struct *tsk, struct pnotify_subscriber *subscriber) { if (pnotify_get_subscriber(tsk, "test_module") == NULL) dprintk("ERROR pnotify expected \"%s\" PID = %d\n", "test_module", tsk->pid); dprintk("FYI pnotify init hook fired for PID = %d\n", tsk->pid); atomic_inc(&init_count); return 0; } === Example fork (test_attach) function === This function is executed when a process forks - this is associated with the pnotify_callout callout in copy_process. There would be a very similar test_detach function (not shown). pnotify will add the kernel module to the notification list for the child process automatically and then execute this fork function pointer (test_attach in this example). However, the kernel module can control whether the kernel module stays on the process's subscriber list and wants notification by the return value. PNOTIFY_ERROR - prevent the process from continuing - failing the fork PNOTIFY_OK - good, adds the kernel module to the subscriber list for process PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process static int test_attach(struct task_struct *tsk, struct pnotify_subscriber *subscriber, void *vp) { dprintk("pnotify attach hook fired for PID = %d\n", tsk->pid); atomic_inc(&attach_count); return PNOTIFY_OK; } === Example exec event function === And here is an example function to run when a task gets to exec. So any time a "tracked" process gets to exec, this would execute. static void test_exec(struct task_struct *tsk, struct pnotify_subscriber *subscriber) { dprintk("pnotify exec hook fired for PID %d\n", tsk->pid); atomic_inc(&exec_count); } === Unregistering with pnotify === You will likely wish to unregister with pnotify in the kernel module's module_exit function. Here is an example: static void __exit test_module_cleanup(void) { pnotify_unregister(&pnotify_events); printk("detach called %d times...\n", atomic_read(&detach_count)); printk("attach called %d times...\n", atomic_read(&attach_count)); printk("init called %d times...\n", atomic_read(&init_count)); printk("exec called %d times ...\n", atomic_read(&exec_count)); if (atomic_read(&attach_count) + atomic_read(&init_count) != atomic_read(&detach_count)) printk("pnotify PROBLEM: attach count + init count SHOULD equal detach cound and doesn't\n"); else printk("Good - attach count + init count equals detach count.\n"); } === Actually using data associated with the process in your module === The above examples show you how to create an example kernel module using pnotify, but they didn't show what you might do with the data pointer associated with a given process. Below, find an example of accessing the data pointer for a given process from within a kernel module making use of pnotify. pnotify_get_subscriber is used to retrieve the pnotify subscriber for a given process and kernel module. Like this: subscriber = pnotify_get_subscriber(task, name); Where name is your kernel module's name (as provided in the pnotify_events structure) and task is the process you're interested in. Please be careful about locking. The task structure has a pnotify_subscriber_list_sem to be used for locking. This example retrieves a given task in a way that ensures it doesn't disappear while we try to access it (that's why we do locking for the tasklist_lock and task). The pnotify subscriber list is locked to ensure the list doesn't change as we search it with pnotify_get_subscriber. read_lock(&tasklist_lock); get_task_struct(task); /* Ensure the task doesn't vanish on us */ read_unlock(&tasklist_lock); /* Unlock the tasklist */ down_read(&task->pnotify_subscriber_list_sem); /* readlock subscriber list */ subscriber = pnotify_get_subscriber(task, name); if (subscriber) { /* Get the widgitId associated with this task */ widgitId = ((widgitId_t *)subscriber->data); } put_task_struct(task); /* Done accessing the task */ up_read(&task->pnotify_subscriber_list_sem); /* unlock subscriber list */ History ------- Process Notification used to be known as PAGG (Process Aggregates). It was re-written to be called Process Notification because we believe this better describes its purpose. Structures and functions were re-named to be more clear and to reflect the new name. Why Not Notifier Lists? ----------------------- We investigated the use of notifier lists, available in newer kernels. Notifier lists would not be as efficient as pnotify for kernel modules wishing to associate data with processes. With pnotify, if the pnotify_subscriber_list of a given task is NULL, we can instantly know there are no kernel modules that care about the process. Further, the callbacks happen in places were the task struct is likely to be cached. So this is a quick operation. With notifier lists, the scope is system wide rather than per process. As long as one kernel module wants to be notified, we have to walk the notifier list and potentially waste cycles. In the case of pnotify, we only walk lists if we're interested about a specific task. On a system where pnotify is used to track only a few processes, the overhead of walking the notifier list is high compared to the overhead of walking the kernel module subscriber list only when a kernel module is interested in a given process. Overlooking performance issues, notifier lists in and of themselves wouldn't solve the problem pnotify solves anyway. Although you could argue notifier lists can implement the callback portion of pnotify, there is no association of data with a given process. This is a needed for kernel modules to efficiently associate a task with a data pointer without cluttering up the task struct. Some Justification ------------------ We feel that pnotify could be used to reduce the size of the task struct or the number of functions in copy_process. For example, if another part of the kernel needs to know when a process is forking or exiting, they could use pnotify instead of adding additional code to task struct, copy_process, or exit. Some have argued that PAGG in the past shouldn't be used because it will allow interesting things to be implemented outside of the kernel. While this might be a small risk, having these in place allows customers and users to implement kernel components that you don't want to see in the kernel anyway. For example, a certain vendor may have an urgent need to implement kernel functionality or special types of accounting that nobody else is interested in. That doesn't mean the code isn't open-source, it just means it isn't applicable to all of Linux because it satisfies a niche. All of pnotify's functionality that needs to be exported is exported with EXPORT_SYMBOL_GPL to discourage abuse. The risk already exists in the kernel for people to implement modules outside the kernel that suffer from less peer review and possibly bad programming practice. pnotify could add more oppurtunities for out-of-tree kernel module authors to make new modules. I believe this is somewhat mitigated by the already-existing 'tainted' warnings in the kernel. Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From erikj@sgi.com Mon Sep 19 17:22:52 2005 Received: with ECARTIS (v1.0.0; list pagg); Mon, 19 Sep 2005 17:22:54 -0700 (PDT) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8K0MqiL028154 for ; Mon, 19 Sep 2005 17:22:52 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8K0K8xT002815 for ; Mon, 19 Sep 2005 19:20:08 -0500 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8K0K8DN15986933; Mon, 19 Sep 2005 19:20:08 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8K0K7S93710952; Mon, 19 Sep 2005 19:20:07 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 91AED6022F4A; Mon, 19 Sep 2005 19:20:07 -0500 (CDT) Date: Mon, 19 Sep 2005 19:20:07 -0500 From: Erik Jacobson To: pagg@oss.sgi.com, steiner@sgi.com, clameter@sgi.com Subject: New pagg patch progress Message-ID: <20050920002007.GA9813@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6i X-archive-position: 114 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg It took me longer then expected to get PAGG turned in to pnotify and change Job to use the new pnotify stuff. I just finished but haven't done any testing. So to those of you I said I would post the patch today - I'm sorry, it will be done tomorrow though. At the very least, I'm hoping the new patch will spur some discussion and at least some accepted solution will come about. I think the pnotify stuff has good exposure because it's been around on SGI machines for quite some time, and it serves its purpose well. However, now I'd just settle for anything that has the basic functionality needed to implement Job. Let's see where tomorrow takes us. I'd appreciate list member involvement in the discussion. Erik From erikj@sgi.com Tue Sep 20 08:17:04 2005 Received: with ECARTIS (v1.0.0; list pagg); Tue, 20 Sep 2005 08:17:17 -0700 (PDT) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8KFH2iL031486 for ; Tue, 20 Sep 2005 08:17:04 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8KFEIxT003974 for ; Tue, 20 Sep 2005 10:14:18 -0500 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8KFEHDN16030276 for ; Tue, 20 Sep 2005 10:14:17 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8KFEHS93747731 for ; Tue, 20 Sep 2005 10:14:17 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 889C86028D21; Tue, 20 Sep 2005 10:14:17 -0500 (CDT) Date: Tue, 20 Sep 2005 10:14:17 -0500 From: Erik Jacobson To: pagg@oss.sgi.com Subject: job version to be posted, recent job fixes Message-ID: <20050920151417.GA24846@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6i X-archive-position: 115 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg I just wanted people to know that the version of Job I plan to post using the new pnotify version of pagg is not the jobfs variant. The last time Job got a bunch of community feedback, they suggested using a jobfs implementation instead of the /proc/job ioctl interface. We implemented that. It does work, but for certain customer situations, the overhead of the inode operations to control job are quite costly. Although most customers wouldn't hit this, at least one big customer would have. In one of the test suite tests, we fork like 40,000 processes maybe more to see if job suffers from a duplicate JID issue that a customer reported. In that test case, where job controls are issued for each process at least once, the run time of the test takes 10 minutes or more compared to less than 20 seconds with the old version. The hold-up was due to inode operations in jobfs. We were trying to decide which way to go -- to try to figure out if there is a way to speed up the inode operations or just go with the tried-and-true kernel implementation. During this time, we found a couple other bugs that I didn't fix because I didn't know which way we were going - jobfs or the old way. Some bugs that will be fixed in the version of job I'm planning to post today include: - Duplicate JIDs possible when process table wraps - we changed JID computation to be based on a counter instead of a PID - Some code that never executes was purged from job_sys_create - A hang (locking logic error) was possible in rare situations in job_sys_create - send_sig_info doesn't check for signal zero (status check) any more, so we changed to use group_send_sig_info which requires the tasklist to be locked during the call. The bug here was that an invalid signal ended up being passed that could wakeup things that didn't expect to be woken up. I just wanted folks to know what was going on with the job patch. -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From erikj@sgi.com Tue Sep 20 09:32:35 2005 Received: with ECARTIS (v1.0.0; list pagg); Tue, 20 Sep 2005 09:32:43 -0700 (PDT) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8KGWZiL009103 for ; Tue, 20 Sep 2005 09:32:35 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8KGToxT017959 for ; Tue, 20 Sep 2005 11:29:50 -0500 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8KGTnDN16033758 for ; Tue, 20 Sep 2005 11:29:50 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8KGTnS93550093 for ; Tue, 20 Sep 2005 11:29:49 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 693B76028D21; Tue, 20 Sep 2005 11:29:49 -0500 (CDT) Date: Tue, 20 Sep 2005 11:29:49 -0500 From: Erik Jacobson To: pagg@oss.sgi.com Subject: Re: job version to be posted, recent job fixes Message-ID: <20050920162949.GA29495@sgi.com> References: <20050920151417.GA24846@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050920151417.GA24846@sgi.com> User-Agent: Mutt/1.5.6i X-archive-position: 116 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg Hi. I got a strong suggestion that I should make both the jobfs versoin of job available and the all-kernel proc ioctl versoin for the community to look at and compare. That means I need to port the fixes to the jobfs version and re-test in addition to changing it to use pnotify. I have some personal stuff to take care of this afternoon so once again I'm pushing forward the posting until either late tonight or tomorrow to give me time to complete and test the changes in the jobfs version of job to both convert it to pnotify and fix outstanding bugs. I'm sorry my estimated time keeps slipping but this time it's because I've been asked to do more than I planned :) PS: The kernel proc/ioctl versoin of job and the new pnotify did pass my regression tests this morning, so that's good. Erik From erikj@sgi.com Wed Sep 21 12:57:47 2005 Received: with ECARTIS (v1.0.0; list pagg); Wed, 21 Sep 2005 12:57:51 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8LJvkiL002981 for ; Wed, 21 Sep 2005 12:57:47 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8LKvIN0006589 for ; Wed, 21 Sep 2005 13:57:19 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8LJt1DN16113553 for ; Wed, 21 Sep 2005 14:55:01 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8LJt1S93765451 for ; Wed, 21 Sep 2005 14:55:01 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id DDB336022F49; Wed, 21 Sep 2005 14:55:00 -0500 (CDT) Date: Wed, 21 Sep 2005 14:55:00 -0500 From: Erik Jacobson To: pagg@oss.sgi.com Subject: jobfs implementation of Linux Job - testing only Message-ID: <20050921195500.GA21918@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6i X-archive-position: 117 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg I have decided to move the jobfs version of Job to a "test only" status. It will appear in a test directory. I have decided in my tests that it just isn't stable and has some performance issues and other problems that need to be fixed if we choose to continue down that path. It isn't clear that the jobfs implementation is the way to go - I'm not sure some of the performance issues are solvable with jobfs. The non-jobfs implementation (the tried and true one) is the one that is stable and supported. Both versions require the job package (libraries and commands) to function. However, the jobfs version moved much of the processing out of the kernel and in to the library. Of course, both require the pnotify patch (formally PAGG) as well. The ftp site is being re-organized to make it clear what is test, what is stable, etc. I'll post on that shortly. -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From erikj@sgi.com Wed Sep 21 14:03:55 2005 Received: with ECARTIS (v1.0.0; list pagg); Wed, 21 Sep 2005 14:04:02 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8LL3siL009822 for ; Wed, 21 Sep 2005 14:03:54 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8LM3R2e016157 for ; Wed, 21 Sep 2005 15:03:27 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8LL09DN16117331 for ; Wed, 21 Sep 2005 16:00:09 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8LL08S93802445 for ; Wed, 21 Sep 2005 16:00:08 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id C19706022F49; Wed, 21 Sep 2005 16:00:08 -0500 (CDT) Date: Wed, 21 Sep 2005 16:00:08 -0500 From: Erik Jacobson To: pagg@oss.sgi.com Subject: ftp site re-organized Message-ID: <20050921210008.GA25218@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6i X-archive-position: 118 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg Hi. As promised, here is some information on the re-organized directories on the pagg ftp site. Later, depending on what happens in the community, I'll probably need to look in to changing the web site itself to take in to account PAGG's new name. If you click on the 'download' link on the upper left of this web page: http://oss.sgi.com/projects/pagg/ Or if you ftp to oss.sgi.com and change to /projects/pagg/download This is the same location as before. Here is the directory layout: old.......old pagg and job files pnotify...New Process Notification implementation (formally pagg) job.......job patch and userland pieces - stable version job-test..job patch and userland pieces - jobfs version, unstable, test Both the stable job and the jobfs implementation in job-test have been updated to use pnotify. The pnotify patch is in the pnotify directory. The job patches and pnotify were all tested against 2.6.13.2 but should apply to any recent kernel. I provided the source RPMs and pre-built rpms for ia64 and x86 for the Job userland library. Documentation for pnotify can be found in the Documentation/pnotify.txt file after applying the pnotify patch. It includes variable name changes from PAGG to pnotify at the end of the document. The README file in job-test describes some of the current problems with the jobfs implementation. -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From erikj@sgi.com Tue Sep 27 13:14:11 2005 Received: with ECARTIS (v1.0.0; list pagg); Tue, 27 Sep 2005 13:14:22 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8RKEBiL013673 for ; Tue, 27 Sep 2005 13:14:11 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8RLEVij031143 for ; Tue, 27 Sep 2005 14:14:31 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8RKAKDN16521944; Tue, 27 Sep 2005 15:10:20 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8RKAKS94178506; Tue, 27 Sep 2005 15:10:20 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 536FE6022F49; Tue, 27 Sep 2005 15:10:20 -0500 (CDT) Date: Tue, 27 Sep 2005 15:10:20 -0500 From: Erik Jacobson To: Kingsley Cheung Cc: Erik Jacobson , pagg@oss.sgi.com, tonyt@aurema.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050927201020.GA30433@sgi.com> References: <20050617014512.GA10285@aurema.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050617014512.GA10285@aurema.com> User-Agent: Mutt/1.5.6i X-archive-position: 119 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg I fixed this in the RCU version of pnotify I'm working per the lse-tech community discussion - thanks for the reminder the other day (in a non-list email). If the RCU version crashes and burns for some reason and we go back to the non-CUu one, I'll need to make the fix there too. The function now looks like this. I hope this is what you had in mind (untested as of this moment). /** * __pnotify_fork - Add kernel module subscriber to same subscribers as parent * @to_task: The child task that will inherit the parent's subscribers * @from_task: The parent task * * Used to attach a new task to the same subscribers the parent has in its * subscriber list. * * The "from" argument is the parent task. The "to" argument is the child * task. * * See Documentation/pnotify.txt for details on * how to handle return codes from the attach function pointer. * * Locking: The to_task is currently in-construction, so we don't * need to worry about write-locks. We do need to be sure the parent's * subscriber list, which we copy here, doesn't go away on us. This is * done via RCU. * */ int __pnotify_fork(struct task_struct *to_task, struct task_struct *from_task) { struct pnotify_subscriber *from_subscriber; int ret; /* We need to be sure the parent's list we copy from doesn't disappear */ rcu_read_lock(); list_for_each_entry_rcu(from_subscriber, &from_task->pnotify_subscriber_list, entry) { struct pnotify_subscriber *to_subscriber = NULL; to_subscriber = pnotify_subscribe(to_task, from_subscriber->events); if (!to_subscriber) { ret=-ENOMEM; __pnotify_exit(to_task); rcu_read_unlock(); return ret; } ret = to_subscriber->events->fork(to_task, to_subscriber, from_subscriber->data); rcu_read_unlock(); /* no more to do with the parent's data */ if (ret < 0) { /* Propagates to copy_process as a fork failure */ /* No __pnotify_exit because there is one in the failure path * for copy_process in fork.c */ return ret; /* Fork failure */ } else if (ret > 0) { /* Success, but fork function pointer in the pnotify_events structure * doesn't want the kenrel module subscribed */ /* Again, this is the in-construction-child so no write lock */ pnotify_unsubscribe(to_subscriber); } } return 0; /* success */ } From kaigai@ak.jp.nec.com Wed Sep 28 04:38:19 2005 Received: with ECARTIS (v1.0.0; list pagg); Wed, 28 Sep 2005 04:38:33 -0700 (PDT) Received: from tyo202.gate.nec.co.jp (TYO206.gate.nec.co.jp [202.32.8.206]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8SBcJiL006649 for ; Wed, 28 Sep 2005 04:38:19 -0700 Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.160] (may be forged)) by tyo202.gate.nec.co.jp (8.11.7/3.7W01080315) with ESMTP id j8SBZAb04220; Wed, 28 Sep 2005 20:35:10 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id j8SBZAg21600; Wed, 28 Sep 2005 20:35:10 +0900 (JST) Received: from mailsv.linux.bs1.fc.nec.co.jp (namesv2.linux.bs1.fc.nec.co.jp [10.34.125.2]) by mailsv4.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id j8SBZ9b26517; Wed, 28 Sep 2005 20:35:09 +0900 (JST) Received: from [10.34.125.249] (sanma.linux.bs1.fc.nec.co.jp [10.34.125.249]) by mailsv.linux.bs1.fc.nec.co.jp (Postfix) with ESMTP id EE7F52FE04; Wed, 28 Sep 2005 20:34:50 +0900 (JST) Message-ID: <433A7FE4.5040109@ak.jp.nec.com> Date: Wed, 28 Sep 2005 20:35:00 +0900 From: Kaigai Kohei User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050317) X-Accept-Language: ja, en-us, en MIME-Version: 1.0 To: Erik Jacobson Cc: Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com, paulmck@us.ibm.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> In-Reply-To: <20050927201020.GA30433@sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 120 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kaigai@ak.jp.nec.com Precedence: bulk X-list: pagg Hi, Erik Jacobson wrote: > I fixed this in the RCU version of pnotify I'm working per the lse-tech > community discussion - thanks for the reminder the other day (in a non-list > email). > > If the RCU version crashes and burns for some reason and we go back to > the non-CUu one, I'll need to make the fix there too. The function now > looks like this. I hope this is what you had in mind (untested as of > this moment). In my understanding, any _write_ operations can not be implemented without locking, even if we can use RCU. (In addition, RCU conscious writing/update style is required.) For example, pnotify permits to attach a new pnotify_subscriber object to another task. If someone calls pnotify_subscribe() for other task which is doing fork(), there is a possibility to break the pnotify_subscriber_list of victim task. Therefore, procedures with updates such __pnotify_fork() should be serialized by somethig locking. RCU is so effective for seldom-write/ frequentrly-read pass, such as SELinux's Access Vector Cache(AVC). But it's not omnipotence, and it restricts write methodology. In the past, I made a proposition of applying RCU for PAGG. But it might be inappropriate for pnotify/PAGG as a general framework. I have attention to another respect. The current pnotify implementation requires to hold pnotify_event_list_sem before calling pnotify_get_events(). Threfore, we must repeat read_lock/unlock(&tasklist_lock) on do_each_thread()/while_each_thread() loop as follows: ---------------------------- read_lock(&tasklist_lock); do_each_thread(g, p) { get_task_struct(p); read_unlock(&tasklist_lock); down_read(&p->pnotify_subscriber_list_sem); subscriber = pnotify_get_subscriber(p, events->name); : up_read(&p->pnotify_subscriber_list_sem); read_lock(&tasklist_lock); << checking, p is dead or not ? >> } while_each_thread(g, p); read_unlock(&tasklist_lock); ---------------------------- I'm happy, if pnotify_subscriber_list would be protected by rwlock. If rwlock is used, we can not implement pnotify_subscribe() with current spec. But is it impossible to prepare pnotify_subscribe_atomic() or pnotify_subscribe_bind() which associates task_struct with pre-allocated pnotify_events object ? ---- in rwlock world :-) --- read_lock(&tasklist_lock); do_each_thread(g, p) { read_lock(&p->pnotify_subscriber_list_rwlock); subscriber = pnotify_get_subscriber(p, events->name); : read_unlock(&p->pnotify_subscriber_list_rwlock); } while_each_thread(g, p); read_unlock(&tasklist_lock); ---------------------------- Thanks, > /** > * __pnotify_fork - Add kernel module subscriber to same subscribers as parent > * @to_task: The child task that will inherit the parent's subscribers > * @from_task: The parent task > * > * Used to attach a new task to the same subscribers the parent has in its > * subscriber list. > * > * The "from" argument is the parent task. The "to" argument is the child > * task. > * > * See Documentation/pnotify.txt for details on > * how to handle return codes from the attach function pointer. > * > * Locking: The to_task is currently in-construction, so we don't > * need to worry about write-locks. We do need to be sure the parent's > * subscriber list, which we copy here, doesn't go away on us. This is > * done via RCU. > * > */ > int > __pnotify_fork(struct task_struct *to_task, struct task_struct *from_task) > { > struct pnotify_subscriber *from_subscriber; > int ret; > > /* We need to be sure the parent's list we copy from doesn't disappear */ > rcu_read_lock(); > > list_for_each_entry_rcu(from_subscriber, &from_task->pnotify_subscriber_list, entry) { > struct pnotify_subscriber *to_subscriber = NULL; > > to_subscriber = pnotify_subscribe(to_task, from_subscriber->events); > if (!to_subscriber) { > ret=-ENOMEM; > __pnotify_exit(to_task); > rcu_read_unlock(); > return ret; > } > ret = to_subscriber->events->fork(to_task, to_subscriber, > from_subscriber->data); > > rcu_read_unlock(); /* no more to do with the parent's data */ rcu_read_unlovk(); should be deployed outside of the list_for_each_entry_rcu(){...}. > > if (ret < 0) { > /* Propagates to copy_process as a fork failure */ > /* No __pnotify_exit because there is one in the failure path > * for copy_process in fork.c */ > return ret; /* Fork failure */ > } > else if (ret > 0) { > /* Success, but fork function pointer in the pnotify_events structure > * doesn't want the kenrel module subscribed */ > /* Again, this is the in-construction-child so no write lock */ > pnotify_unsubscribe(to_subscriber); > } > } > > return 0; /* success */ > } -- Linux Promotion Center, NEC KaiGai Kohei From erikj@sgi.com Wed Sep 28 07:21:25 2005 Received: with ECARTIS (v1.0.0; list pagg); Wed, 28 Sep 2005 07:21:40 -0700 (PDT) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8SELNiL021852 for ; Wed, 28 Sep 2005 07:21:24 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8SEIXxT001947 for ; Wed, 28 Sep 2005 09:18:33 -0500 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8SEIWDN16572380; Wed, 28 Sep 2005 09:18:33 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8SEIVS94206965; Wed, 28 Sep 2005 09:18:32 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 55A0E6022F4A; Wed, 28 Sep 2005 09:18:31 -0500 (CDT) Date: Wed, 28 Sep 2005 09:18:31 -0500 From: Erik Jacobson To: Kaigai Kohei Cc: Erik Jacobson , Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com, paulmck@us.ibm.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050928141831.GA24110@sgi.com> References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <433A7FE4.5040109@ak.jp.nec.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <433A7FE4.5040109@ak.jp.nec.com> User-Agent: Mutt/1.5.6i X-archive-position: 121 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > In my understanding, any _write_ operations can not be implemented > without locking, even if we can use RCU. > (In addition, RCU conscious writing/update style is required.) I'm surrounding write operations with a writelock rwsem (not a spinlock, at least not now... since it is common for pnotify users to use semaphores for their own locking). I think this is similar to your example submission, but you used a spinlock in those places. At this moment, I'm close to done with something to look at. I'm just tracking down a bug in the new implementation that showed up in the Job patch. I'll post what I have when I'm ready and maybe we can tear it apart then. This is my first rcu experience so I'd welcome the feedback including "this just won't work with rcu" if that's what it comes down to. I also want to be sure the 'stale data' problem isn't actually a problem for us. > frequentrly-read pass, such as SELinux's Access Vector Cache(AVC). > But it's not omnipotence, and it restricts write methodology. The feeling I had is that most users of pnotify won't be writing super-often. This is a generalization that may be incorrect. Taking Job as an example, once the process is made part of a job, not much usually happens in terms of adjusting the data pointer associated with the task struct until Job is done. I could imagine there may be things this isn't the case for, then the writes will be a penalty possibly. > I have attention to another respect. The current pnotify implementation > requires to hold pnotify_event_list_sem before calling pnotify_get_events(). As I recall, that code only would happen at most twice in the life of a kernel module, right? The only time the init function pointer would fire, if it's present, is at pnotify_register time. A similar piece of code happens at unregister time I think. I guess I'm wondering if this happens enough to worry about? Please let me know if I missed your entire point. > Threfore, we must repeat read_lock/unlock(&tasklist_lock) on > do_each_thread()/while_each_thread() loop as follows: > > ---------------------------- > read_lock(&tasklist_lock); > do_each_thread(g, p) { > get_task_struct(p); > read_unlock(&tasklist_lock); > > down_read(&p->pnotify_subscriber_list_sem); > subscriber = pnotify_get_subscriber(p, events->name); > : > up_read(&p->pnotify_subscriber_list_sem); > read_lock(&tasklist_lock); > << checking, p is dead or not ? >> > } while_each_thread(g, p); > read_unlock(&tasklist_lock); > ---------------------------- > > I'm happy, if pnotify_subscriber_list would be protected by rwlock. > > If rwlock is used, we can not implement pnotify_subscribe() with current > spec. But is it impossible to prepare pnotify_subscribe_atomic() or > pnotify_subscribe_bind() which associates task_struct with pre-allocated > pnotify_events object ? > > ---- in rwlock world :-) --- > read_lock(&tasklist_lock); > do_each_thread(g, p) { > read_lock(&p->pnotify_subscriber_list_rwlock); > subscriber = pnotify_get_subscriber(p, events->name); > : > read_unlock(&p->pnotify_subscriber_list_rwlock); > } while_each_thread(g, p); > read_unlock(&tasklist_lock); > ---------------------------- > > Thanks, > > > >/** > > * __pnotify_fork - Add kernel module subscriber to same subscribers as > > parent > > * @to_task: The child task that will inherit the parent's subscribers > > * @from_task: The parent task > > * > > * Used to attach a new task to the same subscribers the parent has in its > > * subscriber list. > > * > > * The "from" argument is the parent task. The "to" argument is the child > > * task. > > * > > * See Documentation/pnotify.txt for details on > > * how to handle return codes from the attach function pointer. > > * > > * Locking: The to_task is currently in-construction, so we don't > > * need to worry about write-locks. We do need to be sure the parent's > > * subscriber list, which we copy here, doesn't go away on us. This is > > * done via RCU. > > * > > */ > >int > >__pnotify_fork(struct task_struct *to_task, struct task_struct *from_task) > >{ > > struct pnotify_subscriber *from_subscriber; > > int ret; > > > > /* We need to be sure the parent's list we copy from doesn't > > disappear */ > > rcu_read_lock(); > > > > list_for_each_entry_rcu(from_subscriber, > > &from_task->pnotify_subscriber_list, entry) { > > struct pnotify_subscriber *to_subscriber = NULL; > > > > to_subscriber = pnotify_subscribe(to_task, > > from_subscriber->events); > > if (!to_subscriber) { > > ret=-ENOMEM; > > __pnotify_exit(to_task); > > rcu_read_unlock(); > > return ret; > > } > > ret = to_subscriber->events->fork(to_task, to_subscriber, > > from_subscriber->data); > > > > rcu_read_unlock(); /* no more to do with the parent's data */ > > rcu_read_unlovk(); should be deployed outside of the > list_for_each_entry_rcu(){...}. > > > > > if (ret < 0) { > > /* Propagates to copy_process as a fork failure */ > > /* No __pnotify_exit because there is one in the > > failure path > > * for copy_process in fork.c */ > > return ret; /* Fork failure */ > > } > > else if (ret > 0) { > > /* Success, but fork function pointer in the > > pnotify_events structure > > * doesn't want the kenrel module subscribed */ > > /* Again, this is the in-construction-child so no > > write lock */ > > pnotify_unsubscribe(to_subscriber); > > } > > } > > > > return 0; /* success */ > >} > > -- > Linux Promotion Center, NEC > KaiGai Kohei -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From paulmck@us.ibm.com Wed Sep 28 08:05:06 2005 Received: with ECARTIS (v1.0.0; list pagg); Wed, 28 Sep 2005 08:05:14 -0700 (PDT) Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.144]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8SF4xiL025662 for ; Wed, 28 Sep 2005 08:05:06 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e4.ny.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8SF2966016020 for ; Wed, 28 Sep 2005 11:02:09 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8SF29bd102902 for ; Wed, 28 Sep 2005 11:02:09 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11/8.13.3) with ESMTP id j8SF28J1006348 for ; Wed, 28 Sep 2005 11:02:09 -0400 Received: from linux.local ([9.47.22.63]) by d01av04.pok.ibm.com (8.12.11/8.12.11) with ESMTP id j8SF24JK005925; Wed, 28 Sep 2005 11:02:08 -0400 Received: by linux.local (Postfix on SuSE Linux 7.3 (i386), from userid 500) id 665E5148809; Wed, 28 Sep 2005 08:02:50 -0700 (PDT) Date: Wed, 28 Sep 2005 08:02:50 -0700 From: "Paul E. McKenney" To: Kaigai Kohei Cc: Erik Jacobson , Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050928150250.GB4925@us.ibm.com> Reply-To: paulmck@us.ibm.com References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <433A7FE4.5040109@ak.jp.nec.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <433A7FE4.5040109@ak.jp.nec.com> User-Agent: Mutt/1.4.1i X-archive-position: 122 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: paulmck@us.ibm.com Precedence: bulk X-list: pagg On Wed, Sep 28, 2005 at 08:35:00PM +0900, Kaigai Kohei wrote: > Hi, > > Erik Jacobson wrote: > >I fixed this in the RCU version of pnotify I'm working per the lse-tech > >community discussion - thanks for the reminder the other day (in a > >non-list email). > > > >If the RCU version crashes and burns for some reason and we go back to > >the non-CUu one, I'll need to make the fix there too. The function now > >looks like this. I hope this is what you had in mind (untested as of > >this moment). > > In my understanding, any _write_ operations can not be implemented > without locking, even if we can use RCU. > (In addition, RCU conscious writing/update style is required.) You understanding is quite correct. RCU protects readers from writers. Something else must be used to coordinate writers, for example: 1. locking 2. only single designated thread allowed to update 3. carefully crafted sequences of atomic instructions (but only do this is -really- needed!) Thanx, Paul > For example, pnotify permits to attach a new pnotify_subscriber object > to another task. If someone calls pnotify_subscribe() for other task > which is doing fork(), there is a possibility to break the > pnotify_subscriber_list of victim task. > > Therefore, procedures with updates such __pnotify_fork() should be > serialized by somethig locking. RCU is so effective for seldom-write/ > frequentrly-read pass, such as SELinux's Access Vector Cache(AVC). > But it's not omnipotence, and it restricts write methodology. > > In the past, I made a proposition of applying RCU for PAGG. But it might > be inappropriate for pnotify/PAGG as a general framework. > > > I have attention to another respect. The current pnotify implementation > requires to hold pnotify_event_list_sem before calling pnotify_get_events(). > Threfore, we must repeat read_lock/unlock(&tasklist_lock) on > do_each_thread()/while_each_thread() loop as follows: > > ---------------------------- > read_lock(&tasklist_lock); > do_each_thread(g, p) { > get_task_struct(p); > read_unlock(&tasklist_lock); > > down_read(&p->pnotify_subscriber_list_sem); > subscriber = pnotify_get_subscriber(p, events->name); > : > up_read(&p->pnotify_subscriber_list_sem); > read_lock(&tasklist_lock); > << checking, p is dead or not ? >> > } while_each_thread(g, p); > read_unlock(&tasklist_lock); > ---------------------------- > > I'm happy, if pnotify_subscriber_list would be protected by rwlock. > > If rwlock is used, we can not implement pnotify_subscribe() with current > spec. But is it impossible to prepare pnotify_subscribe_atomic() or > pnotify_subscribe_bind() which associates task_struct with pre-allocated > pnotify_events object ? > > ---- in rwlock world :-) --- > read_lock(&tasklist_lock); > do_each_thread(g, p) { > read_lock(&p->pnotify_subscriber_list_rwlock); > subscriber = pnotify_get_subscriber(p, events->name); > : > read_unlock(&p->pnotify_subscriber_list_rwlock); > } while_each_thread(g, p); > read_unlock(&tasklist_lock); > ---------------------------- > > Thanks, > > > >/** > > * __pnotify_fork - Add kernel module subscriber to same subscribers as > > parent > > * @to_task: The child task that will inherit the parent's subscribers > > * @from_task: The parent task > > * > > * Used to attach a new task to the same subscribers the parent has in its > > * subscriber list. > > * > > * The "from" argument is the parent task. The "to" argument is the child > > * task. > > * > > * See Documentation/pnotify.txt for details on > > * how to handle return codes from the attach function pointer. > > * > > * Locking: The to_task is currently in-construction, so we don't > > * need to worry about write-locks. We do need to be sure the parent's > > * subscriber list, which we copy here, doesn't go away on us. This is > > * done via RCU. > > * > > */ > >int > >__pnotify_fork(struct task_struct *to_task, struct task_struct *from_task) > >{ > > struct pnotify_subscriber *from_subscriber; > > int ret; > > > > /* We need to be sure the parent's list we copy from doesn't > > disappear */ > > rcu_read_lock(); > > > > list_for_each_entry_rcu(from_subscriber, > > &from_task->pnotify_subscriber_list, entry) { > > struct pnotify_subscriber *to_subscriber = NULL; > > > > to_subscriber = pnotify_subscribe(to_task, > > from_subscriber->events); > > if (!to_subscriber) { > > ret=-ENOMEM; > > __pnotify_exit(to_task); > > rcu_read_unlock(); > > return ret; > > } > > ret = to_subscriber->events->fork(to_task, to_subscriber, > > from_subscriber->data); > > > > rcu_read_unlock(); /* no more to do with the parent's data */ > > rcu_read_unlovk(); should be deployed outside of the > list_for_each_entry_rcu(){...}. > > > > > if (ret < 0) { > > /* Propagates to copy_process as a fork failure */ > > /* No __pnotify_exit because there is one in the > > failure path > > * for copy_process in fork.c */ > > return ret; /* Fork failure */ > > } > > else if (ret > 0) { > > /* Success, but fork function pointer in the > > pnotify_events structure > > * doesn't want the kenrel module subscribed */ > > /* Again, this is the in-construction-child so no > > write lock */ > > pnotify_unsubscribe(to_subscriber); > > } > > } > > > > return 0; /* success */ > >} > > -- > Linux Promotion Center, NEC > KaiGai Kohei > From paulmck@us.ibm.com Wed Sep 28 08:07:07 2005 Received: with ECARTIS (v1.0.0; list pagg); Wed, 28 Sep 2005 08:07:17 -0700 (PDT) Received: from e3.ny.us.ibm.com (e3.ny.us.ibm.com [32.97.182.143]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8SF76iL025761 for ; Wed, 28 Sep 2005 08:07:06 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e3.ny.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8SF4FWO008488 for ; Wed, 28 Sep 2005 11:04:16 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay04.pok.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8SF4Fbd080022 for ; Wed, 28 Sep 2005 11:04:15 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11/8.13.3) with ESMTP id j8SF4Fe1022673 for ; Wed, 28 Sep 2005 11:04:15 -0400 Received: from linux.local ([9.47.22.63]) by d01av03.pok.ibm.com (8.12.11/8.12.11) with ESMTP id j8SF4EWQ022589; Wed, 28 Sep 2005 11:04:15 -0400 Received: by linux.local (Postfix on SuSE Linux 7.3 (i386), from userid 500) id C9467148809; Wed, 28 Sep 2005 08:04:55 -0700 (PDT) Date: Wed, 28 Sep 2005 08:04:55 -0700 From: "Paul E. McKenney" To: Erik Jacobson Cc: Kaigai Kohei , Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050928150455.GD4925@us.ibm.com> Reply-To: paulmck@us.ibm.com References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <433A7FE4.5040109@ak.jp.nec.com> <20050928141831.GA24110@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050928141831.GA24110@sgi.com> User-Agent: Mutt/1.4.1i X-archive-position: 123 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: paulmck@us.ibm.com Precedence: bulk X-list: pagg On Wed, Sep 28, 2005 at 09:18:31AM -0500, Erik Jacobson wrote: > > In my understanding, any _write_ operations can not be implemented > > without locking, even if we can use RCU. > > (In addition, RCU conscious writing/update style is required.) > > I'm surrounding write operations with a writelock rwsem (not a spinlock, at > least not now... since it is common for pnotify users to use semaphores for > their own locking). I think this is similar to your example submission, but > you used a spinlock in those places. Yes, a semaphore works as well. Whatever it is, there must be something to coordinate the updaters. Thanx, Paul > At this moment, I'm close to done with something to look at. I'm just > tracking down a bug in the new implementation that showed up in the Job > patch. I'll post what I have when I'm ready and maybe we can tear it apart > then. This is my first rcu experience so I'd welcome the feedback including > "this just won't work with rcu" if that's what it comes down to. > > I also want to be sure the 'stale data' problem isn't actually a problem > for us. > > > frequentrly-read pass, such as SELinux's Access Vector Cache(AVC). > > But it's not omnipotence, and it restricts write methodology. > > The feeling I had is that most users of pnotify won't be writing super-often. > This is a generalization that may be incorrect. Taking Job as an example, > once the process is made part of a job, not much usually happens in terms of > adjusting the data pointer associated with the task struct until Job is done. > > I could imagine there may be things this isn't the case for, then the writes > will be a penalty possibly. > > > I have attention to another respect. The current pnotify implementation > > requires to hold pnotify_event_list_sem before calling pnotify_get_events(). > > As I recall, that code only would happen at most twice in the life of a kernel > module, right? The only time the init function pointer would fire, if it's > present, is at pnotify_register time. A similar piece of code happens at > unregister time I think. I guess I'm wondering if this happens enough to > worry about? Please let me know if I missed your entire point. > > > Threfore, we must repeat read_lock/unlock(&tasklist_lock) on > > do_each_thread()/while_each_thread() loop as follows: > > > > ---------------------------- > > read_lock(&tasklist_lock); > > do_each_thread(g, p) { > > get_task_struct(p); > > read_unlock(&tasklist_lock); > > > > down_read(&p->pnotify_subscriber_list_sem); > > subscriber = pnotify_get_subscriber(p, events->name); > > : > > up_read(&p->pnotify_subscriber_list_sem); > > read_lock(&tasklist_lock); > > << checking, p is dead or not ? >> > > } while_each_thread(g, p); > > read_unlock(&tasklist_lock); > > ---------------------------- > > > > I'm happy, if pnotify_subscriber_list would be protected by rwlock. > > > > If rwlock is used, we can not implement pnotify_subscribe() with current > > spec. But is it impossible to prepare pnotify_subscribe_atomic() or > > pnotify_subscribe_bind() which associates task_struct with pre-allocated > > pnotify_events object ? > > > > ---- in rwlock world :-) --- > > read_lock(&tasklist_lock); > > do_each_thread(g, p) { > > read_lock(&p->pnotify_subscriber_list_rwlock); > > subscriber = pnotify_get_subscriber(p, events->name); > > : > > read_unlock(&p->pnotify_subscriber_list_rwlock); > > } while_each_thread(g, p); > > read_unlock(&tasklist_lock); > > ---------------------------- > > > > Thanks, > > > > > > >/** > > > * __pnotify_fork - Add kernel module subscriber to same subscribers as > > > parent > > > * @to_task: The child task that will inherit the parent's subscribers > > > * @from_task: The parent task > > > * > > > * Used to attach a new task to the same subscribers the parent has in its > > > * subscriber list. > > > * > > > * The "from" argument is the parent task. The "to" argument is the child > > > * task. > > > * > > > * See Documentation/pnotify.txt for details on > > > * how to handle return codes from the attach function pointer. > > > * > > > * Locking: The to_task is currently in-construction, so we don't > > > * need to worry about write-locks. We do need to be sure the parent's > > > * subscriber list, which we copy here, doesn't go away on us. This is > > > * done via RCU. > > > * > > > */ > > >int > > >__pnotify_fork(struct task_struct *to_task, struct task_struct *from_task) > > >{ > > > struct pnotify_subscriber *from_subscriber; > > > int ret; > > > > > > /* We need to be sure the parent's list we copy from doesn't > > > disappear */ > > > rcu_read_lock(); > > > > > > list_for_each_entry_rcu(from_subscriber, > > > &from_task->pnotify_subscriber_list, entry) { > > > struct pnotify_subscriber *to_subscriber = NULL; > > > > > > to_subscriber = pnotify_subscribe(to_task, > > > from_subscriber->events); > > > if (!to_subscriber) { > > > ret=-ENOMEM; > > > __pnotify_exit(to_task); > > > rcu_read_unlock(); > > > return ret; > > > } > > > ret = to_subscriber->events->fork(to_task, to_subscriber, > > > from_subscriber->data); > > > > > > rcu_read_unlock(); /* no more to do with the parent's data */ > > > > rcu_read_unlovk(); should be deployed outside of the > > list_for_each_entry_rcu(){...}. > > > > > > > > if (ret < 0) { > > > /* Propagates to copy_process as a fork failure */ > > > /* No __pnotify_exit because there is one in the > > > failure path > > > * for copy_process in fork.c */ > > > return ret; /* Fork failure */ > > > } > > > else if (ret > 0) { > > > /* Success, but fork function pointer in the > > > pnotify_events structure > > > * doesn't want the kenrel module subscribed */ > > > /* Again, this is the in-construction-child so no > > > write lock */ > > > pnotify_unsubscribe(to_subscriber); > > > } > > > } > > > > > > return 0; /* success */ > > >} > > > > -- > > Linux Promotion Center, NEC > > KaiGai Kohei > -- > Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota > From kingsley@sw.oz.au Wed Sep 28 22:19:32 2005 Received: with ECARTIS (v1.0.0; list pagg); Wed, 28 Sep 2005 22:19:35 -0700 (PDT) Received: from smtp.sw.oz.au (alt.aurema.com [203.217.18.57]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8T5JUO0008759 for ; Wed, 28 Sep 2005 22:19:31 -0700 Received: from kingsley.sw.oz.au (kingsley.sw.oz.au [192.41.203.97]) by smtp.sw.oz.au with ESMTP id j8T5GSc2014500; Thu, 29 Sep 2005 15:16:28 +1000 (EST) Received: from kingsley.sw.oz.au (localhost.localdomain [127.0.0.1]) by kingsley.sw.oz.au (8.13.1/8.12.10) with ESMTP id j8T5GSN0020577; Thu, 29 Sep 2005 15:16:28 +1000 Received: (from kingsley@localhost) by kingsley.sw.oz.au (8.13.1/8.13.1/Submit) id j8T5GRvl020576; Thu, 29 Sep 2005 15:16:27 +1000 Date: Thu, 29 Sep 2005 15:16:27 +1000 From: kingsley@aurema.com To: Erik Jacobson Cc: pagg@oss.sgi.com, tonyt@aurema.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050929051627.GC3404@aurema.com> References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050927201020.GA30433@sgi.com> User-Agent: Mutt/1.4.1i X-Scanned-By: MIMEDefang 2.52 on 192.41.203.35 X-archive-position: 124 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kingsley@aurema.com Precedence: bulk X-list: pagg On Tue, Sep 27, 2005 at 03:10:20PM -0500, Erik Jacobson wrote: > I fixed this in the RCU version of pnotify I'm working per the lse-tech > community discussion - thanks for the reminder the other day (in a non-list > email). > > If the RCU version crashes and burns for some reason and we go back to > the non-CUu one, I'll need to make the fix there too. The function now > looks like this. I hope this is what you had in mind (untested as of > this moment). Erik, I'm not sure that it does at this moment, not seeing the code for copy_process() or __pnotify_exit(). __pnotify_exit() would need to call the exit callback for all clients except for the client failing the fork call. To do this wouldn't the following be needed in __pnotify_fork()? > int > __pnotify_fork(struct task_struct *to_task, struct task_struct *from_task) > { > struct pnotify_subscriber *from_subscriber; > int ret; > > /* We need to be sure the parent's list we copy from doesn't disappear */ > rcu_read_lock(); > > list_for_each_entry_rcu(from_subscriber, &from_task->pnotify_subscriber_list, entry) { > struct pnotify_subscriber *to_subscriber = NULL; > > to_subscriber = pnotify_subscribe(to_task, from_subscriber->events); > if (!to_subscriber) { > ret=-ENOMEM; > __pnotify_exit(to_task); > rcu_read_unlock(); > return ret; > } > ret = to_subscriber->events->fork(to_task, to_subscriber, > from_subscriber->data); > > rcu_read_unlock(); /* no more to do with the parent's data */ > Then, to make sure the current client does not have his exit callback invoked: ... if (ret != 0) { pnotify_unsubscribe(to_subscriber); if (ret < 0) return ret; } } return 0; } What do you think? -- Kingsley From kaigai@ak.jp.nec.com Wed Sep 28 22:53:42 2005 Received: with ECARTIS (v1.0.0; list pagg); Wed, 28 Sep 2005 22:53:48 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [210.143.35.51]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8T5rfO0011100 for ; Wed, 28 Sep 2005 22:53:42 -0700 Received: from mailgate4.nec.co.jp (mailgate53.nec.co.jp [10.7.69.184]) by tyo201.gate.nec.co.jp (8.11.7/3.7W01080315) with ESMTP id j8T5obE01649; Thu, 29 Sep 2005 14:50:37 +0900 (JST) Received: (from root@localhost) by mailgate4.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id j8T5obB05054; Thu, 29 Sep 2005 14:50:37 +0900 (JST) Received: from mailsv.linux.bs1.fc.nec.co.jp (namesv2.linux.bs1.fc.nec.co.jp [10.34.125.2]) by mailsv5.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id j8T5oan00895; Thu, 29 Sep 2005 14:50:36 +0900 (JST) Received: from [10.34.125.249] (sanma.linux.bs1.fc.nec.co.jp [10.34.125.249]) by mailsv.linux.bs1.fc.nec.co.jp (Postfix) with ESMTP id 6AD9F2FADD; Thu, 29 Sep 2005 14:50:36 +0900 (JST) Message-ID: <433B80B6.2010604@ak.jp.nec.com> Date: Thu, 29 Sep 2005 14:50:46 +0900 From: Kaigai Kohei User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050317) X-Accept-Language: ja, en-us, en MIME-Version: 1.0 To: Erik Jacobson Cc: Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com, paulmck@us.ibm.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <433A7FE4.5040109@ak.jp.nec.com> <20050928141831.GA24110@sgi.com> In-Reply-To: <20050928141831.GA24110@sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 125 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kaigai@ak.jp.nec.com Precedence: bulk X-list: pagg Hi, Erik Jacobson wrote: >>frequentrly-read pass, such as SELinux's Access Vector Cache(AVC). >>But it's not omnipotence, and it restricts write methodology. > > The feeling I had is that most users of pnotify won't be writing super-often. > This is a generalization that may be incorrect. Taking Job as an example, > once the process is made part of a job, not much usually happens in terms of > adjusting the data pointer associated with the task struct until Job is done. > > I could imagine there may be things this isn't the case for, then the writes > will be a penalty possibly. In fork-hook, scaning the parent's list is indeed read-only. But exit-hook contains writing-operations like list_del_rcu(), thus something locking is required. Probably, the ratio of readonly-pass and writable-pass might be about 49.9%:50.1%. (Although I didn't actually measure it.) # BTW, more than 99.9% is read-only pass in SELinux's case. # see, /selinux/avc/cache_stats In addition, you must revise whole job's implementation. For example, job_dispatch_attachpid() must be written by using list_update_rcu(). RCU rules affects an implementation of pnotify's clients widely. Is it appropriate as a general framework widely used ? >>I have attention to another respect. The current pnotify implementation >>requires to hold pnotify_event_list_sem before calling pnotify_get_events(). > > > As I recall, that code only would happen at most twice in the life of a kernel > module, right? The only time the init function pointer would fire, if it's > present, is at pnotify_register time. A similar piece of code happens at > unregister time I think. I guess I'm wondering if this happens enough to > worry about? Please let me know if I missed your entire point. When anyone tries to associate a job with a running multithread-process, it's required to scan for each thread in this process under read_lock(&tasklist_lock), because job is an aggregation of processes, not an aggregation of threads. In SGI's JOB (job-2.6.13-patch), job_dispatch_attachpid() associate a task_struct specified by PID with a existing job, but it does not throw siblings of this task into the job. Therefore, this implementation allow a part of the thread in this process to belong to defferent job. e.g) [BEFORE] task-X1(PID=100,TGID=100) -- job-Alpha task-X2(PID=101,TGID=100) -- job-Alpha task-X3(PID=102,TGID=100) -- job-Alpha -> job_dispatch_attachpid(PID=100, job-Beta) [AFTER] task-X1(PID=100,TGID=100) -- job-Beta <-- Same Process belongs task-X2(PID=101,TGID=100) -- job-Alpha <-- to different job ??? task-X3(PID=102,TGID=100) -- job-Alpha Do I have any misunderstandings ? Because of this, while_each_thread(){...} loop under 'tasklist_lock' is also necessary other than initialization or unregistering. A similar problem will happen on detaching job procedure, I think. Thanks, -- Linux Promotion Center, NEC KaiGai Kohei From kaigai@ak.jp.nec.com Thu Sep 29 00:23:38 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 00:23:43 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [210.143.35.51]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8T7NbO0022956 for ; Thu, 29 Sep 2005 00:23:38 -0700 Received: from mailgate3.nec.co.jp (mailgate54.nec.co.jp [10.7.69.193]) by tyo201.gate.nec.co.jp (8.11.7/3.7W01080315) with ESMTP id j8T7KaE07908; Thu, 29 Sep 2005 16:20:36 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id j8T7Ka215513; Thu, 29 Sep 2005 16:20:36 +0900 (JST) Received: from mailsv.linux.bs1.fc.nec.co.jp (namesv2.linux.bs1.fc.nec.co.jp [10.34.125.2]) by mailsv.nec.co.jp (8.11.7/3.7W-MAILSV-NEC) with ESMTP id j8T7KZ609960; Thu, 29 Sep 2005 16:20:35 +0900 (JST) Received: from [10.34.125.249] (sanma.linux.bs1.fc.nec.co.jp [10.34.125.249]) by mailsv.linux.bs1.fc.nec.co.jp (Postfix) with ESMTP id 25E242FADD; Thu, 29 Sep 2005 16:20:20 +0900 (JST) Message-ID: <433B95BE.4080608@ak.jp.nec.com> Date: Thu, 29 Sep 2005 16:20:30 +0900 From: Kaigai Kohei User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050317) X-Accept-Language: ja, en-us, en MIME-Version: 1.0 To: Kaigai Kohei Cc: Erik Jacobson , Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com, paulmck@us.ibm.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <433A7FE4.5040109@ak.jp.nec.com> <20050928141831.GA24110@sgi.com> <433B80B6.2010604@ak.jp.nec.com> In-Reply-To: <433B80B6.2010604@ak.jp.nec.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 126 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kaigai@ak.jp.nec.com Precedence: bulk X-list: pagg > job_dispatch_attachpid() must be written by using list_update_rcu(). s/list_update_rcu()/list_replace_rcu()/g |||orz -- Linux Promotion Center, NEC KaiGai Kohei From erikj@sgi.com Thu Sep 29 07:05:10 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 07:05:20 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TE58O0024611 for ; Thu, 29 Sep 2005 07:05:08 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8TF5erj026916 for ; Thu, 29 Sep 2005 08:05:40 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8TE2EDN16639653; Thu, 29 Sep 2005 09:02:14 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8TE2ES94301353; Thu, 29 Sep 2005 09:02:14 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id E444F6028D21; Thu, 29 Sep 2005 09:02:13 -0500 (CDT) Date: Thu, 29 Sep 2005 09:02:13 -0500 From: Erik Jacobson To: Kaigai Kohei Cc: Erik Jacobson , Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com, paulmck@us.ibm.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050929140213.GE3496@sgi.com> References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <433A7FE4.5040109@ak.jp.nec.com> <20050928141831.GA24110@sgi.com> <433B80B6.2010604@ak.jp.nec.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <433B80B6.2010604@ak.jp.nec.com> User-Agent: Mutt/1.5.6i X-archive-position: 127 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > In fork-hook, scaning the parent's list is indeed read-only. > But exit-hook contains writing-operations like list_del_rcu(), > thus something locking is required. > Probably, the ratio of readonly-pass and writable-pass might be > about 49.9%:50.1%. (Although I didn't actually measure it.) I guess if all tasks have subscribers, this could be true... Even in Job, not all tasks have subscribers although I admit that if you are using it with PAM so each login starts a new job... most would. I tried to send what I had so far twice yesterday but the list server is eating it for some reason. Someone is going to check in to the list server configuration today. I'll respond to more of your post later. Erik From erikj@sgi.com Thu Sep 29 07:51:32 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 07:51:43 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TEpUO0027691 for ; Thu, 29 Sep 2005 07:51:30 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8TFq5Ie002401 for ; Thu, 29 Sep 2005 08:52:05 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8TEldDN16644105; Thu, 29 Sep 2005 09:47:39 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8TElcS94287798; Thu, 29 Sep 2005 09:47:38 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 3C4D96028D21; Thu, 29 Sep 2005 09:47:38 -0500 (CDT) Date: Thu, 29 Sep 2005 09:47:38 -0500 From: Erik Jacobson To: Kaigai Kohei Cc: Erik Jacobson , Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com, paulmck@us.ibm.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050929144738.GA7395@sgi.com> References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <433A7FE4.5040109@ak.jp.nec.com> <20050928141831.GA24110@sgi.com> <433B80B6.2010604@ak.jp.nec.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <433B80B6.2010604@ak.jp.nec.com> User-Agent: Mutt/1.5.6i X-archive-position: 128 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > In addition, you must revise whole job's implementation. For example, > job_dispatch_attachpid() must be written by using list_update_rcu(). > RCU rules affects an implementation of pnotify's clients widely. > Is it appropriate as a general framework widely used ? [ Correctedion in above paragraph from list_update_rcu to list_replace_rcu noted ] Hi. I'm confused with the above paragraph. My initial plan was only to convert how job uses pnotify to be rcu-aware. So this mainly relates to the pnotify_subscriber_list. job_dispatch_attachpid adds the kernel module as a subscriber to the given process. It doesn't access the list directly but does (in my updated version) have the write lock held and the rcu_read_lock/unlock calls. But it calls pnotify_subscribe to actually add the kernel module to the subscriber list of the task. pnotify_subscribe uses list_add_tail_rcu to do this. Maybe there are other examples that need adjustments? On the download site, in pnotify-test, you'll see my first pass at an RCU versoin of pnotify. In job-test, you'll see a first cut at a rcu-pnotify version of Job. > >>I have attention to another respect. The current pnotify implementation > >>requires to hold pnotify_event_list_sem before calling > >>pnotify_get_events(). > > > > > >As I recall, that code only would happen at most twice in the life of a > >kernel > >module, right? The only time the init function pointer would fire, if it's > >present, is at pnotify_register time. A similar piece of code happens at > >unregister time I think. I guess I'm wondering if this happens enough to > >worry about? Please let me know if I missed your entire point. > > When anyone tries to associate a job with a running multithread-process, > it's required to scan for each thread in this process under > read_lock(&tasklist_lock), because job is an aggregation of processes, > not an aggregation of threads. OK; I think I see some of what you're saying here now. If it isn't urgent, let's defer this until we know what's happening with pnotify. I guess I'm most interested in any logic problems I have in the way I used pnotify RCU with job, not problems with how Job might have a flaw that has always been there. Let's address those later. Regarding your comments on if RCU is the right answer for pnotify - I don't really know, it makes me uncomfortable but I need to be sure that feeling is for a valid reason, not just because I'm new to RCU. In some performance tests I ran yesterday, it seems (on a 2p ia64 altix box) that system performance as mesaured by AIM and by a fork-bomb type test are nearly identical between 2.6.14-rc2, 2.6.14-rc2 with RCU version of pnotify, and 2.6.14-rc2 with old fully rwsem version of pnotify. The version of job with RCU pnotify was changed to use RCU protections. Also included was the keyring proof of concept where keyring makes use of pnotify. So, there were two pnotify users in the the two tests with pnotify - keyring and job. The AIM tests were run where the process launching the test was part of a job. In the stock 2.6.14-rc2 test, a stock version of keyrings was enabled. If there is no measurable difference, it seems that RCU might not be the best answer because we're increasing complexity for no good resaon. I'll post more formal numbers a bit later. From erikj@sgi.com Thu Sep 29 08:15:58 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 08:16:00 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TFFwO0029305 for ; Thu, 29 Sep 2005 08:15:58 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8TGGXu5006521 for ; Thu, 29 Sep 2005 09:16:33 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8TFC6DN16639524; Thu, 29 Sep 2005 10:12:06 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8TFC6S94259653; Thu, 29 Sep 2005 10:12:06 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 175836028D21; Thu, 29 Sep 2005 10:12:06 -0500 (CDT) Date: Thu, 29 Sep 2005 10:12:06 -0500 From: Erik Jacobson To: kingsley@aurema.com Cc: Erik Jacobson , pagg@oss.sgi.com, tonyt@aurema.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050929151205.GB7395@sgi.com> References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <20050929051627.GC3404@aurema.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050929051627.GC3404@aurema.com> User-Agent: Mutt/1.5.6i X-archive-position: 129 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > __pnotify_exit() would need to call the exit callback for all clients > except for the client failing the fork call. To do this wouldn't the > following be needed in __pnotify_fork()? If the fork fails in copy_process, we go to cleanup in copy_process, not exit.c. So, with the revision, if the the pnotify kernel module subscriber returns a failure for the process, that failure is passed along to the copy_process call. There, we go to bad_fork_cleanup_namespace where a pnotify_exit is done on the process that failed to fork. In the old version, the pnotify_exit would have also been run within __pnotify_fork. In the new version, we don't run the pnotify_exit from __pnotify_fork because we know it will be run in the fork failure path of copy_process. That should mean that pnotify_exit will only execute once. I think that means that your suggestion for adding a pnotify_unsubscribe isn't necessary? The unsubscribe is done by __pnotify_exit in this case. Or am I still missing the point here? Sorry if it isn't getting through my head. See the RCU test version of pnotify in the download site under pnotify-test. My attempts at posting that patch to the list seem to be eaten by the list server right now. When that's fixed, I'll start posting stuff here. Erik From erikj@sgi.com Thu Sep 29 10:01:22 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 10:01:25 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TH1LO0014784 for ; Thu, 29 Sep 2005 10:01:21 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8TI1uJf025760 for ; Thu, 29 Sep 2005 11:01:57 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8TGwTDN16650273; Thu, 29 Sep 2005 11:58:29 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8TGwTS94288983; Thu, 29 Sep 2005 11:58:29 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 193706028D21; Thu, 29 Sep 2005 11:58:29 -0500 (CDT) Date: Thu, 29 Sep 2005 11:58:29 -0500 From: Erik Jacobson To: Christoph Hellwig Cc: Erik Jacobson , lse-tech@lists.sourceforge.net, akpm@osdl.org, kingsley@aurema.com, canon@nersc.gov, pagg@oss.sgi.com Subject: [PATCH] rcu-pnotify aware version of keyring support Message-ID: <20050929165828.GC15246@sgi.com> References: <20050921213645.GB28239@sgi.com> <20050922151647.GA30784@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050922151647.GA30784@infradead.org> User-Agent: Mutt/1.5.6i X-archive-position: 130 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg This patch implements some parts of the keyring support using the rcu version of pnotify. It may also be illegal because it could sleep with the rcu_read_lock open. See my performance post to follow shortly for a discussion on these things. include/linux/key.h | 21 ++++- include/linux/sched.h | 4 kernel/exit.c | 1 kernel/fork.c | 6 - security/keys/key.c | 24 +++++ security/keys/keyctl.c | 34 +++++++- security/keys/process_keys.c | 173 +++++++++++++++++++++++++++++++++++++------ security/keys/request_key.c | 37 ++++++++- 8 files changed, 257 insertions(+), 43 deletions(-) Index: linux/include/linux/key.h =================================================================== --- linux.orig/include/linux/key.h 2005-09-27 16:01:17.561027702 -0500 +++ linux/include/linux/key.h 2005-09-27 16:05:34.078204518 -0500 @@ -19,6 +19,7 @@ #include #include #include +#include #include #ifdef __KERNEL__ @@ -262,9 +263,9 @@ extern struct key root_user_keyring, root_session_keyring; extern int alloc_uid_keyring(struct user_struct *user); extern void switch_uid_keyring(struct user_struct *new_user); -extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk); +extern int copy_keys(struct task_struct *tsk, struct pnotify_subscriber *sub, void *olddata); extern int copy_thread_group_keys(struct task_struct *tsk); -extern void exit_keys(struct task_struct *tsk); +extern void exit_keys(struct task_struct *task, struct pnotify_subscriber *sub); extern void exit_thread_group_keys(struct signal_struct *tg); extern int suid_keys(struct task_struct *tsk); extern int exec_keys(struct task_struct *tsk); @@ -279,6 +280,22 @@ old_session; \ }) +/* pnotify subscriber service request */ +static struct pnotify_events key_events = { + .module = NULL, + .name = "key", + .data = NULL, + .entry = LIST_HEAD_INIT(key_events.entry), + .fork = copy_keys, + .exit = exit_keys, +}; + +/* key info associated with the task struct and managed by pnotify */ +struct key_task { + struct key *thread_keyring; /* keyring private to this thread */ + unsigned char jit_keyring; /* default keyring to attach requested keys to */ +}; + #else /* CONFIG_KEYS */ #define key_validate(k) 0 Index: linux/kernel/exit.c =================================================================== --- linux.orig/kernel/exit.c 2005-09-27 16:01:17.562004166 -0500 +++ linux/kernel/exit.c 2005-09-27 16:05:34.082110375 -0500 @@ -857,7 +857,6 @@ exit_namespace(tsk); exit_thread(); cpuset_exit(tsk); - exit_keys(tsk); if (group_dead && tsk->signal->leader) disassociate_ctty(1); Index: linux/kernel/fork.c =================================================================== --- linux.orig/kernel/fork.c 2005-09-27 16:01:17.562004166 -0500 +++ linux/kernel/fork.c 2005-09-27 16:05:34.083086839 -0500 @@ -1009,10 +1009,8 @@ goto bad_fork_cleanup_sighand; if ((retval = copy_mm(clone_flags, p))) goto bad_fork_cleanup_signal; - if ((retval = copy_keys(clone_flags, p))) - goto bad_fork_cleanup_mm; if ((retval = copy_namespace(clone_flags, p))) - goto bad_fork_cleanup_keys; + goto bad_fork_cleanup_mm; retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs); if (retval) goto bad_fork_cleanup_namespace; @@ -1175,8 +1173,6 @@ bad_fork_cleanup_namespace: pnotify_exit(p); exit_namespace(p); -bad_fork_cleanup_keys: - exit_keys(p); bad_fork_cleanup_mm: if (p->mm) mmput(p->mm); Index: linux/security/keys/key.c =================================================================== --- linux.orig/security/keys/key.c 2005-09-27 16:01:17.562980631 -0500 +++ linux/security/keys/key.c 2005-09-27 16:05:34.088945625 -0500 @@ -15,6 +15,7 @@ #include #include #include +#include #include "internal.h" static kmem_cache_t *key_jar; @@ -1009,6 +1010,9 @@ */ void __init key_init(void) { + struct key_task *kt; + struct pnotify_subscriber *sub; + /* allocate a slab in which we can store keys */ key_jar = kmem_cache_create("key_jar", sizeof(struct key), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); @@ -1039,4 +1043,24 @@ /* link the two root keyrings together */ key_link(&root_session_keyring, &root_user_keyring); + /* Allocate memory for task assocated key_task structure */ + kt = (struct key_task *)kmalloc(sizeof(struct key_task),GFP_KERNEL); + if (!kt) { + printk(KERN_ERR "Insufficient memory to allocate key_task structure" + " in key_init function.\n"); + return; + } + kt->thread_keyring = NULL; + + /* subscribe this kernel entity to the subscriber list for current task */ + /* Here, there is only one process in existence so we don't do any locking. + */ + sub = pnotify_subscribe(current, &key_events); + if (!sub) { + printk(KERN_ERR "Insufficient memory to add to subscriber list structure" + " in key_init function.\n"); + } + /* Associate the kt structure with this task via pnotify subscriber */ + sub->data = (void *)kt; + } /* end key_init() */ Index: linux/security/keys/process_keys.c =================================================================== --- linux.orig/security/keys/process_keys.c 2005-09-27 16:01:17.562980631 -0500 +++ linux/security/keys/process_keys.c 2005-09-27 16:17:39.829430380 -0500 @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "internal.h" @@ -137,6 +138,8 @@ int install_thread_keyring(struct task_struct *tsk) { struct key *keyring, *old; + struct key_task *kt; + struct pnotify_subscriber *sub; char buf[20]; int ret; @@ -149,9 +152,24 @@ } task_lock(tsk); - old = tsk->thread_keyring; - tsk->thread_keyring = keyring; + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "install_thread_keyring pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + ret = PTR_ERR(sub); + goto error; + } + kt = (struct key_task *)sub->data; + + old = kt->thread_keyring; + kt->thread_keyring = keyring; task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); ret = 0; @@ -267,13 +285,27 @@ /* * copy the keys for fork */ -int copy_keys(unsigned long clone_flags, struct task_struct *tsk) +int copy_keys(struct task_struct *tsk, struct pnotify_subscriber *sub, void *olddata) { - key_check(tsk->thread_keyring); + struct key_task *kt = ((struct key_task *)(sub->data)); + + /* Allocate memory for task-associated key_task structure */ + kt = (struct key_task *)kmalloc(sizeof(struct key_task),GFP_KERNEL); + if (!kt) { + printk(KERN_ERR "Insufficient memory to allocate key_task structure" + " in copy_keys function. Task was: %d", tsk->pid); + return PNOTIFY_ERROR; + } + /* Associate key_task structure with the new child via pnotify subscriber */ + /* At this moment, this is an in-construction-task so locking isn't an + * issue */ + sub->data = (void *)kt; + + key_check(kt->thread_keyring); /* no thread keyring yet */ - tsk->thread_keyring = NULL; - return 0; + kt->thread_keyring = NULL; + return PNOTIFY_OK; } /* end copy_keys() */ @@ -292,9 +324,16 @@ /* * dispose of keys upon thread exit */ -void exit_keys(struct task_struct *tsk) +void exit_keys(struct task_struct *task, struct pnotify_subscriber *sub) { - key_put(tsk->thread_keyring); + struct key_task *kt = ((struct key_task *)(sub->data)); + if (kt == NULL) { /* shouldn't ever happen */ + printk(KERN_ERR "exit_keys pnotify subscriber data ptr null, task: %d\n", task->pid); + return; + } + key_put(kt->thread_keyring); + kfree(kt); /* Free pnotify subscriber data for this task */ + sub->data = NULL; } /* end exit_keys() */ @@ -306,12 +345,31 @@ { unsigned long flags; struct key *old; + struct key_task *kt; + struct pnotify_subscriber *sub; - /* newly exec'd tasks don't get a thread keyring */ task_lock(tsk); - old = tsk->thread_keyring; - tsk->thread_keyring = NULL; + /* pnotify doesn't have a compute_creds event at this time, so we + * need to retrieve the data */ + + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "exec_keys pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return PNOTIFY_OK; /* key structures not populated yet */ + } + kt = (struct key_task *)sub->data; + + /* newly exec'd tasks don't get a thread keyring */ + old = kt->thread_keyring; + kt->thread_keyring = NULL; task_unlock(tsk); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); key_put(old); @@ -344,12 +402,29 @@ */ void key_fsuid_changed(struct task_struct *tsk) { + struct key_task *kt; + struct pnotify_subscriber *sub; + + /* no pnotify event for this, so we need to grab the data */ + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "key_fsuid_changed pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return; + } + kt = (struct key_task *)sub->data; + /* update the ownership of the thread keyring */ - if (tsk->thread_keyring) { - down_write(&tsk->thread_keyring->sem); - tsk->thread_keyring->uid = tsk->fsuid; - up_write(&tsk->thread_keyring->sem); + if (kt->thread_keyring) { + down_write(&kt->thread_keyring->sem); + kt->thread_keyring->uid = tsk->fsuid; + up_write(&kt->thread_keyring->sem); } + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); } /* end key_fsuid_changed() */ @@ -359,12 +434,29 @@ */ void key_fsgid_changed(struct task_struct *tsk) { + struct key_task *kt; + struct pnotify_subscriber *sub; + + /* pnotify doesn't have an event for this, so we need to grab the data */ + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(tsk, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "key_fsgid_changed pnotify subscriber or data ptr was null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return; + } + kt = (struct key_task *)sub->data; + /* update the ownership of the thread keyring */ - if (tsk->thread_keyring) { - down_write(&tsk->thread_keyring->sem); - tsk->thread_keyring->gid = tsk->fsgid; - up_write(&tsk->thread_keyring->sem); + if (kt->thread_keyring) { + down_write(&kt->thread_keyring->sem); + kt->thread_keyring->gid = tsk->fsgid; + up_write(&kt->thread_keyring->sem); } + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); } /* end key_fsgid_changed() */ @@ -383,6 +475,8 @@ { struct request_key_auth *rka; struct key *key, *ret, *err, *instkey; + struct pnotify_subscriber *sub; + struct key_task *kt; /* we want to return -EAGAIN or -ENOKEY if any of the keyrings were * searchable, but we failed to find a key or we found a negative key; @@ -395,12 +489,26 @@ ret = NULL; err = ERR_PTR(-EAGAIN); + rcu_read_lock(); + down_write(&context->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(context, key_events.name); + if (sub == NULL || sub->data == NULL) { + printk(KERN_ERR "search_process_keyrings pnotify subscriber or data ptr null, task: %d\n", context->pid); + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return (struct key *)-EFAULT; + } + kt = (struct key_task *)sub->data; + /* search the thread keyring first */ - if (context->thread_keyring) { - key = keyring_search_aux(context->thread_keyring, + if (kt->thread_keyring) { + key = keyring_search_aux(kt->thread_keyring, context, type, description, match); - if (!IS_ERR(key)) + if (!IS_ERR(key)) { + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); goto found; + } switch (PTR_ERR(key)) { case -EAGAIN: /* no key */ @@ -414,6 +522,8 @@ break; } } + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); /* search the process keyring second */ if (context->signal->process_keyring) { @@ -535,15 +645,28 @@ { struct key *key; int ret; + struct pnotify_subscriber *sub; + struct key_task *kt; if (!context) context = current; key = ERR_PTR(-ENOKEY); + rcu_read_lock(); + down_write(&context->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(context, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "search_process_keyrings pnotify subscriber or data ptr null, task: %d\n", context->pid); + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return (struct key *)-EFAULT; + } + kt = (struct key_task *)sub->data; + switch (id) { case KEY_SPEC_THREAD_KEYRING: - if (!context->thread_keyring) { + if (!kt->thread_keyring) { if (!create) goto error; @@ -554,7 +677,7 @@ } } - key = context->thread_keyring; + key = kt->thread_keyring; atomic_inc(&key->usage); break; @@ -634,6 +757,8 @@ goto invalid_key; error: + up_write(&context->pnotify_subscriber_list_sem); + rcu_read_unlock(); return key; invalid_key: Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h 2005-09-27 16:01:17.561027702 -0500 +++ linux/include/linux/sched.h 2005-09-27 16:05:34.093827947 -0500 @@ -718,10 +718,6 @@ kernel_cap_t cap_effective, cap_inheritable, cap_permitted; unsigned keep_capabilities:1; struct user_struct *user; -#ifdef CONFIG_KEYS - struct key *thread_keyring; /* keyring private to this thread */ - unsigned char jit_keyring; /* default keyring to attach requested keys to */ -#endif int oomkilladj; /* OOM kill score adjustment (bit shift). */ char comm[TASK_COMM_LEN]; /* executable name excluding path - access with [gs]et_task_comm (which lock Index: linux/security/keys/keyctl.c =================================================================== --- linux.orig/security/keys/keyctl.c 2005-09-27 16:01:17.563957095 -0500 +++ linux/security/keys/keyctl.c 2005-09-27 16:18:13.257708245 -0500 @@ -931,31 +931,57 @@ long keyctl_set_reqkey_keyring(int reqkey_defl) { int ret; + unsigned char jit_return; + struct pnotify_subscriber *sub; + struct key_task *kt; + + rcu_read_lock(); + down_write(¤t->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(current, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "keyctl_set_reqkey_keyring pnotify subscriber or data ptr null, task: %d\n", current->pid); + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return -EFAULT; + } + kt = (struct key_task *)sub->data; switch (reqkey_defl) { case KEY_REQKEY_DEFL_THREAD_KEYRING: ret = install_thread_keyring(current); - if (ret < 0) + if (ret < 0) { + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); return ret; + } goto set; case KEY_REQKEY_DEFL_PROCESS_KEYRING: ret = install_process_keyring(current); - if (ret < 0) + if (ret < 0) { + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); return ret; + } case KEY_REQKEY_DEFL_DEFAULT: case KEY_REQKEY_DEFL_SESSION_KEYRING: case KEY_REQKEY_DEFL_USER_KEYRING: case KEY_REQKEY_DEFL_USER_SESSION_KEYRING: set: - current->jit_keyring = reqkey_defl; + + kt->jit_keyring = reqkey_defl; case KEY_REQKEY_DEFL_NO_CHANGE: - return current->jit_keyring; + jit_return = kt->jit_keyring; + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return jit_return; case KEY_REQKEY_DEFL_GROUP_KEYRING: default: + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); return -EINVAL; } Index: linux/security/keys/request_key.c =================================================================== --- linux.orig/security/keys/request_key.c 2005-09-27 16:01:17.563957095 -0500 +++ linux/security/keys/request_key.c 2005-09-27 16:18:37.840196240 -0500 @@ -14,6 +14,7 @@ #include #include #include +#include #include "internal.h" struct key_construction { @@ -39,6 +40,19 @@ char *argv[10], *envp[3], uid_str[12], gid_str[12]; char key_str[12], keyring_str[3][12]; int ret, i; + struct pnotify_subscriber *sub; + struct key_task *kt; + + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(current, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "call_request_key pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return -EFAULT; + } + kt = (struct key_task *)sub->data; kenter("{%d},%s,%s", key->serial, op, callout_info); @@ -58,7 +72,7 @@ /* we specify the process's default keyrings */ sprintf(keyring_str[0], "%d", - tsk->thread_keyring ? tsk->thread_keyring->serial : 0); + kt->thread_keyring ? kt->thread_keyring->serial : 0); prkey = 0; if (tsk->signal->process_keyring) @@ -105,6 +119,8 @@ key_put(session_keyring); error: + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); kleave(" = %d", ret); return ret; @@ -300,15 +316,28 @@ { struct task_struct *tsk = current; struct key *drop = NULL; + struct pnotify_subscriber *sub; + struct key_task *kt; + + rcu_read_lock(); + down_write(&tsk->pnotify_subscriber_list_sem); + sub = pnotify_get_subscriber(current, key_events.name); + if (sub == NULL || sub->data == NULL) { /* shouldn't happen */ + printk(KERN_ERR "request_key_link pnotify subscriber or data ptr null, task: %d\n", tsk->pid); + up_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); + return; + } + kt = (struct key_task *)sub->data; kenter("{%d},%p", key->serial, dest_keyring); /* find the appropriate keyring */ if (!dest_keyring) { - switch (tsk->jit_keyring) { + switch (kt->jit_keyring) { case KEY_REQKEY_DEFL_DEFAULT: case KEY_REQKEY_DEFL_THREAD_KEYRING: - dest_keyring = tsk->thread_keyring; + dest_keyring = kt->thread_keyring; if (dest_keyring) break; @@ -347,6 +376,8 @@ key_put(drop); kleave(""); + down_write(&tsk->pnotify_subscriber_list_sem); + rcu_read_unlock(); } /* end request_key_link() */ From erikj@sgi.com Thu Sep 29 10:06:18 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 10:06:20 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TH6HO0015111 for ; Thu, 29 Sep 2005 10:06:17 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8TI6rbU026475 for ; Thu, 29 Sep 2005 11:06:53 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8TH3QDN16650723; Thu, 29 Sep 2005 12:03:26 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8TH3PS94238226; Thu, 29 Sep 2005 12:03:25 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id B0FE96028D21; Thu, 29 Sep 2005 12:03:25 -0500 (CDT) Date: Thu, 29 Sep 2005 12:03:25 -0500 From: Erik Jacobson To: Christoph Hellwig Cc: Erik Jacobson , lse-tech@lists.sourceforge.net, akpm@osdl.org, kingsley@aurema.com, canon@nersc.gov, pagg@oss.sgi.com Subject: Process Notification (pnotify) performance comparisons Message-ID: <20050929170325.GD15246@sgi.com> References: <20050921213645.GB28239@sgi.com> <20050922151647.GA30784@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050922151647.GA30784@infradead.org> User-Agent: Mutt/1.5.6i X-archive-position: 131 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg In my tests, I have found no mesaurable difference in performance between stock 2.6.14-rc2, pnotify with the subscriber list protected by the rwsem lock like it used to be, and pnotify with read protections implemented with RCU. Problem: Job allocates memory and uses rw semaphores quite frequently. This seems to be illegal from within rcu_read_lock() / rcu_read_unlock(). Running with CONFIG_DEBUG_SPINLOCK_SLEEP confirms this. I'm not sure some of the things we want to do in Job are possible (or at least not easily possible) if this kernel module subscriber can never sleep. Even the keyring infrastructure needs to allocate memory sometimes. My opinion is: Let's not use the RCU version of pnotify and stick with the rwsem version. - Since you aren't supposed to sleep, many (most) users can't make use of pnotify. - The performance data shows no difference between the rwsem version with two subscribers (keyring and job) vs a stock kernel with the non-modified keyring support enabled. - Using RCU for protection adds comlexity Here are the details on the test runs. The RCU version "illegally" can sleep but I didn't hit any problems in my tests. IMPORTANT: The version of AIM7 used here is not currently tracking the community version - ours is closer to the original AIM7. General Test info: - For the two tests with pnotify (old rwsem protected subscriber list and new wsem/rcu protected subscriber list), the tests were always fired off with the shell process being in a job container. This means the tests don't only compare RCU performance, but also some aspects of Linux job performance. All child processes will have Job and keyring as subscribers. Of course, the stock kernel doesn't have pnotify or job and uses the non-modified keyring. - forkexit is simple and just forks and exits the number of times supplied on the command line. It reports the number of times fork returned -1. - The 2.6.14-rc2 kernel was used for each test. The only variations in patches and configuration related to which version of keyring, pnotify, and Job were used (if any). - jobtest is a mini job test suite. One of the tests forks/exits enough processes for PIDs to wrap. A job_create is done for each forked process, which means job is always a subscribed kernenl module to the forked processes. I snipped out the exessive output in the results. - keyring support is enabled in all kernels. For both pnotify kernel versions, keyrings have been changed to support pnotify for task struct entries and fork/exit calls. - All kernels include the kdb patch. Otherwise, all kerenels used the sn2_defconfig as a base. - The test system was a 2-processor 1400 Mhz SGI Altix 350 (ia64) pnotify & linux kernel job with RCU, pnotify-aware keyring ------------------------------------------------------------------------------ forkexit test runs (supplied number is the number of forks fired). minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92468 times real 0m0.916s user 0m0.036s sys 0m0.408s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92469 times real 0m0.914s user 0m0.024s sys 0m0.412s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92468 times real 0m0.909s user 0m0.060s sys 0m0.332s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92467 times real 0m0.911s user 0m0.052s sys 0m0.380s minime1:~ # time jobtest [snip] Great. All tests completed, no errors real 0m11.339s user 0m0.524s sys 0m4.536s minime1:~ # time jobtest [snip] Great. All tests completed, no errors real 0m11.385s user 0m0.500s sys 0m4.592s Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #1 SMP PREEMPT Wed Sep 28 12:11:05 CDT 2005 HOST = minime1 CPUS = 2 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = RCU pnotify+job Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 28 16:00:45 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1962.2 100 3.0 0.5 32.7040 2 4041.7 99 2.9 0.8 33.6806 3 5734.0 98 3.0 1.2 31.8555 4 7232.1 99 3.2 1.5 30.1336 5 8591.7 96 3.4 1.8 28.6389 10 13655.6 93 4.3 3.4 22.7593 20 19812.8 92 5.9 6.7 16.5106 50 26626.4 90 10.9 16.5 8.8755 100 30021.7 87 19.4 32.8 5.0036 150 31080.9 86 28.1 49.1 3.4534 200 31727.0 84 36.7 65.2 2.6439 500 32047.0 84 90.8 163.1 1.0682 1000 31954.7 83 182.1 325.8 0.5326 2000 32633.4 83 356.7 653.9 0.2719 pnotify & linux kernel job WITHOUT RCU - all rwsem, pnotify-aware keyring ------------------------------------------------------------------------------ forkexit test runs (supplied number is the number of forks fired). minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92468 times real 0m0.913s user 0m0.028s sys 0m0.428s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92470 times real 0m1.032s user 0m0.044s sys 0m0.416s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92469 times real 0m1.048s user 0m0.048s sys 0m0.368s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92469 times real 0m0.948s user 0m0.028s sys 0m0.444s minime1:/tmp # time jobtest [snip] Great. All tests completed, no errors real 0m11.620s user 0m0.612s sys 0m4.540s minime1:/tmp # time jobtest [snip] Great. All tests completed, no errors real 0m11.268s user 0m0.548s sys 0m4.488s Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP PREEMPT Wed Sep 28 15:01:33 CDT 2005 HOST = minime1 CPUS = 2 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = non-RCU pnotify+job Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 28 17:17:56 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1964.9 100 3.0 0.5 32.7481 2 4020.7 99 2.9 0.8 33.5060 3 5630.4 98 3.1 1.2 31.2802 4 7218.6 97 3.2 1.5 30.0775 5 8637.6 97 3.4 1.8 28.7919 10 13857.1 94 4.2 3.5 23.0952 20 19429.1 93 6.0 6.7 16.1910 50 26126.8 88 11.1 16.5 8.7089 100 29950.6 86 19.4 32.7 4.9918 150 31052.1 86 28.1 49.1 3.4502 200 31817.2 85 36.6 65.4 2.6514 500 32101.1 84 90.7 163.1 1.0700 1000 31908.8 83 182.4 325.8 0.5318 2000 32718.3 82 355.8 653.7 0.2727 linux kernel without pnotify and without job, non-modified keyring enabled ------------------------------------------------------------------------------ forkexit test runs (supplied number is the number of forks fired). minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92467 times real 0m0.910s user 0m0.024s sys 0m0.356s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92467 times real 0m0.912s user 0m0.032s sys 0m0.472s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92467 times real 0m0.915s user 0m0.044s sys 0m0.400s minime1:/tmp # time ./forkexit 100000 Fork returned an error: 92469 times real 0m0.923s user 0m0.036s sys 0m0.388s (no jobtest tests since kernel doesn't have pnotify or job) --------------------------------------------------------------------------- Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP PREEMPT Wed Sep 28 14:53:02 CDT 2005 HOST = minime1 CPUS = 2 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = NO pnotify+job Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 28 18:36:09 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1964.2 100 3.0 0.5 32.7371 2 4024.9 99 2.9 0.8 33.5408 3 5539.3 99 3.2 1.2 30.7741 4 7123.6 96 3.3 1.5 29.6818 5 8617.1 95 3.4 1.8 28.7237 10 13713.5 96 4.2 3.4 22.8558 20 19732.2 92 5.9 6.7 16.4435 50 26411.3 88 11.0 16.4 8.8038 100 30001.5 86 19.4 32.7 5.0003 150 31184.1 86 28.0 49.0 3.4649 200 31714.9 84 36.7 65.4 2.6429 500 32160.7 84 90.5 163.0 1.0720 1000 31876.8 83 182.6 325.7 0.5313 2000 32708.1 83 355.9 653.0 0.2726 From dipankar@in.ibm.com Thu Sep 29 10:18:41 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 10:18:43 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8THIeO0015898 for ; Thu, 29 Sep 2005 10:18:40 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e35.co.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8THDOwD006365 for ; Thu, 29 Sep 2005 13:13:24 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8THGRXk368086 for ; Thu, 29 Sep 2005 11:16:27 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8THFmhG030556 for ; Thu, 29 Sep 2005 11:15:49 -0600 Received: from localhost.localdomain ([9.12.229.186]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8THFjgp030284; Thu, 29 Sep 2005 11:15:46 -0600 Received: by localhost.localdomain (Postfix, from userid 500) id BC12414337; Thu, 29 Sep 2005 22:39:46 +0530 (IST) Date: Thu, 29 Sep 2005 22:39:46 +0530 From: Dipankar Sarma To: Erik Jacobson Cc: Christoph Hellwig , lse-tech@lists.sourceforge.net, akpm@osdl.org, kingsley@aurema.com, canon@nersc.gov, pagg@oss.sgi.com Subject: Re: [Lse-tech] [PATCH] RCU subscriber_list Process Notification (pnotify) Message-ID: <20050929170946.GC6646@in.ibm.com> Reply-To: dipankar@in.ibm.com References: <20050921213645.GB28239@sgi.com> <20050922151647.GA30784@infradead.org> <20050929165328.GA15246@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050929165328.GA15246@sgi.com> User-Agent: Mutt/1.5.10i X-archive-position: 132 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: dipankar@in.ibm.com Precedence: bulk X-list: pagg On Thu, Sep 29, 2005 at 11:53:28AM -0500, Erik Jacobson wrote: > My feeling is we shouldn't use this RCU-protected-subscriber-list version > of pnotify - see my performance post to follow. It includes notes about > sleeping while rcu_read_lock is held, for exmaple. On a quick look this patch does look bogus to me. > + > +/** > + * pnotify_unsubscribe_rcu - Free up the pnotify_subscriber when RCU is ready > + * @rcu - the rcu_head to retrieve the pointer to free from > + * > + */ > +static void > +pnotify_unsubscribe_rcu (struct rcu_head *rcu) { > + struct pnotify_subscriber *sub = container_of(rcu, > + struct pnotify_subscriber, rcu); > + > + kfree(sub); > +} > + > +/** > + * > + */ > +void > +pnotify_unsubscribe(struct pnotify_subscriber *subscriber) > +{ > + atomic_dec(&subscriber->events->refcnt); /* decr the ref cnt on events */ > + list_del_rcu(&subscriber->entry); > + call_rcu(&subscriber->rcu, pnotify_unsubscribe_rcu); > +} Could you use a per-subscriber reference count ? That will allow you to drop rcu_read_lock() safely. > +static void > +remove_subscriber_from_all_tasks(struct pnotify_events *events) > +{ > + if (events == NULL) > + return; > + > + /* Because of internal race conditions we can't gaurantee > + * getting every task in just one pass so we just keep going > + * until there are no tasks with subscribers from this events struct > + * attached. The inefficiency of this should be tempered by the fact that > + * this happens at most once for each registered client. > + */ > + while (atomic_read(&events->refcnt) != 0) { > + struct task_struct *g = NULL, *p = NULL; > + > + read_lock(&tasklist_lock); > + do_each_thread(g, p) { > + struct pnotify_subscriber *subscriber; > + int task_exited; > + > + get_task_struct(p); > + read_unlock(&tasklist_lock); > + rcu_read_lock(); > + down_write(&p->pnotify_subscriber_list_sem); Wrong. Will refcounting suscrbiber itself here be costly ? Thanks Dipankar From erikj@sgi.com Thu Sep 29 11:12:11 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 11:12:17 -0700 (PDT) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TIC9O0018902 for ; Thu, 29 Sep 2005 11:12:09 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8TI9IxT030558 for ; Thu, 29 Sep 2005 13:09:18 -0500 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8TI9IDN16654606; Thu, 29 Sep 2005 13:09:18 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8TI9GS94333932; Thu, 29 Sep 2005 13:09:16 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 7B0906028D21; Thu, 29 Sep 2005 13:09:16 -0500 (CDT) Date: Thu, 29 Sep 2005 13:09:16 -0500 From: Erik Jacobson To: Dipankar Sarma Cc: Erik Jacobson , Christoph Hellwig , lse-tech@lists.sourceforge.net, akpm@osdl.org, kingsley@aurema.com, canon@nersc.gov, pagg@oss.sgi.com Subject: Re: [Lse-tech] [PATCH] RCU subscriber_list Process Notification (pnotify) Message-ID: <20050929180916.GA18619@sgi.com> References: <20050921213645.GB28239@sgi.com> <20050922151647.GA30784@infradead.org> <20050929165328.GA15246@sgi.com> <20050929170946.GC6646@in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050929170946.GC6646@in.ibm.com> User-Agent: Mutt/1.5.6i X-archive-position: 133 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > Could you use a per-subscriber reference count ? That will allow > you to drop rcu_read_lock() safely. In the performance tests, I showed that I really couldn't measure the speed difference in AIM7 and fork bomb tests using the original rwsems and a kernel without pnotify and job. The process for the tests and any children processes all had two subscribers (keyring and job) at the time. I'm just not convinced that RCU is the right fit for pnotify in general since I don't see a speed gain. If my proof of concept rcu pnotify patch is so bad that it can't even be used to gage some performance numbers, I'm happy to try new things. However, it seems to me that most methods for fixing the rcu pnotify patch would decrease efficiency rather than increase it. As was pointed out to me in a discussion in the pagg mailing list, we could be in a situation where we normally have as many writers as readers for many situations. I'm not sure the rule of thumb for writers vs readers points to a good match for RCU. I'm not saying I'm opposed to trying things that you suggest with RCU if you think it's worth the effort. I was just pointing out my thoughts on the matter and welcoming input. Erik From root@oss.sgi.com Thu Sep 29 11:47:51 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 11:50:30 -0700 (PDT) Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TIlpO0021920 for ; Thu, 29 Sep 2005 11:47:51 -0700 Received: (from root@localhost) by oss.sgi.com (8.12.10/8.12.10/Submit) id j8TIlpBd021919 for pagg@oss.sgi.com; Thu, 29 Sep 2005 11:47:51 -0700 Resent-From: root@oss.sgi.com Resent-Date: Thu, 29 Sep 2005 11:47:50 -0700 Resent-Message-ID: <20050929184750.GA21785@oss.sgi.com> Resent-To: pagg@oss.sgi.com Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TGwtO0014608 for ; Thu, 29 Sep 2005 09:58:55 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8TGu4xT018569 for ; Thu, 29 Sep 2005 11:56:04 -0500 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8TGu4DN16652390; Thu, 29 Sep 2005 11:56:04 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8TGu3S94297663; Thu, 29 Sep 2005 11:56:03 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 70B306028D21; Thu, 29 Sep 2005 11:56:03 -0500 (CDT) Date: Thu, 29 Sep 2005 11:56:03 -0500 From: Erik Jacobson To: Christoph Hellwig Cc: Erik Jacobson , lse-tech@lists.sourceforge.net, akpm@osdl.org, kingsley@aurema.com, canon@nersc.gov, pagg@oss.sgi.com Subject: [PATCH] rcu-pnotify-aware Job patch Message-ID: <20050929165603.GB15246@sgi.com> References: <20050921213645.GB28239@sgi.com> <20050922151647.GA30784@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050922151647.GA30784@infradead.org> User-Agent: Mutt/1.5.6i X-archive-position: 134 X-Approved-By: ralf@linux-mips.org X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: root@oss.sgi.com Precedence: bulk X-list: pagg This is a version of Job modified for use with the RCU version of pnotify. This patch allocates memory and checks rwsems's while rcu_read_lock is active and is probably illegal in that sense. See my post on performance data to follow shortly for a discussion on that. Documentation/job.txt | 104 ++ include/linux/job_acct.h | 124 +++ include/linux/jobctl.h | 185 ++++ init/Kconfig | 29 kernel/Makefile | 1 kernel/fork.c | 1 kernel/job.c | 1892 +++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 2336 insertions(+) Index: linux/Documentation/job.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/Documentation/job.txt 2005-09-27 17:53:47.957595073 -0500 @@ -0,0 +1,104 @@ +Linux Jobs - A Process Notification (pnotify) Module +---------------------------------------------------- + +1. Overview + +This document provides two additional sections. Section 2 provides a +listing of the manual page that describes the particulars of the Linux +job implementation. Section 3 provides some information about using +the user job library to interface to jobs. + +2. Job Man Page + + +JOB(7) Linux User's Manual JOB(7) + + +NAME + job - Linux Jobs kernel module overview + +DESCRIPTION + A job is a group of related processes all descended from a + point of entry process and identified by a unique job + identifier (jid). A job can contain multiple process + groups or sessions, and all processes in one of these sub- + groups can only be contained within a single job. + + The primary purpose for having jobs is to provide job + based resource limits. The current implementation only + provides the job container and resource limits will be + provided in a later implementation. When an implementa- + tion that provides job limits is available, this descrip- + tion will be expanded to provide further explanation of + job based limits. + + Not every process on the system is part of a job. That + is, only processes which are started by a login initiator + like login, rlogin, rsh and so on, get assigned a job ID. + In the Linux environment, jobs are created via a PAM mod- + ule. + + Jobs on Linux are provided using a loadable kernel module. + Linux jobs have the following characteristics: + + o A job is an inescapable container. A process cannot + leave the job nor can a new process be created outside + the job without explicit action, that is, a system + call with root privilege. + + o Each new process inherits the jid and limits [when + implemented] from its parent process. + + o All point of entry processes (job initiators) create a + new job and set the job limits [when implemented] + appropriately. + + o Job initiation on Linux is performed via a PAM session + module. + + o The job initiator performs authentication and security + checks. + + o Users can raise and lower their own job limits within + maximum values specified by the system administrator + [when implemented]. + + o Not all processes on a system need be members of a job. + + o The process control initialization process (init(1M)) + and startup scripts called by init are not part of a + job. + + + Job initiators can be categorized as either interactive or + batch processes. Limit domain names are defined by the + system administrator when the user limits database (ULDB) + is created. [The ULDB will be implemented in conjunction + with future job limits work.] + + Note: The existing command jobs(1) applies to shell "jobs" + and it is not related to the Linux Kernel Module jobs. + The at(1), atd(8), atq(1), batch(1), atrun(8), atrm(1)) + man pages refer to shell scripts as a job. a shell + script. + +SEE ALSO + job(1), jwait(1), jstat(1), jkill(1) + + + + + + + + + +3. User Job Library + +For developers who wish to make software using Linux Jobs, there exists +a user job library. This library contains functions for obtaining information +about running jobs, creating jobs, detaching, etc. + +The library is part of the job package and can be obtained from oss.sgi.com +using anonymous ftp. Look in the /projects/pagg/download directory. See the +README in the job source package for more information. Index: linux/init/Kconfig =================================================================== --- linux.orig/init/Kconfig 2005-09-27 17:46:52.034674237 -0500 +++ linux/init/Kconfig 2005-09-27 17:53:47.961500929 -0500 @@ -170,6 +170,35 @@ Linux Jobs module and the Linux Array Sessions module. If you will not be using such modules, say N. +config JOB + tristate " Process Notification (pnotify) based jobs" + depends on PNOTIFY + help + The Job feature implements a type of process aggregate, + or grouping. A job is the collection of all processes that + are descended from a point-of-entry process. Examples of such + points-of-entry include telnet, rlogin, and console logins. + + Batch schedulers such as LSF also make use of Job for containing, + maintaining, and signaling a job as one entity. + + A job differs from a session and process group since the job + container (or group) is inescapable. Only root level processes, + or those with the CAP_SYS_RESOURCE capability, can create new jobs + or escape from a job. + + A job is identified by a unique job identifier (jid). Currently, + that jid can be used to obtain status information about the job + and the processes it contians. The jid can also be used to send + signals to all processes contained in the job. In addition, + other processes can wait for the completion of a job - the event + where the last process contained in the job has exited. + + If you want to compile support for jobs into the kernel, select + this entry using Y. If you want the support for jobs provided as + a module, select this entry using M. If you do not want support + for jobs, select N. + config SYSCTL bool "Sysctl support" ---help--- Index: linux/kernel/job.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/kernel/job.c 2005-09-28 09:56:01.603487672 -0500 @@ -0,0 +1,1892 @@ +/* + * Linux Job kernel module + * + * + * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + * + * For further information regarding this notice, see: + * + * http://oss.sgi.com/projects/GenInfo/NoticeExplan + */ + +/* + * Description: This file implements a type of process grouping called jobs. + * For further information about jobs, consult the file + * Documentation/job.txt. Jobs are implemented using Process Notification + * (pnotify). For more information about pnotify, see + * Documentation/pnotify.txt. + */ + +/* + * LOCKING INFO + * + * There are currently two levels of locking in this module. So, we + * have two classes of locks: + * + * (1) job table lock (always, job_table_sem) + * (2) job entry lock (usually, job->sem) + * + * Most of the locking used is read/write sempahores. In rare cases, a + * spinlock is also used. Those cases requiring a spinlock concern when the + * tasklist_lock must be locked (such as when looping over all tasks on the + * system). + * + * There is only one job_table_sem. There is a job->sem for each job + * entry in the job_table. This job module uses Process Notification + * (pnotify). Each task has a special lock that protects its pnotify + * information - this is called the pnotify_subscriber_list lock. There are + * special macros used to lock/unlock a task's subscriber list lock. The + * subscriber list lock is really a semaphore. + * + * Purpose: + * + * (1) The job_table_sem protects all entries in the table. + * (2) The job->sem protects all data and task attachments for the job. + * + * Truths we hold to be self-evident: + * + * Only the holder of a write lock for the job_table_lock may add or + * delete a job entry from the job_table. The job_table includes all job + * entries in the hash table and chains off the hash table locations. + * + * Only the holder of a write lock for a job->lock may attach or detach + * processes/tasks from the attached list for the job. + * + * If you hold a read lock of job_table_lock, you can assume that the + * job entries in the table will not change. The link pointers for + * the chains of job entries will not change, the job ID (jid) value + * will not change, and data changes will be (mostly) atomic. + * + * If you hold a read lock of a job->lock, you can assume that the + * attachments to the job will not change. The link pointers for the + * attachment list will not change and the attachments will not change. + * + * pnotify uses RCU protections instead of read locks. If you require + * write access to the tasks's pnotify_subscriber_list, you need to + * down_write(&task->pnotify_subscriber_list_sem). Anywhere reading + * involving the pnotify_subscriber_list is done needs to be protected + * by rcu_read_lock / rcu_read_unlock. If iterators are used on the + * pnotify_subscriber_list, use the rcu aware versions. + * + * If you are going to grab nested locks, the nesting order is: + * + * down_write/up_write(&task->pnotify_subscriber_list_sem) + * job_table_sem + * job->sem + * + * However, it is not strictly necessary to down the job_table_sem + * before downing job->sem. + * + * Also, the nesting order allows you to lock in this order: + * + * down_write/up_write(&task->pnotify_subscriber_list_sem) + * job->sem + * + * without locking job_table_sem between the two. + * + */ + +/* standard for kernel modules */ +#include +#include +#include +#include +#include +#include + +#include /* for get_user & put_user */ + +#include /* for current */ +#include /* for the tty declarations */ +#include +#include + +#include + +#include +#include +#include + +#include /* to use process notification service */ +#include +#include + +MODULE_AUTHOR("Silicon Graphics, Inc."); +MODULE_DESCRIPTION("pnotify based inescapable jobs"); +MODULE_LICENSE("GPL"); + +#define HASH_SIZE 1024 + +/* The states for a job */ +#define RUNNING 1 /* Running job */ +#define ZOMBIE 2 /* Dead job */ + +/* Job creation tags for the job HID (host ID) */ +#define DISABLED 0xffffffff /* New job creation disabled */ +#define LOCAL 0x0 /* Only creating local sys jobs */ + + +#ifdef __BIG_ENDIAN +#define iptr_hid(ll) ((u32 *)&(ll)) +#define iptr_sid(ll) (((u32 *)(&(ll) + 1)) - 1) +#else /* __LITTLE_ENDIAN */ +#define iptr_hid(ll) (((u32 *)(&(ll) + 1)) - 1) +#define iptr_sid(ll) ((u32 *)&(ll)) +#endif /* __BIG_ENDIAN */ + +#define jid_hash(ll) (*(iptr_sid(ll)) % HASH_SIZE) + + +/* Job info entry for member tasks */ +struct job_attach { + struct task_struct *task; /* task we are attaching to job */ + struct pnotify_subscriber *subscriber; /* our subscriber entry in task */ + struct job_entry *job; /* the job we are attaching task to */ + struct list_head entry; /* list stuff */ +}; + +struct job_waitinfo { + int status; /* For tasks waiting on job exit */ +}; + +struct job_csainfo { + u64 corehimem; /* Accounting - highpoint, phys mem */ + u64 virthimem; /* Accounting - highpoint, virt mem */ + struct file *acctfile; /* The accounting file for job */ +}; + +/* Job table entry type */ +struct job_entry { + u64 jid; /* Our job ID */ + int refcnt; /* Number of tasks attached to job */ + int state; /* State of job - RUNNING,... */ + struct rw_semaphore sem; /* lock for the job */ + uid_t user; /* user that owns the job */ + time_t start; /* When the job began */ + struct job_csainfo csa; /* CSA accounting info */ + wait_queue_head_t zombie; /* queue last task - during wait */ + wait_queue_head_t wait; /* queue of tasks waiting on job */ + int waitcnt; /* Number of tasks waiting on job */ + struct job_waitinfo waitinfo; /* Status info for waiting tasks */ + struct list_head attached; /* List of attached tasks */ + struct list_head entry; /* List of other jobs - same hash */ +}; + + +/* Job container tables */ +static struct list_head job_table[HASH_SIZE]; +static int job_table_refcnt = 0; +static DECLARE_RWSEM(job_table_sem); + + +/* Accounting subscriber list */ +static struct job_acctmod *acct_list[JOB_ACCT_COUNT]; +static DECLARE_RWSEM(acct_list_sem); + + +/* Host ID for the localhost */ +static u32 jid_hid; + +static char *hid = NULL; +module_param(hid, charp, 0); + +/* Function prototypes */ +static int job_dispatch_create(struct job_create *); +static int job_dispatch_getjid(struct job_getjid *); +static int job_dispatch_waitjid(struct job_waitjid *); +static int job_dispatch_killjid(struct job_killjid *); +static int job_dispatch_getjidcnt(struct job_jidcnt *); +static int job_dispatch_getjidlst(struct job_jidlst *); +static int job_dispatch_getpidcnt(struct job_pidcnt *); +static int job_dispatch_getpidlst(struct job_pidlst *); +static int job_dispatch_getuser(struct job_user *); +static int job_dispatch_getprimepid(struct job_primepid *); +static int job_dispatch_sethid(struct job_sethid *); +static int job_dispatch_detachjid(struct job_detachjid *); +static int job_dispatch_detachpid(struct job_detachpid *); +static int job_dispatch_attachpid(struct job_attachpid *); +static int job_attach(struct task_struct *, struct pnotify_subscriber *, void *); +static void job_detach(struct task_struct *, struct pnotify_subscriber *); +static struct job_entry *job_getjob(u64 jid); +static int job_dispatcher(unsigned int, unsigned long); + +u64 job_getjid(struct task_struct *); + +int job_ioctl(struct inode *, struct file *, unsigned int, unsigned long); + +/* Job container pnotify service request */ +static struct pnotify_events events = { + .module = THIS_MODULE, + .name = PNOTIFY_JOB, + .data = &job_table, + .entry = LIST_HEAD_INIT(events.entry), + .fork = job_attach, + .exit = job_detach, +}; + +/* proc dir entry */ +struct proc_dir_entry *job_proc_entry; + +/* file operations for proc file */ +static struct file_operations job_file_ops = { + .owner = THIS_MODULE, + .ioctl = job_ioctl +}; + + +/* + * job_getjob - return job_entry given a jid + * @jid: The jid of the job entry we wish to retrieve + * + * Given a jid value, find the entry in the job_table and return a pointer + * to the job entry or NULL if not found. + * + * You should normally down_read the job_table_sem before calling this + * function. + */ +struct job_entry * +job_getjob(u64 jid) +{ + struct list_head *entry = NULL; + struct job_entry *tjob = NULL; + struct job_entry *job = NULL; + + list_for_each(entry, &job_table[ jid_hash(jid) ]) { + tjob = list_entry(entry, struct job_entry, entry); + if (tjob->jid == jid) { + job = tjob; + break; + } + } + return job; +} + + +/* + * job_attach - Attach a task to a specified job + * @task: Task we want to attach to the job + * @new_subscriber: The already allocated subscriber struct for the task + * @old_data: (struct job_attach *)old_data)->job is the specified job + * + * Attach the task to the job specified in the target data (old_data). + * This function will add the task to the list of attached tasks for the job. + * In addition, a link from the task to the job is created and added to the + * task via the data pointer reference. + * + * The process that owns the target data should be at least read locked (using + * down_read(&task->pnotify_subscriber_list_sem)) during this call. This help + * in ensuring that the job cannot be removed since at least one process will + * still be referencing the job (the one owning the target_data). + * + * It is expected that this function will be called from within the + * pnotify_fork() function in the kernel, when forking (do_fork) a child + * process represented by task. + * + * If this function is called form some other point, then it is possible that + * task and data could be altered while going through this function. In such + * a case, the caller should also lock the pnotify_subscriber_list for the task + * task_struct. + * + * the function returns 0 upon success, and -1 upon failure. + */ +static int +job_attach(struct task_struct *task, struct pnotify_subscriber *new_subscriber, + void *old_data) +{ + struct job_entry *job = ((struct job_attach *)old_data)->job; + struct job_attach *attached = NULL; + int errcode = 0; + + /* + * Lock the job for writing. The task owning target_data has its + * pnotify_subscriber_list_sem locked, so we know there is at least one + * active reference to the job - therefore, it cannot have been removed + * before we have gotten this write lock established. + */ + down_write(&job->sem); + + if (job->state == ZOMBIE) { + /* If the job is a zombie (dying), bail out of the attach */ + printk(KERN_WARNING "Attach task(pid=%d) to job" + " failed - job is ZOMBIE\n", + task->pid); + errcode = -EINPROGRESS; + up_write(&job->sem); + goto error_return; + } + + + /* Allocate memory that we will need */ + + attached = (struct job_attach *)kmalloc(sizeof(struct job_attach), + GFP_KERNEL); + if (!attached) { + /* error */ + printk(KERN_ERR "Attach task(pid=%d) to job" + " failed on memory error in kernel\n", + task->pid); + errcode = -ENOMEM; + up_write(&job->sem); + goto error_return; + } + + + attached->task = task; + attached->subscriber = new_subscriber; + attached->job = job; + new_subscriber->data = (void *)attached; + list_add_tail(&attached->entry, &job->attached); + ++job->refcnt; + + up_write(&job->sem); + + return 0; + +error_return: + kfree(attached); + return errcode; +} + + +/* + * job_detach - Detach a task via the pnotify subscriber reference + * @task: The task to be detached + * @subscriber: The pnotify subscriber reference + * + * Detach the task from the job attached to via the pnotify reference. + * This function will remove the task from the list of attached tasks for the + * job specified via the pnotify data pointer. In addition, the link to the + * job provided via the data pointer will also be removed. + * + * The pnotify_subscriber_list should be write locked for task before enterin + * this function (using down_write(&task->pnotify_subscriber_list_sem)). + * + * the function returns 0 uopn success, and -1 uopn failure. + */ +static void +job_detach(struct task_struct *task, struct pnotify_subscriber *subscriber) +{ + struct job_attach *attached = ((struct job_attach *)(subscriber->data)); + struct job_entry *job = attached->job; + struct job_csa csa; + struct job_acctmod *acct; + + /* + * Obtain the lock on the the job_table_sem and the job->sem for + * this job. + */ + down_write(&job_table_sem); + down_write(&job->sem); + + /* - CSA accounting */ + if (acct_list[JOB_ACCT_CSA]) { + acct = acct_list[JOB_ACCT_CSA]; + if (acct->module) { + if (try_module_get(acct->module) == 0) { + printk(KERN_WARNING + "job_detach: Tried to get non-living acct module\n"); + } + } + if (acct->eop) { + csa.job_id = job->jid; + csa.job_uid = job->user; + csa.job_start = job->start; + csa.job_corehimem = job->csa.corehimem; + csa.job_virthimem = job->csa.virthimem; + csa.job_acctfile = job->csa.acctfile; + acct->eop(task->exit_code, task, &csa); + } + if (acct->module) + module_put(acct->module); + } + job->refcnt--; + list_del(&attached->entry); + subscriber->data = NULL; + kfree(attached); + + if (job->refcnt == 0) { + int waitcnt; + + list_del(&job->entry); + --job_table_refcnt; + + /* + * The job is removed from the job_table. + * We can remove the job_table_sem now since + * nobody can access the job via the table. + */ + up_write(&job_table_sem); + + job->state = ZOMBIE; + job->waitinfo.status = task->exit_code; + + waitcnt = job->waitcnt; + + /* + * Release the job semaphore. You cannot hold + * this lock if you want the wakeup to work + * properly. + */ + up_write(&job->sem); + + if (waitcnt > 0) { + wake_up_interruptible(&job->wait); + wait_event(job->zombie, job->waitcnt == 0); + } + + /* + * Job is exiting, all processes waiting for job to exit + * have been notified. Now we call the accountin + * subscribers. + */ + + /* - CSA accounting */ + if (acct_list[JOB_ACCT_CSA]) { + acct = acct_list[JOB_ACCT_CSA]; + if (acct->module) { + if (try_module_get(acct->module) == 0) { + printk(KERN_WARNING + "job_detach: Tried to get non-living acct module\n"); + } + } + if (acct->jobend) { + int res = 0; + + csa.job_id = job->jid; + csa.job_uid = job->user; + csa.job_start = job->start; + csa.job_corehimem = job->csa.corehimem; + csa.job_virthimem = job->csa.virthimem; + csa.job_acctfile = job->csa.acctfile; + + res = acct->jobend(JOB_EVENT_END, + &csa); + if (res) { + printk(KERN_WARNING + "job_detach: CSA -" + " jobend failed.\n"); + } + } + if (acct->module) + module_put(acct->module); + } + /* + * Every process attached or waiting on this job should be + * detached and finished waiting, so now we can free the + * memory for the job. + */ + kfree(job); + + } else { + /* This is case where job->refcnt was greater than 1, so + * we were not going to delete the job after the detach. + * Therefore, only the job->sem is being held - the + * job_table_sem was released earlier. + */ + up_write(&job->sem); + up_write(&job_table_sem); + } + + return; +} + +/* + * job_dispatch_create - create a new job and attach the calling process to it + * @create_args: Pointer of job_create struct which stores the create request + * + * Returns 0 on success, and negative on failure (negative errno value). + */ +static int +job_dispatch_create(struct job_create *create_args) +{ + struct job_create create; + struct job_entry *job = NULL; + struct job_attach *attached = NULL; + struct pnotify_subscriber *subscriber = NULL; + struct pnotify_subscriber *old_subscriber = NULL; + int errcode = 0; + struct job_acctmod *acct = NULL; + static u32 jid_count = 0; + u32 initial_jid_count; + + /* + * if the job ID - host ID segment is set to DISABLED, we will + * not be creating new jobs. We don't mark it as an error, but + * the jid value returned will be 0. + */ + if (jid_hid == DISABLED) { + errcode = 0; + goto error_return; + } + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + goto error_return; + } + if (!create_args) { + errcode = -EINVAL; + goto error_return; + } + + if (copy_from_user(&create, create_args, sizeof(create))) { + errcode = -EFAULT; + goto error_return; + } + + /* + * Allocate some of the memory we might need, before we start + * locking + */ + + attached = (struct job_attach *)kmalloc(sizeof(struct job_attach), GFP_KERNEL); + if (!attached) { + /* error */ + errcode = -ENOMEM; + goto error_return; + } + + job = (struct job_entry *)kmalloc(sizeof(struct job_entry), GFP_KERNEL); + if (!job) { + /* error */ + errcode = -ENOMEM; + goto error_return; + } + + /* We keep the old pnotify subscriber reference around in case we need it + * in an error condition. If, for example, a job_getjob call fails because + * the requested JID is already in use, we don't want to detach that job. + * Having this ability is complicated by the locking. + */ + rcu_read_lock(); + down_write(¤t->pnotify_subscriber_list_sem); + old_subscriber = pnotify_get_subscriber(current, events.name); + + /* + * Lock the job_table and add the pointers for the new job. + * Since the job is new, we won't need to lock the job. + */ + down_write(&job_table_sem); + + /* + * Determine if create should use specified JID or one that is + * generated. + */ + if (create.jid != 0) { + /* We use the specified JID value */ + job->jid = create.jid; + /* Does the supplied JID conflict with an existing one? */ + if (job_getjob(job->jid)) { + /* JID already in use, bail. error_return tosses/frees job */ + + /* error_return doesn't do up_write() */ + up_write(&job_table_sem); + /* we haven't allocated a new pnotify subscriber refernce yet so + * error_return won't unlock this. We'll unlock here */ + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + errcode = -EBUSY; + /* error_return doesn't touch old_subscriber so we don't detach */ + goto error_return; + } + } else { + /* We generate a new JID value using a new JID */ + *(iptr_hid(job->jid)) = jid_hid; + *(iptr_sid(job->jid)) = jid_count; + initial_jid_count = jid_count++; + while (((job->jid == 0) || (job_getjob(job->jid))) && + jid_count != initial_jid_count) { + + /* JID was in use or was zero, try a new one */ + *(iptr_sid(job->jid)) = jid_count++; + } + /* If all the JIDs are in use, fail */ + if (jid_count == initial_jid_count) { + /* error_return tosses/frees job */ + /* error_return doesn't do up_write() */ + up_write(&job_table_sem); + /* we haven't allocated a new pagg yet so error_return won't unlock + * this. We'll unlock here */ + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + errcode = -EBUSY; + /* error_return doesn't touch old_pagg so we don't detach */ + goto error_return; + } + + } + + subscriber = pnotify_subscribe(current, &events); + if (!subscriber) { + /* error */ + up_write(&job_table_sem); /* unlock since error_return doesn't */ + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + errcode = -ENOMEM; + goto error_return; + } + + /* Initialize job entry values & lists */ + job->refcnt = 1; + job->user = create.user; + job->start = jiffies; + job->csa.corehimem = 0; + job->csa.virthimem = 0; + job->csa.acctfile = NULL; + job->state = RUNNING; + init_rwsem(&job->sem); + INIT_LIST_HEAD(&job->attached); + list_add_tail(&attached->entry, &job->attached); + init_waitqueue_head(&job->wait); + init_waitqueue_head(&job->zombie); + job->waitcnt = 0; + job->waitinfo.status = 0; + + /* set link from entry in attached list to task and job entry */ + attached->task = current; + attached->job = job; + attached->subscriber = subscriber; + subscriber->data = (void *)attached; + + /* Insert new job into front of chain list */ + list_add_tail(&job->entry, &job_table[ jid_hash(job->jid) ]);; + ++job_table_refcnt; + + up_write(&job_table_sem); + /* At this point, the possible error conditions where we would need the + * old pnotify subscriber are gone. So we can remove it. We remove after + * we unlock because the detach function does job table lock of its own. + */ + if (old_subscriber) { + /* + * Detaching subscribers for jobs never has a failure case, + * so we don't need to worry about error codes. + */ + old_subscriber->events->exit(current, old_subscriber); + pnotify_unsubscribe(old_subscriber); + } + up_write(¤t->pnotify_subscriber_list_sem); + rcu_read_unlock(); + + /* Issue callbacks into accounting subscribers */ + + /* - CSA subscriber */ + if (acct_list[JOB_ACCT_CSA]) { + acct = acct_list[JOB_ACCT_CSA]; + if (acct->module) { + if (try_module_get(acct->module) == 0) { + printk(KERN_WARNING + "job_dispatch_create: Tried to get non-living acct module\n"); + } + } + if (acct->jobstart) { + int res; + struct job_csa csa; + + csa.job_id = job->jid; + csa.job_uid = job->user; + csa.job_start = job->start; + csa.job_corehimem = job->csa.corehimem; + csa.job_virthimem = job->csa.virthimem; + csa.job_acctfile = job->csa.acctfile; + + res = acct->jobstart(JOB_EVENT_START, &csa); + if (res < 0) { + printk(KERN_WARNING "job_dispatch_create: CSA -" + " jobstart failed.\n"); + } + } + if (acct->module) + module_put(acct->module); + } + + + create.r_jid = job->jid; + if (copy_to_user(create_args, &create, sizeof(create))) { + return -EFAULT; + } + + return 0; + +error_return: + kfree(attached); + kfree(job); + create.r_jid = 0; + if (copy_to_user(create_args, &create, sizeof(create))) + return -EFAULT; + + return errcode; +} + + +/* + * job_dispatch_getjid - retrieves the job ID (jid) for the specified process (pid) + * @getjid_args: Pointer of job_getjid struct which stores the get request + * + * returns 0 on success, negative errno value on exit. + */ +static int +job_dispatch_getjid(struct job_getjid *getjid_args) +{ + struct job_getjid getjid; + int errcode = 0; + struct task_struct *task; + + if (copy_from_user(&getjid, getjid_args, sizeof(getjid))) + return -EFAULT; + + /* lock the tasklist until we grab the specific task */ + read_lock(&tasklist_lock); + + if (getjid.pid == current->pid) { + task = current; + } else { + task = find_task_by_pid(getjid.pid); + } + if (task) { + get_task_struct(task); /* Ensure the task doesn't vanish on us */ + read_unlock(&tasklist_lock); /* unlock the task list */ + getjid.r_jid = job_getjid(task); + put_task_struct(task); /* We're done accessing the task */ + if (getjid.r_jid == 0) { + errcode = -ENODATA; + } + } else { + read_unlock(&tasklist_lock); + getjid.r_jid = 0; + errcode = -ESRCH; + } + + + if (copy_to_user(getjid_args, &getjid, sizeof(getjid))) + return -EFAULT; + return errcode; +} + + +/* + * job_dispatch_waitjid - allows a process to wait until a job exits + * @waitjid_args: Pointer of job_waitjid struct which stores the wait request + * + * On success returns 0, failure it returns the negative errno value. + */ +static int +job_dispatch_waitjid(struct job_waitjid *waitjid_args) +{ + struct job_waitjid waitjid; + struct job_entry *job; + int retcode = 0; + + if (copy_from_user(&waitjid, waitjid_args, sizeof(waitjid))) + return -EFAULT; + + waitjid.r_jid = waitjid.stat = 0; + + if (waitjid.options != 0) { + retcode = -EINVAL; + goto general_return; + } + + /* Lock the job table so that the current jobs don't change */ + down_read(&job_table_sem); + + if ((job = job_getjob(waitjid.jid)) == NULL ) { + up_read(&job_table_sem); + retcode = -ENODATA; + goto general_return; + } + + /* + * We got the job we need, we can release the job_table_sem + */ + down_write(&job->sem); + up_read(&job_table_sem); + + ++job->waitcnt; + + up_write(&job->sem); + + /* We shouldn't hold any locks at this point! The increment of the + * jobs waitcnt will ensure that the job is not removed without + * first notifying this current task */ + retcode = wait_event_interruptible(job->wait, + job->refcnt == 0); + + if (!retcode) { + /* + * This data is static at this point, we will + * not need a lock to read it. + */ + waitjid.stat = job->waitinfo.status; + waitjid.r_jid = job->jid; + } + + down_write(&job->sem); + --job->waitcnt; + + if (job->waitcnt == 0) { + up_write(&job->sem); + + /* + * We shouldn't hold any locks at this point! Else, the + * last process in the job will not be able to remove the + * job entry. + * + * That process is stuck waiting for this wake_up, so the + * job shouldn't disappear until after this function call. + * The job entry is not longer in the job table, so no + * other process can get to the entry to foul things up. + */ + wake_up(&job->zombie); + } else { + up_write(&job->sem); + } + +general_return: + if (copy_to_user(waitjid_args, &waitjid, sizeof(waitjid))) + return -EFAULT; + return retcode; +} + + +/* + * job_dispatch_killjid - send a signal to all processes in a job + * @killjid_args: Pointer of job_killjid struct which stores the kill request + * + * returns 0 on success, negative of errno on failure. + */ +static int +job_dispatch_killjid(struct job_killjid *killjid_args) +{ + struct job_killjid killjid; + struct job_entry *job; + struct list_head *attached_entry; + struct siginfo info; + int retcode = 0; + + if (copy_from_user(&killjid, killjid_args, sizeof(killjid))) { + retcode = -EFAULT; + goto cleanup_0locks_return; + } + + killjid.r_val = -1; + + /* A signal of zero is really a status check and is handled as such + * by send_sig_info. So we have < 0 instead of <= 0 here. + */ + if (killjid.sig < 0) { + retcode = -EINVAL; + goto cleanup_0locks_return; + } + + down_read(&job_table_sem); + job = job_getjob(killjid.jid); + if (!job) { + /* Job not found, copy back data & bail with error */ + retcode = -ENODATA; + goto cleanup_1locks_return; + } + + down_read(&job->sem); + + /* + * Check capability to signal job. The signaling user must be + * the owner of the job or have CAP_SYS_RESOURCE capability. + */ + if (!capable(CAP_SYS_RESOURCE)) { + if (current->uid != job->user) { + retcode = -EPERM; + goto cleanup_2locks_return; + } + } + + info.si_signo = killjid.sig; + info.si_errno = 0; + info.si_code = SI_USER; + info.si_pid = current->pid; + info.si_uid = current->uid; + + /* send_group_sig_info needs the tasklist lock locked */ + read_lock(&tasklist_lock); + list_for_each(attached_entry, &job->attached) { + int err; + struct job_attach *attached; + + attached = list_entry(attached_entry, struct job_attach, entry); + err = send_group_sig_info(killjid.sig, &info, + attached->task); + if (err != 0) { + /* + * XXX - the "prime" process, or initiating process + * for the job may not be owned by the user. So, + * we would get an error in this case. However, we + * ignore the error for that specific process - it + * should exit when all the child processes exit. It + * should ignore all signals from the user. + * + */ + if (attached->entry.prev != &job->attached) { + retcode = err; + } + } + + } + read_unlock(&tasklist_lock); + +cleanup_2locks_return: + up_read(&job->sem); +cleanup_1locks_return: + up_read(&job_table_sem); +cleanup_0locks_return: + killjid.r_val = retcode; + + if (copy_to_user(killjid_args, &killjid, sizeof(killjid))) + return -EFAULT; + return retcode; +} + + +/* + * job_dispatch_getjidcnt - return the number of jobs currently on the system + * @jidcnt_args: Pointer of job_jidcnt struct which stores the get request + * + * returns 0 on success & it always succeeds. + */ +static int +job_dispatch_getjidcnt(struct job_jidcnt *jidcnt_args) +{ + struct job_jidcnt jidcnt; + + /* read lock might be overdoing it in this case */ + down_read(&job_table_sem); + jidcnt.r_val = job_table_refcnt; + up_read(&job_table_sem); + + if (copy_to_user(jidcnt_args, &jidcnt, sizeof(jidcnt))) + return -EFAULT; + + return 0; +} + + +/* + * job_dispatch_getjidlst - get the list of all jids currently on the system + * @jidlist_args: Pointer of job_jidlst struct which stores the get request + */ +static int +job_dispatch_getjidlst(struct job_jidlst *jidlst_args) +{ + struct job_jidlst jidlst; + u64 *jid; + struct job_entry *job; + struct list_head *job_entry; + int i; + int count; + + if (copy_from_user(&jidlst, jidlst_args, sizeof(jidlst))) + return -EFAULT; + + if (jidlst.r_val == 0) + return 0; + + jid = (u64 *)kmalloc(sizeof(u64)*jidlst.r_val, GFP_KERNEL); + if (!jid) { + jidlst.r_val = 0; + if (copy_to_user(jidlst_args, &jidlst, sizeof(jidlst))) + return -EFAULT; + return -ENOMEM; + } + + count = 0; + down_read(&job_table_sem); + for (i = 0; i < HASH_SIZE && count < jidlst.r_val; i++) { + list_for_each(job_entry, &job_table[i]) { + job = list_entry(job_entry, struct job_entry, entry); + jid[count++] = job->jid; + if (count == jidlst.r_val) { + break; + } + } + } + up_read(&job_table_sem); + + jidlst.r_val = count; + + for (i = 0; i < count; i++) + if (copy_to_user(jidlst.jid+i, &jid[i], sizeof(u64))) + return -EFAULT; + + kfree(jid); + + if (copy_to_user(jidlst_args, &jidlst, sizeof(jidlst))) + return -EFAULT; + return 0; +} + + +/* + * job_dispatch_getpidcnt - get the processe count in the specified job + * @pidcnt_args: Pointer of job_pidcnt struct which stores the get request + * + * returns 0 on success, or negative errno value on failure. + */ +static int +job_dispatch_getpidcnt(struct job_pidcnt *pidcnt_args) +{ + struct job_pidcnt pidcnt; + struct job_entry *job; + int retcode = 0; + + if (copy_from_user(&pidcnt, pidcnt_args, sizeof(pidcnt))) + return -EFAULT; + + pidcnt.r_val = 0; + + down_read(&job_table_sem); + job = job_getjob(pidcnt.jid); + if (!job) { + retcode = -ENODATA; + } else { + /* Read lock might be overdoing it for this case */ + down_read(&job->sem); + pidcnt.r_val = job->refcnt; + up_read(&job->sem); + } + up_read(&job_table_sem); + + if (copy_to_user(pidcnt_args, &pidcnt, sizeof(pidcnt))) + return -EFAULT; + return retcode; +} + +/* + * job_getpidlst - get the process list in the specified job + * @pidlst_args: Pointer of job_pidlst struct which stores the get request + * + * This function returns the the list of processes that are part of the job. + * The number of processes provided by this function could be trimmed if + * max size specified in r_val is not large enough to hold the entire list. + * + * returns 0 on success, negative errno value on failure. + */ +static int +job_dispatch_getpidlst(struct job_pidlst *pidlst_args) +{ + struct job_pidlst pidlst; + struct job_entry *job; + struct job_attach *attached; + struct list_head *attached_entry; + pid_t *pid; + int max; + int i; + + if (copy_from_user(&pidlst, pidlst_args, sizeof(pidlst))) + return -EFAULT; + + if (pidlst.r_val == 0) + return 0; + + max = pidlst.r_val; + pidlst.r_val = 0; + pid = (pid_t *)kmalloc(sizeof(pid_t)*max, GFP_KERNEL); + if (!pid) { + if (copy_to_user(pidlst_args, &pidlst, sizeof(pidlst))) + return -EFAULT; + return -ENOMEM; + } + + down_read(&job_table_sem); + + job = job_getjob(pidlst.jid); + if (!job) { + up_read(&job_table_sem); + if (copy_to_user(pidlst_args, &pidlst, sizeof(pidlst))) + return -EFAULT; + return -ENODATA; + } else { + + down_read(&job->sem); + up_read(&job_table_sem); + + i = 0; + list_for_each(attached_entry, &job->attached) { + if (i == max) { + break; + } + attached = list_entry(attached_entry, struct job_attach, + entry); + pid[i++] = attached->task->pid; + } + pidlst.r_val = i; + + up_read(&job->sem); + } + + for (i = 0; i < pidlst.r_val; i++) + if (copy_to_user(pidlst.pid+i, &pid[i], sizeof(pid_t))) + return -EFAULT; + kfree(pid); + + copy_to_user(pidlst_args, &pidlst, sizeof(pidlst)); + return 0; +} + + +/* + * job_dispatch_getuser - get the uid of the user that owns the job + * @user_args: Pointer of job_user struct which stores the get request + * + * returns 0 on success, returns negative errno on failure. + */ +static int +job_dispatch_getuser(struct job_user *user_args) +{ + struct job_entry *job; + struct job_user user; + int retcode = 0; + + if (copy_from_user(&user, user_args, sizeof(user))) + return(-EFAULT); + user.r_user = 0; + + down_read(&job_table_sem); + + job = job_getjob(user.jid); + if (!job) { + retcode = -ENODATA; + } else { + down_read(&job->sem); + user.r_user = job->user; + up_read(&job->sem); + } + + up_read(&job_table_sem); + + if (copy_to_user(user_args, &user, sizeof(user))) + return -EFAULT; + return retcode; +} + + +/* + * job_dispatch_getprimepid - get the oldest process (primepid) in the job + * @primepid_args: Pointer of job_primepid struct which stores the get request + * + * returns 0 on success, negative errno on failure. + */ +static int +job_dispatch_getprimepid(struct job_primepid *primepid_args) +{ + struct job_primepid primepid; + struct job_entry *job = NULL; + struct job_attach *attached = NULL; + int retcode = 0; + + if (copy_from_user(&primepid, primepid_args, sizeof(primepid))) + return -EFAULT; + + primepid.r_pid = 0; + + down_read(&job_table_sem); + + job = job_getjob(primepid.jid); + if (!job) { + up_read(&job_table_sem); + /* Job not found, return INVALID VALUE */ + return -ENODATA; + } + + /* + * Job found, now look at first pid entry in the + * attached list. + */ + down_read(&job->sem); + up_read(&job_table_sem); + if (list_empty(&job->attached)) { + retcode = -ESRCH; + primepid.r_pid = 0; + } else { + attached = list_entry(job->attached.next, struct job_attach, entry); + if (!attached->task) { + retcode = -ESRCH; + } else { + primepid.r_pid = attached->task->pid; + } + } + up_read(&job->sem); + + if (copy_to_user(primepid_args, &primepid, sizeof(primepid))) + return -EFAULT; + return retcode; +} + + +/* + * job_dispatch_sethid - set the host ID segment for the job IDs (jid) + * @sethid_args: Pointer of job_sethid struct which stores the set request + * + * If hid does not get set, then the jids upper 32 bits will be set to + * 0 and the jid cannot be used reliably in a cluster environment. + * + * returns -errno value on fail, 0 on success + */ +static int +job_dispatch_sethid(struct job_sethid *sethid_args) +{ + struct job_sethid sethid; + int errcode = 0; + + if (copy_from_user(&sethid, sethid_args, sizeof(sethid))) + return -EFAULT; + + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + sethid.r_hid = 0; + goto cleanup_return; + } + + /* + * Set job_table_sem, so no jobs can be deleted while doing + * this operation. + */ + down_write(&job_table_sem); + + sethid.r_hid = jid_hid = sethid.hid; + + up_write(&job_table_sem); + +cleanup_return: + if (copy_to_user(sethid_args, &sethid, sizeof(sethid))) + return -EFAULT; + return errcode; +} + + +/* + * job_dispatch_detachjid - detach all processes attached to the specified job + * @detachjid_args: Pointer of job_detachjid struct + * + * The job will exit after the detach. The processes are allowed to + * continue running. You need CAP_SYS_RESOURCE capability for this to succeed. + * + * returns -errno value on fail, 0 on success + */ +static int +job_dispatch_detachjid(struct job_detachjid *detachjid_args) +{ + struct job_detachjid detachjid; + struct job_entry *job; + struct list_head *entry; + int count; + int errcode = 0; + struct task_struct *task; + struct pnotify_subscriber *subscriber; + + if (copy_from_user(&detachjid, detachjid_args, sizeof(detachjid))) + return -EFAULT; + + detachjid.r_val = 0; + + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + goto cleanup_return; + } + + /* + * Set job_table_sem, so no jobs can be deleted while doing + * this operation. + */ + down_write(&job_table_sem); + + job = job_getjob(detachjid.jid); + + if (job) { + + down_write(&job->sem); + + /* Mark job as ZOMBIE so no new processes can attach to it */ + job->state = ZOMBIE; + + count = job->refcnt; + + /* Okay, no new processes can attach to the job. We can + * release the locks on the job_table and job since the only + * way for the job to change now is for tasks to detach and + * the job to be removed. And this is what we want to happen + */ + up_write(&job_table_sem); + up_write(&job->sem); + + + /* Walk through list of attached tasks and unset the + * pnotify subscriber entries. + * + * We don't test with list_empty because that actually means NO tasks + * left rather than one task. If we used !list_empty or list_for_each, + * we could reference memory freed by the pnotify hook detach function + * (job_detach). + * + * We know there is only one task left when job->attached.next and + * job->attached.prev both point to the same place. + */ + while (job->attached.next != job->attached.prev) { + entry = job->attached.next; + + task = (list_entry(entry, struct job_attach, entry))->task; + subscriber = (list_entry(entry, struct job_attach, entry))->subscriber; + + rcu_read_lock(); + down_write(&task->pnotify_subscriber_list_sem); + subscriber->events->exit(task, subscriber); + pnotify_unsubscribe(subscriber); + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + + } + /* At this point, there is only one task left */ + + entry = job->attached.next; + + task = (list_entry(entry, struct job_attach, entry))->task; + subscriber = (list_entry(entry, struct job_attach, entry))->subscriber; + + rcu_read_lock(); + down_write(&task->pnotify_subscriber_list_sem); + subscriber->events->exit(task, subscriber); + pnotify_unsubscribe(subscriber); + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + + detachjid.r_val = count; + + } else { + errcode = -ENODATA; + up_write(&job_table_sem); + } + +cleanup_return: + if (copy_to_user(detachjid_args, &detachjid, sizeof(detachjid))) + return -EFAULT; + return errcode; +} + + +/* + * job_dispatch_detachpid - detach a process from the job it is attached to + * @detachpid_args: Pointer of job_detachpid struct. + * + * That process is allowed to continue running. You need + * CAP_SYS_RESOURCE capability for this to succeed. + * + * returns -errno value on fail, 0 on success + */ +static int +job_dispatch_detachpid(struct job_detachpid *detachpid_args) +{ + struct job_detachpid detachpid; + struct task_struct *task; + struct pnotify_subscriber *subscriber; + int errcode = 0; + + if (copy_from_user(&detachpid, detachpid_args, sizeof(detachpid))) + return -EFAULT; + + detachpid.r_jid = 0; + + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + goto cleanup_return; + } + + /* Lock the task list while we find a specific task */ + read_lock(&tasklist_lock); + task = find_task_by_pid(detachpid.pid); + if (!task) { + errcode = -ESRCH; + /* We need to unlock the tasklist here too or the lock is held forever */ + read_unlock(&tasklist_lock); + goto cleanup_return; + } + + /* We have a valid task now */ + get_task_struct(task); /* Ensure the task doesn't vanish on us */ + read_unlock(&tasklist_lock); /* Unlock the tasklist */ + rcu_read_lock(); + down_write(&task->pnotify_subscriber_list_sem); + + subscriber = pnotify_get_subscriber(task, events.name); + if (subscriber) { + detachpid.r_jid = ((struct job_attach *)subscriber->data)->job->jid; + subscriber->events->exit(task, subscriber); + pnotify_unsubscribe(subscriber); + } else { + errcode = -ENODATA; + } + put_task_struct(task); /* Done accessing the task */ + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + +cleanup_return: + if (copy_to_user(detachpid_args, &detachpid, sizeof(detachpid))) + return -EFAULT; + return errcode; +} + +/* + * job_dispatch_attachpid - attach a process to the specified job + * @attachpid_args: Pointer of job_attachpid struct. + * + * The attaching process must not belong to any job and the specified job + * must exist. You need CAP_SYS_RESOURCE capability for this to succeed. + * + * returns -errno value on fail, 0 on success + */ +static int +job_dispatch_attachpid(struct job_attachpid *attachpid_args) +{ + struct job_attachpid attachpid; + struct task_struct *task; + struct pnotify_subscriber *subscriber; + struct job_entry *job = NULL; + struct job_attach *attached = NULL; + int errcode = 0; + + if (copy_from_user(&attachpid, attachpid_args, sizeof(attachpid))) + return -EFAULT; + + if (!capable(CAP_SYS_RESOURCE)) { + errcode = -EPERM; + goto cleanup_return; + } + + /* lock the tasklist until we grab the specific task */ + read_lock(&tasklist_lock); + task = find_task_by_pid(attachpid.pid); + if (!task) { + errcode = -ESRCH; + /* We need to unlock the tasklist here too or the lock is held f +orever */ + read_unlock(&tasklist_lock); + goto cleanup_return; + } + + /* We have a valid task now */ + get_task_struct(task); /* Ensure the task doesn't vanish on us */ + read_unlock(&tasklist_lock); /* Unlock the tasklist */ + + rcu_read_lock(); + down_write(&task->pnotify_subscriber_list_sem); + /* check if it belongs to a job*/ + subscriber = pnotify_get_subscriber(task, events.name); + if (subscriber) { + put_task_struct(task); + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + errcode = -EINVAL; + goto cleanup_return; + } + + /* Alloc subscriber list entry for it */ + subscriber = pnotify_subscribe(task, &events); + if (subscriber) { + down_read(&job_table_sem); + /* Check on the requested job */ + job = job_getjob(attachpid.r_jid); + if (!job) { + pnotify_unsubscribe(subscriber); + errcode = -ENODATA; + } + else { + attached = list_entry(job->attached.next, struct job_attach, entry); + if(attached) { + if (subscriber->events->fork(task, subscriber, attached) != 0) { + pnotify_unsubscribe(subscriber); + errcode = -EFAULT; + } + } + } + up_read(&job_table_sem); + } else + errcode = -ENOMEM; + put_task_struct(task); /* Done accessing the task */ + up_write(&task->pnotify_subscriber_list_sem); + rcu_read_unlock(); + +cleanup_return: + if (copy_to_user(attachpid_args, &attachpid, sizeof(attachpid))) + return -EFAULT; + return errcode; +} + + +/* + * job_register_acct - accounting modules register to job module + * @am: The registering accounting module's job_acctmod pointer + * + * returns -errno value on fail, 0 on success. + */ +int +job_register_acct(struct job_acctmod *am) +{ + if (!am) + return -EINVAL; /* error, invalid value */ + if (am->type < 0 || am->type > (JOB_ACCT_COUNT-1)) + return -EINVAL; /* error, invalid value */ + + down_write(&acct_list_sem); + if (acct_list[am->type] != NULL) { + up_write(&acct_list_sem); + return -EBUSY; /* error, duplicate entry */ + } + + acct_list[am->type] = am; + up_write(&acct_list_sem); + return 0; +} + + +/* + * job_unregister_acct - accounting modules to unregister with the job module + * @am: The unregistering accounting module's job_acctmod pointer + * + * Returns -errno on failure and 0 on success. + */ +int +job_unregister_acct(struct job_acctmod *am) +{ + if (!am) + return -EINVAL; /* error, invalid value */ + if (am->type < 0 || am->type > (JOB_ACCT_COUNT-1)) + return -EINVAL; /* error, invalid value */ + + down_write(&acct_list_sem); + + if (acct_list[am->type] != am) { + up_write(&acct_list_sem); + return -EFAULT; /* error, not matching entry */ + } + + acct_list[am->type] = NULL; + up_write(&acct_list_sem); + return 0; +} + +/* + * job_getjid - return the Job ID for the given task. + * @task: The given task + * + * If the task is not attached to a job, then 0 is returned. + * + */ +u64 job_getjid(struct task_struct *task) +{ + struct pnotify_subscriber *subscriber = NULL; + struct job_entry *job = NULL; + u64 jid = 0; + + rcu_read_lock(); + subscriber = pnotify_get_subscriber(task, events.name); + if (subscriber) { + job = ((struct job_attach *)subscriber->data)->job; + down_read(&job->sem); + jid = job->jid; + up_read(&job->sem); + } + rcu_read_unlock(); + + return jid; +} + + +/* + * job_getacct - accounting subscribers get accounting information about a job. + * @jid: the job id + * @type: the accounting subscriber type + * @data: the accounting data that subscriber wants. + * + * The caller must supply the Job ID (jid) that specifies the job. The + * "type" argument indicates the type of accounting data to be returned. + * The data will be returned in the memory accessed via the data pointer + * argument. The data pointer is void so that this function interface + * can handle different types of accounting data. + * + */ +int job_getacct(u64 jid, int type, void *data) +{ + struct job_entry *job; + + if (!data) + return -EINVAL; + + if (!jid) + return -EINVAL; + + down_read(&job_table_sem); + job = job_getjob(jid); + if (!job) { + up_read(&job_table_sem); + return -ENODATA; + } + + down_read(&job->sem); + up_read(&job_table_sem); + + switch (type) { + case JOB_ACCT_CSA: + { + struct job_csa *csa = (struct job_csa *)data; + + csa->job_id = job->jid; + csa->job_uid = job->user; + csa->job_start = job->start; + csa->job_corehimem = job->csa.corehimem; + csa->job_virthimem = job->csa.virthimem; + csa->job_acctfile = job->csa.acctfile; + break; + } + default: + up_read(&job->sem); + return -EINVAL; + break; + } + up_read(&job->sem); + return 0; +} + +/* + * job_setacct - accounting subscribers set accounting info in the job + * @jid: the job id + * @type: the accounting subscriber type. + * @subfield: the accounting information subfield for this set call + * @data: the accounting information to be set + * + * The job is identified by the jid argument. The type indicates the + * type of accounting the information is associated with. The subfield + * is a bitmask that indicates exactly what subfields are to be changed. + * The data that is used to set the values is supplied by the data pointer. + * The data pointer is a void type so that the interface can be used for + * different types of accounting information. + */ +int job_setacct(u64 jid, int type, int subfield, void *data) +{ + struct job_entry *job; + + if (!data) + return -EINVAL; + + if (!jid) + return -EINVAL; + + down_read(&job_table_sem); + job = job_getjob(jid); + if (!job) { + up_read(&job_table_sem); + return -ENODATA; + } + + down_read(&job->sem); + up_read(&job_table_sem); + + switch (type) { + case JOB_ACCT_CSA: + { + struct job_csa *csa = (struct job_csa *)data; + + if (subfield & JOB_CSA_ACCTFILE) { + job->csa.acctfile = csa->job_acctfile; + } + if (subfield & JOB_CSA_COREHIMEM) { + job->csa.corehimem = csa->job_corehimem; + } + if (subfield & JOB_CSA_VIRTHIMEM) { + job->csa.virthimem = csa->job_virthimem; + } + break; + } + default: + up_read(&job->sem); + return -EINVAL; + break; + } + up_read(&job->sem); + return 0; +} + + + +/* + * job_dispatcher - handles job ioctl requests + * @request: The syscall request type + * @data: The syscall request data + * + * Returns 0 on success and -(ERRNO VALUE) upon failure. + */ +int +job_dispatcher(unsigned int request, unsigned long data) +{ + int rc=0; + + switch (request) { + case JOB_CREATE: + rc = job_dispatch_create((struct job_create *)data); + break; + case JOB_ATTACH: + case JOB_DETACH: + /* RESERVED */ + rc = -EBADRQC; + break; + case JOB_GETJID: + rc = job_dispatch_getjid((struct job_getjid *)data); + break; + case JOB_WAITJID: + rc = job_dispatch_waitjid((struct job_waitjid *)data); + break; + case JOB_KILLJID: + rc = job_dispatch_killjid((struct job_killjid *)data); + break; + case JOB_GETJIDCNT: + rc = job_dispatch_getjidcnt((struct job_jidcnt *)data); + break; + case JOB_GETJIDLST: + rc = job_dispatch_getjidlst((struct job_jidlst *)data); + break; + case JOB_GETPIDCNT: + rc = job_dispatch_getpidcnt((struct job_pidcnt *)data); + break; + case JOB_GETPIDLST: + rc = job_dispatch_getpidlst((struct job_pidlst *)data); + break; + case JOB_GETUSER: + rc = job_dispatch_getuser((struct job_user *)data); + break; + case JOB_GETPRIMEPID: + rc = job_dispatch_getprimepid((struct job_primepid *)data); + break; + case JOB_SETHID: + rc = job_dispatch_sethid((struct job_sethid *)data); + break; + case JOB_DETACHJID: + rc = job_dispatch_detachjid((struct job_detachjid *)data); + break; + case JOB_DETACHPID: + rc = job_dispatch_detachpid((struct job_detachpid *)data); + break; + case JOB_ATTACHPID: + rc = job_dispatch_attachpid((struct job_attachpid *)data); + break; + case JOB_SETJLIMIT: + case JOB_GETJLIMIT: + case JOB_GETJUSAGE: + case JOB_FREE: + default: + rc = -EBADRQC; + break; + } + + return rc; +} + + +/* + * job_ioctl - handles job ioctl call requests + * + * + * Returns 0 on success and -(ERRNO VALUE) upon failure. + */ +int +job_ioctl(struct inode *inode, struct file *file, unsigned int request, + unsigned long data) +{ + return job_dispatcher(request, data); +} + + +/* + * init_module + * + * This function is called when a module is inserted into a kernel. This + * function allocates any necessary structures and sets initial values for + * module data. + * + * If the function succeeds, then 0 is returned. On failure, -1 is returned. + */ +static int __init +init_job(void) +{ + int i,rc; + + + /* Initialize the job table chains */ + for (i = 0; i < HASH_SIZE; i++) { + INIT_LIST_HEAD(&job_table[i]); + } + + /* Get hostID string and fill in jid_template hostID segment */ + if (hid) { + jid_hid = (int)simple_strtoul(hid, &hid, 16); + } else { + jid_hid = 0; + } + + rc = pnotify_register(&events); + if (rc < 0) { + return -1; + } + + /* Setup our /proc entry file */ + job_proc_entry = create_proc_entry(JOB_PROC_ENTRY, + S_IFREG | S_IRUGO, &proc_root); + + if (!job_proc_entry) { + pnotify_unregister(&events); + return -1; + } + + job_proc_entry->proc_fops = &job_file_ops; + job_proc_entry->proc_iops = NULL; + + + return 0; +} +module_init(init_job); + +/* + * cleanup_module + * + * This function is called to cleanup after a module when it is removed. + * All memory allocated for this module will be freed. + * + * This function does not take any inputs or produce and output. + */ +static void __exit +cleanup_job(void) +{ + remove_proc_entry(JOB_PROC_ENTRY, &proc_root); + pnotify_unregister(&events); + return; +} +module_exit(cleanup_job); + +EXPORT_SYMBOL(job_register_acct); +EXPORT_SYMBOL(job_unregister_acct); +EXPORT_SYMBOL(job_getjid); +EXPORT_SYMBOL(job_getacct); +EXPORT_SYMBOL(job_setacct); Index: linux/kernel/Makefile =================================================================== --- linux.orig/kernel/Makefile 2005-09-27 17:46:52.056156447 -0500 +++ linux/kernel/Makefile 2005-09-27 17:53:47.970289106 -0500 @@ -21,6 +21,7 @@ obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_COMPAT) += compat.o obj-$(CONFIG_PNOTIFY) += pnotify.o +obj-$(CONFIG_JOB) += job.o obj-$(CONFIG_CPUSETS) += cpuset.o obj-$(CONFIG_IKCONFIG) += configs.o obj-$(CONFIG_IKCONFIG_PROC) += configs.o Index: linux/include/linux/jobctl.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/include/linux/jobctl.h 2005-09-27 17:53:47.977124355 -0500 @@ -0,0 +1,185 @@ +/* + * + * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + * + * For further information regarding this notice, see: + * + * http://oss.sgi.com/projects/GenInfo/NoticeExplan + * + * + * Description: This file, include/linux/jobctl.h, contains the data + * definitions used by job to communicate with pnotify via the /proc/job + * ioctl interface. + * + */ + +#ifndef _LINUX_JOBCTL_H +#define _LINUX_JOBCTL_H +#ifndef __KERNEL__ +#include +#include +#include +#endif + +#define PNOTIFY_NAMELEN 32 /* Max chars in PNOTIFY module name */ +#define PNOTIFY_NAMESTR PNOTIFY_NAMELN+1 /* PNOTIFY mod name string including + * room for end-of-string = '\0' */ + +/* + * ======================= + * JOB PNOTIFY definitions + * ======================= + */ +#define PNOTIFY_JOB "job" /* PNOTIFY module identifier string */ + + + +/* + * ================ + * KERNEL INTERFACE + * ================ + */ +#define JOB_PROC_ENTRY "job" /* /proc entry name */ +#define JOB_IOCTL_NUM 'A' + + +/* + * + * Define ioctl options available in the job module + * + */ + +#define JOB_NOOP _IOWR(JOB_IOCTL_NUM, 0, void *) /* No-op options */ + +#define JOB_CREATE _IOWR(JOB_IOCTL_NUM, 1, void *) /* Create a job - uid = 0 only */ +#define JOB_ATTACH _IOWR(JOB_IOCTL_NUM, 2, void *) /* RESERVED */ +#define JOB_DETACH _IOWR(JOB_IOCTL_NUM, 3, void *) /* RESERVED */ +#define JOB_GETJID _IOWR(JOB_IOCTL_NUM, 4, void *) /* Get Job ID for specificed pid */ +#define JOB_WAITJID _IOWR(JOB_IOCTL_NUM, 5, void *) /* Wait for job to complete */ +#define JOB_KILLJID _IOWR(JOB_IOCTL_NUM, 6, void *) /* Send signal to job */ +#define JOB_GETJIDCNT _IOWR(JOB_IOCTL_NUM, 9, void *) /* Get number of JIDs on system */ +#define JOB_GETJIDLST _IOWR(JOB_IOCTL_NUM, 10, void *) /* Get list of JIDs on system */ +#define JOB_GETPIDCNT _IOWR(JOB_IOCTL_NUM, 11, void *) /* Get number of PIDs in JID */ +#define JOB_GETPIDLST _IOWR(JOB_IOCTL_NUM, 12, void *) /* Get list of PIDs in JID */ +#define JOB_SETJLIMIT _IOWR(JOB_IOCTL_NUM, 13, void *) /* Future: set job limits info */ +#define JOB_GETJLIMIT _IOWR(JOB_IOCTL_NUM, 14, void *) /* Future: get job limits info */ +#define JOB_GETJUSAGE _IOWR(JOB_IOCTL_NUM, 15, void *) /* Future: get job res. usage */ +#define JOB_FREE _IOWR(JOB_IOCTL_NUM, 16, void *) /* Future: Free job entry */ +#define JOB_GETUSER _IOWR(JOB_IOCTL_NUM, 17, void *) /* Get owner for job */ +#define JOB_GETPRIMEPID _IOWR(JOB_IOCTL_NUM, 18, void *) /* Get prime pid for job */ +#define JOB_SETHID _IOWR(JOB_IOCTL_NUM, 19, void *) /* Set HID for jid values */ +#define JOB_DETACHJID _IOWR(JOB_IOCTL_NUM, 20, void *) /* Detach all tasks from job */ +#define JOB_DETACHPID _IOWR(JOB_IOCTL_NUM, 21, void *) /* Detach a task from job */ +#define JOB_ATTACHPID _IOWR(JOB_IOCTL_NUM, 22, void *) /* Attach a task to a job */ +#define JOB_OPT_MAX _IOWR(JOB_IOCTL_NUM, 23 , void *) /* Should always be highest number */ + + +/* + * Define ioctl request structures for job module + */ + +struct job_create { + u64 r_jid; /* Return value of JID */ + u64 jid; /* Jid value requested */ + int user; /* UID of user associated with job */ + int options;/* creation options - unused */ +}; + + +struct job_getjid { + u64 r_jid; /* Returned value of JID */ + pid_t pid; /* Info requested for PID */ +}; + + +struct job_waitjid { + u64 r_jid; /* Returned value of JID */ + u64 jid; /* Waiting on specified JID */ + int stat; /* Status information on JID */ + int options;/* Waiting options */ +}; + + +struct job_killjid { + int r_val; /* Return value of kill request */ + u64 jid; /* Sending signal to all PIDs in JID */ + int sig; /* Signal to send */ +}; + + +struct job_jidcnt { + int r_val; /* Number of JIDs on system */ +}; + + +struct job_jidlst { + int r_val; /* Number of JIDs in list */ + u64 *jid; /* List of JIDs */ +}; + + +struct job_pidcnt { + int r_val; /* Number of PIDs in JID */ + u64 jid; /* Getting count of JID */ +}; + + +struct job_pidlst { + int r_val; /* Number of PIDs in list */ + pid_t *pid; /* List of PIDs */ + u64 jid; +}; + + +struct job_user { + int r_user; /* The UID of the owning user */ + u64 jid; /* Get the UID for this job */ +}; + +struct job_primepid { + pid_t r_pid; /* The prime pid */ + u64 jid; /* Get the prime pid for this job */ +}; + +struct job_sethid { + unsigned long r_hid; /* Value that was set */ + unsigned long hid; /* Value to set to */ +}; + + +struct job_detachjid { + int r_val; /* Number of tasks detached from job */ + u64 jid; /* Job to detach processes from */ +}; + +struct job_detachpid { + u64 r_jid; /* Jod ID task was attached to */ + pid_t pid; /* Task to detach from job */ +}; + +struct job_attachpid { + u64 r_jid; /* Job ID task is to attach to */ + pid_t pid; /* Task to be attached */ +}; + +#endif /* _LINUX_JOBCTL_H */ Index: linux/include/linux/job_acct.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/include/linux/job_acct.h 2005-09-27 17:53:47.978100819 -0500 @@ -0,0 +1,124 @@ +/* + * Linux Job kernel definitions & interfaces using pnotify + * + * + * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + * + * For further information regarding this notice, see: + * + * http://oss.sgi.com/projects/GenInfo/NoticeExplan + */ + +/* + * Description: This file, include/linux/job.h, contains the data + * structure definitions and functions prototypes used + * by other kernel bits that communicate with the job + * module. One such example is Comprehensive System + * Accounting (CSA). + */ + +#ifndef _LINUX_JOB_ACCT_H +#define _LINUX_JOB_ACCT_H + +/* + * ================ + * GENERAL USE INFO + * ================ + */ + +/* + * The job start/stop events: These will identify the + * the reason the jobstart and jobend callbacks are being + * called. + */ +enum { + JOB_EVENT_IGNORE = 0, + JOB_EVENT_START = 1, + JOB_EVENT_RESTART = 2, + JOB_EVENT_END = 3, +}; + + +/* + * ========================================= + * INTERFACE INFO FOR ACCOUNTING SUBSCRIBERS + * ========================================= + */ + +/* To register as a job dependent accounting module */ +struct job_acctmod { + int type; /* CSA or something else */ + int (*jobstart)(int event, void *data); + int (*jobend)(int event, void *data); + void (*eop)(int code, void *data1, void *data2); + struct module *module; +}; + + +/* + * Subscriber type: Each module that registers as a accounting data + * "subscriber" has to have a type. This type will identify the + * the appropriate structs and macros to use when exchanging data. + */ +#define JOB_ACCT_CSA 0 +#define JOB_ACCT_COUNT 1 /* Number of entries available */ + + +/* + * -------------- + * CSA ACCOUNTING + * -------------- + */ + +/* + * For data exchange betwee job and csa. The embedded defines + * identify the sub-fields + */ +struct job_csa { +#define JOB_CSA_JID 001 + u64 job_id; +#define JOB_CSA_UID 002 + uid_t job_uid; +#define JOB_CSA_START 004 + time_t job_start; +#define JOB_CSA_COREHIMEM 010 + u64 job_corehimem; +#define JOB_CSA_VIRTHIMEM 020 + u64 job_virthimem; +#define JOB_CSA_ACCTFILE 040 + struct file *job_acctfile; +}; + + +/* + * =================== + * FUNCTION PROTOTYPES + * =================== + */ +int job_register_acct(struct job_acctmod *); +int job_unregister_acct(struct job_acctmod *); +u64 job_getjid(struct task_struct *); +int job_getacct(u64, int, void *); +int job_setacct(u64, int, int, void *); + +#endif /* _LINUX_JOB_ACCT_H */ Index: linux/kernel/fork.c =================================================================== --- linux.orig/kernel/fork.c 2005-09-27 17:53:47.137365248 -0500 +++ linux/kernel/fork.c 2005-09-27 17:53:47.979077283 -0500 @@ -122,6 +122,7 @@ if (!profile_handoff_task(tsk)) free_task(tsk); } +EXPORT_SYMBOL_GPL(__put_task_struct); void __init fork_init(unsigned long mempages) { From dipankar@in.ibm.com Thu Sep 29 12:20:40 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 12:20:42 -0700 (PDT) Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TJKdO0029219 for ; Thu, 29 Sep 2005 12:20:40 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e36.co.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8TJGZmP018626 for ; Thu, 29 Sep 2005 15:16:35 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8TJILXk441948 for ; Thu, 29 Sep 2005 13:18:21 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8TJHguo032149 for ; Thu, 29 Sep 2005 13:17:43 -0600 Received: from localhost.localdomain ([9.12.229.186]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8TJHefx032002; Thu, 29 Sep 2005 13:17:41 -0600 Received: by localhost.localdomain (Postfix, from userid 500) id DAF4314337; Fri, 30 Sep 2005 00:41:49 +0530 (IST) Date: Fri, 30 Sep 2005 00:41:49 +0530 From: Dipankar Sarma To: Erik Jacobson Cc: Christoph Hellwig , lse-tech@lists.sourceforge.net, akpm@osdl.org, kingsley@aurema.com, canon@nersc.gov, pagg@oss.sgi.com Subject: Re: [Lse-tech] [PATCH] RCU subscriber_list Process Notification (pnotify) Message-ID: <20050929191149.GD6646@in.ibm.com> Reply-To: dipankar@in.ibm.com References: <20050921213645.GB28239@sgi.com> <20050922151647.GA30784@infradead.org> <20050929165328.GA15246@sgi.com> <20050929170946.GC6646@in.ibm.com> <20050929180916.GA18619@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050929180916.GA18619@sgi.com> User-Agent: Mutt/1.5.10i X-archive-position: 135 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: dipankar@in.ibm.com Precedence: bulk X-list: pagg On Thu, Sep 29, 2005 at 01:09:16PM -0500, Erik Jacobson wrote: > > Could you use a per-subscriber reference count ? That will allow > > you to drop rcu_read_lock() safely. > > In the performance tests, I showed that I really couldn't measure > the speed difference in AIM7 and fork bomb tests using the original > rwsems and a kernel without pnotify and job. The process for the tests and > any children processes all had two subscribers (keyring and job) at the > time. I'm just not convinced that RCU is the right fit for pnotify in > general since I don't see a speed gain. > > If my proof of concept rcu pnotify patch is so bad that it can't even be > used to gage some performance numbers, I'm happy to try new things. > However, it seems to me that most methods for fixing the rcu pnotify patch > would decrease efficiency rather than increase it. > > As was pointed out to me in a discussion in the pagg mailing list, we > could be in a situation where we normally have as many writers as > readers for many situations. I'm not sure the rule of thumb for writers vs > readers points to a good match for RCU. Oh, I am only pointing out RCU problems. It does make sense to do some benchmarking and see if it has benefits over rwsem or not. I would like to see the comparison on one of those SGI behemoths instead of a 2-cpu box :) Thanks Dipankar From erikj@sgi.com Thu Sep 29 12:29:15 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 12:29:18 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TJTFO0029788 for ; Thu, 29 Sep 2005 12:29:15 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8TKTpqo019974 for ; Thu, 29 Sep 2005 13:29:51 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8TJQNDN16661544; Thu, 29 Sep 2005 14:26:23 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8TJQMS94292558; Thu, 29 Sep 2005 14:26:22 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 364186028D21; Thu, 29 Sep 2005 14:26:22 -0500 (CDT) Date: Thu, 29 Sep 2005 14:26:22 -0500 From: Erik Jacobson To: Dipankar Sarma Cc: Erik Jacobson , Christoph Hellwig , lse-tech@lists.sourceforge.net, akpm@osdl.org, kingsley@aurema.com, canon@nersc.gov, pagg@oss.sgi.com Subject: Re: [Lse-tech] [PATCH] RCU subscriber_list Process Notification (pnotify) Message-ID: <20050929192622.GB23932@sgi.com> References: <20050921213645.GB28239@sgi.com> <20050922151647.GA30784@infradead.org> <20050929165328.GA15246@sgi.com> <20050929170946.GC6646@in.ibm.com> <20050929180916.GA18619@sgi.com> <20050929191149.GD6646@in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050929191149.GD6646@in.ibm.com> User-Agent: Mutt/1.5.6i X-archive-position: 136 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > Oh, I am only pointing out RCU problems. It does make sense to do > some benchmarking and see if it has benefits over rwsem or not. > I would like to see the comparison on one of those SGI behemoths > instead of a 2-cpu box :) I can run the same tests on a bigger box, sure. I guess the host name in the AIM output isn't even that exciting for you -- minime1 :) I'll get some time on a larger system and get back to you. Thanks! -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From kingsley@aurema.com Thu Sep 29 15:41:26 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 15:41:28 -0700 (PDT) Received: from smtp.sw.oz.au (alt.aurema.com [203.217.18.57]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TMfOO0009516 for ; Thu, 29 Sep 2005 15:41:26 -0700 Received: from smtp.sw.oz.au (localhost [127.0.0.1]) by smtp.sw.oz.au with ESMTP id j8TMcRc2007536; Fri, 30 Sep 2005 08:38:27 +1000 (EST) Received: (from kingsley@localhost) by smtp.sw.oz.au id j8TMcRNl007535; Fri, 30 Sep 2005 08:38:27 +1000 (EST) Date: Fri, 30 Sep 2005 08:38:27 +1000 From: kingsley@aurema.com To: Erik Jacobson Cc: pagg@oss.sgi.com, tonyt@aurema.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050929223827.GA2737@aurema.com> References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <20050929051627.GC3404@aurema.com> <20050929151205.GB7395@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050929151205.GB7395@sgi.com> User-Agent: Mutt/1.4.2.1i X-Scanned-By: MIMEDefang 2.52 on 192.41.203.35 X-archive-position: 137 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kingsley@aurema.com Precedence: bulk X-list: pagg On Thu, Sep 29, 2005 at 10:12:06AM -0500, Erik Jacobson wrote: > > __pnotify_exit() would need to call the exit callback for all clients > > except for the client failing the fork call. To do this wouldn't the > > following be needed in __pnotify_fork()? > > If the fork fails in copy_process, we go to cleanup in copy_process, > not exit.c. So, with the revision, if the the pnotify kernel module > subscriber returns a failure for the process, that failure is passed > along to the copy_process call. There, we go to bad_fork_cleanup_namespace > where a pnotify_exit is done on the process that failed to fork. > > In the old version, the pnotify_exit would have also been run within > __pnotify_fork. In the new version, we don't run the pnotify_exit > from __pnotify_fork because we know it will be run in the fork failure > path of copy_process. That should mean that pnotify_exit will only > execute once. I think that means that your suggestion for adding > a pnotify_unsubscribe isn't necessary? The unsubscribe is done by > __pnotify_exit in this case. Or am I still missing the point here? > Sorry if it isn't getting through my head. That's okay. I've taken a look at the test version and it doesn't seem like its doing what I suggested, so let me try again. My suggestion is that the unsubscribe exit callback should only be done for clients who successfully subscribed. For example, during a fork: 1. Client A subscribes with its subscriber->events->fork() being successful. 2. Client B subscribes with its subscriber->events->fork() being successful. 3. Client C fails the fork and its subscriber->events->fork() returns a failure. Client C has its subscription removed from the pnotify list but its subscriber->events->exit() callback should not be called. 4. Client A & B are unsubscribed and their subscriber->events->exit() callbacks are invoked. Right now, AFAICS, client C would have its subscriber->events->exit() callback invoked in spite of a failure. IMHO the exit() callback should only ever be called for a client for whom its fork() callback succeeded. What do you think? > > See the RCU test version of pnotify in the download site under > pnotify-test. My attempts at posting that patch to the list seem to be > eaten by the list server right now. When that's fixed, I'll start posting > stuff here. > > Erik Thanks, -- Kingsley From kaigai@ak.jp.nec.com Thu Sep 29 19:07:46 2005 Received: with ECARTIS (v1.0.0; list pagg); Thu, 29 Sep 2005 19:07:50 -0700 (PDT) Received: from tyo201.gate.nec.co.jp (TYO201.gate.nec.co.jp [210.143.35.51]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U27jO0030287 for ; Thu, 29 Sep 2005 19:07:45 -0700 Received: from mailgate3.nec.co.jp (mailgate53.nec.co.jp [10.7.69.192]) by tyo201.gate.nec.co.jp (8.11.7/3.7W01080315) with ESMTP id j8U24nE24531; Fri, 30 Sep 2005 11:04:49 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id j8U24mk02875; Fri, 30 Sep 2005 11:04:48 +0900 (JST) Received: from mailsv.linux.bs1.fc.nec.co.jp (namesv2.linux.bs1.fc.nec.co.jp [10.34.125.2]) by mailsv4.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id j8U24lb27848; Fri, 30 Sep 2005 11:04:47 +0900 (JST) Received: from [10.34.125.249] (sanma.linux.bs1.fc.nec.co.jp [10.34.125.249]) by mailsv.linux.bs1.fc.nec.co.jp (Postfix) with ESMTP id 7C2E52FE18; Fri, 30 Sep 2005 11:04:32 +0900 (JST) Message-ID: <433C9D38.8060801@ak.jp.nec.com> Date: Fri, 30 Sep 2005 11:04:40 +0900 From: Kaigai Kohei User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050317) X-Accept-Language: ja, en-us, en MIME-Version: 1.0 To: Erik Jacobson Cc: Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com, paulmck@us.ibm.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <433A7FE4.5040109@ak.jp.nec.com> <20050928141831.GA24110@sgi.com> <433B80B6.2010604@ak.jp.nec.com> <20050929144738.GA7395@sgi.com> In-Reply-To: <20050929144738.GA7395@sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 138 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: kaigai@ak.jp.nec.com Precedence: bulk X-list: pagg Hi, > Hi. I'm confused with the above paragraph. My initial plan was only to > convert how job uses pnotify to be rcu-aware. So this mainly relates to > the pnotify_subscriber_list. > > job_dispatch_attachpid adds the kernel module as a subscriber to the given > process. It doesn't access the list directly but does (in my updated version) > have the write lock held and the rcu_read_lock/unlock calls. But it calls > pnotify_subscribe to actually add the kernel module to the subscriber list > of the task. > > pnotify_subscribe uses list_add_tail_rcu to do this. Ahh, It might be my misunderstandings. Since the imeplementation of job does not allow to move a process from one job to another one directly, The above-mentioned worry is useless. When a process moves between jobs, job was detached at first. Right ? Therefore, it's hidden by pnotify_subscribe(). OK. > Maybe there are other examples that need adjustments? On the download site, > in pnotify-test, you'll see my first pass at an RCU versoin of pnotify. > In job-test, you'll see a first cut at a rcu-pnotify version of Job. Now, I'm trying to read your new pnotify/job patch in this morning. >>When anyone tries to associate a job with a running multithread-process, >>it's required to scan for each thread in this process under >>read_lock(&tasklist_lock), because job is an aggregation of processes, >>not an aggregation of threads. > > > OK; I think I see some of what you're saying here now. If it isn't > urgent, let's defer this until we know what's happening with pnotify. It's not urgent problem for me. > I guess I'm most interested in any logic problems I have in the way I used > pnotify RCU with job, not problems with how Job might have a flaw that has > always been there. Let's address those later. Yes, I agree. But would you remind that there is a difficulty to implement an pnotify/PAGG client, if its private-members are protected by semaphore. I think implementing a new pnotify's client becomes easier, if pnotify_subscriber_list would be protected by rwlock. > If there is no measurable difference, it seems that RCU might not be the > best answer because we're increasing complexity for no good resaon. I saw what you posted in LSE-tech. In my opinion, there is no significant difference between two versions. fork() and exit() are originally so complex processing which acquires many locking-objects. Thus, pnotify's cost might be small enough. Thanks, -- Linux Promotion Center, NEC KaiGai Kohei From erikj@sgi.com Fri Sep 30 07:08:29 2005 Received: with ECARTIS (v1.0.0; list pagg); Fri, 30 Sep 2005 07:08:34 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8UE8TO0028873 for ; Fri, 30 Sep 2005 07:08:29 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8UF99Jf015771 for ; Fri, 30 Sep 2005 08:09:10 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8UE5ZDN16716089; Fri, 30 Sep 2005 09:05:35 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8UE5XS94367693; Fri, 30 Sep 2005 09:05:34 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id CD1906022F49; Fri, 30 Sep 2005 09:05:33 -0500 (CDT) Date: Fri, 30 Sep 2005 09:05:33 -0500 From: Erik Jacobson To: Kaigai Kohei Cc: Erik Jacobson , Kingsley Cheung , pagg@oss.sgi.com, tonyt@aurema.com, paulmck@us.ibm.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050930140533.GC18478@sgi.com> References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <433A7FE4.5040109@ak.jp.nec.com> <20050928141831.GA24110@sgi.com> <433B80B6.2010604@ak.jp.nec.com> <20050929144738.GA7395@sgi.com> <433C9D38.8060801@ak.jp.nec.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <433C9D38.8060801@ak.jp.nec.com> User-Agent: Mutt/1.5.6i X-archive-position: 139 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg > I think implementing a new pnotify's client becomes easier, > if pnotify_subscriber_list would be protected by rwlock. The issue I have here is an rwlock is really a spinlock... So that would forbid kernel modules from sleeping. This includes, for exmaple, waiting for memory to become free with kmalloc. The modules wouldn't be able to use semaphores on their own either. We could investigate trying to convert the job code, for example, to never sleep with the lock held but I'm not sure how easy that would be. On the other hand, I understand the community is anxious about modules being able to sleep in the fork or exit path. I don't think we can do everything we want to do if the subscriber list is locked with rwlock. But I'm open to being convinced otherwise :) Erik From erikj@sgi.com Fri Sep 30 07:52:53 2005 Received: with ECARTIS (v1.0.0; list pagg); Fri, 30 Sep 2005 07:53:01 -0700 (PDT) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8UEqqO0031838 for ; Fri, 30 Sep 2005 07:52:53 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8UEo0xT025984 for ; Fri, 30 Sep 2005 09:50:00 -0500 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8UEo0DN16720698 for ; Fri, 30 Sep 2005 09:50:00 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8UEo0S94358860 for ; Fri, 30 Sep 2005 09:50:00 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 293746022F49; Fri, 30 Sep 2005 09:50:00 -0500 (CDT) Date: Fri, 30 Sep 2005 09:50:00 -0500 From: Erik Jacobson To: pagg@oss.sgi.com Subject: Small fix to job patch Message-ID: <20050930145000.GB22845@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6i X-archive-position: 140 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg I posted job-2.6.14-rc2-patch. It will apply fine to 2.6.13. The only real difference is some how some brackets went missing in job_dispatch_getpidlst... resulting in that funtion always returning -ENOMEM. It turns out my mini job test suite doesn't catch this. FYI. -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From erikj@sgi.com Fri Sep 30 10:02:15 2005 Received: with ECARTIS (v1.0.0; list pagg); Fri, 30 Sep 2005 10:02:18 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8UH2EO0012452 for ; Fri, 30 Sep 2005 10:02:14 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8UI2wrG014480 for ; Fri, 30 Sep 2005 11:02:58 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8UGxMDN16726346; Fri, 30 Sep 2005 11:59:22 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8UGxLS94364023; Fri, 30 Sep 2005 11:59:21 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 651B56022F49; Fri, 30 Sep 2005 11:59:21 -0500 (CDT) Date: Fri, 30 Sep 2005 11:59:21 -0500 From: Erik Jacobson To: Erik Jacobson Cc: Christoph Hellwig , lse-tech@lists.sourceforge.net, akpm@osdl.org, kingsley@aurema.com, canon@nersc.gov, pagg@oss.sgi.com Subject: Re: Process Notification (pnotify) performance comparisons Message-ID: <20050930165921.GA30608@sgi.com> References: <20050921213645.GB28239@sgi.com> <20050922151647.GA30784@infradead.org> <20050929170325.GD15246@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050929170325.GD15246@sgi.com> User-Agent: Mutt/1.5.6i X-archive-position: 141 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg New performance tests. These tests were run on a 32-processor Altix system. It has 1300 MHz CPUs. Note that the individual CPUs are slower than my earlier testing with the 2p box. fork-bomb-like tests are slower on big NUMA machines like this unless you restrict them to running CPUs within the same node (and this is why the 2p box does better in that respect). I know it was requested that I run this on a behemoth - I'm just hoping 32p is big enough. The bigger the machine, the harder it is to reserve time... As seen in the thread, my first pass at an RCU implementation isn't good enough to use for real. I'm hoping it is good enough to get us some numbers. Conclusions: 1. My feeling is that RCU isn't buying us much here. In fact, it restricts what kernel modules can do in that you can't sleep in many interesting places. This affects things such as kmalloc's (unless you say you aren't willing to wait)... It also makes it so you can't use rwsems in all the places you might need to. 2. In the data below, unlike the 2p test, there are some differences. My conclusion here, and the reason I added an additional test, is that the Job infrastructure is the cause for a small slow down. So I added a kernel test where pnotify still has two kernel module subscribers per task, but Job isn't one of them. Job is replaced by a tiny kernel module that just simply counts the number of times the fork, exit, and exec hooks fire. This takes Job out of the equation. That kernel closely tracks the stock kernel. That means pnotify isn't at fault, Job is. In other words, if Job implemented its own hooks and was configured on, we'd probably see the same small dip. 3. The fork/exit tests (jump to the end for a summary) show stock performing a tiny bit better than pnotify with two subscribers and no job (keyrings and a test module). I suspect if I did a test where pnotify only had keyrings as a subscriber, the numbers would be nearly the same. The AIM7 data shows these two kernels as very similar. 4. So I feel pnotify isn't really costing us but the data does show we need to keep an eye on pnotify users. We need to treat new pnotify users in a similar way as new callouts in exit, copy_process, and exec when the pnotify user plans to associate with all or many running tasks (at least for performance reasons) I'm going to integrate other comments I got on the pnotify patch now and send a new version of the non-rcu pnotify patch soon. Some info on the tests: jobtest: a mini job test suite that includes forking enough processes to make the PID numbers roll. Output trimmed. forkexit: just forks and exits the specified number of times. fork-wait-exit: is the same, but the parent waits for the child. AIM7: The version of aim7 we use is not tracking the current community release. All kernels had kdb patches applied and kdb was enabled. 4 kernels tested: 2.6.14-rc2 with pnotify, job, and pnotify-aware keyrings, original NON-RCU 2.6.14-rc2 with pnotify, job, and pnotify-aware keyrings, RCU Testing 2.6.14-rc2 stock (no pnotify, no job, non-modified keyrings enabled) 2.6.14-rc2 non-rcu pnotify, NO Job, pnotify-aware keyrings, tiny pnotify module In the last kernel, I'm showing a pnotify user that only does atomic adds and subtracts on a few variables and nothing else. The test module, like keyrings, is a subscriber to all processes. The purpose of this test is to take the Job infrastructure out of the performance picture to just focus on pnotify. This tiny test module counts the number of times fork, exit, and exec fires. Output is provided. 2.6.14-rc2 pnotify, job, pnotify-aware keyrings implementation NOT using rcu ------------------------------------------------------------------------------ === jobtest === belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m26.222s user 0m1.736s sys 0m21.380s belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m35.574s user 0m2.512s sys 0m31.356s belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m35.696s user 0m2.556s sys 0m31.128s === forkexit === belay:~ # time ./forkexit 40000 Fork returned an error: 7720 times real 0m14.665s user 0m0.072s sys 0m14.464s belay:~ # time ./forkexit 40000 Fork returned an error: 7720 times real 0m15.439s user 0m0.104s sys 0m15.212s belay:~ # time ./forkexit 40000 Fork returned an error: 7720 times real 0m15.115s user 0m0.068s sys 0m14.924s === fork-wait-exit === belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.494s user 0m2.136s sys 0m31.588s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.208s user 0m1.760s sys 0m28.208s === AIM7 === Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP PREEMPT Wed Sep 28 15:01:33 CDT 2005 HOST = belay CPUS = 32 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = pnotify+job+new-keyring, NON-RCU Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 29 16:03:37 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1907.6 100 3.1 0.6 31.7929 2 3694.1 96 3.2 1.2 30.7839 3 5643.2 96 3.1 1.5 31.3510 4 7994.5 99 2.9 1.8 33.3104 5 9417.5 97 3.1 2.4 31.3916 10 18976.2 97 3.1 4.7 31.6270 20 37439.7 98 3.1 9.1 31.1997 50 74348.5 94 3.9 23.6 24.7828 100 101694.9 90 5.7 46.5 16.9492 150 124643.1 87 7.0 67.8 13.8492 200 130669.1 86 8.9 90.3 10.8891 500 154877.9 83 18.8 227.9 5.1626 1000 150566.6 80 38.7 453.3 2.5094 2000 154665.9 78 75.3 937.6 1.2889 2.6.14-rc2 pnotify, job, pnotify-aware keyrings implementation with RCU ------------------------------------------------------------------------------ === jobtest === belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m26.501s user 0m1.608s sys 0m20.596s belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m35.501s user 0m2.320s sys 0m30.148s belay:~ # time jobtest [snip] Great. All tests completed, no errors real 0m35.563s user 0m2.576s sys 0m30.632s === forkexit === --> Note: The first attempt took 1 minutes 23 seconds. This only --> happened once. With the stock kernel test below, one attempt was --> an outlier taking more than 5 minutes. Therefore, I don't think it --> is anything I changed that caused this. belay:~ # time ./forkexit 40000 Fork returned an error: 7701 times real 1m22.814s user 0m0.064s sys 1m22.740s belay:~ # time ./forkexit 40000 Fork returned an error: 7693 times real 0m14.557s user 0m0.076s sys 0m14.368s belay:~ # time ./forkexit 40000 Fork returned an error: 7693 times real 0m14.774s user 0m0.076s sys 0m14.580s belay:~ # time ./forkexit 40000 Fork returned an error: 7693 times real 0m15.218s user 0m0.112s sys 0m14.992s === fork-wait-exit === belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.922s user 0m1.560s sys 0m28.060s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.178s user 0m1.680s sys 0m27.988s === AIM7 === nux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #5 SMP PREEMPT Thu Sep 29 15:00:00 CDT 2005 HOST = belay CPUS = 32 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = RCU pnotify+job+new-keyring Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 29 16:42:08 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1903.8 100 3.1 0.6 31.7305 2 3655.8 96 3.2 1.2 30.4648 3 5628.6 96 3.1 1.6 31.2701 4 7553.5 96 3.1 2.1 31.4731 5 10003.4 99 2.9 2.2 33.3448 10 18945.3 97 3.1 4.7 31.5755 20 36823.8 97 3.2 9.2 30.6865 50 72532.4 94 4.0 23.4 24.1775 100 97552.8 90 6.0 46.7 16.2588 150 120981.2 87 7.2 68.4 13.4424 200 130581.1 86 8.9 88.7 10.8818 500 151665.2 83 19.2 228.0 5.0555 1000 153586.3 81 37.9 467.5 2.5598 2000 153454.7 79 75.9 941.7 1.2788 2.6.14-rc2 stock+kdb, no pnotify, no job, unmodified keyrings enabled ------------------------------------------------------------------------------ === jobtest === None: kernel doesn't have pnotify or job === forkexit === --> My first attempt on a stock kernel+kdb took more than 5 minutes. The --> data isn't useful because I escaped in to kdb to check on some things. --> There was a similar outlier for my RCU pnotify kernel above that took --> 1 minute 23 above. Later runs were normal. belay:~ # time ./forkexit 40000 Fork returned an error: 7698 times real 0m14.421s user 0m0.088s sys 0m14.220s belay:~ # time ./forkexit 40000 Fork returned an error: 7699 times real 0m14.282s user 0m0.064s sys 0m14.100s belay:~ # time ./forkexit 40000 Fork returned an error: 7699 times real 0m15.736s user 0m0.072s sys 0m15.648s === fork-wait-exit === --> The 16.838 was an outlier I was never able to duplicate. --> I tried many times. Of course, if I resrict it to just two processors --> on a single node, it's faster. Perhaps we got lucky once. belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m16.838s user 0m1.080s sys 0m17.092s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.599s user 0m1.728s sys 0m28.248s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.742s user 0m1.668s sys 0m27.332s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.777s user 0m1.720s sys 0m26.548s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.829s user 0m1.644s sys 0m27.168s === AIM7 === nux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP PREEMPT Wed Sep 28 14:53:02 CDT 2005 HOST = belay CPUS = 32 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = stock, non-modified-keyrings enabled Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 29 17:38:42 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1909.4 100 3.0 0.6 31.8241 2 3670.8 96 3.2 1.2 30.5897 3 5597.9 96 3.1 1.6 31.0997 4 7543.7 96 3.1 2.1 31.4323 5 9445.0 97 3.1 2.4 31.4833 10 19088.2 97 3.0 4.5 31.8137 20 37977.2 98 3.1 9.0 31.6476 50 73559.2 95 4.0 23.6 24.5197 100 101147.0 90 5.8 47.1 16.8578 150 120496.9 86 7.2 68.5 13.3885 200 129003.7 86 9.0 90.5 10.7503 500 149737.6 83 19.4 222.8 4.9913 1000 158010.5 80 36.8 439.7 2.6335 2000 156453.7 77 74.4 885.3 1.3038 2.6.14-rc2 non-rcu pnotify, NO Job, pnotify-aware keyrings, tiny pnotify module ------------------------------------------------------------------------------ === jobtest === None: kernel doesn't have job === forkexit === --> Like the other tests, the first run of this took longer than the other --> runs of it. belay:~ # time ./forkexit 40000 Fork returned an error: 7686 times real 1m35.260s user 0m0.076s sys 1m35.064s belay:~ # time ./forkexit 40000 Fork returned an error: 7686 times real 0m15.843s user 0m0.068s sys 0m15.652s belay:~ # time ./forkexit 40000 Fork returned an error: 7687 times real 0m14.404s user 0m0.064s sys 0m14.220s belay:~ # time ./forkexit 40000 Fork returned an error: 7687 times real 0m14.487s user 0m0.060s sys 0m14.304s === fork-wait-exit === belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.205s user 0m2.112s sys 0m30.684s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m25.045s user 0m1.872s sys 0m30.208s belay:~ # time ./fork-wait-exit 40000 Fork returned an error: 0 times real 0m24.927s user 0m1.964s sys 0m29.200s === AIM7 === Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #3 SMP PREEMPT Fri Sep 30 09:21:15 CDT 2005 HOST = belay CPUS = 32 DIRS = 1 DISKS= 0 FS = xfs SCSI = non-xscsi ID = non-rcu pnotify, NO Job, pnotify keyrings, tiny pnotify test module Run 1 of 1 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" AIM7 Run Sep 30 10:19:29 2005 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 1909.4 100 3.0 0.6 31.8241 2 3694.1 96 3.2 1.2 30.7839 3 5679.9 97 3.1 1.5 31.5550 4 7538.9 96 3.1 2.2 31.4119 5 10031.0 99 2.9 2.2 33.4367 10 19088.2 97 3.0 4.5 31.8137 20 36964.1 97 3.1 9.2 30.8034 50 72424.1 95 4.0 23.9 24.1414 100 102410.7 90 5.7 46.1 17.0684 150 120148.6 86 7.3 67.4 13.3498 200 131927.9 87 8.8 89.5 10.9940 500 154229.4 83 18.9 220.7 5.1410 1000 158552.9 80 36.7 440.5 2.6425 2000 152884.3 78 76.1 882.8 1.2740 === Special note for this kernel === After all the testing, I removed my tiny test kernel module that did atomic increments to count the number of times various hooks happened. Here is the info from that (from dmesg): Unregistering pnotify support for (name=pnotify-test) exit called 797136 times... fork called 796785 times... init called 351 times... exec called 97675 times ... Good - fork count + init count equals exit count. ------------------------------------------------------------------------------ I really would need more trials to see if these converge but I think we're covered on data with the AIM runs. Let me know if more data is requested. It appears stock is performing best but pnotify with two subscribers (keyrings, which is present in stock and my test module that atomic increments counters for the callbacks) is very close. I suspect if I ran another test where only one subscriber was present to match what was in stock, the numbers would be nearly the same. forkexit summary (real time average minus outliers): pnotify+job+pnotify keyrings, NON-RCU: 15.07 pnotify+job+pnotify keyrings, RCU: 14.85 stock: 14.813 pnotify+pnotify keyrings+test mod, non-rcu: 14.911 fork-wait-exit summary (real time average minus outlier): pnotify+job+pnotify keyrings, NON-RCU: 25.351 pnotify+job+pnotify keyrings, RCU: 25.05 stock: 24.703 pnotify+pnotify keyrings+test mod, non-rcu: 25.059 From erikj@sgi.com Fri Sep 30 14:02:51 2005 Received: with ECARTIS (v1.0.0; list pagg); Fri, 30 Sep 2005 14:02:58 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8UL2pO0029900 for ; Fri, 30 Sep 2005 14:02:51 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8UM3a4W020141 for ; Fri, 30 Sep 2005 15:03:36 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8UKwxDN16738719; Fri, 30 Sep 2005 15:58:59 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8UKwwS94387648; Fri, 30 Sep 2005 15:58:58 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 9FE4E6022F49; Fri, 30 Sep 2005 15:58:58 -0500 (CDT) Date: Fri, 30 Sep 2005 15:58:58 -0500 From: Erik Jacobson To: kingsley@aurema.com Cc: Erik Jacobson , pagg@oss.sgi.com, tonyt@aurema.com Subject: Re: [patch] Minor PAGG attach/detach semantic change for 2.6.11 Message-ID: <20050930205858.GA10565@sgi.com> References: <20050617014512.GA10285@aurema.com> <20050927201020.GA30433@sgi.com> <20050929051627.GC3404@aurema.com> <20050929151205.GB7395@sgi.com> <20050929223827.GA2737@aurema.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050929223827.GA2737@aurema.com> User-Agent: Mutt/1.5.6i X-archive-position: 142 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg Ok, I think I understand now. You're saying since the process never really was running, it doesn't make sense to run __pnotify_exit. In __pnotify_fork, I think I agree with you. If there was a failure forking, the kernel module wouldn't expect to be notified for exit. Also, if the fork callout indicated that the process should fail as a fork failure, I agree, the kernel module wouldn't expect the the exit callout. However, in fork.c's copy_process function, I'm not sure I'm comfortable changing the pnotify_exit call there to pnotify_unsubscribe. I don't want to cause any sort of memory leak where structures a kernel module allocated are not freed. However, we know if our own pnotify_fork call failed, that we don't have anything to cleanup. So I made that jump past the pnotify_exit in the cleanup section. But if anything after pnotify_fork fails, pnotify_exit will be called. This seems pretty normal for copy_process. Does that sound ok? I'll post an updated patch shortly with this and some other feedback. > My suggestion is that the unsubscribe exit callback should only be > done for clients who successfully subscribed. For example, during a > fork: > > 1. Client A subscribes with its subscriber->events->fork() being > successful. > > 2. Client B subscribes with its subscriber->events->fork() being > successful. > > 3. Client C fails the fork and its subscriber->events->fork() returns > a failure. Client C has its subscription removed from the pnotify > list but its subscriber->events->exit() callback should not be called. > > 4. Client A & B are unsubscribed and their subscriber->events->exit() > callbacks are invoked. > > > Right now, AFAICS, client C would have its subscriber->events->exit() > callback invoked in spite of a failure. IMHO the exit() callback > should only ever be called for a client for whom its fork() callback > succeeded. What do you think? > > > > > See the RCU test version of pnotify in the download site under > > pnotify-test. My attempts at posting that patch to the list seem to be > > eaten by the list server right now. When that's fixed, I'll start posting > > stuff here. > > > > Erik > > Thanks, > -- > Kingsley -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota From erikj@sgi.com Fri Sep 30 14:46:45 2005 Received: with ECARTIS (v1.0.0; list pagg); Fri, 30 Sep 2005 14:46:51 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.sgi.com [192.48.171.19]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8ULkiO0031754 for ; Fri, 30 Sep 2005 14:46:44 -0700 Received: from flecktone.americas.sgi.com (flecktone.americas.sgi.com [198.149.16.15]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with ESMTP id j8UMlUGs025886 for ; Fri, 30 Sep 2005 15:47:30 -0700 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by flecktone.americas.sgi.com (8.12.9/8.12.10/SGI_generic_relay-1.2) with ESMTP id j8ULgqDN16738232; Fri, 30 Sep 2005 16:42:52 -0500 (CDT) Received: from snoot.americas.sgi.com (hoot.americas.sgi.com [128.162.233.104]) by thistle-e236.americas.sgi.com (8.12.9/SGI-server-1.8) with ESMTP id j8ULgpS94415492; Fri, 30 Sep 2005 16:42:51 -0500 (CDT) Received: by snoot.americas.sgi.com (Postfix, from userid 31161) id 28CBC6022F49; Fri, 30 Sep 2005 16:42:51 -0500 (CDT) Date: Fri, 30 Sep 2005 16:42:51 -0500 From: Erik Jacobson To: Christoph Hellwig Cc: Erik Jacobson , lse-tech@lists.sourceforge.net, akpm@osdl.org, kingsley@aurema.com, canon@nersc.gov, pagg@oss.sgi.com Subject: Re: [Lse-tech] [PATCH] Process Notification (pnotify) Message-ID: <20050930214250.GA13326@sgi.com> References: <20050921213645.GB28239@sgi.com> <20050922151647.GA30784@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050922151647.GA30784@infradead.org> User-Agent: Mutt/1.5.6i X-archive-position: 143 X-ecartis-version: Ecartis v1.0.0 Sender: pagg-bounce@oss.sgi.com Errors-to: pagg-bounce@oss.sgi.com X-original-sender: erikj@sgi.com Precedence: bulk X-list: pagg Based on feedback obtained so far, I have posted a number of different versions of a few patches. In the performance discussion, I show that I don't belive RCU is faster by showing the numbers for RCU based job are similar to those of original rwsem version. The grain of salt there is that my RCU implementation wasn't complete as my Job patch couldn't fully use it within the restrictions of 'no sleep during rcu_read_locks'. The test runs were done on a 2p and 32p ia64 box - see the performance subthread of this thread. I implemented two versions of a keyring patch showing how an existing piece of the kernel may make use of pnotify - one for the RCU version and one for the original version. Because my feeling, after doing this research, is that the RCU version of pnotify may not be the best fit, I went ahead and implemented the other feedback gathered from this list, the pagg mailing list, and a co-worker. Here is a cleaned up pnotify patch that is _not_ using RCU. Signed-off-by: Erik Jacobson --- Documentation/pnotify.txt | 368 +++++++++++++++++++++++++++++ fs/exec.c | 2 include/linux/init_task.h | 2 include/linux/pnotify.h | 227 ++++++++++++++++++ include/linux/sched.h | 5 init/Kconfig | 8 kernel/Makefile | 1 kernel/exit.c | 4 kernel/fork.c | 17 + kernel/pnotify.c | 568 ++++++++++++++++++++++++++++++++++++++++++++++ 10 files changed, 1201 insertions(+), 1 deletion(-) Index: linux/fs/exec.c =================================================================== --- linux.orig/fs/exec.c 2005-09-30 14:57:55.097213456 -0500 +++ linux/fs/exec.c 2005-09-30 14:57:57.629184199 -0500 @@ -48,6 +48,7 @@ #include #include #include +#include #include #include @@ -1203,6 +1204,7 @@ retval = search_binary_handler(bprm,regs); if (retval >= 0) { free_arg_pages(bprm); + pnotify_exec(current); /* execve success */ security_bprm_free(bprm); Index: linux/include/linux/init_task.h =================================================================== --- linux.orig/include/linux/init_task.h 2005-09-30 14:57:55.098189920 -0500 +++ linux/include/linux/init_task.h 2005-09-30 14:57:57.636019445 -0500 @@ -2,6 +2,7 @@ #define _LINUX__INIT_TASK_H #include +#include #include #define INIT_FDTABLE \ @@ -121,6 +122,7 @@ .proc_lock = SPIN_LOCK_UNLOCKED, \ .journal_info = NULL, \ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ + INIT_TASK_PNOTIFY(tsk) \ .fs_excl = ATOMIC_INIT(0), \ } Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h 2005-09-30 14:57:55.098189920 -0500 +++ linux/include/linux/sched.h 2005-09-30 15:25:34.616252251 -0500 @@ -795,6 +795,11 @@ struct mempolicy *mempolicy; short il_next; #endif +#ifdef CONFIG_PNOTIFY +/* List of pnotify kernel module subscribers */ + struct list_head pnotify_subscriber_list; + struct rw_semaphore pnotify_subscriber_list_sem; +#endif #ifdef CONFIG_CPUSETS struct cpuset *cpuset; nodemask_t mems_allowed; Index: linux/init/Kconfig =================================================================== --- linux.orig/init/Kconfig 2005-09-30 14:57:55.099166384 -0500 +++ linux/init/Kconfig 2005-09-30 15:25:34.489311959 -0500 @@ -162,6 +162,14 @@ for processing it. A preliminary version of these tools is available at . +config PNOTIFY + bool "Support for Process Notification" + help + Say Y here if you will be loading modules which provide support + for process notification. Examples of such modules include the + Linux Jobs module and the Linux Array Sessions module. If you will not + be using such modules, say N. + config SYSCTL bool "Sysctl support" ---help--- Index: linux/kernel/Makefile =================================================================== --- linux.orig/kernel/Makefile 2005-09-30 14:57:55.100142848 -0500 +++ linux/kernel/Makefile 2005-09-30 15:25:34.490288423 -0500 @@ -20,6 +20,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_COMPAT) += compat.o +obj-$(CONFIG_PNOTIFY) += pnotify.o obj-$(CONFIG_CPUSETS) += cpuset.o obj-$(CONFIG_IKCONFIG) += configs.o obj-$(CONFIG_IKCONFIG_PROC) += configs.o Index: linux/kernel/fork.c =================================================================== --- linux.orig/kernel/fork.c 2005-09-30 14:57:55.100142848 -0500 +++ linux/kernel/fork.c 2005-09-30 15:54:50.502255817 -0500 @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -151,6 +152,9 @@ init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2; init_task.signal->rlim[RLIMIT_SIGPENDING] = init_task.signal->rlim[RLIMIT_NPROC]; + + /* Initialize the pnotify list in pid 0 before it can clone itself. */ + INIT_PNOTIFY_LIST(current); } static struct task_struct *dup_task_struct(struct task_struct *orig) @@ -1039,6 +1043,15 @@ p->exit_state = 0; /* + * Call pnotify kernel module subscribers and add the same subscribers the + * parent has to the new process. + * Fail the fork on error. + */ + retval = pnotify_fork(p, current); + if (retval) + goto bad_fork_cleanup_namespace; + + /* * Ok, make it visible to the rest of the system. * We dont wake it up yet. */ @@ -1073,7 +1086,7 @@ if (sigismember(¤t->pending.signal, SIGKILL)) { write_unlock_irq(&tasklist_lock); retval = -EINTR; - goto bad_fork_cleanup_namespace; + goto bad_fork_cleanup_pnotify; } /* CLONE_PARENT re-uses the old parent */ @@ -1159,6 +1172,8 @@ return ERR_PTR(retval); return p; +bad_fork_cleanup_pnotify: + pnotify_exit(p); bad_fork_cleanup_namespace: exit_namespace(p); bad_fork_cleanup_keys: Index: linux/kernel/exit.c =================================================================== --- linux.orig/kernel/exit.c 2005-09-30 14:57:55.100142848 -0500 +++ linux/kernel/exit.c 2005-09-30 15:25:34.617228715 -0500 @@ -29,6 +29,7 @@ #include #include #include +#include #include #include @@ -866,6 +867,9 @@ module_put(tsk->binfmt->module); tsk->exit_code = code; + + pnotify_exit(tsk); + exit_notify(tsk); #ifdef CONFIG_NUMA mpol_free(tsk->mempolicy); Index: linux/kernel/pnotify.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/kernel/pnotify.c 2005-09-30 16:01:50.518408112 -0500 @@ -0,0 +1,568 @@ +/* + * Process Notification (pnotify) interface + * + * + * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + */ + +#include +#include +#include +#include +#include +#include + +/* list of pnotify event list entries that reference the "module" + * implementations */ +static LIST_HEAD(pnotify_event_list); +static DECLARE_RWSEM(pnotify_event_list_sem); + + +/** + * pnotify_get_subscriber - get a pnotify subscriber given a search key + * @task: We examine the pnotify_subscriber_list from the given task + * @key: Key name of kernel module subscriber we wish to retrieve + * + * Given a pnotify_subscriber_list structure, this function will return + * a pointer to the kernel module pnotify_subsciber struct that matches the + * search key. If the key is not found, the function will return NULL. + * + * Locking: This is a pnotify_subscriber_list reader. This function should + * be called with at least a read lock on the pnotify_subscriber_list using + * down_read(&task->pnotify_subscriber_list_sem). + * + */ +struct pnotify_subscriber * +pnotify_get_subscriber(struct task_struct *task, char *key) +{ + struct pnotify_subscriber *subscriber; + + list_for_each_entry(subscriber, &task->pnotify_subscriber_list, entry) { + if (!strcmp(subscriber->events->name,key)) + return subscriber; + } + return NULL; +} + + +/** + * pnotify_subscribe - Add kernel module to the subscriber list for process + * @task: Task that gets the new kernel module subscriber added to the list + * @events: pnotify_events structure to associate with kernel module + * + * Given a task and a pnotify_events structure, this function will allocate + * a new pnotify_subscriber, initialize the settings, and insert it into + * the pnotify_subscriber_list for the task. + * + * Locking: + * The caller for this function should hold at least a read lock on the + * pnotify_event_list_sem - or ensure that the pnotify_events entry cannot be + * removed. If this function was called from the pnotify module (usually the + * case), then the caller need not hold this lock because the event + * structure won't disappear until pnotify_unregister is called. + * + * This is a pnotify_subscriber_list WRITER. The caller must hold a write + * lock on for the tasks pnotify_subscriber_list_sem. This can be locked + * using down_write(&task->pnotify_subscriber_list_sem). + */ +struct pnotify_subscriber * +pnotify_subscribe(struct task_struct *task, struct pnotify_events *events) +{ + struct pnotify_subscriber *subscriber; + + subscriber = kmalloc(sizeof(struct pnotify_subscriber), GFP_KERNEL); + if (!subscriber) + return NULL; + + subscriber->events = events; + subscriber->data = NULL; + atomic_inc(&events->refcnt); /* Increase hook's reference count */ + list_add_tail(&subscriber->entry, &task->pnotify_subscriber_list); + return subscriber; +} + + +/** + * pnotify_unsubscribe - Remove kernel module subscriber from process + * @subscriber: The subscriber to remove + * + * This function will ensure the subscriber is deleted form + * the list of subscribers for the task. Finally, the memory for the + * subscriber is discarded. + * + * Prior to calling pnotify_unsubscribe, the subscriber should have been + * detached from any uses the kernel module may have. This is often done using + * p->events->exit(task, subscriber); + * + * Locking: + * This is a pnotify_subscriber_list WRITER. The caller of this function must + * hold a write lock on the pnotify_subscriber_list_sem for the task. This can + * be locked using down_write(&task->pnotify_subscriber_list_sem). Because + * events are referenced, the caller should ensure the events structure + * doesn't disappear. If the caller is a pnotify module, the events + * structure won't disappear until pnotify_unregister is called so it's safe + * not to lock the pnotify_event_list_sem. + * + * + */ +void +pnotify_unsubscribe(struct pnotify_subscriber *subscriber) +{ + atomic_dec(&subscriber->events->refcnt); /* dec events ref count */ + list_del(&subscriber->entry); + kfree(subscriber); +} + + +/** + * pnotify_get_events - Get the pnotify_events struct matching requested name + * @key: The name of the events structure to get + * + * Given a pnotify_events struct name that represents the kernel module name, + * this function will return a pointer to the pnotify_events structure that + * matches the name. + * + * Locking: + * You should hold either the write or read lock for pnotify_event_list_sem + * before using this function. This will ensure that the pnotify_event_list + * does not change while iterating through the list entries. + * + */ +static struct pnotify_events * +pnotify_get_events(char *key) +{ + struct pnotify_events *events; + + list_for_each_entry(events, &pnotify_event_list, entry) { + if (!strcmp(events->name, key)) { + return events; + } + } + return NULL; +} + +/** + * remove_subscriber_from_all_tasks - Remove subscribers for given events struct + * @events: pnotify_events struct for subscribers to remove + * + * Given a kernel module events struct registered with pnotify, + * this function will remove all subscribers matching the events struct from + * all tasks. + * + * If there is a exit function associated with the subscriber, it is called + * before the subscriber is unsubscribed/freed. + * + * This is meant to be used by pnotify_register and pnotify_unregister + * + * Locking: This is a pnotify_subscriber_list WRITER and this function + * handles locking of the pnotify_subscriber_list_sem so callers don't + * need to. + * + */ +static void +remove_subscriber_from_all_tasks(struct pnotify_events *events) +{ + if (events == NULL) + return; + + /* Because of internal race conditions we can't guarantee + * getting every task in just one pass so we just keep going + * until there are no tasks with subscribers from this events struct + * attached. The inefficiency of this should be tempered by the fact + * that this happens at most once for each registered client. + */ + while (atomic_read(&events->refcnt) != 0) { + struct task_struct *g = NULL, *p = NULL; + + read_lock(&tasklist_lock); + do_each_thread(g, p) { + struct pnotify_subscriber *subscriber; + int task_exited; + + get_task_struct(p); + read_unlock(&tasklist_lock); + down_write(&p->pnotify_subscriber_list_sem); + subscriber = pnotify_get_subscriber(p, events->name); + if (subscriber != NULL) { + (void)events->exit(p, subscriber); + pnotify_unsubscribe(subscriber); + } + up_write(&p->pnotify_subscriber_list_sem); + read_lock(&tasklist_lock); + /* If a task exited while we were looping, its sibling + * list would be empty. In that case, we jump out of + * the do_each_thread and loop again in the outter + * while because the reference count probably isn't + * zero for the pnotify events yet. Doing it this way + * makes it so we don't hold the tasklist lock too + * long. + */ + + + task_exited = list_empty(&p->sibling); + put_task_struct(p); + if (task_exited) + goto endloop; + } while_each_thread(g, p); + endloop: + read_unlock(&tasklist_lock); + } +} + +/** + * pnotify_register - Register a new module subscriber and enter it in the list + * @events_new: The new pnotify events structure to register. + * + * Used to register a new module subscriber pnotify_events structure and enter + * it into the pnotify_event_list. The service name for a pnotify_events + * struct is restricted to 32 characters. + * + * If an "init()" function is supplied in the events struct being registered + * then the kernel module will be subscribed to all existing tasks and the + * supplied "init()" function will be applied to it. If any call to the + * supplied "init()" function returns a non zero result, the registration will + * be aborted. As part of the abort process, all subscribers belonging to the + * new client will be removed from all tasks and the supplied "detach()" + * function will be called on them. + * + * If a memory error is encountered, the module (pnotify_events structure) + * is unregistered and any tasks we became subscribed to are detached. + * + * Locking: This function is an event list writer as well as a + * pnotify_subscriber_list writer. This function does the locks itself. + * Callers don't need to. + * + */ +int +pnotify_register(struct pnotify_events *events_new) +{ + struct pnotify_events *events = NULL; + + /* Add new pnotify module to access list */ + if (!events_new) + return -EINVAL; /* error */ + if (!list_empty(&events_new->entry)) + return -EINVAL; /* error */ + if (events_new->name == NULL || strlen(events_new->name) > + PNOTIFY_NAMELN) + return -EINVAL; /* error */ + if (!events_new->fork || !events_new->exit) + return -EINVAL; /* error */ + + /* Try to insert new events entry into the events list */ + down_write(&pnotify_event_list_sem); + + events = pnotify_get_events(events_new->name); + + if (events) { + up_write(&pnotify_event_list_sem); + printk(KERN_WARNING "Attempt to register duplicate" + " pnotify support (name=%s)\n", events_new->name); + return -EBUSY; + } + + /* Okay, we can insert into the events list */ + list_add_tail(&events_new->entry, &pnotify_event_list); + /* set the ref count to zero */ + atomic_set(&events_new->refcnt, 0); + + /* Now we can call the init function (if present) for each task */ + if (events_new->init != NULL) { + struct task_struct *g = NULL, *p = NULL; + int init_result = 0; + + /* Because of internal race conditions we can't guarantee + * getting every task in just one pass so we just keep going + * until we don't find any unitialized tasks. The inefficiency + * of this should be tempered by the fact that this happens + * at most once for each registered client. + */ + read_lock(&tasklist_lock); + repeat: + do_each_thread(g, p) { + struct pnotify_subscriber *subscriber; + int task_exited; + + get_task_struct(p); + read_unlock(&tasklist_lock); + down_write(&p->pnotify_subscriber_list_sem); + subscriber = pnotify_get_subscriber(p, + events_new->name); + if (!subscriber && !(p->flags & PF_EXITING)) { + subscriber = pnotify_subscribe(p, events_new); + if (subscriber != NULL) { + init_result = events_new->init(p, + subscriber); + + /* Success, but init function pointer + * doesn't want kernel module on the + * subscriber list. */ + if (init_result > 0) { + pnotify_unsubscribe(subscriber); + } + } + else { + init_result = -ENOMEM; + } + } + up_write(&p->pnotify_subscriber_list_sem); + read_lock(&tasklist_lock); + /* Like in remove_subscriber_from_all_tasks, if the + * task disappeared on us while we were going through + * the for_each_thread loop, we need to start over + * with that loop. That's why we have the list_empty + * here */ + task_exited = list_empty(&p->sibling); + put_task_struct(p); + if (init_result < 0) + goto endloop; + if (task_exited) + goto repeat; + } while_each_thread(g, p); + endloop: + read_unlock(&tasklist_lock); + + /* + * if anything went wrong during initialisation abandon the + * registration process + */ + if (init_result < 0) { + remove_subscriber_from_all_tasks(events_new); + list_del_init(&events_new->entry); + up_write(&pnotify_event_list_sem); + + printk(KERN_WARNING "Registering pnotify support for" + " (name=%s) failed\n", events_new->name); + + return init_result; /* init function error result */ + } + } + + up_write(&pnotify_event_list_sem); + + printk(KERN_INFO "Registering pnotify support for (name=%s)\n", + events_new->name); + + return 0; /* success */ + +} + +/** + * pnotify_unregister - Unregister kernel module/pnotify_event struct + * @event_old: pnotify_event struct for the kernel module we're unregistering + * + * Used to unregister kernel module subscribers indicated by + * pnotify_events struct. Removes them from the list of kernel modules + * in pnotify_event_list. + * + * Once the events entry in the pnotify_event_list is found, subscribers for + * this kernel module have their exit functions called and will then be + * removed from the list. + * + * Locking: This functoin is a pnotify_event_list writer. It also calls + * remove_subscriber_from_all_tasks, which is a pnotify_subscriber_list + * writer. Callers don't need to hold these locks ahead of calling this + * function. + * + */ +int +pnotify_unregister(struct pnotify_events *events_old) +{ + struct pnotify_events *events; + + /* Check the validity of the arguments */ + if (!events_old) + return -EINVAL; /* error */ + if (list_empty(&events_old->entry)) + return -EINVAL; /* error */ + if (events_old->name == NULL) + return -EINVAL; /* error */ + + down_write(&pnotify_event_list_sem); + + events = pnotify_get_events(events_old->name); + + if (events && events == events_old) { + remove_subscriber_from_all_tasks(events); + list_del_init(&events->entry); + up_write(&pnotify_event_list_sem); + + printk(KERN_INFO "Unregistering pnotify support for" + " (name=%s)\n", events_old->name); + + return 0; /* success */ + } + + up_write(&pnotify_event_list_sem); + + printk(KERN_WARNING "Attempt to unregister pnotify support (name=%s)" + " failed - not found\n", events_old->name); + + return -EINVAL; /* error */ +} + + +/** + * __pnotify_fork - Add kernel module subscribe to same subscribers as parent + * @to_task: The child task that will inherit the parent's subscribers + * @from_task: The parent task + * + * Make it so a new task being constructed has the same kernel module + * subscribers of its parent. + * + * The "from" argument is the parent task. The "to" argument is the child + * task. + * + * See Documentation/pnotify.txt * for details on + * how to handle return codes from the attach function pointer. + * + * Locking: The to_task is currently in-construction, so we don't + * need to worry about write-locks. We do need to be sure the parent's + * subscriber list, which we copy here, doesn't go away on us. This function + * read-locks the pnotify_subscriber_list. Callers don't need to lock. + * + */ +int +__pnotify_fork(struct task_struct *to_task, struct task_struct *from_task) +{ + struct pnotify_subscriber *from_subscriber; + int ret; + + /* lock the parents subscriber list we are copying from */ + down_read(&from_task->pnotify_subscriber_list_sem); + + list_for_each_entry(from_subscriber, + &from_task->pnotify_subscriber_list, entry) { + struct pnotify_subscriber *to_subscriber = NULL; + + to_subscriber = pnotify_subscribe(to_task, + from_subscriber->events); + if (!to_subscriber) { + /* Failed to get memory. + * we don't force __pnotify_exit to run here because + * the child is in-consturction and not running yet. + * We don't need a write lock on the subscriber + * list because the child is in construction. + */ + pnotify_unsubscribe(to_task); + up_read(&from_task->pnotify_subscriber_list_sem); + return -ENOMEM; + } + ret = to_subscriber->events->fork(to_task, to_subscriber, + from_subscriber->data); + + if (ret < 0) { + /* Propagates to copy_process as a fork failure. + * Since the child is in consturction, we don't + * need a write lock on the subscriber list. + * __pnotify_exit isn't run because the child + * never got running, exit doesn't make sense. + */ + pnotify_unsubscribe(to_task); + up_read(&from_task->pnotify_subscriber_list_sem); + return ret; /* Fork failure */ + } + else if (ret > 0) { + /* Success, but fork function pointer in the + * pnotify_events structure doesn't want the kernel + * module subscribed. This is an in-construction + * child so we don't need to write lock */ + pnotify_unsubscribe(to_subscriber); + } + } + + /* unlock parent's subscriber list */ + up_read(&from_task->pnotify_subscriber_list_sem); + + return 0; /* success */ +} + +/** + * __pnotify_exit - Remove all subscribers from given task + * @task: Task to remove subscribers from + * + * For each subscriber for the given task, we run the function pointer + * for exit in the associated pnotify_events structure then remove the + * it from the tasks's subscriber list until all subscribers are gone. + * + * Locking: This is a pnotify_subscriber_list writer. This function + * write locks the pnotify_subscriber_list. Callers don't have to do their own + * locking. The pnotify_events structure referenced exit function is called + * with the pnotify_subscriber_list write lock held. + * + */ +void +__pnotify_exit(struct task_struct *task) +{ + struct pnotify_subscriber *subscriber; + struct pnotify_subscriber *subscribertmp; + + /* Remove ref. to subscribers from task immediately */ + down_write(&task->pnotify_subscriber_list_sem); + + list_for_each_entry_safe(subscriber, subscribertmp, + &task->pnotify_subscriber_list, entry) { + subscriber->events->exit(task, subscriber); + pnotify_unsubscribe(subscriber); + } + + up_write(&task->pnotify_subscriber_list_sem); + + return; /* 0 = success, else return last code for failure */ +} + + +/** + * __pnotify_exec - Execute exec callback for each subscriber in this task + * @task: We go through the subscriber list in the given task + * + * Used to when a process that has a subscriber list does an exec. + * The exec pointer in the events structure is optional. + * + * Locking: This is a pnotify_subscriber_list reader and implements the + * read locks itself. Callers don't need to do their own locking. The + * pnotify_events referenced exec function pointer is called in an + * environment where the pnotify_subscriber_list is read locked. + * + */ +int +__pnotify_exec(struct task_struct *task) +{ + struct pnotify_subscriber *subscriber; + + down_read(&task->pnotify_subscriber_list_sem); + + list_for_each_entry(subscriber, &task->pnotify_subscriber_list, entry) { + if (subscriber->events->exec) /* exec funct. ptr is optional */ + subscriber->events->exec(task, subscriber); + } + + up_read(&task->pnotify_subscriber_list_sem); + return 0; +} + + +EXPORT_SYMBOL_GPL(pnotify_get_subscriber); +EXPORT_SYMBOL_GPL(pnotify_subscribe); +EXPORT_SYMBOL_GPL(pnotify_unsubscribe); +EXPORT_SYMBOL_GPL(pnotify_register); +EXPORT_SYMBOL_GPL(pnotify_unregister); Index: linux/include/linux/pnotify.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/include/linux/pnotify.h 2005-09-30 14:57:57.651642867 -0500 @@ -0,0 +1,227 @@ +/* + * Process Notification (pnotify) interface + * + * + * Copyright (c) 2000-2002, 2004-2005 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * + * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + * + * For further information regarding this notice, see: + * + * http://oss.sgi.com/projects/GenInfo/NoticeExplan + */ + +/* + * Data structure definitions and function prototypes used to implement + * process notification (pnotify). + * + * pnotify provides a method (service) for kernel modules to be notified when + * certain events happen in the life of a process. It also provides a + * data pointer that is associated with a given process. See + * Documentation/pnotify.txt for a full description. + */ + +#ifndef _LINUX_PNOTIFY_H +#define _LINUX_PNOTIFY_H + +#include + +#ifdef CONFIG_PNOTIFY + +#define PNOTIFY_NAMELN 32 /* Max chars in PNOTIFY kernel module name */ + +#define PNOTIFY_ERROR -1 /* Error. Fork fail for pnotify_fork */ +#define PNOTIFY_OK 0 /* All is well, stay subscribed */ +#define PNOTIFY_NOSUB 1 /* All is well but don't subscribe module + * to subscriber list for the process */ + + +/** + * INIT_PNOTIFY_LIST - init a pnotify subscriber list struct after declaration + * @_l: Task struct to init the pnotify_module_subscriber_list and semaphore + * + */ +#define INIT_PNOTIFY_LIST(_l) \ +do { \ + INIT_LIST_HEAD(&(_l)->pnotify_subscriber_list); \ + init_rwsem(&(_l)->pnotify_subscriber_list_sem); \ +} while(0) + +/* + * Used by task_struct to manage list of subscriber kernel modules for the + * process. Each pnotify_subscriber provides the link between the process + * and the correct kernel module subscriber. + * + * STRUCT MEMBERS: + * pnotify_events: events: Reference to pnotify_events structure, which + * holds the name key and function pointers. + * data: Opaque data pointer - defined by pnotify kernel modules. + * entry: List pointers + */ +struct pnotify_subscriber { + struct pnotify_events *events; + void *data; + struct list_head entry; +}; + +/* + * Used by pnotify modules to define the callback functions into the + * module. See Documentation/pnotify.txt for details. + * + * STRUCT MEMBERS: + * name: The name of the pnotify container type provided by + * the module. This will be set by the pnotify module. + * fork: Function pointer to function used when associating + * a forked process with a kernel module referenced by + * this struct. pnotify.txt will provide details on + * special return codes interpreted by pnotify. + * + * exit: Function pointer to function used when a process + * associated with the kernel module owning this struct + * exits. + * + * init: Function pointer to initialization function. This + * function is used when the module registers with pnotify + * to associate existing processes with the referring + * kernel module. This is optional and may be set to NULL + * if it is not needed by the pnotify kernel module. + * + * Note: The return values are managed the same way as in + * attach above. Except, of course, an error doesn't + * result in a fork failure. + * + * Note: The implementation of pnotify_register causes + * us to evaluate some tasks more than once in some cases. + * See the comments in pnotify_register for why. + * Therefore, if the init function pointer returns + * PNOTIFY_NOSUB, which means that it doesn't want this + * process associated with the kernel module, that init + * function must be prepared to possibly look at the same + * "skipped" task more than once. + * + * data: Opaque data pointer - defined by pnotify modules. + * module: Pointer to kernel module struct. Used to increment & + * decrement the use count for the module. + * entry: List pointers + * exec: Function pointer to function used when a process + * this kernel module is subscribed to execs. This + * is optional and may be set to NULL if it is not + * needed by the pnotify module. + * refcnt: Keep track of user count of pnotify_events + */ +struct pnotify_events { + struct module *module; + char *name; /* Name Key - restricted to 32 chars */ + void *data; /* Opaque module specific data */ + struct list_head entry; /* List pointers */ + atomic_t refcnt; /* usage counter */ + int (*init)(struct task_struct *, struct pnotify_subscriber *); + int (*fork)(struct task_struct *, struct pnotify_subscriber *, void*); + void (*exit)(struct task_struct *, struct pnotify_subscriber *); + void (*exec)(struct task_struct *, struct pnotify_subscriber *); +}; + + +/* Kernel service functions for providing pnotify support */ +extern struct pnotify_subscriber *pnotify_get_subscriber(struct task_struct + *task, char *key); +extern struct pnotify_subscriber *pnotify_subscribe(struct task_struct *task, + struct pnotify_events *pt); +extern void pnotify_unsubscribe(struct pnotify_subscriber *subscriber); +extern int pnotify_register(struct pnotify_events *pt_new); +extern int pnotify_unregister(struct pnotify_events *pt_old); +extern int __pnotify_fork(struct task_struct *to_task, + struct task_struct *from_task); +extern void __pnotify_exit(struct task_struct *task); +extern int __pnotify_exec(struct task_struct *task); + +/** + * pnotify_fork - child inherits subscriber list associations of its parent + * @child: child task - to inherit + * @parent: parenet task - child inherits subscriber list from this parent + * + * function used when a child process must inherit subscriber list assocation + * from the parent. Return code is propagated as a fork fail. + * + */ +static inline int pnotify_fork(struct task_struct *child, + struct task_struct *parent) +{ + INIT_PNOTIFY_LIST(child); + if (!list_empty(&parent->pnotify_subscriber_list)) + return __pnotify_fork(child, parent); + + return 0; +} + + +/** + * pnotify_exit - Detach subscriber kernel modules from this process + * @task: The task the subscribers will be detached from + * + */ +static inline void pnotify_exit(struct task_struct *task) +{ + if (!list_empty(&task->pnotify_subscriber_list)) + __pnotify_exit(task); +} + +/** + * pnotify_exec - Used when a process exec's + * @task: The process doing the exec + * + */ +static inline void pnotify_exec(struct task_struct *task) +{ + if (!list_empty(&task->pnotify_subscriber_list)) + __pnotify_exec(task); +} + +/** + * INIT_TASK_PNOTIFY - Used in INIT_TASK to set head and sem of subscriber list + * @tsk: The task work with + * + * Marco Used in INIT_TASK to set the head and sem of pnotify_subscriber_list + * If CONFIG_PNOTIFY is off, it is defined as an empty macro below. + * + */ +#define INIT_TASK_PNOTIFY(tsk) \ + .pnotify_subscriber_list = LIST_HEAD_INIT(tsk.pnotify_subscriber_list),\ + .pnotify_subscriber_list_sem = \ + __RWSEM_INITIALIZER(tsk.pnotify_subscriber_list_sem), + +#else /* CONFIG_PNOTIFY */ + +/* + * Replacement macros used when pnotify (Process Notification) support is not + * compiled into the kernel. + */ +#define INIT_TASK_PNOTIFY(tsk) +#define INIT_PNOTIFY_LIST(l) do { } while(0) +#define pnotify_fork(ct, pt) ({ 0; }) +#define pnotify_exit(t) do { } while(0) +#define pnotify_exec(t) do { } while(0) +#define pnotify_unsubscribe(t) do { } while(0) + +#endif /* CONFIG_PNOTIFY */ + +#endif /* _LINUX_NOTIFY_H */ Index: linux/Documentation/pnotify.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/Documentation/pnotify.txt 2005-09-30 14:57:57.655548722 -0500 @@ -0,0 +1,368 @@ +Process Notification (pnotify) +-------------------- +pnotify provides a method (service) for kernel modules to be notified when +certain events happen in the life of a process. Events we support include +fork, exit, and exec. A special init event is also supported (see events +below). More events could be added. pnotify also provides a generic data +pointer for the modules to work with so that data can be associated per +process. + +A kernel module will register (pnotify_register) a service request describing +events it cares about (pnotify_events) with pnotify_register. The request +tells pnotify which notifications the kernel module wants. The kernel module +passes along function pointers to be called for these events (exit, fork, exec) +in the pnotify_events service request. + +From the process point of view, each process has a kernel module subscriber +list (pnotify_subscriber_list). These kernel modules are the ones who want +notification about the life of the process. As described above, each kernel +module subscriber on the list has a generic data pointer to point to data +associated with the process. + +In the case of fork, pnotify will allocate the same kernel module subscriber +list for the new child that existed for the parent. The kernel module's +function pointer for fork is also called for the child being constructed so +the kernel module can do what ever it needs to do when a parent forks this +child. Special return values apply for the fork and init event that don't to +others. They are described in the fork and init example below. + +For exit, similar things happen but the exit function pointer for each +kernel module subscriber is called and the kernel module subscriber entry for +that process is deleted. + + +Events +------ +Events are stages of a processes life that kernel modules care about. The +fork event is triggered in a certain location in copy_process when a parent +forks. The exit event happens when a process is going away. We also support +an exec event, which happens when a process execs. Finally, there is an init +event. This special event makes it so this kernel module will be associated +with all current processes in the system at the time of registration. This is +used when a kernel module wants to keep track of all current processes as +opposed to just those it associates by itself (and children that follow). The +events a kernel module cares about are set up in the pnotify_events +structure - see usage below. + +When setting up a pnotify_events, you designate which events you care about +by either associating NULL (meaning you don't care about that event) or a +pointer to the function to run when the event is triggered. The fork event +and the exit event is currently required. + + +How do processes become associated with kernel modules? +------------------------------------------------------- +Your kernel module itself can use the pnotify_subscribe function to associate +a given process with a given pnotify_events structure. This adds +your kernel module to the subscriber list of the process. In the case +of inescapable job containers making use of PAM, when PAM allows a person to +log in, PAM contacts job (via a PAM job module which uses the job userland +library) and the kernel Job code will call pnotify_subscribe to associate the +process with pnotify. From that point on, the kernel module will be notified +about events in the process's life that the module cares about (as well, +as any children that process may later have). + +Likewise, your kernel module can remove an association between it and +a given process by using pnotify_unsubscribe. + + +Example Usage +------------- + +=== filling out the pnotify_events structure === + +A kernel module wishing to use pnotify needs to set up a pnotify_events +structure. This structure tells pnotify which events you care about and what +functions to call when those events are triggered. In addition, you supply a +name (usually the kernel module name). The entry is always filled out as +shown below. .module is usually set to THIS_MODULE. data can be optionally +used to store a pointer with the pnotify_events structure. + +Example of a filled out pnotify_events: + +static struct pnotify_events pnotify_events = { + .module = THIS_MODULE, + .name = "test_module", + .data = NULL, + .entry = LIST_HEAD_INIT(pnotify_events.entry), + .init = test_init, + .fork = test_attach, + .exit = test_detach, + .exec = test_exec, +}; + +The above pnotify_events structure says the kernel module "test_module" cares +about events fork, exit, exec, and init. In fork, call the kernel module's +test_attach function. In exec, call test_exec. In exit, call test_detach. +The init event is specified, so all processes on the system will be associated +with this kernel module during registration and the test_init function will +be run for each. + + +=== Registering with pnotify === + +You will likely register with pnotify in your kernel module's module_init +function. Here is an example: + +static int __init test_module_init(void) +{ + int rc = pnotify_register(&pnotify_events); + if (rc < 0) { + return -1; + } + + return 0; +} + + +=== Example init event function ==== + +Since the init event is defined, it means this kernel module is added +to the subscriber list of all processes -- it will receive notification +about events it cares about for all processes and all children that +follow. + +Of course, if a kernel module doesn't need to know about all current +processes, that module shouldn't implement this and '.init' in the +pnotify_events structure would be NULL. + +This is as opposed to the normal method where the kernel module adds itself +to the subscriber list of a process using pnotify_subscribe. + +Important: +Note: The implementation of pnotify_register causes us to evaluate some tasks +more than once in some cases. See the comments in pnotify_register for why. +Therefore, if the init function pointer returns PNOTIFY_NOSUB, which means +that it doesn't want a process association, that init function must be +prepared to possibly look at the same "skipped" task more than once. + +Note that the return value here is similar to the fork function pointer +below except there is no notion of failing the fork since existing processes +aren't forking. + +PNOTIFY_OK - good, adds the kernel module to the subscriber list for process +PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process + +static int test_init(struct task_struct *tsk, struct pnotify_subscriber *subscriber) +{ + if (pnotify_get_subscriber(tsk, "test_module") == NULL) + dprintk("ERROR pnotify expected \"%s\" PID = %d\n", "test_module", tsk->pid); + + dprintk("FYI pnotify init hook fired for PID = %d\n", tsk->pid); + atomic_inc(&init_count); + return 0; +} + + +=== Example fork (test_attach) function === + +This function is executed when a process forks - this is associated +with the pnotify_callout callout in copy_process. There would be a very +similar test_detach function (not shown). + +pnotify will add the kernel module to the notification list for the child +process automatically and then execute this fork function pointer (test_attach +in this example). However, the kernel module can control whether the kernel +module stays on the process's subscriber list and wants notification by the +return value. + +PNOTIFY_ERROR - prevent the process from continuing - failing the fork +PNOTIFY_OK - good, adds the kernel module to the subscriber list for process +PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process + + +static int test_attach(struct task_struct *tsk, struct pnotify_subscriber *subscriber, void *vp) +{ + dprintk("pnotify attach hook fired for PID = %d\n", tsk->pid); + atomic_inc(&attach_count); + + return PNOTIFY_OK; +} + + +=== Example exec event function === + +And here is an example function to run when a task gets to exec. So any +time a "tracked" process gets to exec, this would execute. + +static void test_exec(struct task_struct *tsk, struct pnotify_subscriber *subscriber) +{ + dprintk("pnotify exec hook fired for PID %d\n", tsk->pid); + atomic_inc(&exec_count); +} + + +=== Unregistering with pnotify === + +You will likely wish to unregister with pnotify in the kernel module's +module_exit function. Here is an example: + +static void __exit test_module_cleanup(void) +{ + pnotify_unregister(&pnotify_events); + printk("detach called %d times...\n", atomic_read(&detach_count)); + printk("attach called %d times...\n", atomic_read(&attach_count)); + printk("init called %d times...\n", atomic_read(&init_count)); + printk("exec called %d times ...\n", atomic_read(&exec_count)); + if (atomic_read(&attach_count) + atomic_read(&init_count) != + atomic_read(&detach_count)) + printk("pnotify PROBLEM: attach count + init count SHOULD equal detach cound and doesn't\n"); + else + printk("Good - attach count + init count equals detach count.\n"); +} + + + +=== Actually using data associated with the process in your module === + +The above examples show you how to create an example kernel module using +pnotify, but they didn't show what you might do with the data pointer +associated with a given process. Below, find an example of accessing +the data pointer for a given process from within a kernel module making use +of pnotify. + +pnotify_get_subscriber is used to retrieve the pnotify subscriber for a given +process and kernel module. Like this: + +subscriber = pnotify_get_subscriber(task, name); + +Where name is your kernel module's name (as provided in the pnotify_events +structure) and task is the process you're interested +in. + +Please be careful about locking. The task structure has a +pnotify_subscriber_list_sem to be used for locking. This example retrieves +a given task in a way that ensures it doesn't disappear while we try to +access it (that's why we do locking for the tasklist_lock and task). The +pnotify subscriber list is locked to ensure the list doesn't change as we +search it with pnotify_get_subscriber. + + read_lock(&tasklist_lock); + get_task_struct(task); /* Ensure the task doesn't vanish on us */ + read_unlock(&tasklist_lock); /* Unlock the tasklist */ + down_read(&task->pnotify_subscriber_list_sem); /* readlock subscriber list */ + + subscriber = pnotify_get_subscriber(task, name); + if (subscriber) { + /* Get the widgitId associated with this task */ + widgitId = ((widgitId_t *)subscriber->data); + } + put_task_struct(task); /* Done accessing the task */ + up_read(&task->pnotify_subscriber_list_sem); /* unlock subscriber list */ + + +Future Events +------------- +Kingsley Cheung suggested that we add events for uid and gid changes and this +may inspire broader use. Depending on how the discussoin goes, I'll post a +patch to add this functionality in the next day or two. + +History +------- +Process Notification used to be known as PAGG (Process Aggregates). +It was re-written to be called Process Notification because we believe this +better describes its purpose. Structures and functions were re-named to +be more clear and to reflect the new name. + + +Why Not Notifier Lists? +----------------------- +We investigated the use of notifier lists, available in newer kernels. + +Notifier lists would not be as efficient as pnotify for kernel modules +wishing to associate data with processes. With pnotify, if the +pnotify_subscriber_list of a given task is NULL, we can instantly know +there are no kernel modules that care about the process. Further, the +callbacks happen in places were the task struct is likely to be cached. +So this is a quick operation. With notifier lists, the scope is system +wide rather than per process. As long as one kernel module wants to be +notified, we have to walk the notifier list and potentially waste cycles. +In the case of pnotify, we only walk lists if we're interested about +a specific task. + +On a system where pnotify is used to track only a few processes, the +overhead of walking the notifier list is high compared to the overhead +of walking the kernel module subscriber list only when a kernel module +is interested in a given process. + +I don't believe this is easily solved in notifier lists themselves as +they are meant to be global resources, not per-task resources. + +Overlooking performance issues, notifier lists in and of themselves wouldn't +solve the problem pnotify solves anyway. Although you could argue notifier +lists can implement the callback portion of pnotify, there is no association +of data with a given process. This is a needed for kernel modules to +efficiently associate a task with a data pointer without cluttering up +the task struct. + +In addition to data associated with a process, we desire the ability for +kernel modules to add themselves to the subscriber list for any arbitrary +process - not just current or a child of current. + + +Some Justification +------------------ +We feel that pnotify could be used to reduce the size of the task struct or +the number of functions in copy_process. For example, if another part of the +kernel needs to know when a process is forking or exiting, they could use +pnotify instead of adding additional code to task struct, copy_process, or +exit. + +Some have argued that PAGG in the past shouldn't be used because it will +allow interesting things to be implemented outside of the kernel. While this +might be a small risk, having these in place allows customers and users to +implement kernel components that you don't want to see in the kernel anyway. + +For example, a certain vendor may have an urgent need to implement kernel +functionality or special types of accounting that nobody else is interested +in. That doesn't mean the code isn't open-source, it just means it isn't +applicable to all of Linux because it satisfies a niche. + +All of pnotify's functionality that needs to be exported is exported with +EXPORT_SYMBOL_GPL to discourage abuse. + +The risk already exists in the kernel for people to implement modules outside +the kernel that suffer from less peer review and possibly bad programming +practice. pnotify could add more oppurtunities for out-of-tree kernel module +authors to make new modules. I believe this is somewhat mitigated by the +already-existing 'tainted' warnings in the kernel. + +Other Ideas? +------------ +There have been similar proposals to provide pieces of the pnotify +functionality. If there is a better proposal out there, let's explore it. +Here are some key functions I hope to see in any proposal: + + - Ability to have notification for exec, fork, exit at minimum + - Ability to extend to other callouts later (such as uid/gid changes as + I described earlier) + - Ability for pnotify user modules to implement code that ends up adding + a kernel module subscriber to any arbitrary process (not just current and + its children). + +I believe, if the above are more or less met, we should be in good shape for +our other open source projects such as linux job. + +Variable Name Changes from PAGG to pnotify +------------------------------------------ +PAGG_NAMELEN -> PNOTIFY_NAMELEN +struct pagg -> pnotify_subscriber +pagg_get -> pnotify_get_subscriber +pagg_alloc -> pnotify_subscribe +pagg_free -> pnotify_unsubscribe +pagg_hook_register -> pnotify_register +pagg_hook_unregister -> pnotify_unregister +pagg_attach -> pnotify_fork +pagg_detach -> pnotify_exit +pagg_exec -> pnotify_exec +struct pagg_hook -> pnotify_events + +With pnotify_events (formerly pagg_hook): + attach -> fork + detach -> exit + +Return codes for the init and fork function pointers should use: +PNOTIFY_ERROR - prevent the process from continuing - failing the fork +PNOTIFY_OK - good, adds the kernel module to the subscriber list for process +PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process + -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota