From owner-LinuxScalability@oss.sgi.com Fri Aug 4 09:05:22 2000 Received: by oss.sgi.com id ; Fri, 4 Aug 2000 09:05:12 -0700 Received: from pneumatic-tube.sgi.com ([204.94.214.22]:12622 "EHLO pneumatic-tube.sgi.com") by oss.sgi.com with ESMTP id ; Fri, 4 Aug 2000 09:04:49 -0700 Received: from cthulhu.engr.sgi.com (cthulhu.engr.sgi.com [192.26.80.2]) by pneumatic-tube.sgi.com (980327.SGI.8.8.8-aspam/980310.SGI-aspam) via ESMTP id JAA05212 for ; Fri, 4 Aug 2000 09:10:16 -0700 (PDT) mail_from (qarce@pc-griffin2.engr.sgi.com) From: qarce@pc-griffin2.engr.sgi.com Received: from pc-griffin2.engr.sgi.com (pc-griffin2.engr.sgi.com [163.154.5.74]) by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id JAA74079 for ; Fri, 4 Aug 2000 09:04:04 -0700 (PDT) mail_from (qarce@pc-griffin2.engr.sgi.com) Received: from localhost (qarce@localhost) by pc-griffin2.engr.sgi.com (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id JAA17501 for ; Fri, 4 Aug 2000 09:00:56 -0700 Date: Fri, 4 Aug 2000 09:00:56 -0700 (PDT) To: linux-scalability@oss.sgi.com Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing testing -------------------- Quentin Arce Linux Infrastructure qarce@engr.sgi.com (650) 993 - 3771 From owner-LinuxScalability@oss.sgi.com Mon Aug 7 22:32:22 2000 Received: by oss.sgi.com id ; Mon, 7 Aug 2000 22:32:12 -0700 Received: from chmls06.mediaone.net ([24.147.1.144]:65190 "EHLO chmls06.mediaone.net") by oss.sgi.com with ESMTP id ; Mon, 7 Aug 2000 22:31:41 -0700 Received: from pianoman.cluster.toy (h00001c600ed5.ne.mediaone.net [24.147.29.131]) by chmls06.mediaone.net (8.8.7/8.8.7) with ESMTP id BAA24128 for ; Tue, 8 Aug 2000 01:31:08 -0400 (EDT) Date: Tue, 8 Aug 2000 01:31:32 -0400 (EDT) From: PianoMan X-Sender: clemej@pianoman.cluster.toy To: linux-scalability@oss.sgi.com Subject: [RFC] Adding the notion of System vs. Application processors to Linux Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing This is a compilation of some ideas I've been kicking around and have started to implement. It is not a win in all cases.. please listen to my arguements carefully, and all comments are welcome. Some issues have already been raised by private conversations with people on this list, and I hope to address them here. ---- As Linux scales to larger and larger machines, it is becoming more of a force in the technical and high performance computing areas. These areas generate a workload that is generally CPU bound, as opposed to I/O bound loads often generated by large scale servers for E-Commerce and the like. These programs want every last possible cycle of CPU power available to them, because they often take hours, days, or weeks to run.. and even a 1 or 2% performance imporvenet could save hours of computing time. The problem is that Linux is optomised as a general puspose OS, and does not conatin very many ways of tuning for this particular case... The problems: (correct me if my view of reality is wrong, please) - The standard Linux scheduler uses a modified Round Robin approach with quantas. The scheduler schedules processes to run on all the CPU's, and then when a small period of time, or "quanta" expires, the schedular then interrupts the running process and see's if anything else needs to get run. Even if your CPU bound process is the only one running, it will still be subject to this quanta interruption, losing a few few cycles each time. Also, while a process is weighted towardss running on the same CPU, it is not guaranteed to always be scheduled on the same CPU, thus losing any cache it had already warn in the first processor. (This may have cahnged with som recent patched from SGI). Finally, on a multi-user system where more processes are running then there are processors available, then the scheduler will schedule other tasks on the current processor, even if your job is "more important". - Interrupts are by default turned on and handled by all processors.. as well as tasklets, bottom halves, and soft IRQ's. Each of these will interrupt the flow of your program and pollute your current local cache. (This too turns out to be mostly tunable from the new proc/irq/#/smp_affinity file in later kernels). - System, as well as user processes can be scheduled on any cpu at any time. While you may be able to guarantee that your process will always use one processor, there is no guarantee to keep other processes from using that processor either. Even on an otherwise idle system, kernel threads can get scheduled on your CPUs. My proposed solution: Add to linux's flexibility by allowing the administrator of a large machine to declare a certain number of processors on a large system to be "application" processors, and the others to be "system" processors. When Linux boots, it will run on the system processors, assign user space programs and IRQ's and tasklets and system threads to ONLY the system processors. The processors which are designated to be application processors sre spun up but NEVER ASSIGNED ANY TASKS, and all (maybe not timer/fpu?) interrupts are disabled. Then modify the schedular to add a new scheduling class called "SCHED_BATCH", and any processes that are started with "SCHED_BATCH" scheduling would enter a special simple batch schedular which would hand the task to an application CPU and let it run until it is complete, then run the next SCHED_BATCH process in the queue. If no application processors are available, we can either queue the task to run the the current tasks are complete, or turn the task back over to the system CPU pool to be run these as a regular process. (does this answer your question Dimitris? get rid of the quanta by having a separate schedular for those CPU's) Essentially, this turns a configurable number of processors into special purpose "co-processors". The same effect could be accomplished by building a PCI card and putting a few PPC chips on it, and telling Linux about it... except that these processors can access all the resources regular linux processes can, without any penalty. To put it another way, you're turning the machine into two separate machines with unique characteristics and sharing memory and disk; one a general purpose multi-user front end, and the other a CPU intensive farm as a backend. As an example, lets take an 8 CPU server and partition it into 4 system processors and 4 application processors. The system processors would be governed by the stndard schedular, and the 4 application processors would be waiting there for a task. Say a user then wanted to run 7 copies of SETI@Home. They could mark the executable to use the SCHED_BATCH scheduler (maybe use a user space program to modify a field in the ELF header?). The general schedular sees these 7 tasks are a special case and sends them off to the batch schedular, which assigns 4 SETI processes to the four application processors. These processes now proceed as fast as the hardware will allow them to, without interruption from the OS, until they are complete (or killed). Upon seeing that all application processors are in use, the schedular would then chenge the remaining SETI processes to SCHED_RR and send them back to the normal schedular to be run on the user processors and fight for thier time as they would have before. (does this answer your question Kanoj? we're not limiting the # of cpu's available to the process, we're making a few of the processors available "better".) The other benefit is that if someone logs in and runs 20 copies of the CPU-Eater background (see IRIX, circa 1994 :-), the 4 application processors will not be affected at all, and happily go along doing thier useful work. The SETI@Home clients run on the application processors would be close to the theoretical limits of the hardware in speed, and this run faster than the ones in the user pool. If you switch SETI@home with your faviorite CPU intensive single threaded or multi-threaded program, then you have a winning situation. This is by no means a win in all situations. Firstly, one needs a large number of CPU's, I would guess NCPUS >= 8. Secondly, If the OS needs 8 cpus to keep up the the system load, then taking a few CPU's out of the general OS pool may help the CPU intensive tasks but may very well hurt the rest of the system. This is also something that I feel should NEVER be merged into the main linux kernel. This would be a patch to handle a specific, special purpose goal, and would go against the "optimise for the general case " philosophy of Linux. However, I do feel that this scheme should be maintained as a patch outside of the linux tree so that it would be available for those who wished to take advantage of it. There are still a lot of details to be hammered out, and to be quite honest, I'm not sure this is a win in ANY situation.. but that's why I'm in the middle of coding it up to show everyone... It seems to me this is very indicitave to the way that Cray does things on it's large T3x series. (course, there it's hundreds of processors, not tens.).. But I feel that in certain situations this could be a big win, especially in computation servers which require some interactive "traditional" server capabilities but also need to provide hard core CPU crunching. It could also be very useful for real-time applications, as Kanoj pointed out to me... however I have not examined this issue. Here are some possible issues I myself could see being a problem: - Tasks on the application processors that block waiting for I/O can't do any useful work. (doesn't pollute cache, but seems somewhat wasteful) - I've not talked about gang scheduling on the application CPU's... could be a big win, but don't know how to handle the case of threads of the same program on application processors and user processors at the same time. - Should I disable the timer on the application CPUs? - the simple batch schedular could be used as a denial-of-service (does anyone really care?) - If implemented correctly, then there should be little impact on the user space side of things. An extra compare in the scheduler, and a creative config option. IF not desired, then the option can be configed out and not used at compile time at all. - It seems kludgey. but a hell of a lot more elegant then running DOS. Any comments/flames/whatever would be greatly appreciated. I will be working on a patch that implements this on ia32 (only SMP machine I have).. lets see if I can get soem real numbers to back up my claims. or you can just go tell me i'm insane. and I know my spelling's atrocious, but it's also 1:30am.. I need some sleep. john.c - -- John Clemens RPI Computer Engineering 2000 clemej@alum.rpi.edu http://pianoman.penguinpowered.com/ "I Hate Quotes" -- Samuel L. Clemens From owner-LinuxScalability@oss.sgi.com Tue Aug 8 12:57:16 2000 Received: by oss.sgi.com id ; Tue, 8 Aug 2000 12:56:57 -0700 Received: from pneumatic-tube.sgi.com ([204.94.214.22]:39261 "EHLO pneumatic-tube.sgi.com") by oss.sgi.com with ESMTP id ; Tue, 8 Aug 2000 12:56:38 -0700 Received: from cthulhu.engr.sgi.com (gate3-relay.engr.sgi.com [130.62.1.234]) by pneumatic-tube.sgi.com (980327.SGI.8.8.8-aspam/980310.SGI-aspam) via ESMTP id NAA07957 for ; Tue, 8 Aug 2000 13:02:09 -0700 (PDT) mail_from (dimitris@darkside.engr.sgi.com) Received: from darkside.engr.sgi.com (darkside.engr.sgi.com [163.154.5.83]) by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id MAA07489; Tue, 8 Aug 2000 12:55:51 -0700 (PDT) mail_from (dimitris@darkside.engr.sgi.com) Received: (from dimitris@localhost) by darkside.engr.sgi.com (8.9.3/8.8.7) id MAA04034; Tue, 8 Aug 2000 12:55:51 -0700 Message-ID: X-Mailer: XFMail 1.4.4 on Linux X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Tue, 08 Aug 2000 12:55:50 -0700 (PDT) Organization: SGI From: Dimitris Michailidis To: PianoMan Subject: RE: [RFC] Adding the notion of System vs. Application processors Cc: linux-scalability@oss.sgi.com Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing On 08-Aug-2000 PianoMan wrote: > - System, as well as user processes can be scheduled on any cpu at any > time. While you may be able to guarantee that your process will always > use one processor, there is no guarantee to keep other processes from > using that processor either. Even on an otherwise idle system, kernel > threads can get scheduled on your CPUs. Only if you allow them to. The pinning mechanism in Linux is a lot more flexible than the one in IRIX (and a lot simpler). By being able to pin to arbitrary sets of CPUs you can do many interesting things. You want kernel threads to run only on some of the CPUs? Fine, pin them to those CPUs. You have 8 CPUs and you want to designate CPUs 4-7 as application processors? You could add a boot option to pass a mask to the kernel and the kernel would set init_task.cpus_allowed to this mask. Boot with mask=0xf and then nothing would run on CPUs 4-7 unless explicitly requested by the admin. > My proposed solution: > > Add to linux's flexibility by allowing the administrator of a > large machine to declare a certain number of processors on a large system > to be "application" processors, and the others to be "system" processors. Can be done with a boot option or later while the system is running. Small amount of code required on top of existing infrastucture. The boot option itself is a handful of lines. > When Linux boots, it will run on the system processors, assign user space > programs and IRQ's and tasklets and system threads to ONLY the system > processors. The processors which are designated to be application > processors sre spun up but NEVER ASSIGNED ANY TASKS, and all (maybe not > timer/fpu?) interrupts are disabled. Don't turn off the timer interrupt. It would mess up time accounting. We also need it for slab cache management. IPIs must also be left on. > Then modify the schedular to add a new > scheduling class called "SCHED_BATCH", and any processes that are started > with "SCHED_BATCH" scheduling would enter a special simple batch schedular > which would hand the task to an application CPU and let it run until it is > complete, then run the next SCHED_BATCH process in the queue. If no > application processors are available, we can either queue the task to run > the the current tasks are complete, or turn the task back over to the > system CPU pool to be run these as a regular process. (does this answer > your question Dimitris? get rid of the quanta by having a separate > schedular for those CPU's) I believe that all of this can be done in user space without any scheduler changes. To avoid quanta you can make the process SCHED_FIFO. So have a user level application-processor-administration thingy to which you submit your processes. If it finds a free AP it can pin a process to it and make it SCHED_FIFO to avoid quanta and that's all. If you're careful and don't oversubscribe the APs you don't even need to pin, processor affinity will do it for free. And if your process will be blocking often and don't want to let the CPU idle, you could assign a second process to the same CPU at a lower SCHED_FIFO priority and it would get to run whenever the main thread is blocked. > As an example, lets take an 8 CPU server and partition it into 4 > system processors and 4 application processors. The system processors > would be governed by the stndard schedular, and the 4 application > processors would be waiting there for a task. Say a user then wanted to > run 7 copies of SETI@Home. > ... The only thing you need from the kernel that isn't there today is a pinning system call (coming). The rest belongs to user space IMO. > This is by no means a win in all situations. Firstly, one needs a > large number of CPU's, I would guess NCPUS >= 8. It's easy to make it configurable at run time. > - Should I disable the timer on the application CPUs? No. > - the simple batch schedular could be used as a denial-of-service > (does anyone really care?) No. > - If implemented correctly, then there should be little impact on > the user space side of things. An extra compare in the > scheduler, and a creative config option. IF not desired, then > the option can be configed out and not used at compile time at > all. I don't see that the kernel needs to be aware of the partitioning at all, at least as far as the scheduling component is concerned. > - It seems kludgey. but a hell of a lot more elegant then running > DOS. If you do it in user space it won't be. It can be quite elegant actually. > Any comments/flames/whatever would be greatly appreciated. I will be > working on a patch that implements this on ia32 (only SMP machine I > have).. lets see if I can get soem real numbers to back up my claims. > > or you can just go tell me i'm insane. I don't doubt that you'll see measurable difference, I doubt that you need to change the kernel though. -- Dimitris Michailidis dimitris@engr.sgi.com From owner-LinuxScalability@oss.sgi.com Tue Aug 8 13:39:06 2000 Received: by oss.sgi.com id ; Tue, 8 Aug 2000 13:38:47 -0700 Received: from chmls05.mediaone.net ([24.147.1.143]:32477 "EHLO chmls05.mediaone.net") by oss.sgi.com with ESMTP id ; Tue, 8 Aug 2000 13:38:12 -0700 Received: from pianoman.cluster.toy (h00001c600ed5.ne.mediaone.net [24.147.29.131]) by chmls05.mediaone.net (8.8.7/8.8.7) with ESMTP id QAA16113; Tue, 8 Aug 2000 16:37:22 -0400 (EDT) Date: Tue, 8 Aug 2000 16:37:49 -0400 (EDT) From: PianoMan X-Sender: clemej@pianoman.cluster.toy To: Dimitris Michailidis cc: linux-scalability@oss.sgi.com Subject: RE: [RFC] Adding the notion of System vs. Application processors In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing Dimitris, we agree almost completely. The more I have looked into this (and the recent changes put in the scheduler it CAN almost all eb done fron user space... however: On Tue, 8 Aug 2000, Dimitris Michailidis wrote: > > Only if you allow them to. The pinning mechanism in Linux is a lot more > flexible than the one in IRIX (and a lot simpler). By being able to pin to > arbitrary sets of CPUs you can do many interesting things. You want kernel > threads to run only on some of the CPUs? Fine, pin them to those CPUs. You > have 8 CPUs and you want to designate CPUs 4-7 as application processors? You > could add a boot option to pass a mask to the kernel and the kernel would set > init_task.cpus_allowed to this mask. Boot with mask=0xf and then nothing > would run on CPUs 4-7 unless explicitly requested by the admin. except, like you said, you need the system call .... Also, I have already done this (with 2.4.0-test5 on a dual proc ia32 machine)... and I have still noticed, occasionally, even with all irq's /proc/irq/#/smp_affinity flags set to give all interrupts to the "system" processor, I was still seeing occasional spikes in "top" indicating that every once in a while the CPU, which was supposed to be completely idle except for timer, caseade, and fpu interrupts, still occasionally had activity to the point of being 5% utilized for a moment. Now, either I'm being lied to by top (quite possible), or something is still getting scheduled on that processor.. I'm guessing bh's or softirq's (or system calls?) (since they don't use the "can_schedule" macro). If that's the case, more kernel changes are still needed. (I'm currently tracing the paths now to see if there's something I've missed. > Can be done with a boot option or later while the system is running. Small > amount of code required on top of existing infrastucture. The boot option > itself is a handful of lines. yes, expecially with the addition of the cpus_allowed entry in task_struct. > Don't turn off the timer interrupt. It would mess up time accounting. We > also need it for slab cache management. IPIs must also be left on. Excuse a novice kernel hacker, why would slab cache management be taking place on these processors? or is each processor marking the pages it uses with it's own timestamp.. > I believe that all of this can be done in user space without any scheduler > changes. To avoid quanta you can make the process SCHED_FIFO. So have a > user level application-processor-administration thingy to which you submit > your processes. If it finds a free AP it can pin a process to it and make it > SCHED_FIFO to avoid quanta and that's all. If you're careful and don't > oversubscribe the APs you don't even need to pin, processor affinity will do > it for free. And if your process will be blocking often and don't want to > let the CPU idle, you could assign a second process to the same CPU at a > lower SCHED_FIFO priority and it would get to run whenever the main thread is > blocked. a SCHED_FIFO process doesn't get interrupted and switched out at a quanta??? you mean I could write a process now that does an infinate loop on a uniprocessor and it would block the whole machine? I think you are correct in that the kernel changes would be minimal, but I still think there would need to be some minor re-working of the scheduler. > The only thing you need from the kernel that isn't there today is a pinning > system call (coming). The rest belongs to user space IMO. And I agree for the most part. but we need to add the bootup code the set the CPU mask, add the new pinning syscall, (arguably) make some minor modifications to the scheduler, and then look at softirq's and the like to see where these "other" artifacts I'm seeing are coming from. you are proposing doing I am proposing, essentially. Once all those are in place, then you're right, we have all the mechanism's we need to do the rest in user space. Please realize that I do not work for SGI, I don't know what you've got up your sleeves... I did not know you were getting ready to add a syscall to do the processor pinning.... > > This is by no means a win in all situations. Firstly, one needs a > > large number of CPU's, I would guess NCPUS >= 8. > > It's easy to make it configurable at run time. agreed. > > - Should I disable the timer on the application CPUs? > > No. ok, i tend to agree, but not for the same reason... > I don't see that the kernel needs to be aware of the partitioning at all, at > least as far as the scheduling component is concerned. I'm still not 100% convinced... > > - It seems kludgey. but a hell of a lot more elegant then running > > DOS. > > If you do it in user space it won't be. It can be quite elegant actually. agreed again. > I don't doubt that you'll see measurable difference, I doubt that you need to > change the kernel though. but, you just admitted we still had to change the kernel, even to just add the syscall :-) thanks for the input.. I think we're both on the same wavelength, for the most part at least. john.c From owner-LinuxScalability@oss.sgi.com Tue Aug 8 13:58:17 2000 Received: by oss.sgi.com id ; Tue, 8 Aug 2000 13:58:08 -0700 Received: from pneumatic-tube.sgi.com ([204.94.214.22]:26724 "EHLO pneumatic-tube.sgi.com") by oss.sgi.com with ESMTP id ; Tue, 8 Aug 2000 13:57:54 -0700 Received: from cthulhu.engr.sgi.com (gate3-relay.engr.sgi.com [130.62.1.234]) by pneumatic-tube.sgi.com (980327.SGI.8.8.8-aspam/980310.SGI-aspam) via ESMTP id OAA01847 for ; Tue, 8 Aug 2000 14:03:26 -0700 (PDT) mail_from (dimitris@darkside.engr.sgi.com) Received: from darkside.engr.sgi.com (darkside.engr.sgi.com [163.154.5.83]) by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id NAA64605; Tue, 8 Aug 2000 13:57:08 -0700 (PDT) mail_from (dimitris@darkside.engr.sgi.com) Received: (from dimitris@localhost) by darkside.engr.sgi.com (8.9.3/8.8.7) id NAA04142; Tue, 8 Aug 2000 13:57:08 -0700 Message-ID: X-Mailer: XFMail 1.4.4 on Linux X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Tue, 08 Aug 2000 13:57:08 -0700 (PDT) Organization: SGI From: Dimitris Michailidis To: PianoMan Subject: RE: [RFC] Adding the notion of System vs. Application processors Cc: linux-scalability@oss.sgi.com Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing On 08-Aug-2000 PianoMan wrote: > Now, either I'm being lied to by > top (quite possible), or something is still getting scheduled on that > processor.. I'm guessing bh's or softirq's (or system calls?) (since they > don't use the "can_schedule" macro). Tasklets probably. >> Don't turn off the timer interrupt. It would mess up time accounting. We >> also need it for slab cache management. IPIs must also be left on. > > Excuse a novice kernel hacker, why would slab cache management be taking > place on these processors? or is each processor marking the pages it uses > with it's own timestamp.. When we need to shrink slab caches we ask each cpu to shrink its part (caches are per-CPU these days). Then each CPU does it in its next timer interrupt. We also need IPIs or we won't be able to send signals, flush TLBs, etc. > a SCHED_FIFO process doesn't get interrupted and switched out at a > quanta??? No, they keep executing until something higher priority comes along. Same with SCHED_RR. When they run out of time they get a new quantum and continue executing if they are the highest priority process around. > you mean I could write a process now that does an infinate loop > on a uniprocessor and it would block the whole machine? Absolutely. That's why you need to be root to launch a real-time process. > And I agree for the most part. but we need to add the bootup code the set > the CPU mask, add the new pinning syscall, (arguably) make some minor > modifications to the scheduler, and then look at softirq's and the like to > see where these "other" artifacts I'm seeing are coming from. you are > proposing doing I am proposing, essentially. There are also a number of bugs that need to be fixed. -- Dimitris Michailidis dimitris@engr.sgi.com From owner-LinuxScalability@oss.sgi.com Tue Aug 8 15:42:37 2000 Received: by oss.sgi.com id ; Tue, 8 Aug 2000 15:42:25 -0700 Received: from pneumatic-tube.sgi.com ([204.94.214.22]:35191 "EHLO pneumatic-tube.sgi.com") by oss.sgi.com with ESMTP id ; Tue, 8 Aug 2000 15:39:45 -0700 Received: from nodin.corp.sgi.com (nodin.corp.sgi.com [192.26.51.193]) by pneumatic-tube.sgi.com (980327.SGI.8.8.8-aspam/980310.SGI-aspam) via ESMTP id PAA05937 for ; Tue, 8 Aug 2000 15:44:37 -0700 (PDT) mail_from (kanoj@google.engr.sgi.com) Received: from google.engr.sgi.com ([163.154.10.145]) by nodin.corp.sgi.com (980427.SGI.8.8.8/980728.SGI.AUTOCF) via ESMTP id PAA47338 for ; Tue, 8 Aug 2000 15:38:04 -0700 (PDT) Received: (from kanoj@localhost) by google.engr.sgi.com (SGI-8.9.3/8.9.3) id PAA48216; Tue, 8 Aug 2000 15:35:32 -0700 (PDT) From: Kanoj Sarcar Message-Id: <200008082235.PAA48216@google.engr.sgi.com> Subject: Re: [RFC] Adding the notion of System vs. Application processors To: clemej@alum.rpi.edu (PianoMan) Date: Tue, 8 Aug 2000 15:35:32 -0700 (PDT) Cc: dimitris@cthulhu.engr.sgi.com (Dimitris Michailidis), linux-scalability@oss.sgi.com In-Reply-To: from "PianoMan" at Aug 08, 2000 04:37:49 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing > > except, like you said, you need the system call .... Also, I have already As I noted to you in my first response to you, I did everything I could to get a system call to do pinning. It is not that hard to do, and I have created primitive patches towards this in the past, except that it will not go into 2.4. No SGI scret there ... > > Don't turn off the timer interrupt. It would mess up time accounting. We > > also need it for slab cache management. IPIs must also be left on. FWIW, on IRIX, you _can_ turn off the timer interrupt on specific cpus, as well as IPIs, but these were a headache to get right ... probably worth it, just to be able to claim realtime latencies .... > mechanism's we need to do the rest in user space. Please realize that I > do not work for SGI, I don't know what you've got up your sleeves... I did > not know you were getting ready to add a syscall to do the processor > pinning.... See above ... Kanoj From owner-LinuxScalability@oss.sgi.com Mon Aug 14 21:34:37 2000 Received: by oss.sgi.com id ; Mon, 14 Aug 2000 21:34:26 -0700 Received: from chmls05.mediaone.net ([24.147.1.143]:50377 "EHLO chmls05.mediaone.net") by oss.sgi.com with ESMTP id ; Mon, 14 Aug 2000 21:34:15 -0700 Received: from pianoman.cluster.toy (h00001c600ed5.ne.mediaone.net [24.147.29.131]) by chmls05.mediaone.net (8.8.7/8.8.7) with ESMTP id AAA14661 for ; Tue, 15 Aug 2000 00:34:13 -0400 (EDT) Date: Tue, 15 Aug 2000 00:34:37 -0400 (EDT) From: PianoMan X-Sender: clemej@pianoman.cluster.toy To: linux-scalability@oss.sgi.com Subject: app/sys stuff again... Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing Just wanted to let you gentlemen know that I'm still alive.. and I'm in the process of writing te patch (and userspace) to implement my idea. in fact, it's 90% done. then I'll be able to do some benchmarks and see if's is a GoodThing(tm)... The problem I'm having now is that I can pin a process to a cpu (code to add the functionality to the prctl call was stolen from a patch Dimitris made to the kernel list a few month ago), and shut off all other scheduling/interrupts/and tasklets(not efficiently, but it seems to work)... and it works like a charm if I don't change the scheduler. But when I change the process to SCHED_FIFO, with a priority of 99, it appearently gets scheduled on the System CPU, and softlocks the system (I have a dual proc machine, 1 system cpu, and one app cpu.. so if it gets scheduled on the system cpu, then nothing else can run until its done... its supposed to go on the app cpu)..... ugh.. soo close... SO once I figure that out, I'll be posting a preliminary patch up here to see what people think... then I'll post some benchmarks and release it all for real, for better or for worse... john.c From owner-LinuxScalability@oss.sgi.com Mon Aug 14 21:51:06 2000 Received: by oss.sgi.com id ; Mon, 14 Aug 2000 21:50:56 -0700 Received: from deliverator.sgi.com ([204.94.214.10]:5486 "EHLO deliverator.sgi.com") by oss.sgi.com with ESMTP id ; Mon, 14 Aug 2000 21:50:55 -0700 Received: from cthulhu.engr.sgi.com (gate3-relay.engr.sgi.com [130.62.1.234]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id VAA20635 for ; Mon, 14 Aug 2000 21:43:20 -0700 (PDT) mail_from (dimitris@darkside.engr.sgi.com) Received: from darkside.engr.sgi.com (darkside.engr.sgi.com [163.154.5.83]) by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id VAA81542; Mon, 14 Aug 2000 21:50:39 -0700 (PDT) mail_from (dimitris@darkside.engr.sgi.com) Received: (from dimitris@localhost) by darkside.engr.sgi.com (8.9.3/8.8.7) id VAA15845; Mon, 14 Aug 2000 21:50:39 -0700 Message-ID: X-Mailer: XFMail 1.4.4 on Linux X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 14 Aug 2000 21:50:39 -0700 (PDT) Organization: SGI From: Dimitris Michailidis To: PianoMan Subject: RE: app/sys stuff again... Cc: linux-scalability@oss.sgi.com Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing On 15-Aug-2000 PianoMan wrote: > The problem I'm having now is that I can pin a process to a cpu (code to > add the functionality to the prctl call was stolen from a patch Dimitris > made to the kernel list a few month ago), and shut off all other > scheduling/interrupts/and tasklets(not efficiently, but it seems to > work)... and it works like a charm if I don't change the scheduler. > But when I change the process to SCHED_FIFO, with a priority of 99, it > appearently gets scheduled on the System CPU, and softlocks the system (I > have a dual proc machine, 1 system cpu, and one app cpu.. so if it gets > scheduled on the system cpu, then nothing else can run until its > done... its supposed to go on the app cpu)..... ugh.. soo close... Heh. I think I know what's doing this. There's a bug in the current scheduler that allows a runnable process to be considered for scheduling by its current CPU even if it can no longer run on this CPU. Basically this means that if you try to pin a process to a CPU other than its current this will not take effect until the CPU is taken away from the process somehow (sleeping, preemption). With a SCHED_FIFO it won't get preempted and I suppose it doesn't sleep either so you're out of luck. Try having the process sleep for a moment and see if it behaves. -- Dimitris Michailidis dimitris@engr.sgi.com From owner-LinuxScalability@oss.sgi.com Wed Aug 16 08:38:01 2000 Received: by oss.sgi.com id ; Wed, 16 Aug 2000 08:37:51 -0700 Received: from mail.rpi.alumlink.com ([207.92.136.80]:45833 "EHLO mail.alum.rpi.edu") by oss.sgi.com with ESMTP id ; Wed, 16 Aug 2000 08:37:47 -0700 Date: Wed, 16 Aug 2000 11:38:37 -0400 Message-Id: <200008161138.AA66716578@mail.alum.rpi.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii From: "clemej" Reply-To: To: Dimitris Michailidis CC: Subject: RE: app/sys stuff again... X-Mailer: Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing It worked like a charm by puttng a "sleep(2)" between the prctl call and the call to sched_setscheduler(). I have a fe more features I'd like to add, but I hope to have something concrete out by the end of the weekend. john.c ---------- Original Message ------------------- From: Dimitris Michailidis Date: Mon, 14 Aug 2000 21:50:39 -0700 (PDT) > >Heh. I think I know what's doing this. There's a bug in the current >scheduler that allows a runnable process to be considered for scheduling by >its current CPU even if it can no longer run on this CPU. Basically this >means that if you try to pin a process to a CPU other than its current this >will not take effect until the CPU is taken away from the process somehow >(sleeping, preemption). With a SCHED_FIFO it won't get preempted and I >suppose it doesn't sleep either so you're out of luck. Try having the >process sleep for a moment and see if it behaves. > >-- >Dimitris Michailidis dimitris@engr.sgi.com > From owner-LinuxScalability@oss.sgi.com Wed Aug 23 17:15:58 2000 Received: by oss.sgi.com id ; Wed, 23 Aug 2000 17:15:48 -0700 Received: from chmls05.mediaone.net ([24.147.1.143]:32224 "EHLO chmls05.mediaone.net") by oss.sgi.com with ESMTP id ; Wed, 23 Aug 2000 17:15:21 -0700 Received: from pianoman.cluster.toy (h00001c600ed5.ne.mediaone.net [24.147.29.131]) by chmls05.mediaone.net (8.8.7/8.8.7) with ESMTP id UAA19329; Wed, 23 Aug 2000 20:14:49 -0400 (EDT) Date: Wed, 23 Aug 2000 20:15:05 -0400 (EDT) From: PianoMan X-Sender: clemej@pianoman.cluster.toy To: linux-scalability@oss.sgi.com cc: kanoj@google.engr.sgi.com, dimitris@engr.sgi.com Subject: App/Sys Proc Experiment an insignificant success... Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing Apologies if this is sent twice, I sent it lastnight but it niether bounced back to me (whose subscribed to the list) nor did it appear in the archive, so I'm sending it again.. and CCing Kanoj and Dimitris JIC. john.c ------------------------- Well, for the three people who may actually subscribe to this list, and the one (me) that cares, the experiment that I did with setting some processors up as strictly application processors and some as system processors was a minor success. The process consists of: - applying the attached patch (to 2.4.0-test6) (it is by no means pretty, this is a proof of concept attempt) - configure, compile, reboot the kernel. - turn off all irq's on the aplication procssor(s) by changing each irq's smp_affinity bitmask on /proc/irq/(irq #)/smp_affinity (all except the timer, cascade, and fpu interrupts) - make a CPU intensive program, and add the following calls to the beginning of it: #include #include struct sched_param p; prctl(9,0,); sleep(1); p.priority = 99; sched_setscheduler(0, SCHED_FIFO, &p); - become root and run the program... On a dual processor PIIIEB 666Mhz, I created a simple FPU intensive program that operated on 60MB of matricies. It takes ~26 minutes to run completely. When scheduled on the special designated application processor, it takes approximately 1 to 2% less wall clock time then when scheduled on the one system processor. Thus, the performance benefit is probably too little to be worth it (or even to measure given measurement inaccuracy). Had it been 5%, I would feel differently ;-) The only real improvement is that if you heavily load down the system, then the performace of the app on the application processor will change very little. It will still be fighting for memory and I/O, but it will always have the CPU to itself. Where this could also be of some benefit is extending the idea to handle ccNUMA architectures. for example, assign processes to a group of processors in a single SMP node, thus helping to avoid the affects of migration... I don't know, I'm talking off the top of my head here... Since I don't have a machine such as that to even run some tests on, there is no way i can see if there's any benefit to it or not. So here's what I've done.. I'll write up the hard numbers in a little bit and put them on the web somewhere.... but for right now, here at least is my little hacked patch if anyone wants to play with it. Maybe someday it;ll turn into something useful... if anyone is even remotely interested in this work I'll continue tinkering with it,, if not.. I may just drop it. NOTES ABOUT THE PATCH: - The prctl stuff was based on Dimitris's patch to the LKML a few months back. - The way the syscall is set up, it SHOULD allow changing another processes cpus_allowed field et al... however, tat doesn't work (it seems when tried that the process never gets scheduled again). It's also a major security hole, i guess ;) but who cares ? patch attached.. and thanks Dimitris and Kanoj for the help/insight. First forays into the linux scheduler are never easy, and I appreciate the pointers and gentle nudges in the right direction. later, john.c diff -u --recursive --new-file linux/arch/i386/config.in /raiddisk/john/linux/arch/i386/config.in --- linux/arch/i386/config.in Fri Feb 14 18:56:20 1997 +++ /raiddisk/john/linux/arch/i386/config.in Mon Aug 14 20:12:45 2000 @@ -157,6 +157,14 @@ mainmenu_option next_comment comment 'General setup' +if [ "$CONFIG_SMP" = "y" ]; then + bool 'Support for Application/System Processor Splitting' CONFIG_APPPROC + if [ "$CONFIG_APPPROC" = "y" ]; then + int ' Total Number of Processors in the System' CONFIG_APPPROC_TOTALNUM 2 + int ' Number of Processors to reserve as Application Processors' CONFIG_APPPROC_APPNUM 1 + fi +fi + bool 'Networking support' CONFIG_NET bool 'SGI Visual Workstation support' CONFIG_VISWS if [ "$CONFIG_VISWS" = "y" ]; then diff -u --recursive --new-file linux/include/linux/interrupt.h /raiddisk/john/linux/include/linux/interrupt.h --- linux/include/linux/interrupt.h Mon Aug 21 14:36:05 2000 +++ /raiddisk/john/linux/include/linux/interrupt.h Mon Aug 14 22:54:37 2000 @@ -159,7 +159,12 @@ if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) { int cpu = smp_processor_id(); unsigned long flags; - + +#ifdef CONFIG_APPPROC /* there has to be a batter way.... */ + if ((1<next = tasklet_vec[cpu].list; tasklet_vec[cpu].list = t; @@ -173,6 +178,11 @@ if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) { int cpu = smp_processor_id(); unsigned long flags; + +#ifdef CONFIG_APPPROC /* there has to be a batter way.... */ + if ((1<next = tasklet_hi_vec[cpu].list; diff -u --recursive --new-file linux/include/linux/prctl.h /raiddisk/john/linux/include/linux/prctl.h --- linux/include/linux/prctl.h Sun Mar 19 14:15:32 2000 +++ /raiddisk/john/linux/include/linux/prctl.h Mon Aug 14 21:16:43 2000 @@ -20,4 +20,8 @@ #define PR_GET_KEEPCAPS 7 #define PR_SET_KEEPCAPS 8 +/* Get/Set cpus_allowed */ +#define PR_SET_CPUSALLOWED 9 +#define PR_GET_CPUSALLOWED 10 + #endif /* _LINUX_PRCTL_H */ diff -u --recursive --new-file linux/include/linux/sched.h /raiddisk/john/linux/include/linux/sched.h --- linux/include/linux/sched.h Mon Aug 21 14:36:05 2000 +++ /raiddisk/john/linux/include/linux/sched.h Mon Aug 14 22:56:15 2000 @@ -66,6 +66,12 @@ #define CT_TO_SECS(x) ((x) / HZ) #define CT_TO_USECS(x) (((x) % HZ) * 1000000/HZ) +#ifdef CONFIG_APPPROC +# define APPPROC_MASK ( ((1<has_cpu = 0; p->processor = current->processor; +#ifdef CONFIG_APPPROC + p->cpus_allowed = current->cpus_allowed; +#endif /* ?? should we just memset this ?? */ for(i = 0; i < smp_num_cpus; i++) p->per_cpu_utime[i] = p->per_cpu_stime[i] = 0; diff -u --recursive --new-file linux/kernel/sys.c /raiddisk/john/linux/kernel/sys.c --- linux/kernel/sys.c Mon Aug 21 14:36:05 2000 +++ /raiddisk/john/linux/kernel/sys.c Sun Aug 20 18:08:14 2000 @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -1155,7 +1156,8 @@ { int error = 0; int sig; - + struct task_struct *p; + switch (option) { case PR_SET_PDEATHSIG: sig = arg2; @@ -1206,6 +1208,33 @@ } current->keep_capabilities = arg2; break; +#ifdef CONFIG_APPPROC + case PR_SET_CPUSALLOWED: + if (arg2 == 0) { + current->cpus_allowed = arg3; + current->need_resched = 1; + } else { + p = find_task_by_pid(arg2); + if (!p) { + error = -EINVAL; break; + } + p->cpus_allowed = arg3; + p->need_resched = 1; + } + break; + case PR_GET_CPUSALLOWED: + if (arg2==0) { + error = put_user(current->cpus_allowed, (long *)arg3); + } else { + p = find_task_by_pid(arg2); + if (!p) { + error = -EINVAL; + break; + } + error = put_user(p->cpus_allowed, (long *)arg3); + } + break; +#endif default: error = -EINVAL; break; - -- John Clemens RPI Computer Engineering 2000 clemej@alum.rpi.edu http://www.rpi.edu/~clemej/ "I Hate Quotes" -- Samual L. Clemens From owner-LinuxScalability@oss.sgi.com Wed Aug 23 17:31:57 2000 Received: by oss.sgi.com id ; Wed, 23 Aug 2000 17:31:47 -0700 Received: from pneumatic-tube.sgi.com ([204.94.214.22]:48995 "EHLO pneumatic-tube.sgi.com") by oss.sgi.com with ESMTP id ; Wed, 23 Aug 2000 17:31:24 -0700 Received: from cthulhu.engr.sgi.com (gate3-relay.engr.sgi.com [130.62.1.234]) by pneumatic-tube.sgi.com (980327.SGI.8.8.8-aspam/980310.SGI-aspam) via ESMTP id RAA03176 for ; Wed, 23 Aug 2000 17:37:11 -0700 (PDT) mail_from (dimitris@darkside.engr.sgi.com) Received: from darkside.engr.sgi.com (darkside.engr.sgi.com [163.154.5.83]) by cthulhu.engr.sgi.com (980427.SGI.8.8.8/970903.SGI.AUTOCF) via ESMTP id RAA64862; Wed, 23 Aug 2000 17:30:38 -0700 (PDT) mail_from (dimitris@darkside.engr.sgi.com) Received: (from dimitris@localhost) by darkside.engr.sgi.com (8.9.3/8.8.7) id RAA12440; Wed, 23 Aug 2000 17:30:37 -0700 Message-ID: X-Mailer: XFMail 1.4.4 on Linux X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Wed, 23 Aug 2000 17:30:37 -0700 (PDT) Organization: SGI From: Dimitris Michailidis To: PianoMan Subject: RE: App/Sys Proc Experiment an insignificant success... Cc: kanoj@google.engr.sgi.com, linux-scalability@oss.sgi.com Sender: owner-LinuxScalability@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-scalability-outgoing On 24-Aug-2000 PianoMan wrote: > On a dual processor PIIIEB 666Mhz, I created a simple FPU intensive > program that operated on 60MB of matricies. It takes ~26 minutes to run > completely. When scheduled on the special designated application > processor, it takes approximately 1 to 2% less wall clock time then when > scheduled on the one system processor. Thus, the performance benefit is > probably too little to be worth it (or even to measure given measurement > inaccuracy). Had it been 5%, I would feel differently ;-) Did you try loads that included a lot of network activity? That's probably where you'd see the biggest difference. A lot of the network processing happens in softirqs and your patch would take all this processing away from the application cpu. -- Dimitris Michailidis dimitris@engr.sgi.com