This is a compilation of some ideas I've been kicking around and have
started to implement. It is not a win in all cases.. please listen to my
arguements carefully, and all comments are welcome. Some issues have
already been raised by private conversations with people on this list, and
I hope to address them here.
----
As Linux scales to larger and larger machines, it is becoming more of a
force in the technical and high performance computing areas. These
areas generate a workload that is generally CPU bound, as opposed to I/O
bound loads often generated by large scale servers for E-Commerce and the
like. These programs want every last possible cycle of CPU power
available to them, because they often take hours, days, or weeks to
run.. and even a 1 or 2% performance imporvenet could save hours of
computing time.
The problem is that Linux is optomised as a general puspose OS, and does
not conatin very many ways of tuning for this particular case...
The problems: (correct me if my view of reality is wrong, please)
- The standard Linux scheduler uses a modified Round Robin approach
with quantas. The scheduler schedules processes to run on all the CPU's,
and then when a small period of time, or "quanta" expires, the schedular
then interrupts the running process and see's if anything else needs to
get run. Even if your CPU bound process is the only one running, it will
still be subject to this quanta interruption, losing a few few cycles each
time. Also, while a process is weighted towardss running on the same CPU,
it is not guaranteed to always be scheduled on the same CPU, thus losing
any cache it had already warn in the first processor. (This may have
cahnged with som recent patched from SGI). Finally, on a multi-user
system where more processes are running then there are processors
available, then the scheduler will schedule other tasks on the current
processor, even if your job is "more important".
- Interrupts are by default turned on and handled by all
processors.. as well as tasklets, bottom halves, and soft IRQ's. Each of
these will interrupt the flow of your program and pollute your current
local cache. (This too turns out to be mostly tunable from the new
proc/irq/#/smp_affinity file in later kernels).
- System, as well as user processes can be scheduled on any cpu at any
time. While you may be able to guarantee that your process will always
use one processor, there is no guarantee to keep other processes from
using that processor either. Even on an otherwise idle system, kernel
threads can get scheduled on your CPUs.
My proposed solution:
Add to linux's flexibility by allowing the administrator of a
large machine to declare a certain number of processors on a large system
to be "application" processors, and the others to be "system" processors.
When Linux boots, it will run on the system processors, assign user space
programs and IRQ's and tasklets and system threads to ONLY the system
processors. The processors which are designated to be application
processors sre spun up but NEVER ASSIGNED ANY TASKS, and all (maybe not
timer/fpu?) interrupts are disabled. Then modify the schedular to add a new
scheduling class called "SCHED_BATCH", and any processes that are started
with "SCHED_BATCH" scheduling would enter a special simple batch schedular
which would hand the task to an application CPU and let it run until it is
complete, then run the next SCHED_BATCH process in the queue. If no
application processors are available, we can either queue the task to run
the the current tasks are complete, or turn the task back over to the
system CPU pool to be run these as a regular process. (does this answer
your question Dimitris? get rid of the quanta by having a separate
schedular for those CPU's)
Essentially, this turns a configurable number of processors into
special purpose "co-processors". The same effect could be accomplished by
building a PCI card and putting a few PPC chips on it, and telling Linux
about it... except that these processors can access all the resources
regular linux processes can, without any penalty. To put it another way,
you're turning the machine into two separate machines with unique
characteristics and sharing memory and disk; one a general purpose
multi-user front end, and the other a CPU intensive farm as a backend.
As an example, lets take an 8 CPU server and partition it into 4
system processors and 4 application processors. The system processors
would be governed by the stndard schedular, and the 4 application
processors would be waiting there for a task. Say a user then wanted to
run 7 copies of SETI@Home. They could mark the executable to use the
SCHED_BATCH scheduler (maybe use a user space program to modify a field in
the ELF header?). The general schedular sees these 7 tasks are a special
case and sends them off to the batch schedular, which assigns 4 SETI
processes to the four application processors. These processes now proceed
as fast as the hardware will allow them to, without interruption from the
OS, until they are complete (or killed). Upon seeing that all application
processors are in use, the schedular would then chenge the remaining SETI
processes to SCHED_RR and send them back to the normal schedular to be run
on the user processors and fight for thier time as they would have before.
(does this answer your question Kanoj? we're not limiting the # of cpu's
available to the process, we're making a few of the processors available
"better".) The other benefit is that if someone logs in and runs 20
copies of the CPU-Eater background (see IRIX, circa 1994 :-), the 4
application processors will not be affected at all, and happily go along
doing thier useful work. The SETI@Home clients run on the application
processors would be close to the theoretical limits of the hardware in
speed, and this run faster than the ones in the user pool. If you switch
SETI@home with your faviorite CPU intensive single threaded or
multi-threaded program, then you have a winning situation.
This is by no means a win in all situations. Firstly, one needs a
large number of CPU's, I would guess NCPUS >= 8. Secondly, If the OS
needs 8 cpus to keep up the the system load, then taking a few CPU's out
of the general OS pool may help the CPU intensive tasks but may very well
hurt the rest of the system. This is also something that I feel should
NEVER be merged into the main linux kernel. This would be a patch to
handle a specific, special purpose goal, and would go against the
"optimise for the general case " philosophy of Linux. However, I do feel
that this scheme should be maintained as a patch outside of the linux tree
so that it would be available for those who wished to take advantage of
it.
There are still a lot of details to be hammered out, and to be
quite honest, I'm not sure this is a win in ANY situation.. but that's why
I'm in the middle of coding it up to show everyone... It seems to me this
is very indicitave to the way that Cray does things on it's large T3x
series. (course, there it's hundreds of processors, not tens.).. But I
feel that in certain situations this could be a big win, especially in
computation servers which require some interactive "traditional" server
capabilities but also need to provide hard core CPU crunching. It could
also be very useful for real-time applications, as Kanoj pointed out to
me... however I have not examined this issue.
Here are some possible issues I myself could see being a problem:
- Tasks on the application processors that block waiting for I/O
can't do any useful work. (doesn't pollute cache, but seems
somewhat wasteful)
- I've not talked about gang scheduling on the application
CPU's... could be a big win, but don't know how to handle the
case of threads of the same program on application processors
and user processors at the same time.
- Should I disable the timer on the application CPUs?
- the simple batch schedular could be used as a denial-of-service
(does anyone really care?)
- If implemented correctly, then there should be little impact on
the user space side of things. An extra compare in the
scheduler, and a creative config option. IF not desired, then
the option can be configed out and not used at compile time at
all.
- It seems kludgey. but a hell of a lot more elegant then running
DOS.
Any comments/flames/whatever would be greatly appreciated. I will be
working on a patch that implements this on ia32 (only SMP machine I
have).. lets see if I can get soem real numbers to back up my claims.
or you can just go tell me i'm insane.
and I know my spelling's atrocious, but
it's also 1:30am.. I need some sleep.
john.c
- --
John Clemens RPI Computer Engineering 2000 clemej@xxxxxxxxxxxx
http://pianoman.penguinpowered.com/ "I Hate Quotes" -- Samuel L. Clemens
|