pagg
[Top] [All Lists]

Re: Process Notification (pnotify) performance comparisons

To: Erik Jacobson <erikj@xxxxxxx>
Subject: Re: Process Notification (pnotify) performance comparisons
From: Erik Jacobson <erikj@xxxxxxx>
Date: Fri, 30 Sep 2005 11:59:21 -0500
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>, lse-tech@xxxxxxxxxxxxxxxxxxxxx, akpm@xxxxxxxx, kingsley@xxxxxxxxxx, canon@xxxxxxxxx, pagg@xxxxxxxxxxx
In-reply-to: <20050929170325.GD15246@xxxxxxx>
References: <20050921213645.GB28239@xxxxxxx> <20050922151647.GA30784@xxxxxxxxxxxxx> <20050929170325.GD15246@xxxxxxx>
Sender: pagg-bounce@xxxxxxxxxxx
User-agent: Mutt/1.5.6i
New performance tests.  

These tests were run on a 32-processor Altix system.

It has 1300 MHz CPUs.  Note that the individual CPUs are slower than my
earlier testing with the 2p box.  fork-bomb-like tests are slower on 
big NUMA machines like this unless you restrict them to running CPUs 
within the same node (and this is why the 2p box does better in that
respect).

I know it was requested that I run this on a behemoth - I'm just hoping
32p is big enough.  The bigger the machine, the harder it is to reserve
time...

As seen in the thread, my first pass at an RCU implementation isn't good 
enough to use for real.  I'm hoping it is good enough to get us some numbers.

Conclusions:
1. My feeling is that RCU isn't buying us much here.  In fact, it 
   restricts what kernel modules can do in that you can't sleep in many
   interesting places.  This affects things such as kmalloc's (unless you 
   say you aren't willing to wait)... It also makes it so you can't use 
   rwsems in all the places you might need to.

2. In the data below, unlike the 2p test, there are some differences.
   My conclusion here, and the reason I added an additional test, is
   that the Job infrastructure is the cause for a small slow down.
   So I added a kernel test where pnotify still has two kernel module
   subscribers per task, but Job isn't one of them.  Job is replaced by
   a tiny kernel module that just simply counts the number of times 
   the fork, exit, and exec hooks fire.  This takes Job out of the 
   equation.  That kernel closely tracks the stock kernel.  That means
   pnotify isn't at fault, Job is.  In other words, if Job implemented its
   own hooks and was configured on, we'd probably see the same small dip.

3. The fork/exit tests (jump to the end for a summary) show stock performing
   a tiny bit better than pnotify with two subscribers and no job (keyrings 
   and a test module).  I suspect if I did a test where pnotify only had 
   keyrings as a subscriber, the numbers would be nearly the same.  The AIM7 
   data shows these two kernels as very similar.

4. So I feel pnotify isn't really costing us but the data does show we need
   to keep an eye on pnotify users.  We need to treat new pnotify users in a 
   similar way as new callouts in exit, copy_process, and exec when the 
   pnotify user plans to associate with all or many running tasks (at least
   for performance reasons)

I'm going to integrate other comments I got on the pnotify patch now and 
send a new version of the non-rcu pnotify patch soon.

Some info on the tests:

jobtest: a mini job test suite that includes forking enough processes
         to make the PID numbers roll.  Output trimmed.

forkexit: just forks and exits the specified number of times.

fork-wait-exit: is the same, but the parent waits for the child.

AIM7: The version of aim7 we use is not tracking the current community
release. 

All kernels had kdb patches applied and kdb was enabled.

4 kernels tested:
2.6.14-rc2 with pnotify, job, and pnotify-aware keyrings, original NON-RCU
2.6.14-rc2 with pnotify, job, and pnotify-aware keyrings, RCU Testing
2.6.14-rc2 stock (no pnotify, no job, non-modified keyrings enabled)
2.6.14-rc2 non-rcu pnotify, NO Job, pnotify-aware keyrings, tiny pnotify module

In the last kernel, I'm showing a pnotify user that only does atomic adds
and subtracts on a few variables and nothing else. The test module, like
keyrings, is a subscriber to all processes.   The purpose of this test is to
take the Job infrastructure out of the performance picture to just focus
on pnotify.  This tiny test module counts the number of times fork, exit,
and exec fires.  Output is provided.


2.6.14-rc2 pnotify, job, pnotify-aware keyrings implementation NOT using rcu
------------------------------------------------------------------------------

=== jobtest ===
belay:~ # time jobtest
[snip]
Great.  All tests completed, no errors
 
real    0m26.222s
user    0m1.736s
sys     0m21.380s

belay:~ # time jobtest
[snip]
Great.  All tests completed, no errors
 
real    0m35.574s
user    0m2.512s
sys     0m31.356s

belay:~ # time jobtest
[snip]
Great.  All tests completed, no errors
 
real    0m35.696s
user    0m2.556s
sys     0m31.128s


=== forkexit ===

belay:~ # time ./forkexit 40000
Fork returned an error: 7720 times
 
real    0m14.665s
user    0m0.072s
sys     0m14.464s
belay:~ # time ./forkexit 40000
Fork returned an error: 7720 times
 
real    0m15.439s
user    0m0.104s
sys     0m15.212s
belay:~ # time ./forkexit 40000
Fork returned an error: 7720 times
 
real    0m15.115s
user    0m0.068s
sys     0m14.924s

=== fork-wait-exit ===

belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m25.494s
user    0m2.136s
sys     0m31.588s
belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m25.208s
user    0m1.760s
sys     0m28.208s

=== AIM7 ===

Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP 
PREEMPT Wed Sep 28 15:01:33 CDT 2005
HOST = belay
CPUS = 32
DIRS = 1
DISKS= 0
FS   = xfs
SCSI = non-xscsi
ID   = pnotify+job+new-keyring, NON-RCU
Run 1 of 1
Benchmark       Version Machine Run Date
AIM Multiuser Benchmark - Suite VII     "1.1"   AIM7 Run        Sep 29 16:03:37 
2005
 
Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
1       1907.6          100     3.1     0.6     31.7929
2       3694.1          96      3.2     1.2     30.7839
3       5643.2          96      3.1     1.5     31.3510
4       7994.5          99      2.9     1.8     33.3104
5       9417.5          97      3.1     2.4     31.3916
10      18976.2         97      3.1     4.7     31.6270
20      37439.7         98      3.1     9.1     31.1997
50      74348.5         94      3.9     23.6    24.7828
100     101694.9                90      5.7     46.5    16.9492
150     124643.1                87      7.0     67.8    13.8492
200     130669.1                86      8.9     90.3    10.8891
500     154877.9                83      18.8    227.9   5.1626
1000    150566.6                80      38.7    453.3   2.5094
2000    154665.9                78      75.3    937.6   1.2889




2.6.14-rc2 pnotify, job, pnotify-aware keyrings implementation with RCU
------------------------------------------------------------------------------
=== jobtest ===

belay:~ # time jobtest
[snip]
Great.  All tests completed, no errors
 
real    0m26.501s
user    0m1.608s
sys     0m20.596s

belay:~ # time jobtest
[snip]
Great.  All tests completed, no errors
 
real    0m35.501s
user    0m2.320s
sys     0m30.148s

belay:~ # time jobtest
[snip]
Great.  All tests completed, no errors
 
real    0m35.563s
user    0m2.576s
sys     0m30.632s

=== forkexit ===

--> Note: The first attempt took 1 minutes 23 seconds.  This only 
--> happened once.  With the stock kernel test below, one attempt was
--> an outlier taking more than 5 minutes.  Therefore, I don't think it
--> is anything I changed that caused this.

belay:~ # time ./forkexit 40000
Fork returned an error: 7701 times
 
real    1m22.814s
user    0m0.064s
sys     1m22.740s

belay:~ # time ./forkexit 40000
Fork returned an error: 7693 times
 
real    0m14.557s
user    0m0.076s
sys     0m14.368s
belay:~ # time ./forkexit 40000
Fork returned an error: 7693 times
 
real    0m14.774s
user    0m0.076s
sys     0m14.580s
belay:~ # time ./forkexit 40000
Fork returned an error: 7693 times
 
real    0m15.218s
user    0m0.112s
sys     0m14.992s


=== fork-wait-exit ===

belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m24.922s
user    0m1.560s
sys     0m28.060s
belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m25.178s
user    0m1.680s
sys     0m27.988s


=== AIM7 ===

nux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #5 SMP 
PREEMPT Thu Sep 29 15:00:00 CDT 2005
HOST = belay
CPUS = 32
DIRS = 1
DISKS= 0
FS   = xfs
SCSI = non-xscsi
ID   = RCU pnotify+job+new-keyring
Run 1 of 1
Benchmark       Version Machine Run Date
AIM Multiuser Benchmark - Suite VII     "1.1"   AIM7 Run        Sep 29 16:42:08 
2005
 
Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
1       1903.8          100     3.1     0.6     31.7305
2       3655.8          96      3.2     1.2     30.4648
3       5628.6          96      3.1     1.6     31.2701
4       7553.5          96      3.1     2.1     31.4731
5       10003.4         99      2.9     2.2     33.3448
10      18945.3         97      3.1     4.7     31.5755
20      36823.8         97      3.2     9.2     30.6865
50      72532.4         94      4.0     23.4    24.1775
100     97552.8         90      6.0     46.7    16.2588
150     120981.2                87      7.2     68.4    13.4424
200     130581.1                86      8.9     88.7    10.8818
500     151665.2                83      19.2    228.0   5.0555
1000    153586.3                81      37.9    467.5   2.5598
2000    153454.7                79      75.9    941.7   1.2788



2.6.14-rc2 stock+kdb, no pnotify, no job, unmodified keyrings enabled
------------------------------------------------------------------------------
=== jobtest ===
None: kernel doesn't have pnotify or job

=== forkexit ===

--> My first attempt on a stock kernel+kdb took more than 5 minutes.  The
--> data isn't useful because I escaped in to kdb to check on some things.
--> There was a similar outlier for my RCU pnotify kernel above that took
--> 1 minute 23 above.  Later runs were normal.

belay:~ # time ./forkexit 40000
Fork returned an error: 7698 times
 
real    0m14.421s
user    0m0.088s
sys     0m14.220s
belay:~ # time ./forkexit 40000
Fork returned an error: 7699 times
 
real    0m14.282s
user    0m0.064s
sys     0m14.100s
belay:~ # time ./forkexit 40000
Fork returned an error: 7699 times
 
real    0m15.736s
user    0m0.072s
sys     0m15.648s


=== fork-wait-exit ===

--> The 16.838 was an outlier I was never able to duplicate.
--> I tried many times.  Of course, if I resrict it to just two processors
--> on a single node, it's faster.  Perhaps we got lucky once.

belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m16.838s
user    0m1.080s
sys     0m17.092s
belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m24.599s
user    0m1.728s
sys     0m28.248s
belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m24.742s
user    0m1.668s
sys     0m27.332s
belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m24.777s
user    0m1.720s
sys     0m26.548s

belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m24.829s
user    0m1.644s
sys     0m27.168s


=== AIM7 ===

nux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #2 SMP 
PREEMPT Wed Sep 28 14:53:02 CDT 2005
HOST = belay
CPUS = 32
DIRS = 1
DISKS= 0
FS   = xfs
SCSI = non-xscsi
ID   = stock, non-modified-keyrings enabled
Run 1 of 1
Benchmark       Version Machine Run Date
AIM Multiuser Benchmark - Suite VII     "1.1"   AIM7 Run        Sep 29 17:38:42 
2005
 
Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
1       1909.4          100     3.0     0.6     31.8241
2       3670.8          96      3.2     1.2     30.5897
3       5597.9          96      3.1     1.6     31.0997
4       7543.7          96      3.1     2.1     31.4323
5       9445.0          97      3.1     2.4     31.4833
10      19088.2         97      3.0     4.5     31.8137
20      37977.2         98      3.1     9.0     31.6476
50      73559.2         95      4.0     23.6    24.5197
100     101147.0                90      5.8     47.1    16.8578
150     120496.9                86      7.2     68.5    13.3885
200     129003.7                86      9.0     90.5    10.7503
500     149737.6                83      19.4    222.8   4.9913
1000    158010.5                80      36.8    439.7   2.6335
2000    156453.7                77      74.4    885.3   1.3038



2.6.14-rc2 non-rcu pnotify, NO Job, pnotify-aware keyrings, tiny pnotify module
------------------------------------------------------------------------------
=== jobtest ===
None: kernel doesn't have job

=== forkexit ===

--> Like the other tests, the first run of this took longer than the other
--> runs of it.

belay:~ # time ./forkexit 40000
Fork returned an error: 7686 times
 
real    1m35.260s
user    0m0.076s
sys     1m35.064s
belay:~ # time ./forkexit 40000
Fork returned an error: 7686 times
 
real    0m15.843s
user    0m0.068s
sys     0m15.652s
belay:~ # time ./forkexit 40000
Fork returned an error: 7687 times
 
real    0m14.404s
user    0m0.064s
sys     0m14.220s
belay:~ # time ./forkexit 40000
Fork returned an error: 7687 times
 
real    0m14.487s
user    0m0.060s
sys     0m14.304s

=== fork-wait-exit ===

belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m25.205s
user    0m2.112s
sys     0m30.684s

belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m25.045s
user    0m1.872s
sys     0m30.208s
belay:~ # time ./fork-wait-exit 40000
Fork returned an error: 0 times
 
real    0m24.927s
user    0m1.964s
sys     0m29.200s


=== AIM7 ===

Linux version 2.6.14-rc2 (erikj@attica) (gcc version 3.3.3 (SuSE Linux)) #3 SMP 
PREEMPT Fri Sep 30 09:21:15 CDT 2005
HOST = belay
CPUS = 32
DIRS = 1
DISKS= 0
FS   = xfs
SCSI = non-xscsi
ID   = non-rcu pnotify, NO Job, pnotify keyrings, tiny pnotify test module
Run 1 of 1
Benchmark       Version Machine Run Date
AIM Multiuser Benchmark - Suite VII     "1.1"   AIM7 Run        Sep 30 10:19:29 
2005
 
Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
1       1909.4          100     3.0     0.6     31.8241
2       3694.1          96      3.2     1.2     30.7839
3       5679.9          97      3.1     1.5     31.5550
4       7538.9          96      3.1     2.2     31.4119
5       10031.0         99      2.9     2.2     33.4367
10      19088.2         97      3.0     4.5     31.8137
20      36964.1         97      3.1     9.2     30.8034
50      72424.1         95      4.0     23.9    24.1414
100     102410.7                90      5.7     46.1    17.0684
150     120148.6                86      7.3     67.4    13.3498
200     131927.9                87      8.8     89.5    10.9940
500     154229.4                83      18.9    220.7   5.1410
1000    158552.9                80      36.7    440.5   2.6425
2000    152884.3                78      76.1    882.8   1.2740


=== Special note for this kernel ===

After all the testing, I removed my tiny test kernel module that did 
atomic increments to count the number of times various hooks happened.
Here is the info from that (from dmesg):

Unregistering pnotify support for (name=pnotify-test)
exit called 797136 times...
fork called 796785 times...
init called 351 times...
exec called 97675 times ...
Good - fork count + init count equals exit count.

------------------------------------------------------------------------------
I really would need more trials to see if these converge but I think we're
covered on data with the AIM runs.  Let me know if more data is requested.
It appears stock is performing best but pnotify with two subscribers
(keyrings, which is present in stock and my test module that atomic increments
counters for the callbacks) is very close.  I suspect if I ran another
test where only one subscriber was present to match what was in stock,
the numbers would be nearly the same.

forkexit summary (real time average minus outliers):
     pnotify+job+pnotify keyrings, NON-RCU:  15.07
         pnotify+job+pnotify keyrings, RCU:  14.85
                                     stock:  14.813
pnotify+pnotify keyrings+test mod, non-rcu:  14.911


fork-wait-exit summary (real time average minus outlier):
     pnotify+job+pnotify keyrings, NON-RCU:  25.351 
         pnotify+job+pnotify keyrings, RCU:  25.05 
                                     stock:  24.703 
pnotify+pnotify keyrings+test mod, non-rcu:  25.059 

<Prev in Thread] Current Thread [Next in Thread>